The Impact of AI on the Field of Proteomics

Countless questions continue to linger around the impact of AI and related technologies in virtually all areas of analytical science. The field of proteomics, with its rapidly advancing technologies and expanding scope, continues to revolutionize our understanding of health, disease, and beyond. The integration of AI, particularly through deep learning models and advanced data analysis tools, is poised to propel proteomics into a new era, enhancing our ability to make precise predictions based on past observations.

The integration of AI into proteomics

Mass spectrometry—a cornerstone technique in proteomics analysis—produces a vast amount of raw data that needs to be matched to in-silico representation of peptides to identify and quantify the underlying proteins. While search algorithms for proteomic data have existed since 1993, the desire for better sensitivity and reproducibility requires novel data analysis concepts.

Martin Frejno, CEO of proteomic data analysis software company MSAID, explains that in recent years, advancements from companies such as Meta and Google have led to significant progress in machine learning models focused on tasks such as image and language interpretation. A major area of growth is natural language processing (NLP), and the resulting large language models have been applied to proteomics data analysis. "In this context, peptides can be thought of as words, which can be translated into spectra, similar to translating between languages," elaborates Frejno.

Frejno notes that the first major impact of deep learning in proteomics has been the accurate prediction of peptide fragmentation spectra. Peptides generate conserved fragmentation patterns in the mass spectrometer that directly relate to the amino acid sequence. These learned patterns can be generalized to any peptide sequence by the models and serve as references to compare against experimental spectra, allowing for the better identification of peptides through the calculation of spectral similarity measures. Traditionally, identification relied on matching fragment ion mass-to-charge (m/z) positions. Now, deep learning enables the interpretation of the intensity dimension of mass spectra as well.

Frejno notes that the first major impact of deep learning in proteomics has been the accurate prediction of peptide fragmentation spectra.

"And, of course, it doesn't stop at fragmentation prediction," offers Frejno. He adds that there are now multiple models that predict other properties, such as retention time, collisional cross-section, ion mobility, and peptide detectability in a sample. "All of this additional information is useful in helping to better separate true identifications from false ones."

A specific example of the application of AI in proteomics is Prosit, a deep learning model introduced by a team at the laboratory of Bernhard Kuster at the Technical University of Munich. "This model showed for the first time how accurately you can predict fragmentation and has been widely applied," says Frejno. He explains that another application is termed Rescoring, where predicted spectra are utilized to corroborate identifications. This process involves initially searching data using a classical database search engine, which interprets the spectrum in the traditional way (comparing experimental spectra to those in the database). Then, the scores from this search are augmented with intensity-based scores derived from deep learning predictions, improving the analysis depth and confidence in the identifications.

Frejno notes that there are other companies that work in the same realm of deep learning-based approaches. For example, Talus Bio is conducting extensive research into using machine learning and neural networks for spectrum interpretation. "Recently, we've seen the development of models that don't rely on traditional scoring features, such as counting peaks or calculating spectral similarity. Instead, these models embed spectrum peptides into the same space to determine the similarity of a peptide sequence to a spectrum, thereby assessing the correct sequence."

Challenges of integrating AI into proteomics studies

A major challenge for companies is to make deep learning-based tools readily available to the scientific researcher. "The main issue is that predictions of fragmentation spectra are slow on average desktop computers, as the models perform best with graphics processing units (GPUs),” Frejno says. “Most people running proteomics experiments don't have access to GPUs, relying instead on average laptops or desktops." This deployment challenge led Frejno and his team to choose a cloud-based approach. "By using the cloud, we can rent GPUs on demand, eliminating the need for users to invest in specialized hardware to support AI predictions in proteomics."

Another challenge that Frejno deems worthy of discussion is the topic of the datasets on which the models are trained. "There's a vast pool of publicly available data when it comes to spectrum prediction of fragmentation that one can tap into, but there are challenges in making that data usable for machine learning," says Frejno. He warns that there are differences in factors such as instrument setups, collision energies, and protocols, resulting in unharmonized data. Frejno adds that when models are trained on such diverse and unstandardized data, they risk being biased by these differences, leading to suboptimal performance and the perception that specific models are needed for particular instruments or organisms. However, Frejno believes that a well-trained model should generalize across diverse, unharmonized data sets if the differences in data acquisition are properly controlled.

The future of AI in proteomics

Frejno explains that the models we are seeing involve end-to-end learning of a particular problem, be that peptide identification or other challenges in the biomedical sciences. He remarks that these models learn directly from data without being biased by researcher-designed scoring methods, which can be advantageous. However, he reiterates that the challenge lies in determining what data to train these models on, as they are far removed from the underlying science. "It's unclear how well such models generalize to different species or enzymes or other variations in the data." He adds that while some claims suggest good generalization, the vast problem space makes it challenging to ensure the training data is sufficient for all possible scenarios. There are innovative approaches, such as using one model's predictions to train another, but it's uncertain if these methods will be applicable in proteomics.

As the field of proteomics continues to expand, so does the potential for leveraging deep learning models and related tools. These technologies enhance peptide identification through predictive spectra, enhancing the possibilities of proteomic analyses. As with any advanced tools, AI-driven models come with their challenges, such as the need for hefty computational resources, dataset harmonization, and reliable validation methods. Looking forward, the continued development of AI is set to produce increasingly robust and reliable models, paving the way for deeper insights into the proteome.