ABSTRACT
Peptide search engines are algorithms that are able to identify peptides (i.e., short proteins or parts of proteins) from mass spectra of biological samples. These identification algorithms report the best matching peptide for a given spectrum and a score that represents the quality of the match; usually, the higher this score, the higher is the reliability of the respective match. In order to estimate the specificity and sensitivity of search engines, sets of target sequences are given to the identification algorithm as well as so-called decoy sequences that are randomly created or scrambled versions of real sequences; decoy sequences should be assigned low scores whereas target sequences should be assigned high scores.
In this paper we present an approach based on symbolic regression (using genetic programming) that helps to distinguish between target and decoy matches. On the basis of features calculated for matched sequences and using the information on the original sequence set (target or decoy) we learn mathematical models that calculate updated scores. As an alternative to this white box modeling approach we also use a black box modeling method, namely random forests.
As we show in the empirical section of this paper, this approach leads to scores that increase the number of reliably identified samples that are originally scored using the MS Amanda identification algorithm for high resolution as well as for low resolution mass spectra.
- Michael Affenzeller, Stephan Winkler, Stefan Wagner, and Andreas Beham. Genetic Algorithms and Genetic Programming - Modern Concepts and Practical Applications. Chapman & Hall / CRC, 2009. Google ScholarCross Ref
- Thomas E. Angel, Uma K. Aryal, Shawna M. Hengel, Erin S. Baker, Ryan T. Kelly, Errol W. Robinson, and Richard D. Smith. Mass spectrometry-based proteomics: existing capabilities and future directions. Chemical Society reviews, 41(10):3912--3928, May 2012.Google ScholarCross Ref
- Wolfgang Banzhaf and Christian W.G. Lasarczyk. Genetic programming of an algorithmic chemistry. In U. O'Reilly, T. Yu, R. Riolo, and B. Worzel, editors, Genetic Programming Theory and Practice II, pages 175--190. Ann Arbor, 2004.Google Scholar
- Leo Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarDigital Library
- Jürgen Cox, Nadin Neuhauser, Annette Michalski, Richard A. Scheltema, Jesper V. Olsen, and Matthias Mann. Andromeda: A peptide search engine integrated into the maxquant environment. Journal of Proteome Research, 10:1794--1805, 2011.Google ScholarCross Ref
- Viktoria Dorfer, Peter Pichler, Thomas Stranzl, Johannes Stadlmann, Thomas Taus, Stephan M. Winkler, and Karl Mechtler. MS Amanda, a universal identification algorithm optimized for high accuracy tandem mass spectra. Journal of Proteome Research, 13:3679--3684, 2014.Google ScholarCross Ref
- Joshua E. Elias and Steven P. Gygi. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature methods, 4(3):207--14, March 2007.Google ScholarCross Ref
- Jimmy K. Eng, Ashley L. McCormack, and John R. Yates III. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry, 5(11):976--989, 1994.Google ScholarCross Ref
- Lukas Käll, Jesse D. Canterbury, Jason Weston, William Stafford Noble, and Michael J. MacCoss. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature methods, 4(11):923--925, 2007.Google ScholarCross Ref
- Thomas Köcher, Peter Pichler, Remco Swart, and Karl Mechtler. Analysis of protein mixtures from whole-cell extracts by single-run nanoLC-MS/MS using ultralong gradients. Nature protocols, 7(5):882--90, May 2012.Google ScholarCross Ref
- Michael Kommenda, Gabriel Kronberger, Stefan Wagner, Stephan Winkler, and Michael Affenzeller. On the architecture and implementation of tree-based genetic programming in heuristiclab. In Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO '12, pages 101--108, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- John R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, 1992. Google ScholarDigital Library
- Roger E Moore, Mary K Young, and Terry D Lee. Qscore: an algorithm for evaluating SEQUEST database search results. Journal of the American Society for Mass Spectrometry, 13(4):378--86, April 2002.Google ScholarCross Ref
- David N. Perkins, Darryl J. C. Pappin, David M. Creasy, and John S. Cottrell. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis, 20:3551--3567, 1999.Google ScholarCross Ref
- Mark R. Segal. Machine Learning Benchmarks and Random Forest Regression. Center for Bioinformatics & Molecular Biostatistics, 2004.Google Scholar
- Jamie Shotton, Tae-Kyun Kim, and Bjorn Stenger. Boosting & randomized forests for visual recognition. In ICCV 2009, 2009.Google Scholar
- The UniProt Consortium. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Research, 41:D43--D47, 2013.Google ScholarCross Ref
- Marc Vaudel, Harald Barsnes, Frode S. Berven, Albert Sickmann, and Lennart Martens. SearchGUI: An open-source graphical user interface for simultaneous OMSSA and X!Tandem searches. Proteomics, 11(5):996--9, March 2011.Google ScholarCross Ref
- Marc Vaudel, Julia M. Burkhart, René P. Zahedi, Eystein Oveland, Frode S. Berven, Albert Sickmann, Lennart Martens, and Harald Barsnes. PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nature Biotechnology, 33(1):22--24, January 2015.Google ScholarCross Ref
- Stefan Wagner, Gabriel Kronberger, Andreas Beham, Michael Kommenda, Andreas Scheibenpflug, Erik Pitzer, Stefan Vonolfen, Monika Kofler, Stephan M. Winkler, Viktoria Dorfer, and Michael Affenzeller. Architecture and design of the heuristiclab optimization environment. Advanced Methods and Applications in Computational Intelligence, Topics in Intelligent Engineering and Informatics, 6:197--261, 2013.Google ScholarCross Ref
- Stephan M. Winkler. Evolutionary System Identification - Modern Concepts and Practical Applications. PhD thesis, Institute for Formal Models and Verification, Johannes Kepler University Linz, 2008.Google Scholar
Index Terms
- A Symbolic Regression Based Scoring System Improving Peptide Identifications for MS Amanda
Recommendations
Improving X!Tandem on Peptide Identification from Mass Spectrometry by Self-Boosted Percolator
A critical component in mass spectrometry (MS)-based proteomics is an accurate protein identification procedure. Database search algorithms commonly generate a list of peptide-spectrum matches (PSMs). The validity of these PSMs is critical for ...
Peptide retention time prediction yields improved tandem mass spectrum identification for diverse chromatography conditions
RECOMB'07: Proceedings of the 11th annual international conference on Research in computational molecular biologyMost tandem mass spectrum identification algorithms use information only from the final spectrum, ignoring precursor information such as peptide retention time (RT). Efforts to exploit peptide RT for peptide identification can be frustrated by its ...
Improving phosphopeptide identification in shotgun proteomics by supervised filtering of peptide-spectrum matches
BCB'13: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical InformaticsOne of the important objectives in mass spectrometry-based proteomics is the identification of post-translationally modified sites in cellular and extracellular proteomes. Proteomics techniques have been particularly effective in studying protein ...
Comments