skip to main content
10.1145/2739482.2768509acmconferencesArticle/Chapter ViewAbstractPublication PagesgeccoConference Proceedingsconference-collections
research-article

A Symbolic Regression Based Scoring System Improving Peptide Identifications for MS Amanda

Authors Info & Claims
Published:11 July 2015Publication History

ABSTRACT

Peptide search engines are algorithms that are able to identify peptides (i.e., short proteins or parts of proteins) from mass spectra of biological samples. These identification algorithms report the best matching peptide for a given spectrum and a score that represents the quality of the match; usually, the higher this score, the higher is the reliability of the respective match. In order to estimate the specificity and sensitivity of search engines, sets of target sequences are given to the identification algorithm as well as so-called decoy sequences that are randomly created or scrambled versions of real sequences; decoy sequences should be assigned low scores whereas target sequences should be assigned high scores.

In this paper we present an approach based on symbolic regression (using genetic programming) that helps to distinguish between target and decoy matches. On the basis of features calculated for matched sequences and using the information on the original sequence set (target or decoy) we learn mathematical models that calculate updated scores. As an alternative to this white box modeling approach we also use a black box modeling method, namely random forests.

As we show in the empirical section of this paper, this approach leads to scores that increase the number of reliably identified samples that are originally scored using the MS Amanda identification algorithm for high resolution as well as for low resolution mass spectra.

References

  1. Michael Affenzeller, Stephan Winkler, Stefan Wagner, and Andreas Beham. Genetic Algorithms and Genetic Programming - Modern Concepts and Practical Applications. Chapman & Hall / CRC, 2009. Google ScholarGoogle ScholarCross RefCross Ref
  2. Thomas E. Angel, Uma K. Aryal, Shawna M. Hengel, Erin S. Baker, Ryan T. Kelly, Errol W. Robinson, and Richard D. Smith. Mass spectrometry-based proteomics: existing capabilities and future directions. Chemical Society reviews, 41(10):3912--3928, May 2012.Google ScholarGoogle ScholarCross RefCross Ref
  3. Wolfgang Banzhaf and Christian W.G. Lasarczyk. Genetic programming of an algorithmic chemistry. In U. O'Reilly, T. Yu, R. Riolo, and B. Worzel, editors, Genetic Programming Theory and Practice II, pages 175--190. Ann Arbor, 2004.Google ScholarGoogle Scholar
  4. Leo Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jürgen Cox, Nadin Neuhauser, Annette Michalski, Richard A. Scheltema, Jesper V. Olsen, and Matthias Mann. Andromeda: A peptide search engine integrated into the maxquant environment. Journal of Proteome Research, 10:1794--1805, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  6. Viktoria Dorfer, Peter Pichler, Thomas Stranzl, Johannes Stadlmann, Thomas Taus, Stephan M. Winkler, and Karl Mechtler. MS Amanda, a universal identification algorithm optimized for high accuracy tandem mass spectra. Journal of Proteome Research, 13:3679--3684, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  7. Joshua E. Elias and Steven P. Gygi. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature methods, 4(3):207--14, March 2007.Google ScholarGoogle ScholarCross RefCross Ref
  8. Jimmy K. Eng, Ashley L. McCormack, and John R. Yates III. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry, 5(11):976--989, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  9. Lukas Käll, Jesse D. Canterbury, Jason Weston, William Stafford Noble, and Michael J. MacCoss. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature methods, 4(11):923--925, 2007.Google ScholarGoogle ScholarCross RefCross Ref
  10. Thomas Köcher, Peter Pichler, Remco Swart, and Karl Mechtler. Analysis of protein mixtures from whole-cell extracts by single-run nanoLC-MS/MS using ultralong gradients. Nature protocols, 7(5):882--90, May 2012.Google ScholarGoogle ScholarCross RefCross Ref
  11. Michael Kommenda, Gabriel Kronberger, Stefan Wagner, Stephan Winkler, and Michael Affenzeller. On the architecture and implementation of tree-based genetic programming in heuristiclab. In Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO '12, pages 101--108, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. John R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Roger E Moore, Mary K Young, and Terry D Lee. Qscore: an algorithm for evaluating SEQUEST database search results. Journal of the American Society for Mass Spectrometry, 13(4):378--86, April 2002.Google ScholarGoogle ScholarCross RefCross Ref
  14. David N. Perkins, Darryl J. C. Pappin, David M. Creasy, and John S. Cottrell. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis, 20:3551--3567, 1999.Google ScholarGoogle ScholarCross RefCross Ref
  15. Mark R. Segal. Machine Learning Benchmarks and Random Forest Regression. Center for Bioinformatics & Molecular Biostatistics, 2004.Google ScholarGoogle Scholar
  16. Jamie Shotton, Tae-Kyun Kim, and Bjorn Stenger. Boosting & randomized forests for visual recognition. In ICCV 2009, 2009.Google ScholarGoogle Scholar
  17. The UniProt Consortium. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Research, 41:D43--D47, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  18. Marc Vaudel, Harald Barsnes, Frode S. Berven, Albert Sickmann, and Lennart Martens. SearchGUI: An open-source graphical user interface for simultaneous OMSSA and X!Tandem searches. Proteomics, 11(5):996--9, March 2011.Google ScholarGoogle ScholarCross RefCross Ref
  19. Marc Vaudel, Julia M. Burkhart, René P. Zahedi, Eystein Oveland, Frode S. Berven, Albert Sickmann, Lennart Martens, and Harald Barsnes. PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nature Biotechnology, 33(1):22--24, January 2015.Google ScholarGoogle ScholarCross RefCross Ref
  20. Stefan Wagner, Gabriel Kronberger, Andreas Beham, Michael Kommenda, Andreas Scheibenpflug, Erik Pitzer, Stefan Vonolfen, Monika Kofler, Stephan M. Winkler, Viktoria Dorfer, and Michael Affenzeller. Architecture and design of the heuristiclab optimization environment. Advanced Methods and Applications in Computational Intelligence, Topics in Intelligent Engineering and Informatics, 6:197--261, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  21. Stephan M. Winkler. Evolutionary System Identification - Modern Concepts and Practical Applications. PhD thesis, Institute for Formal Models and Verification, Johannes Kepler University Linz, 2008.Google ScholarGoogle Scholar

Index Terms

  1. A Symbolic Regression Based Scoring System Improving Peptide Identifications for MS Amanda

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              GECCO Companion '15: Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation
              July 2015
              1568 pages
              ISBN:9781450334884
              DOI:10.1145/2739482

              Copyright © 2015 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 11 July 2015

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article

              Acceptance Rates

              Overall Acceptance Rate1,669of4,410submissions,38%

              Upcoming Conference

              GECCO '24
              Genetic and Evolutionary Computation Conference
              July 14 - 18, 2024
              Melbourne , VIC , Australia
            • Article Metrics

              • Downloads (Last 12 months)1
              • Downloads (Last 6 weeks)0

              Other Metrics

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader