research-article

A Symbolic Regression Based Scoring System Improving Peptide Identifications for MS Amanda

Authors:
Viktoria Dorfer

University of Applied Sciences Upper Austria, Hagenberg, Austria

University of Applied Sciences Upper Austria, Hagenberg, Austria
View Profile

,
Sergey Maltsev

Research Institute of Molecular Pathology, Vienna, Austria

Research Institute of Molecular Pathology, Vienna, Austria
View Profile

,
Stephan Dreiseitl

University of Applied Sciences Upper Austria, Hagenberg, Austria

University of Applied Sciences Upper Austria, Hagenberg, Austria
View Profile

,
Karl Mechtler

Research Institute of Molecular Pathology, Vienna, Austria

Research Institute of Molecular Pathology, Vienna, Austria
View Profile

,
Stephan M. Winkler

University of Applied Sciences Upper Austria, Hagenberg, Austria

University of Applied Sciences Upper Austria, Hagenberg, Austria
View Profile

GECCO Companion '15: Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary ComputationJuly 2015Pages 1335–1341https://doi.org/10.1145/2739482.2768509

Published:11 July 2015Publication History

GECCO Companion '15: Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation

Pages 1335–1341

ABSTRACT

Peptide search engines are algorithms that are able to identify peptides (i.e., short proteins or parts of proteins) from mass spectra of biological samples. These identification algorithms report the best matching peptide for a given spectrum and a score that represents the quality of the match; usually, the higher this score, the higher is the reliability of the respective match. In order to estimate the specificity and sensitivity of search engines, sets of target sequences are given to the identification algorithm as well as so-called decoy sequences that are randomly created or scrambled versions of real sequences; decoy sequences should be assigned low scores whereas target sequences should be assigned high scores.

In this paper we present an approach based on symbolic regression (using genetic programming) that helps to distinguish between target and decoy matches. On the basis of features calculated for matched sequences and using the information on the original sequence set (target or decoy) we learn mathematical models that calculate updated scores. As an alternative to this white box modeling approach we also use a black box modeling method, namely random forests.

As we show in the empirical section of this paper, this approach leads to scores that increase the number of reliably identified samples that are originally scored using the MS Amanda identification algorithm for high resolution as well as for low resolution mass spectra.

References

Michael Affenzeller, Stephan Winkler, Stefan Wagner, and Andreas Beham. Genetic Algorithms and Genetic Programming - Modern Concepts and Practical Applications. Chapman & Hall / CRC, 2009. Google ScholarCross Ref
Thomas E. Angel, Uma K. Aryal, Shawna M. Hengel, Erin S. Baker, Ryan T. Kelly, Errol W. Robinson, and Richard D. Smith. Mass spectrometry-based proteomics: existing capabilities and future directions. Chemical Society reviews, 41(10):3912--3928, May 2012.Google ScholarCross Ref
Wolfgang Banzhaf and Christian W.G. Lasarczyk. Genetic programming of an algorithmic chemistry. In U. O'Reilly, T. Yu, R. Riolo, and B. Worzel, editors, Genetic Programming Theory and Practice II, pages 175--190. Ann Arbor, 2004.Google Scholar
Leo Breiman. Random forests. Machine Learning, 45(1):5--32, 2001. Google ScholarDigital Library
Jürgen Cox, Nadin Neuhauser, Annette Michalski, Richard A. Scheltema, Jesper V. Olsen, and Matthias Mann. Andromeda: A peptide search engine integrated into the maxquant environment. Journal of Proteome Research, 10:1794--1805, 2011.Google ScholarCross Ref
Viktoria Dorfer, Peter Pichler, Thomas Stranzl, Johannes Stadlmann, Thomas Taus, Stephan M. Winkler, and Karl Mechtler. MS Amanda, a universal identification algorithm optimized for high accuracy tandem mass spectra. Journal of Proteome Research, 13:3679--3684, 2014.Google ScholarCross Ref
Joshua E. Elias and Steven P. Gygi. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nature methods, 4(3):207--14, March 2007.Google ScholarCross Ref
Jimmy K. Eng, Ashley L. McCormack, and John R. Yates III. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. Journal of the American Society for Mass Spectrometry, 5(11):976--989, 1994.Google ScholarCross Ref
Lukas Käll, Jesse D. Canterbury, Jason Weston, William Stafford Noble, and Michael J. MacCoss. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nature methods, 4(11):923--925, 2007.Google ScholarCross Ref
Thomas Köcher, Peter Pichler, Remco Swart, and Karl Mechtler. Analysis of protein mixtures from whole-cell extracts by single-run nanoLC-MS/MS using ultralong gradients. Nature protocols, 7(5):882--90, May 2012.Google ScholarCross Ref
Michael Kommenda, Gabriel Kronberger, Stefan Wagner, Stephan Winkler, and Michael Affenzeller. On the architecture and implementation of tree-based genetic programming in heuristiclab. In Proceedings of the 14th Annual Conference Companion on Genetic and Evolutionary Computation, GECCO '12, pages 101--108, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
John R. Koza. Genetic Programming: On the Programming of Computers by Means of Natural Selection. The MIT Press, 1992. Google ScholarDigital Library
Roger E Moore, Mary K Young, and Terry D Lee. Qscore: an algorithm for evaluating SEQUEST database search results. Journal of the American Society for Mass Spectrometry, 13(4):378--86, April 2002.Google ScholarCross Ref
David N. Perkins, Darryl J. C. Pappin, David M. Creasy, and John S. Cottrell. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis, 20:3551--3567, 1999.Google ScholarCross Ref
Mark R. Segal. Machine Learning Benchmarks and Random Forest Regression. Center for Bioinformatics & Molecular Biostatistics, 2004.Google Scholar
Jamie Shotton, Tae-Kyun Kim, and Bjorn Stenger. Boosting & randomized forests for visual recognition. In ICCV 2009, 2009.Google Scholar
The UniProt Consortium. Update on activities at the Universal Protein Resource (UniProt) in 2013. Nucleic Acids Research, 41:D43--D47, 2013.Google ScholarCross Ref
Marc Vaudel, Harald Barsnes, Frode S. Berven, Albert Sickmann, and Lennart Martens. SearchGUI: An open-source graphical user interface for simultaneous OMSSA and X!Tandem searches. Proteomics, 11(5):996--9, March 2011.Google ScholarCross Ref
Marc Vaudel, Julia M. Burkhart, René P. Zahedi, Eystein Oveland, Frode S. Berven, Albert Sickmann, Lennart Martens, and Harald Barsnes. PeptideShaker enables reanalysis of MS-derived proteomics data sets. Nature Biotechnology, 33(1):22--24, January 2015.Google ScholarCross Ref
Stefan Wagner, Gabriel Kronberger, Andreas Beham, Michael Kommenda, Andreas Scheibenpflug, Erik Pitzer, Stefan Vonolfen, Monika Kofler, Stephan M. Winkler, Viktoria Dorfer, and Michael Affenzeller. Architecture and design of the heuristiclab optimization environment. Advanced Methods and Applications in Computational Intelligence, Topics in Intelligent Engineering and Informatics, 6:197--261, 2013.Google ScholarCross Ref
Stephan M. Winkler. Evolutionary System Identification - Modern Concepts and Practical Applications. PhD thesis, Institute for Formal Models and Verification, Johannes Kepler University Linz, 2008.Google Scholar

Index Terms

A Symbolic Regression Based Scoring System Improving Peptide Identifications for MS Amanda

Recommendations

Improving X!Tandem on Peptide Identification from Mass Spectrometry by Self-Boosted Percolator

A critical component in mass spectrometry (MS)-based proteomics is an accurate protein identification procedure. Database search algorithms commonly generate a list of peptide-spectrum matches (PSMs). The validity of these PSMs is critical for ...
Read More
Peptide retention time prediction yields improved tandem mass spectrum identification for diverse chromatography conditions
RECOMB'07: Proceedings of the 11th annual international conference on Research in computational molecular biology

Most tandem mass spectrum identification algorithms use information only from the final spectrum, ignoring precursor information such as peptide retention time (RT). Efforts to exploit peptide RT for peptide identification can be frustrated by its ...
Read More
Improving phosphopeptide identification in shotgun proteomics by supervised filtering of peptide-spectrum matches
BCB'13: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics

One of the important objectives in mass spectrometry-based proteomics is the identification of post-translationally modified sites in cellular and extracellular proteomes. Proteomics techniques have been particularly effective in studying protein ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
GECCO Companion '15: Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation
July 2015
1568 pages
ISBN:9781450334884
DOI:10.1145/2739482
Editor:
Sara Silva
Universidade de Lisboa, Portugal
,
General Chair:
Anna I. Esparcia-Alcázar
Universitat Politècnica de València, Spain
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 July 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
peptide identification
proteomics
symbolic regression
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,669of4,410submissions,38%
Upcoming Conference
GECCO '24

Sponsor:

sigevo

Genetic and Evolutionary Computation Conference

July 14 - 18, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 70
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Symbolic Regression Based Scoring System Improving Peptide Identifications for MS Amanda

GECCO Companion '15: Proceedings of the Companion Publication of the 2015 Annual Conference on Genetic and Evolutionary Computation

ABSTRACT

References

Cited By

Index Terms

Recommendations

Improving X!Tandem on Peptide Identification from Mass Spectrometry by Self-Boosted Percolator

Peptide retention time prediction yields improved tandem mass spectrum identification for diverse chromatography conditions

Improving phosphopeptide identification in shotgun proteomics by supervised filtering of peptide-spectrum matches