Protein secondary structure prediction evaluation and a novel transition site model with new encoding schemes

Date

2017-05-09

Authors

Zamani, Masood

Journal Title

Journal ISSN

Volume Title

Publisher

University of Guelph

Abstract

Rapid progress in genomics has led to the discovery of millions of protein sequences while less than 0.2% of the sequenced proteins’ structures have been resolved by X-ray crystallography or NMR spectroscopy which are complex, time consuming, and expensive. Employing advanced computational techniques for protein structure prediction at secondary and tertiary levels provides alternative ways to accelerate the prediction process and overcome the extremely low percentage of protein structures that have been determined. State-of the art protein secondary structure (PSS) prediction methods employ machine learning (ML) techniques, compared to early approaches based on statistical information and sequence homology. In this research, we develop a two-stage PSS prediction model based on Artificial Neural Networks (ANNs) and Genetic Programming (GP) through a novel framework of PSS transition sites, and new amino acid encoding schemes derived from the genetic Codon mappings, Clustering and Information theory. PSS transition sites represent structural information of protein backbones, and reduce the input space and learning parameters in the PSS prediction model. PSS transition sites can be utilized in Homology Modeling (HM) to define the boundary of secondary structure elements. The prediction performance of the proposed method is evaluated by using Q3 and segment overlap (SOV)scores on two standard datasets, RS126 and CB513, and the latest protein dataset, PISCES, compiled with very strict homology measures by which each sequence pair has a similarity below the twilight zone or less than 25%. The experimental results and statistical analyses of the proposed PSS model indicate statistically significant improvements in PSS prediction accuracy compared to the state-of-the-art ML techniques which commonly employ cascaded ANNs and SVMs. The proposed encoding schemes show advantages in extracting sequence and profile information, reducing input parameters and training performances. A successful PSS prediction model can be utilized in homology detection tools for distant protein sequences and protein tertiary structure prediction methods to reduce the complexity of the protein structure prediction which has important applications in medicine, agriculture and the biological sciences.

Description

Keywords

Protein Stucture, Machine Learning

Citation