Created by W.Langdon from gp-bibliography.bib Revision:1.7546
Life is a tremendously complex system consisting of a large number of inter-connected actors. In order to further understand biological processes, it is often not possible to pick individual blocks and investigate these in isolation. Instead,the system should be considered, modeled and researched with as much context as possible. Inside each cell, DNA acts as a blueprint containing all necessary information to construct and maintain the organism. Certain regions of the DNA, called genes, can be expressed into proteins, which perform the vast number of functions in the body. As each cell in an organism contains the same DNA, changes in gene expression are the driving factor in cell differentiation and a key system to respond to external factors. One of the major mechanisms cells use to influence gene expression is transcriptional regulation. In this process, certain proteins called transcription factors work in a combinatorial fashion to tune the amount of produced RNA through various mechanisms.
In biochemistry, wet lab techniques have been developed which can measure genome-wide the expression of genes at a certain moment in time. These snap-shots of gene activity have proven to be indispensable tools in systems biology. A large fraction of these gene expression measurements, using techniques such as microarrays or RNASeq, have been performed in a context where relative changes in expression due to certain perturbations, conditions or time aspects are investigated with a specific hypothesis or goal in mind.
Similarly to the paradigm shift described earlier, we can also consider such collections of genome-wide snapshots as general data, hiding a potential wealth of knowledge unrelated to the original purpose. In particular, algorithms have been described that use collections of gene expression snapshots to deduce transcriptional or other regulating effects between genes and present these results in the form of networks. These algorithms have to work in a an extremely challenging setting, as the number of genes that are being measured by far exceeds the amount of data points that are available. This work will discuss the use of machine learning methods to infer knowledge in the form of networks from gene expression measurements. In a first part of this dissertation, we propose a general framework using machine learning techniques to infer gene regulatory networks. In these networks,a gene A has an outgoing edge to a gene B, if gene A through its gene products causes a (direct) effect on the transcription rate of gene B. Our framework generalizes the successful method GENIE3 which decomposes the network inference task into separate regression problems. For each gene in the network the expression values of a particular target gene are predicted using all other genes as possible predictors. Next, using tree-based ensemble methods, an importance measure for each predictor gene is calculated with respect to the target gene and a high feature importance is considered as putative evidence of a regulatory link existing between both genes. We generalize GENIE3 by proposing a subsampling approach which allows any feature selection algorithm that produces a feature ranking to be cast into an ensemble feature importance algorithm. We demonstrate that the ensemble setting is key to the network inference task, as only ensemble variants achieve top performance. In addition, we explore the effect of using rankwise averaged predictions of multiple ensemble algorithms as opposed to only one. We name this approach NIMEFI (Network Inference using Multiple Ensemble Feature Importance algorithms) and show that this approach outperforms all individual methods in general, although on a specific network a single method can perform better.
In a second part of this thesis, we propose a post-processing algorithm for gene regulatory network predictions, Netter, which uses graphlets and several other graph-invariant properties to transform the network into a more accurate prediction. Common inference strategies of GRN inference algorithms include the calculation of local pairwise measures between genes or the transformation of the problem into independent regression subproblems to derive connections between genes. Using such schemes, the algorithm is unaware that the goal is to infer an actual network topology and the global network structure cannot influence the inference process. Netter is a flexible system which can be applied in unison with any method producing a ranking from omics data and can be tailored to specific prior-knowledge by expert users or applied in general uses cases. We re-rank predictions of six different state-of-the-art algorithms using three simple network properties as optimization criteria and show that Netter can improve the predictions made on both artificially generated data as well as commonly used benchmark data. Furthermore, Netter compares favorably to other post-processing algorithms and is not restricted to correlation-like predictions. Lastly, we demonstrate that the performance increase is robust for a wide range of parameter settings.
Thirdly, we also apply our gene regulatory network inference algorithms on a practical use case. More specifically, we collected and processed a large compendium of microarrays gathered in the context of an large international immunological cell project named BibTeX entry too long. Truncated
Genetic Programming entries for Joeri Ruyssinck