Machine Learning for Biological Network Inference

author = "Joeri Ruyssinck",

title = "Machine Learning for Biological Network Inference",

school = "Department of Information technology, Faculty of Engineering and Architecture, Ghent University",

year = "2017",

address = "Belgium",

month = "19 " # jun,

keywords = "genetic algorithms, genetic programming, IBCN",

isbn13 = "978-94-6355-015-4",

language = "eng",

URL = "

https://www.irc.ugent.be/index.php?id=phdtheses&type=98",

URL = "

https://biblio.ugent.be/publication/8526161",

URL = "

https://biblio.ugent.be/publication/8526161/file/8526164.pdf",

size = "160 pages",

abstract = "There is a recent and strong belief in both industry and academia that data is the new gold. Traditionally, turning data into information or knowledge has been an exclusively human task. Although computers have substantially increased our capabilities to handle larger amounts and more complex data, they have always remained the tool in this process, firmly placed in the hands of the intelligent human. This paradigm severely limits the possibilities to extract information or solve problems in a setting where it is not clear how a computer should process the data. Many tasks that we consider to be results of intelligent behavior are not learned by following a clear set of instructions but by learning from examples. For example, learning how to drive a car in traffic cannot be learned by reading the instruction manual but requires practice to provide example situations to learn from. Similarly, if we wish to extract more knowledge from data or use this data to perform more complex tasks, artificial intelligence needs to become a core part of the machine or computer. Machine learning investigates how computers or machines can perform tasks by learning from data without the need to be explicitly programmed. Recent successes and advances in machine learning have caused a strong reinforcing loop in which more data is being gathered without a specific goal in mind and is subsequently mined for gold. In this dissertation, we apply machine learning techniques in order to gain knowledge about relations between biological entities in the cell.

Life is a tremendously complex system consisting of a large number of inter-connected actors. In order to further understand biological processes, it is often not possible to pick individual blocks and investigate these in isolation. Instead,the system should be considered, modeled and researched with as much context as possible. Inside each cell, DNA acts as a blueprint containing all necessary information to construct and maintain the organism. Certain regions of the DNA, called genes, can be expressed into proteins, which perform the vast number of functions in the body. As each cell in an organism contains the same DNA, changes in gene expression are the driving factor in cell differentiation and a key system to respond to external factors. One of the major mechanisms cells use to influence gene expression is transcriptional regulation. In this process, certain proteins called transcription factors work in a combinatorial fashion to tune the amount of produced RNA through various mechanisms.

In biochemistry, wet lab techniques have been developed which can measure genome-wide the expression of genes at a certain moment in time. These snap-shots of gene activity have proven to be indispensable tools in systems biology. A large fraction of these gene expression measurements, using techniques such as microarrays or RNASeq, have been performed in a context where relative changes in expression due to certain perturbations, conditions or time aspects are investigated with a specific hypothesis or goal in mind.

Similarly to the paradigm shift described earlier, we can also consider such collections of genome-wide snapshots as general data, hiding a potential wealth of knowledge unrelated to the original purpose. In particular, algorithms have been described that use collections of gene expression snapshots to deduce transcriptional or other regulating effects between genes and present these results in the form of networks. These algorithms have to work in a an extremely challenging setting, as the number of genes that are being measured by far exceeds the amount of data points that are available. This work will discuss the use of machine learning methods to infer knowledge in the form of networks from gene expression measurements. In a first part of this dissertation, we propose a general framework using machine learning techniques to infer gene regulatory networks. In these networks,a gene A has an outgoing edge to a gene B, if gene A through its gene products causes a (direct) effect on the transcription rate of gene B. Our framework generalizes the successful method GENIE3 which decomposes the network inference task into separate regression problems. For each gene in the network the expression values of a particular target gene are predicted using all other genes as possible predictors. Next, using tree-based ensemble methods, an importance measure for each predictor gene is calculated with respect to the target gene and a high feature importance is considered as putative evidence of a regulatory link existing between both genes. We generalize GENIE3 by proposing a subsampling approach which allows any feature selection algorithm that produces a feature ranking to be cast into an ensemble feature importance algorithm. We demonstrate that the ensemble setting is key to the network inference task, as only ensemble variants achieve top performance. In addition, we explore the effect of using rankwise averaged predictions of multiple ensemble algorithms as opposed to only one. We name this approach NIMEFI (Network Inference using Multiple Ensemble Feature Importance algorithms) and show that this approach outperforms all individual methods in general, although on a specific network a single method can perform better.

In a second part of this thesis, we propose a post-processing algorithm for gene regulatory network predictions, Netter, which uses graphlets and several other graph-invariant properties to transform the network into a more accurate prediction. Common inference strategies of GRN inference algorithms include the calculation of local pairwise measures between genes or the transformation of the problem into independent regression subproblems to derive connections between genes. Using such schemes, the algorithm is unaware that the goal is to infer an actual network topology and the global network structure cannot influence the inference process. Netter is a flexible system which can be applied in unison with any method producing a ranking from omics data and can be tailored to specific prior-knowledge by expert users or applied in general uses cases. We re-rank predictions of six different state-of-the-art algorithms using three simple network properties as optimization criteria and show that Netter can improve the predictions made on both artificially generated data as well as commonly used benchmark data. Furthermore, Netter compares favorably to other post-processing algorithms and is not restricted to correlation-like predictions. Lastly, we demonstrate that the performance increase is robust for a wide range of parameter settings.

Thirdly, we also apply our gene regulatory network inference algorithms on a practical use case. More specifically, we collected and processed a large compendium of microarrays gathered in the context of an large international immunological cell project named BibTeX entry too long. Truncated