Exploring genetic influences on adverse outcome pathways using heuristic simulation and graph data science
Introduction
Informatics and computational methods have revolutionized biomedical research and enabled scientists to explore questions that are either infeasible or impossible through traditional experimentation alone [32]. In environmental health and toxicology, common computational tasks include building and training models that predict various chemical properties, conducting statistical analysis of observational and epidemiological data to better understand exposure-related health outcomes, and performing network analyses to discover key processes in biochemical pathways, among others [17], [38]. Despite the successes made using these methods, some key deficiencies have become apparent in toxicological research, such as a lack of richly structured, multimodal biomedical data describing chemicals and the biological systems that respond to chemical exposure [31] and a paucity of novel methods for discovering new knowledge from these complex data resources [36]. In this paper, we employ both to gain new insights into a phenomenon of growing interest: the influence of genetics on susceptibility to an adverse outcome following specific chemical exposures.
Adverse Outcome Pathways (AOPs) are pathway-like descriptions that outline the mechanistic associations between molecular exposure events and higher-order clinical and population-level outcomes that may arise from the exposure [2], [26]. AOPs consist of molecular initiating events (MIEs), key events (KEs), and adverse outcomes (AOs). By definition, a KE is any internal step within an AOP at some level of biological organization, and an MIE is a particular kind of KE that both initiates an AOP and is comprised of a molecular interaction between a toxicant and a body component. AOPs are classified according to their respective health outcomes, and AOPs associated with similar outcomes often overlap to create an ‘AOP network.’ An AOP’s set of KEs can include genetic polymorphisms that are associated with higher risk to the adverse outcome. For example, colon cancer AOPs include 53 unique SNP associations originally derived from GWAS [37]. This study will attempt to look at the influence genetic phenomena have on susceptibility to adverse outcomes after specific chemical exposures using AOPs as a framework for reference.
Methodologically, one area in particular that has experienced rapid growth, and holds great promise in all areas of biomedicine, is artificial intelligence (AI). AI broadly aims to construct computational systems that make intelligent decisions based on available data, knowledge, and/or human input. The scope of what comprises AI is broad, and usually nebulously defined. In this paper, we explore two areas within AI: Evolutionary algorithms and graph data science. Evolutionary algorithms are a family of algorithms that imitate processes found in biological evolution to optimize a system (e.g., a predictive model, a symbolic mathematical equation, or even another algorithm). Unsurprisingly, evolutionary computation is often used in computational biology, for example, in the context of simulating natural systems or processes [10], [11], [22] and building machine learning classifiers that perform well on a specific task [16], [23], [28]. Graph data science refers to the quantitative analysis of graphs – sometimes known as networks (e.g., biological networks), and comprised of a set of nodes connected by a set of edges that define relationships between those nodes [7], [27]. Some tasks within graph data science involve community detection [9], identification of the shortest paths linking two nodes in a graph [12], determining ‘hub nodes’ that play critical roles in the global structure of a graph [7], [41], and using computational algorithms that yield quantitative understandings of the behavior and characteristics of a given graph [1], [15]. Since AOPs can be represented as graphs, graph data science provides a powerful set of tools for discovering properties of AOPs that are not obvious through manual inspection.
Here, we propose a novel approach to gain understanding of the mechanisms underlying genetic influences on toxic adverse outcomes, without the inclusion of associated case-control information, that leverages these two areas of AI, and subsequently evaluates the approach in the context of toxicity-mediated adverse outcome pathways involved in liver cancer (LC). Briefly, we train interpretable generative models to construct synthetic datasets resembling real-world LC AOP genotype data via the HIBACHI software, and introspect the best models produced by HIBACHI (Heuristic Identification of Biological Architectures for simulating Complex Hierarchical genetic Interactions) for the most prominent AOP SNPs that influence LC outcomes. HIBACHI is a command line utility based on genetic programming (GP) that generates (synthetic) datasets with interactions between input features [24], [25]. It uses the (μ + λ) evolutionary algorithm [6] to construct trees of primitive mathematical operations that can represent interactions between independent variables. For example, when applied to genetic data, these feature interactions may represent epistasis or mechanisms underlying polygenic traits. HIBACHI can take an existing dataset – referred to in the context of GP as a model – as input, which is then used to evaluate the fitness of candidate output datasets. Our hypothesis is that HIBACHI can create synthetic datasets of SNPs involved in AOPs that behave the same as real data for the same AOPs. This will allow us to explore the interpretable generative models used to create the synthetic data, which gives insights into interactions between specific features in the real data used to train HIBACHI. Conceptually, this process can be likened to a brute-force version of symbolic regression [18] that avoids pitfalls arising from statistical analyses on genetic data with complex interactions between features [40]. Importantly, this approach utilizes genomic and phenotypic data from real-world populations, combined with information and knowledge sourced from publicly available, open access databases describing mechanisms of toxicity. Our methods are generalizable to other diseases of interest and provide a new framework for toxicologists to explore genetic mechanisms that underlie toxic adverse outcome susceptibility.
Section snippets
Data sources
Our analysis uses data from the US Environmental Protection Agency’s Adverse Outcome Pathway Database (AOP-DB) and the UK Biobank (UKBB). The AOP-DB provides a formal structure for AOPs and their contained key events, as well as the relationships and associations between key events, genes (and their variants), metabolic pathways, diseases, and other relationships of toxicological interest. Data in the AOP-DB are aggregated from third-party public databases, including automated data pulls from
AOPs and SNPs associated with liver cancer
Our initial query for LC AOPs finds 16 liver related AOPs and 189 SNPs associated with these AOPs. AOPs 1, 37, 41, 46, 107, 108, and 117 are specific, describing a particular etiology of LC or hepatocellular carcinoma, while the other AOPs describe LC in a more general context. Interestingly, the AOPs describing liver fibrosis, hepatotoxicity, and liver injury contain no SNP associations, although a number of these AOPs are still under development. The AOPs that feature SNP associations often
Discussion
The 4 SNPs implicated by HIBACHI are members of 2 AOPs: Cholestatic Liver Injury induced by Inhibition of the Bile Salt Export Pump (ABCB11), and Sustained AhR Activation leading to Rodent Liver Tumors. Since these two AOPs directly implicate key roles played by the Abcb11 and Ahr genes, these can be thought of as the central mediators of genetic risk to toxicity-induced LC. However, although these genes may be the most important in terms of disease etiology, the HIBACHI-identified SNPs may
Conclusions
In this study, we show that genetic programming and graph data science can be leveraged to uncover patterns of genetic regulation in adverse outcome pathways using real-world observational data. Our approach provides one of the first concrete examples of using HIBACHI – an open-source software tool originally designed to create synthetic datasets with interactions between features – on a task that increases our understanding of biological phenomena. We describe a novel association between
CRediT authorship contribution statement
Joseph D. Romano: Visualization. Liang Mei: Visualization. Jonathan Senn: Visualization. Jason H. Moore: Software. Holly M. Mortensen: Conceptualized the study methods.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work is supported by the Environmental Protection Agency's National Research Program in Chemical Safety and Sustainability, Adverse Outcome Pathway Discovery and Development (FY22 CSS AOPDD 4.3.2.2). This research has been conducted using data from UK Biobank, a major biomedical database. The work was additionally funded using grant support from the US National Institutes of Health: K99-LM013646 (PI: Romano), R01-AG066833, R01-LM010098, R01-LM013463 (PI: Moore), and P30-ES013508 (PI:
EPA Disclaimer
This manuscript has been reviewed by the Center for Public Health and Environmental Assessment, United States Environmental Protection Agency and approved for publication. Approval does not signify that the contents necessarily reflect the views and policies of the Agency nor does mention of trade names or commercial products constitute endorsement or recommendation for use. The authors declare no conflict of interest.
References (41)
Community detection in graphs
Physics Reports
(2010)UK Biobank: Bank on it
Lancet (London, England)
(2007)- et al.
Network hubs in the human brain
Trends in Cognitive Sciences
(2013) Graph-based methods for analysing networks in cell biology
Briefings in Bioinformatics
(2006)- et al.
Adverse outcome pathways: A conceptual framework to support ecotoxicology research and risk assessment
Environmental Toxicology and Chemistry
(2010) - et al.
Search-and-replace genome editing without double-strand breaks or donor DNA
Nature
(2019) - et al.
A global reference for human genetic variation
Nature
(2015) - et al.
Statistical primer: Propensity score matching and its alternatives
European Journal of Cardio-Thoracic Surgery: Official Journal of the European Association for Cardio-Thoracic Surgery
(2018) - et al.
Evolution strategies—A comprehensive introduction
Natural Computing
(2002) Modern Graph Theory
(1998)