The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming
Introduction
The classification of cancer is a major research area in the medical field. Such classification is an important step in determining treatment and prognosis [1], [2]. Accurate diagnosis leads to better treatment and toxicity minimization for patients. Current morphological and clinical approaches that aim to classify tumors are not sufficient to recognize all the various types of tumors correctly. Patients may suffer from different type of tumors, even though they may show morphologically similar symptoms. A disease like a tumor is fundamentally a malfunction of genes, so utilizing the gene expression data might be the most direct diagnosis approach [1].
DNA microarray technology is a promising tool for cancer diagnosis. It generates large-scale gene expression profiles that include valuable information on organization as well as cancer [3]. Although microarray technology requires further development, it already allows for a more systematic approach to cancer classification using gene expression profiles [2], [4].
It is difficult to interpret gene expression data directly. Thus, many machine-learning techniques have been applied to classify the data. These techniques include the artificial neural network [5], [6], [7], [8], Bayesian approaches [9], [10], support vector machines [11], [12], [13], decision trees [14], [15], and k nearest neighbors [16].
Evolutionary techniques have also been used to analyze gene expression data. The genetic algorithm is mainly used to select useful features, while the genetic programming is used to find out a classification rule. Li et al. proposed a hybrid model of the genetic algorithm and k nearest neighbors to obtain effective gene selection [16], and Deutsch investigated evolutionary algorithms in order to find optimal gene sets [17]. Karzynsci et al. proposed a hybrid model of the genetic algorithm and a perceptron for the prediction of cancer [18]. Langdon and Buxton applied genetic programming for classifying DNA chip data [19]. Ensemble approaches have been also attempted to obtain highly accurate cancer classification by Valentini [20], Park and Cho [21], and Tan and Gilbert [22].
Highly accurate cancer classification is difficult to achieve. Since gene expression profiles consist of only a few samples that represent a large number of genes, many machine-learning techniques are apt to be over-fitted. Ensemble approaches offer increased accuracy and reliability when dealing with such problems. The approaches that combine multiple classifiers have received much attention in the past decade, and this is now a standard approach to improving classification performance in machine-learning [23], [24]. The ensemble classifier aims to generate more accurate and reliable performance than an individual classifier. Two representative issues, which are “how to generate diverse base classifiers” and “how to combine base classifiers” have been actively investigated in the ensemble approach.
The first issue “how to generate diverse base classifiers” is very important in the ensemble approach. As already known, ensemble approaches that use a set of same classifiers offer no benefit in performance to individual ones. Improvement might be obtained only when the base classifiers are complementary. Ideally, as long as the error of each classifier is less than 0.5, the error rate might be reduced to zero by increasing the number of base classifiers. However, the results are different in practical experiments, since there is a trade-off between diversity and individual error [25]. Many researchers have tried to generate a set of accurate as well as diverse classifiers. Generating base classifiers for ensemble approaches is often called ensemble learning. There are two representative ensemble-learning methods: bagging and boosting [26].
Bagging (bootstrap aggregating) was introduced by Breimen. This method generates base classifiers by using a randomly organized set of samples from the original data. Bagging tries to take advantage of the randomness of machine-learning techniques. Boosting, introduced by Schapire, produces a series of base classifiers. A set of samples is chosen based on the results of previous classifiers in the series. Samples that were incorrectly classified by previous classifiers are given further chances to be selected to construct a training set. Arching and Ada-Boosting are currently used as promising boosting techniques [25], [26].
Various other works have been used in an attempt to generate diverse base classifiers. Webb and Zheng proposed a multistrategy ensemble-learning method [25], while Optiz and Maclin provided an empirical study on popular ensemble methods [26]. Bryll et al. introduced attribute bagging, which generates diverse base classifiers using random feature subsets [24]. Islam et al. trained a set of neural networks to be negatively correlated with each other [27]. Other works have tried to estimate diversity and to select a subset of base classifiers for constructing an ensemble classifier [28], [29].
The second issue “how to combine base classifiers” is important together with the first one. Once base classifiers are obtained, a choice of a proper fusion strategy can maximize the ensemble effect. There are many simple combination strategies, including majority vote, average, weighted average, minimum, median, maximum, product, and Borda count. These strategies consider only the current results of each classifier for a sample. Instead, other combination strategies (such as Naïve Bayes, behavior-knowledge space, decision templates, Dempster–Shafer combination, and fuzzy integral) require a training process to construct decision matrices. On the other hand, the oracle strategy, which requires only one classifier to classify a sample correctly, is often employed to provide a possible upper bound on improvement to classification accuracy.
There has been much research on combination strategies. Verikas et al. comparatively tested various fusion methods on several datasets [30], and Kuncheva provided a formula for classification errors in simple combination strategies [23]. Tax compared averaging and multiplying as combining multiple classifiers [31], while Alexandre et al. compared sum and product rules [32]. Decision templates strategy, which was proposed by Kuncheva et al., has been compared with conventional methods [33]. Kim and Cho applied fuzzy integration of structure adaptive self-organizing maps (SOMs) for web content mining [34]. Shipp and Kuncheva tried to show relationships between combination methods and measures of diversity [29].
In this paper, we would like to address diversity in ensemble approaches and propose an effective ensemble approach by considering further diversity in genetic programming. A set of classification rules was generated by genetic programming, and then diverse ones were selected from among them in order to construct an ensemble classifier. In contrast to the conventional approaches, diversity was measured by matching the structure of the rules based on the interpretability of genetic programming. The paper also examines several representative feature selection methods and combination methods. Three popular gene expression datasets (lymphoma cancer dataset, lung cancer dataset and ovarian cancer dataset) were used for the experiments.
Section snippets
Ensemble genetic programming
Genetic programming was proposed by Koza in order to automatically generate a program that could solve a given problem [35]. It was originally similar to the genetic algorithm in many ways, but it was different in representation. An individual was represented as a tree composing of functions and terminal symbols. Various functions and terminal symbols were developed for the target application, and classification was one of the goals of genetic programming.
There are several works on ensemble
Diversity-based ensembling for accurate cancer classification
The proposed method consists of two parts: generating individual classification rules and combining them to construct an ensemble classifier as shown in Fig. 1. The process of generating individual classification rules is similar to approaches that have been used in previous work [41]. Feature selection is performed first to reduce the dimensionality of data, and a classification rule is generated by ensemble genetic programming. A number of individual classification rules are prepared by
Experimental environment
There are several DNA microarray datasets from published cancer gene expression studies. These include breast cancer datasets, central nervous system cancer datasets, colon cancer datasets, leukemia cancer datasets, lung cancer datasets, lymphoma cancer datasets, NCI60 datasets, ovarian cancer datasets, and prostate cancer datasets. Among them, three representative datasets were used in this paper. The first and second datasets involve samples from two variants of the same disease and the third
Conclusion
The classification of cancer, based on gene expression profiles, is a challenging task in bioinformatics. Many machine-learning techniques have been developed to obtain highly accurate classification performance. In this paper, we have proposed an effective ensemble approach that uses diversity in ensemble genetic programming to classify gene expression data. The ensemble helps improve classification performance, but diversity is also an important factor in constructing an ensemble classifier.
Acknowledgement
This research was supported by Brain Science and Engineering Research Program sponsored by Korean Ministry of Commerce, Industry and Energy.
References (52)
- et al.
Cancer diagnosis and microarrays
Int J Biochem Cell Biol
(2003) - et al.
Cancer classification using gene expression data
Inform Syst
(2003) - et al.
Characteristic attributes in cancer microarrays
J Biomed Inform
(2002) - et al.
A primer on gene expression and microarrays for machine learning researchers
J Biomed Inform
(2004) - et al.
An epicurean learning approach to gene-expression data classification
Artif Intell Med
(2003) - et al.
Cancer classification and prediction using logistic regression with Bayesian gene selection
J Biomed Inform
(2004) - et al.
Comprehensive vertical sample-based KNN/LSVM classification for gene expression analysis
J Biomed Inform
(2004) Gene expression data analysis of human lymphoma using support vector machines and output coding ensembles
Artif Intell Med
(2002)- et al.
Attribute bagging: improving accuracy of classifier ensembles by using random feature subsets
Pattern Recog
(2003) - et al.
Ensemble neural networks: many could be better than all
Artif Intell
(2002)
Relationships between combination methods and measures of diversity in combining classifiers
Inform Fusion
Soft combination of neural classifiers: a comparative study
Pattern Recog Lett
Combining multiple classifiers by averaging or by multiplying?
Pattern Recog
On combining classifiers using sum and product rules
Pattern Recog Lett
Decision templates for multiple classifier fusion: an experimental comparison
Pattern Recog
Fuzzy integration of structure adaptive SOMs for web content mining
Fuzzy Sets Syst
Genetic programming in classifying large-scale data: an ensemble method
Inform Sci
Feature selection in gene expression-based tumor classification
Mol Genet Metab
Diversity measures for multiple classifier system analysis and design
Inform Fusion
Use of proteomic patterns in serum to identify ovarian cancer
Lancet
A computational neural approach to support the discovery of gene function and classes of cancer
IEEE Trans Biomed Eng
Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks
Nat Med
Application of probabilistic neural networks to the class prediction of leukemia and embryonal tumor of central nervous system
Neural Process Lett
Bayesian class discovery in microarray datasets
IEEE Trans Biomed Eng
Multi-class protein fold recognition using support vector machines and neural networks
Bioinformatics
Multiclass cancer diagnosis using tumor gene expression signatures
Proc Natl Acad Sci
Cited by (74)
Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms
2020, Cognitive Informatics, Computer Modelling, and Cognitive Science: Volume 1: Theory, Case Studies, and ApplicationsAn evolutionary framework for machine learning applied to medical data
2019, Knowledge-Based SystemsStatistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms
2019, Computer Methods and Programs in BiomedicineCitation Excerpt :Supervised learning techniques such as artificial neural network (ANN) [12,13], support vector machine (SVM) [15–17] and so on were used for both feature extraction from gene expression data and gene characterization using training/testing paradigm. The different statistical tests such as t test [18–20], Kruskal–Wallis (KW) test, [21–23], entropy and information gain [24–26] have been also widely used in large scale gene expression datasets. Thus these statistical tests provide a powerful paradigm for identification and can lead to a better design model.
Predicting overall survivability in comorbidity of cancers: A data mining approach
2015, Decision Support SystemsSoft Computing Approaches for Ovarian Cancer: A Review
2024, GMSARN International JournalComputational studies in breast cancer
2022, Research Anthology on Medical Informatics in Breast and Cervical Cancer