The classification of cancer based on DNA microarray data that uses diverse ensemble genetic programming

https://doi.org/10.1016/j.artmed.2005.06.002Get rights and content

Summary

Object

The classification of cancer based on gene expression data is one of the most important procedures in bioinformatics. In order to obtain highly accurate results, ensemble approaches have been applied when classifying DNA microarray data. Diversity is very important in these ensemble approaches, but it is difficult to apply conventional diversity measures when there are only a few training samples available. Key issues that need to be addressed under such circumstances are the development of a new ensemble approach that can enhance the successful classification of these datasets.

Materials and methods

An effective ensemble approach that does use diversity in genetic programming is proposed. This diversity is measured by comparing the structure of the classification rules instead of output-based diversity estimating.

Results

Experiments performed on common gene expression datasets (such as lymphoma cancer dataset, lung cancer dataset and ovarian cancer dataset) demonstrate the performance of the proposed method in relation to the conventional approaches.

Conclusion

Diversity measured by comparing the structure of the classification rules obtained by genetic programming is useful to improve the performance of the ensemble classifier.

Introduction

The classification of cancer is a major research area in the medical field. Such classification is an important step in determining treatment and prognosis [1], [2]. Accurate diagnosis leads to better treatment and toxicity minimization for patients. Current morphological and clinical approaches that aim to classify tumors are not sufficient to recognize all the various types of tumors correctly. Patients may suffer from different type of tumors, even though they may show morphologically similar symptoms. A disease like a tumor is fundamentally a malfunction of genes, so utilizing the gene expression data might be the most direct diagnosis approach [1].

DNA microarray technology is a promising tool for cancer diagnosis. It generates large-scale gene expression profiles that include valuable information on organization as well as cancer [3]. Although microarray technology requires further development, it already allows for a more systematic approach to cancer classification using gene expression profiles [2], [4].

It is difficult to interpret gene expression data directly. Thus, many machine-learning techniques have been applied to classify the data. These techniques include the artificial neural network [5], [6], [7], [8], Bayesian approaches [9], [10], support vector machines [11], [12], [13], decision trees [14], [15], and k nearest neighbors [16].

Evolutionary techniques have also been used to analyze gene expression data. The genetic algorithm is mainly used to select useful features, while the genetic programming is used to find out a classification rule. Li et al. proposed a hybrid model of the genetic algorithm and k nearest neighbors to obtain effective gene selection [16], and Deutsch investigated evolutionary algorithms in order to find optimal gene sets [17]. Karzynsci et al. proposed a hybrid model of the genetic algorithm and a perceptron for the prediction of cancer [18]. Langdon and Buxton applied genetic programming for classifying DNA chip data [19]. Ensemble approaches have been also attempted to obtain highly accurate cancer classification by Valentini [20], Park and Cho [21], and Tan and Gilbert [22].

Highly accurate cancer classification is difficult to achieve. Since gene expression profiles consist of only a few samples that represent a large number of genes, many machine-learning techniques are apt to be over-fitted. Ensemble approaches offer increased accuracy and reliability when dealing with such problems. The approaches that combine multiple classifiers have received much attention in the past decade, and this is now a standard approach to improving classification performance in machine-learning [23], [24]. The ensemble classifier aims to generate more accurate and reliable performance than an individual classifier. Two representative issues, which are “how to generate diverse base classifiers” and “how to combine base classifiers” have been actively investigated in the ensemble approach.

The first issue “how to generate diverse base classifiers” is very important in the ensemble approach. As already known, ensemble approaches that use a set of same classifiers offer no benefit in performance to individual ones. Improvement might be obtained only when the base classifiers are complementary. Ideally, as long as the error of each classifier is less than 0.5, the error rate might be reduced to zero by increasing the number of base classifiers. However, the results are different in practical experiments, since there is a trade-off between diversity and individual error [25]. Many researchers have tried to generate a set of accurate as well as diverse classifiers. Generating base classifiers for ensemble approaches is often called ensemble learning. There are two representative ensemble-learning methods: bagging and boosting [26].

Bagging (bootstrap aggregating) was introduced by Breimen. This method generates base classifiers by using a randomly organized set of samples from the original data. Bagging tries to take advantage of the randomness of machine-learning techniques. Boosting, introduced by Schapire, produces a series of base classifiers. A set of samples is chosen based on the results of previous classifiers in the series. Samples that were incorrectly classified by previous classifiers are given further chances to be selected to construct a training set. Arching and Ada-Boosting are currently used as promising boosting techniques [25], [26].

Various other works have been used in an attempt to generate diverse base classifiers. Webb and Zheng proposed a multistrategy ensemble-learning method [25], while Optiz and Maclin provided an empirical study on popular ensemble methods [26]. Bryll et al. introduced attribute bagging, which generates diverse base classifiers using random feature subsets [24]. Islam et al. trained a set of neural networks to be negatively correlated with each other [27]. Other works have tried to estimate diversity and to select a subset of base classifiers for constructing an ensemble classifier [28], [29].

The second issue “how to combine base classifiers” is important together with the first one. Once base classifiers are obtained, a choice of a proper fusion strategy can maximize the ensemble effect. There are many simple combination strategies, including majority vote, average, weighted average, minimum, median, maximum, product, and Borda count. These strategies consider only the current results of each classifier for a sample. Instead, other combination strategies (such as Naïve Bayes, behavior-knowledge space, decision templates, Dempster–Shafer combination, and fuzzy integral) require a training process to construct decision matrices. On the other hand, the oracle strategy, which requires only one classifier to classify a sample correctly, is often employed to provide a possible upper bound on improvement to classification accuracy.

There has been much research on combination strategies. Verikas et al. comparatively tested various fusion methods on several datasets [30], and Kuncheva provided a formula for classification errors in simple combination strategies [23]. Tax compared averaging and multiplying as combining multiple classifiers [31], while Alexandre et al. compared sum and product rules [32]. Decision templates strategy, which was proposed by Kuncheva et al., has been compared with conventional methods [33]. Kim and Cho applied fuzzy integration of structure adaptive self-organizing maps (SOMs) for web content mining [34]. Shipp and Kuncheva tried to show relationships between combination methods and measures of diversity [29].

In this paper, we would like to address diversity in ensemble approaches and propose an effective ensemble approach by considering further diversity in genetic programming. A set of classification rules was generated by genetic programming, and then diverse ones were selected from among them in order to construct an ensemble classifier. In contrast to the conventional approaches, diversity was measured by matching the structure of the rules based on the interpretability of genetic programming. The paper also examines several representative feature selection methods and combination methods. Three popular gene expression datasets (lymphoma cancer dataset, lung cancer dataset and ovarian cancer dataset) were used for the experiments.

Section snippets

Ensemble genetic programming

Genetic programming was proposed by Koza in order to automatically generate a program that could solve a given problem [35]. It was originally similar to the genetic algorithm in many ways, but it was different in representation. An individual was represented as a tree composing of functions and terminal symbols. Various functions and terminal symbols were developed for the target application, and classification was one of the goals of genetic programming.

There are several works on ensemble

Diversity-based ensembling for accurate cancer classification

The proposed method consists of two parts: generating individual classification rules and combining them to construct an ensemble classifier as shown in Fig. 1. The process of generating individual classification rules is similar to approaches that have been used in previous work [41]. Feature selection is performed first to reduce the dimensionality of data, and a classification rule is generated by ensemble genetic programming. A number of individual classification rules are prepared by

Experimental environment

There are several DNA microarray datasets from published cancer gene expression studies. These include breast cancer datasets, central nervous system cancer datasets, colon cancer datasets, leukemia cancer datasets, lung cancer datasets, lymphoma cancer datasets, NCI60 datasets, ovarian cancer datasets, and prostate cancer datasets. Among them, three representative datasets were used in this paper. The first and second datasets involve samples from two variants of the same disease and the third

Conclusion

The classification of cancer, based on gene expression profiles, is a challenging task in bioinformatics. Many machine-learning techniques have been developed to obtain highly accurate classification performance. In this paper, we have proposed an effective ensemble approach that uses diversity in ensemble genetic programming to classify gene expression data. The ensemble helps improve classification performance, but diversity is also an important factor in constructing an ensemble classifier.

Acknowledgement

This research was supported by Brain Science and Engineering Research Program sponsored by Korean Ministry of Commerce, Industry and Energy.

References (52)

  • C. Shipp et al.

    Relationships between combination methods and measures of diversity in combining classifiers

    Inform Fusion

    (2002)
  • A. Verikas et al.

    Soft combination of neural classifiers: a comparative study

    Pattern Recog Lett

    (1999)
  • D. Tax et al.

    Combining multiple classifiers by averaging or by multiplying?

    Pattern Recog

    (2000)
  • L. Alexandre et al.

    On combining classifiers using sum and product rules

    Pattern Recog Lett

    (2001)
  • L. Kuncheva et al.

    Decision templates for multiple classifier fusion: an experimental comparison

    Pattern Recog

    (2001)
  • K. Kim et al.

    Fuzzy integration of structure adaptive SOMs for web content mining

    Fuzzy Sets Syst

    (2004)
  • Y. Zhang et al.

    Genetic programming in classifying large-scale data: an ensemble method

    Inform Sci

    (2004)
  • M. Xiong et al.

    Feature selection in gene expression-based tumor classification

    Mol Genet Metab

    (2001)
  • T. Windeatt

    Diversity measures for multiple classifier system analysis and design

    Inform Fusion

    (2005)
  • E. Petricoin et al.

    Use of proteomic patterns in serum to identify ovarian cancer

    Lancet

    (2002)
  • F. Azuaje

    A computational neural approach to support the discovery of gene function and classes of cancer

    IEEE Trans Biomed Eng

    (2001)
  • J. Khan et al.

    Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks

    Nat Med

    (2001)
  • C. Huang et al.

    Application of probabilistic neural networks to the class prediction of leukemia and embryonal tumor of central nervous system

    Neural Process Lett

    (2004)
  • V. Roth et al.

    Bayesian class discovery in microarray datasets

    IEEE Trans Biomed Eng

    (2004)
  • C. Ding et al.

    Multi-class protein fold recognition using support vector machines and neural networks

    Bioinformatics

    (2001)
  • S. Ramaswamy et al.

    Multiclass cancer diagnosis using tumor gene expression signatures

    Proc Natl Acad Sci

    (2001)
  • Cited by (74)

    • Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms

      2020, Cognitive Informatics, Computer Modelling, and Cognitive Science: Volume 1: Theory, Case Studies, and Applications
    • Statistical characterization and classification of colon microarray gene expression data using multiple machine learning paradigms

      2019, Computer Methods and Programs in Biomedicine
      Citation Excerpt :

      Supervised learning techniques such as artificial neural network (ANN) [12,13], support vector machine (SVM) [15–17] and so on were used for both feature extraction from gene expression data and gene characterization using training/testing paradigm. The different statistical tests such as t test [18–20], Kruskal–Wallis (KW) test, [21–23], entropy and information gain [24–26] have been also widely used in large scale gene expression datasets. Thus these statistical tests provide a powerful paradigm for identification and can lead to a better design model.

    • Computational studies in breast cancer

      2022, Research Anthology on Medical Informatics in Breast and Cervical Cancer
    View all citing articles on Scopus
    View full text