Elsevier

Information Sciences

Volume 222, 10 February 2013, Pages 805-823
Information Sciences

Repairing fractures between data using genetic programming-based feature extraction: A case study in cancer diagnosis

https://doi.org/10.1016/j.ins.2010.09.018Get rights and content

Abstract

There is an underlying assumption on most model building processes: given a learned classifier, it should be usable to explain unseen data from the same given problem. Despite this seemingly reasonable assumption, when dealing with biological data it tends to fail; where classifiers built out of data generated using the same protocols in two different laboratories can lead to two different, non-interchangeable, classifiers. There are usually too many uncontrollable variables in the process of generating data in the lab and biological variations, and small differences can lead to very different data distributions, with a fracture between data.

This paper presents a genetics-based machine learning approach that performs feature extraction on data from a lab to help increase the classification performance of an existing classifier that was built using the data from a different laboratory which uses the same protocols, while learning about the shape of the fractures between data that motivated the bad behavior.

The experimental analysis over benchmark problems together with a real-world problem on prostate cancer diagnosis show the good behavior of the proposed algorithm.

Introduction

The assumption that a properly trained classifier will be able to predict the behavior of unseen data from the same problem is at the core of any automatic classification process. However, this hypothesis tends to prove unreliable when dealing with biological (or other experimental sciences) data, especially when such data is provided by more than one laboratory, even if they are following the same protocols to obtain it.

The specific problem this paper attempts to solve is the following: we have data from one laboratory (dataset A), and derive a classifier from it that can predict its category accurately. We are then presented with data from a second laboratory (dataset B). This second dataset is not accurately predicted by the classifier we had previously built due to a fracture between the data of both laboratories. We intend to find a transformation of dataset B (dataset S) where the classifier works.

Evolutionary computing, as introduced by Holland [27]; is based on the idea of the survival of the fittest, evoked by the natural evolutionary process. In genetic algorithms (GAs) [21], solutions (genes) are more likely to reproduce the fitter they are, and random sporadic mutations help maintain population diversity. Genetic Programming (GP) [33] is a development of those techniques, and follows a similar pattern to evolve tree-shaped solutions using variable-length chromosomes.

Feature extraction, as defined by Wyse et al. [56], ‘consists of the extraction a set of new features from the original features through some functional mapping’. Our approach to the problem can be seen as feature extraction, since we build a new set of features which are functions of the old ones. However, we have a different goal than that of classical feature extraction, since our intention is to fit a dataset to an already existing classifier, not to improve the performance of a future one.

In this work, we intend to demonstrate the use of GP-based feature extraction to unveil transformations in order to improve the accuracy of a previously built classifier, by performing feature extraction on a dataset where said classifier should, in principle, work; but where it does not perform accurately enough. We test our algorithm first on artificially-built problems (where we apply ad hoc transformations to datasets from which a classifier has been built, and use the dataset resulting from those transformations as our problem dataset); and then on a real-world application where biological data from two different medical laboratories regarding prostate cancer diagnosis are used as datasets A and B.

Even though the method proposed in this paper does not attempt to reduce the number of features or instances in the dataset, it can still be regarded as a form of data reduction because it unifies the data distributions of two datasets; which results in the capability of applying the same classifier to both of them, instead of needing two different classifiers, one for each dataset.

The remainder of this paper is organized as follows: in Section 2, some preliminaries about the techniques used and some approaches to similar problems in the literature are presented. Section 3 details the real-world biological problem that motivates this paper. Section 4 has a description of the proposed algorithm GP-RFD; and Section 5 includes the experimental setup, along with the results obtained, and an analysis. Finally, in Section 6 some concluding remarks are made.

Section snippets

Preliminaries

This section is divided in the following way: in Subsection 2.1 we introduce the notation that has been used in this paper. Then we include an introduction to GP in Subsection 2.2, a brief summary of what has been done in feature extraction in Subsection 2.3, and a short review of the different approaches we found in the specialized literature on the use of GP for feature extraction in Subsection 2.4. We conclude mentioning some works related to the finding and repair of fractures between data

Case study: prostate cancer diagnosis

This section begins with an introduction to the importance of the problem in Subsection 3.1. The diagnostic procedure is summarized in Subsection 3.2, and the reason to apply GP-RFD to this problem is shown in Subsection 3.3. Finally, the preprocessing the data went through is presented in Subsection 3.4.

A proposal for GP-based feature extraction for the repairing of fractures between data (GP-RFD)

This section is presented in the following way: first, a justification for the choice of GP is included. Subsection 4.1 details how the solutions are represented, then the fitness evaluation procedure and the genetic operators are introduced in Subsections 4.2 Fitness evaluation, 4.3 Genetic operators respectively. Then, the parameter choices are explained in Subsection 4.4, while the function set is in Subsection 4.5. Finally, the execution flow of the whole procedure is shown in Subsection 4.6

Experimental study

This section is organized in the following way: to begin with, a general description of the experimental procedure is presented in Subsection 5.1, along with the datasets that we have used for our testing (both the benchmark problems and the prostate cancer dataset); and also in the benchmarks’ case the transformations performed on each of them. The parameters used for each experiment are shown in Subsection 5.2; followed by a presentation of the benchmark experimental results in Subsection 5.3

Concluding remarks

We have presented GP-RFD, a new algorithm that approaches a common problem in real life for which not many solutions have been proposed in evolutionary computing. The problem in question is the repairing of fractures between data by adjusting the data itself, not the classifiers built from it.

We have developed a solution to the problem by means of a GP-based algorithm that performs feature extraction on the problem dataset driven by the accuracy of the previously built classifier.

We have tested

Acknowledgments

Jose García Moreno-Torres was supported by a scholarship from ‘Obra Social la Caixa’ and is currently supported by a FPU grant from the Ministerio de Educación y Ciencia of the Spanish Government, and also by the KEEL project (TIN2008-06681-C06-01). Rohit Bhargava would like to acknowledge collaborators over the years, especially Dr. Stephen M. Hewitt and Dr. Ira W. Levin of the National Institutes of Health, for numerous useful discussions and guidance. Funding for this work was provided in

References (62)

  • A. Arcuri, X. Yao, Co-evolutionary automatic programming for software development, Information Sciences (2010), in...
  • A. Asuncion et al.

    UCI machine learning repository

    (2007)
  • W. Banzhaf et al.

    The effect of extensive use of the mutation operator on generalization in genetic programming using sparse data sets

  • A.S. Bickel et al.

    Tree structured rules in genetic algorithms

    (1987)
  • M.C.J. Bot

    Feature extraction for the k-nearest neighbour classifier with genetic programming

    (2001)
  • D.A. Cieslak et al.

    A framework for monitoring classifiers’ performance: when and why failure occurs?

    Knowledge and Information Systems

    (2009)
  • N.L. Cramer

    A representation for the adaptive generation of simple sequential programs

    (1985)
  • C. Darwin

    On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life

    (1859)
  • J. Demšar

    Statistical comparisons of classifiers over multiple data sets

    Journal of Machine Learning Research

    (2006)
  • P.G. Espejo et al.

    A survey on the application of genetic programming to classification

    IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)

    (2010)
  • M. Evett et al.

    Numeric mutation improves the discovery of numeric constants in genetic programming

  • D.C. Fernandez et al.

    Infrared spectroscopic imaging for histopathologic recognition

    Nature Biotechnology

    (2005)
  • C. Fujiko et al.

    Using the genetic algorithm to generate lisp source code to solve the prisoner’s dilemma

  • C. Gagné et al.

    Genericity in evolutionary computation software tools: principles and case study

    International Journal on Artificial Intelligence Tools

    (2006)
  • S. García et al.

    A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability

    Soft Computing

    (2009)
  • S. García et al.

    An extension on ‘statistical comparisons of classifiers over multiple data sets’ for all pairwise comparisons

    Journal of Machine Learning Research

    (2008)
  • D.E. Goldberg

    Genetic Algorithms in Search, Optimization, and Machine Learning

    (1989)
  • A. Guérin-Dugué et al. Deliverable R3-B1-P - Task B1: Databases. Technical report, Elena-NervesII “Enhanced Learning...
  • I. Guyon et al.

    An introduction to variable and feature selection

    Journal of Machine Learning Research

    (2003)
  • C. Harris, An investigation into the Application of Genetic Programming techniques to Signal Analysis and Feature...
  • Cited by (28)

    • Adaptive outlier elimination in image registration using genetic programming

      2017, Information Sciences
      Citation Excerpt :

      GP has been effectively used in developing mathematical models in the applications of pattern recognition, feature extraction, and computer vision [18,19,23]. In [18], a GP-based approach was presented to extract the features for improving the classifier performance in cancer diagnosis. In another work [23], GP-based model was developed for generating robust and discriminative feature descriptor for image classification.

    • Two-level evolutionary algorithm for discovering relations between nodes’ features in a complex network

      2017, Applied Soft Computing
      Citation Excerpt :

      This can be seen as a special case of the feature construction process, where only simple functional expressions are used (the feature may or may not be included) without discovering complicated relationships between features. For example, GP-based feature extraction was used in [25], while grammar guided genetic programming (G3P) was proposed in [26]. In the continuation, a two-level evolutionary algorithm is proposed for obtaining the optimal definition of T and estimating its coefficients together with the value of threshold th.

    • Memory and forgetting: An improved dynamic maintenance method for case-based reasoning

      2014, Information Sciences
      Citation Excerpt :

      Based on the cognitive hypothesis of “similar problems have similar solutions”, CBR can solve new problems by retrieving similar cases from the case base [1]. It has been applied in forecasting [6,25], business management [4], and classification [19], etc. The traditional wisdom in case-based system has been that, the more cases stored in the case base, the more likely that similar cases may be retrieved, and the more accurate the problem solving becomes.

    • Class distribution estimation based on the Hellinger distance

      2013, Information Sciences
      Citation Excerpt :

      A Wilcoxon signed-rank test between the method that achieves the lowest MAE and the others has been performed. To carry out the test, we used in this case the 50 performance values [43] from the experiment. The best methods (statistically equivalent) for each scenario are underlined in Table 12.

    View all citing articles on Scopus
    View full text