Repairing fractures between data using genetic programming-based feature extraction: A case study in cancer diagnosis
Introduction
The assumption that a properly trained classifier will be able to predict the behavior of unseen data from the same problem is at the core of any automatic classification process. However, this hypothesis tends to prove unreliable when dealing with biological (or other experimental sciences) data, especially when such data is provided by more than one laboratory, even if they are following the same protocols to obtain it.
The specific problem this paper attempts to solve is the following: we have data from one laboratory (dataset A), and derive a classifier from it that can predict its category accurately. We are then presented with data from a second laboratory (dataset B). This second dataset is not accurately predicted by the classifier we had previously built due to a fracture between the data of both laboratories. We intend to find a transformation of dataset B (dataset S) where the classifier works.
Evolutionary computing, as introduced by Holland [27]; is based on the idea of the survival of the fittest, evoked by the natural evolutionary process. In genetic algorithms (GAs) [21], solutions (genes) are more likely to reproduce the fitter they are, and random sporadic mutations help maintain population diversity. Genetic Programming (GP) [33] is a development of those techniques, and follows a similar pattern to evolve tree-shaped solutions using variable-length chromosomes.
Feature extraction, as defined by Wyse et al. [56], ‘consists of the extraction a set of new features from the original features through some functional mapping’. Our approach to the problem can be seen as feature extraction, since we build a new set of features which are functions of the old ones. However, we have a different goal than that of classical feature extraction, since our intention is to fit a dataset to an already existing classifier, not to improve the performance of a future one.
In this work, we intend to demonstrate the use of GP-based feature extraction to unveil transformations in order to improve the accuracy of a previously built classifier, by performing feature extraction on a dataset where said classifier should, in principle, work; but where it does not perform accurately enough. We test our algorithm first on artificially-built problems (where we apply ad hoc transformations to datasets from which a classifier has been built, and use the dataset resulting from those transformations as our problem dataset); and then on a real-world application where biological data from two different medical laboratories regarding prostate cancer diagnosis are used as datasets A and B.
Even though the method proposed in this paper does not attempt to reduce the number of features or instances in the dataset, it can still be regarded as a form of data reduction because it unifies the data distributions of two datasets; which results in the capability of applying the same classifier to both of them, instead of needing two different classifiers, one for each dataset.
The remainder of this paper is organized as follows: in Section 2, some preliminaries about the techniques used and some approaches to similar problems in the literature are presented. Section 3 details the real-world biological problem that motivates this paper. Section 4 has a description of the proposed algorithm GP-RFD; and Section 5 includes the experimental setup, along with the results obtained, and an analysis. Finally, in Section 6 some concluding remarks are made.
Section snippets
Preliminaries
This section is divided in the following way: in Subsection 2.1 we introduce the notation that has been used in this paper. Then we include an introduction to GP in Subsection 2.2, a brief summary of what has been done in feature extraction in Subsection 2.3, and a short review of the different approaches we found in the specialized literature on the use of GP for feature extraction in Subsection 2.4. We conclude mentioning some works related to the finding and repair of fractures between data
Case study: prostate cancer diagnosis
This section begins with an introduction to the importance of the problem in Subsection 3.1. The diagnostic procedure is summarized in Subsection 3.2, and the reason to apply GP-RFD to this problem is shown in Subsection 3.3. Finally, the preprocessing the data went through is presented in Subsection 3.4.
A proposal for GP-based feature extraction for the repairing of fractures between data (GP-RFD)
This section is presented in the following way: first, a justification for the choice of GP is included. Subsection 4.1 details how the solutions are represented, then the fitness evaluation procedure and the genetic operators are introduced in Subsections 4.2 Fitness evaluation, 4.3 Genetic operators respectively. Then, the parameter choices are explained in Subsection 4.4, while the function set is in Subsection 4.5. Finally, the execution flow of the whole procedure is shown in Subsection 4.6
Experimental study
This section is organized in the following way: to begin with, a general description of the experimental procedure is presented in Subsection 5.1, along with the datasets that we have used for our testing (both the benchmark problems and the prostate cancer dataset); and also in the benchmarks’ case the transformations performed on each of them. The parameters used for each experiment are shown in Subsection 5.2; followed by a presentation of the benchmark experimental results in Subsection 5.3
Concluding remarks
We have presented GP-RFD, a new algorithm that approaches a common problem in real life for which not many solutions have been proposed in evolutionary computing. The problem in question is the repairing of fractures between data by adjusting the data itself, not the classifiers built from it.
We have developed a solution to the problem by means of a GP-based algorithm that performs feature extraction on the problem dataset driven by the accuracy of the previously built classifier.
We have tested
Acknowledgments
Jose García Moreno-Torres was supported by a scholarship from ‘Obra Social la Caixa’ and is currently supported by a FPU grant from the Ministerio de Educación y Ciencia of the Spanish Government, and also by the KEEL project (TIN2008-06681-C06-01). Rohit Bhargava would like to acknowledge collaborators over the years, especially Dr. Stephen M. Hewitt and Dr. Ira W. Levin of the National Institutes of Health, for numerous useful discussions and guidance. Funding for this work was provided in
References (62)
- et al.
GP-COACH: genetic programming-based learning of compact and accurate fuzzy rule-based classification systems for high-dimensional problems
Information Sciences
(2010) - et al.
Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power
Information Sciences
(2010) - et al.
Breast cancer diagnosis using genetic programming generated feature
Pattern Recognition
(2006) - et al.
Population variation in genetic programming
Information Sciences
(2007) - et al.
Dynamic population variation in genetic programming
Information Sciences
(2009) - et al.
Routine high-return human-competitive automated problem-solving by means of genetic programming
Information Sciences
(2008) - et al.
Classifier design with feature selection and feature extraction using layered genetic programming
Expert Systems with Applications
(2008) - et al.
Conceptual equivalence for contrast mining in classification learning
Data & Knowledge Engineering
(2008) - et al.
Keel: a software tool to assess evolutionary algorithms for data mining problems
Soft Computing - A Fusion of Foundations, Methodologies and Applications
(2008) - AmericanCancerSociety. How many men get prostate cancer?...
UCI machine learning repository
The effect of extensive use of the mutation operator on generalization in genetic programming using sparse data sets
Tree structured rules in genetic algorithms
Feature extraction for the k-nearest neighbour classifier with genetic programming
A framework for monitoring classifiers’ performance: when and why failure occurs?
Knowledge and Information Systems
A representation for the adaptive generation of simple sequential programs
On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life
Statistical comparisons of classifiers over multiple data sets
Journal of Machine Learning Research
A survey on the application of genetic programming to classification
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews)
Numeric mutation improves the discovery of numeric constants in genetic programming
Infrared spectroscopic imaging for histopathologic recognition
Nature Biotechnology
Using the genetic algorithm to generate lisp source code to solve the prisoner’s dilemma
Genericity in evolutionary computation software tools: principles and case study
International Journal on Artificial Intelligence Tools
A study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability
Soft Computing
An extension on ‘statistical comparisons of classifiers over multiple data sets’ for all pairwise comparisons
Journal of Machine Learning Research
Genetic Algorithms in Search, Optimization, and Machine Learning
An introduction to variable and feature selection
Journal of Machine Learning Research
Cited by (28)
Adaptive outlier elimination in image registration using genetic programming
2017, Information SciencesCitation Excerpt :GP has been effectively used in developing mathematical models in the applications of pattern recognition, feature extraction, and computer vision [18,19,23]. In [18], a GP-based approach was presented to extract the features for improving the classifier performance in cancer diagnosis. In another work [23], GP-based model was developed for generating robust and discriminative feature descriptor for image classification.
Two-level evolutionary algorithm for discovering relations between nodes’ features in a complex network
2017, Applied Soft ComputingCitation Excerpt :This can be seen as a special case of the feature construction process, where only simple functional expressions are used (the feature may or may not be included) without discovering complicated relationships between features. For example, GP-based feature extraction was used in [25], while grammar guided genetic programming (G3P) was proposed in [26]. In the continuation, a two-level evolutionary algorithm is proposed for obtaining the optimal definition of T and estimating its coefficients together with the value of threshold th.
Memory and forgetting: An improved dynamic maintenance method for case-based reasoning
2014, Information SciencesCitation Excerpt :Based on the cognitive hypothesis of “similar problems have similar solutions”, CBR can solve new problems by retrieving similar cases from the case base [1]. It has been applied in forecasting [6,25], business management [4], and classification [19], etc. The traditional wisdom in case-based system has been that, the more cases stored in the case base, the more likely that similar cases may be retrieved, and the more accurate the problem solving becomes.
Prediction of the Unified Parkinson's Disease Rating Scale assessment using a genetic programming system with geometric semantic genetic operators
2014, Expert Systems with ApplicationsClass distribution estimation based on the Hellinger distance
2013, Information SciencesCitation Excerpt :A Wilcoxon signed-rank test between the method that achieves the lowest MAE and the others has been performed. To carry out the test, we used in this case the 50 performance values [43] from the experiment. The best methods (statistically equivalent) for each scenario are underlined in Table 12.