Elsevier

Knowledge-Based Systems

Volume 185, 1 December 2019, 104982
Knowledge-Based Systems

An evolutionary framework for machine learning applied to medical data

https://doi.org/10.1016/j.knosys.2019.104982Get rights and content

Abstract

Supervised learning problems can be faced by using a wide variety of approaches supported in machine learning. In recent years there has been an increasing interest in using the evolutionary computation paradigm as a search method for classifiers, helping the applied machine learning technique. In this context, the knowledge representation in the form of logical rules has been one of the most accepted machine learning approaches, because of its level of expressiveness. This paper proposes an evolutionary framework for rule-based classifier induction. Our proposal introduces genetic programming to build a search method for classification-rules (IF/THEN). From this approach, we deal with problems such as, maximum rule length and rule intersection. The experiments have been carried out on our domain of interest, medical data. The achieved results define a methodology to follow in the learning method evaluation for knowledge discovery from medical data. Moreover, the results compared to other methods have shown that our proposal can be very useful in data analysis and classification coming from the medical domain.

Introduction

Machine learning can be approached as the systematic study of algorithms and systems, which improve their knowledge or performance based on experience [1], [2], [3], [4], [5], [6]. The building of machines able to learn from experience has been for a long time a matter of debate since such machines have proven to have a meaningful level of learning ability. Thus, the introduction of machine learning techniques in computer science troubleshooting is of vital importance since there exist problems that cannot be solved through common programming techniques [7], [8], [9], [10]. For such problems there is not an available consistent mathematical model able to guide the programmer. But solving those problems has the potential of reforming aspects of our life and the used machine learning methods may provide the key to their solutions [11].

According to [12], the tasks involving machine learning can be classified into three main categories as follows: supervised learning, which builds a model from a set of inputs and a corresponding set of outputs. The goal is to find a mapping relating inputs with outputs. In contrast, unsupervised learning is not based on experience as the case of supervised learning and there are no labels on the data. The goal of this approach is to capture the true structure of the data to disclose knowledge. In reinforcement learning, intelligent processes (which can be called agents) interact with each other in a dynamic environment to reach their targets. In this approach, agents learn from a series of reinforcements, rewards or punishments, which makes the difference with supervised learning. Since such agents are endowed of a reinforcement process, they can evolve by learning from their environment [13].

The approach using examples (also called instances or patterns) to create programs is known as learning methodology and the set of examples can be referred to as the training data (or training set). The estimate of the target function, which is learned from the learning algorithm and maps inputs with outputs is known as the solution of the learning problem (or decision function). Usually, a set of candidate functions known as hypotheses is selected before to start learning the correct function. Therefore, the set of hypotheses can be seen as the key of the learning strategy. On the other hand, the method taking the training data as input and choosing the hypothesis from the space is the second key of the learning strategy, which is known as the learning method. A learning problem with binary outputs is called binary classification, one with a finite number of outputs is known as multi-class classification and one with real value outputs (continuous values) is known as regression.

The exponential growth of the amount of available medical data raises the problems of efficient storage and management of information as well as disclosing useful information from the data. The problem above is a challenge in computational medicine, claiming the development of methods and tools able to transform data into medical knowledge on the underlying mechanism. Those tools (methods) allow us to go beyond a simple description of the data and provide knowledge in form of models. Through this data abstraction involving a model, we will be able to obtain predictions of systems [8], [14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25].

There are several medical domains where machine learning techniques have been applied to discover knowledge, such as: diagnosis and prognosis, medical imaging and signal processing, planning and scheduling. Diagnosis and prognosis are the most common within this domain. Diagnosis is the process of selectively collecting information concerning a patient for its subsequent interpretation according to previous knowledge, as evidence for or against the presence or absence of disorders [26]. In a prognostic process, information is also collected and interpreted through the patient. But in this case, the goal is to predict the future behavior of the patient’s condition. For its predictive nature, prognosis systems are often used as tools to state medical treatments [27]. The goal of machine learning in the context of diagnosis and prognosis is knowledge discovery needed to interpret the gathered information. In some cases, this knowledge has been expressed as probabilistic relationships between clinical features and the proposed diagnosis (or prognosis). In other cases, a rule-based representation has been selected, so as to provide the expert with an explanation of the decision. Moreover, there are other cases where the system is designed as a black box decision maker, which is totally indifferent to the interpretation of its decisions. In summary, machine learning techniques are well suited to solve these kinds of problems due to their ability to carry out searches in extremely complex spaces.

We are today racking up a huge amount of data involving medical domains and we are also interested in transforming it into knowledge. Hence, the latter can be used as an auxiliary assistant in the decision-making process performed in the diagnosis of diseases [17], [20], [25]. Disclosing knowledge from data is one of the tasks supported by learning algorithms. In the case of the medical data domain, we are interested about inducing classifiers from a set of labeled patterns as a kind of supervised learning. We aim at developing classifiers in the form of a set of rules. To achieve this, we will devise an evolutionary algorithm, i.e., an algorithm inspired by the principles of evolution by natural selection, and the principles of genetics, tuned for this kind of problem.

The task of particular interest that we are going to implement aims at the development of interpretable solutions, enabling us to transform clinical data into medical knowledge and to incorporate existing clinical evidence. The concrete goal of this task is threefold: (1) to develop a novel algorithm for the building of a classifier from labeled examples and (2–3) to test and validate the algorithm with medical data sets.

As a general approach and in relation to evolutionary computation, we have developed an evolutionary framework based on genetic programming to induce rules of form IFconditions THEN class, which will be able to classify patterns coming from the search space. Therefore, the implementation of the framework has been intending to achieve rule-based classifiers by basing on the idea of sequential covering (or separate and conquer) to render each rule [28]. In such a case, we assume individuals evolving from our learning algorithm as single classification rules (Michigan-style) [29]. The biggest challenge faced by the proposed framework was to solve the intersection problem of rules, i.e., patterns classified by rules in different classes. To solve such a problem, we have proposed an ensemble method to classify patterns at the intersection.

To achieve the aims above, the remainder of this paper has been outlined into the following sections: Section 2 deals with the existing background about our proposal. Section 3 presents our proposal, an evolutionary framework to induce rule-based classifiers. In this context, details such as, encoding and search strategy, rule generalization, fitness functions, genetic operators, classification model, rule intersection, the evolutionary algorithm and classifier improvement have been included in this section. Section 4 outlines the main features of the four clinical data sets involved in the experiments as well as the results reached by this proposal compared with other machine learning methods. Section 5 explains the conclusions of this work. Appendix A Example of a rule-based classifier for dataset#1, Appendix B Proof of theorems present an example of a rule-based classifier and the theoretical results respectively. Additionally, this paper provides a supplemental material in the Supplemental Material Document (SMD), which supports and complements the contributions reached in this research.

Section snippets

Background

Genetic programming (GP) represents a flexible and powerful evolutionary technique that uses a set of functions and terminals to produce computable expressions [30], [31], [32]. This allows us to find general solutions, in form of IF-THEN rules, able to classify patterns from a determined problem [28], [29], [33].

When GP is used for rule induction by generating classifiers becomes one of the most important applications in this field, since it allows us to capture the main features of a given

An evolutionary framework for rule induction

This section describes our evolutionary proposal, which defines a Rule Induction Method based on Genetic Programming (called RIM-GP). Before starting with an explanation of the method, we are going to definite classification-rule and rule-based classifier according to the approach assumed in this work. This will be useful in the analysis and understanding of RIM-GP.

Definition 1 Classification Rule

Let Dd (or D for short) be a data set of dimension d, which has been partitioned into a set of classes {C0,C1,,Cm}, whose CiCj=ϕ

Results on medical data

This section describes the experiments carried out by the proposed approach on four clinical datasets from the public repository of the Center for Machine Learning and Intelligent Systems, http://archive.ics.uci.edu/ml/datasets/. For each case, we describe the essential of the data set and provide the results achieved by the proposal. In particular, a visual analysis of the used data sets in conjunction with the RIM-GP results has been presented in the Supplemental Material Document (SMD). At

Conclusions

Machine learning, as a practical matter, deals with the extraction of the right features from the data to build the right models achieving the right tasks [6], [60]. We have also seen that the most common approaches used in machine learning are classification and regression. In that sense, we can say that machine learning processes aim to render classification expressions as simple as possible for humans to understand [67]. The creation and assessment of intelligent machines whose learning is

CRediT authorship contribution statement

José A. Castellanos-Garzón: Conceptualization, Formal analysis, Investigation, Writing-original draft. Ernesto Costa: Supervision, Funding acquisition, Investigation, Validation, Methodology. José Luis Jaimes S.: Formal analysis, Methodology, Writing-review & editing. Juan M. Corchado: Supervision, Funding acquisition, Project administration, Validation.

Acknowledgments

This work has been carried out under the iCIS, Spain project (CENTRO-07-ST24-FEDER-002003), which has been co-financed by QREN, Spain, in the scope of the Mais Centro Program and European Union’s FEDER. This work has also been partially supported by the Interreg V-A Spain-Portugal Program (PocTep) and the European Regional Development Fund (ERDF) under the IOTEC project (grant 0123_IOTEC_3_E).

References (67)

  • NaredoE. et al.

    Evolving genetic programming classifiers with novelty search

    Inform. Sci.

    (2016)
  • De FalcoI. et al.

    Discovering interesting classification rules with genetic programming

    Appl. Soft Comput.

    (2002)
  • Castellanos-GarzónJ.A. et al.

    An evolutionary computational model applied to cluster analysis of DNA microarray data

    Expert Syst. Appl.

    (2013)
  • Castellanos-GarzónJ.A. et al.

    A visual analytics framework for cluster analysis of DNA microarray data

    Expert Syst. Appl.

    (2013)
  • DudaR.O. et al.

    Pattern Classification

    (2001)
  • EmbreyD.

    Human Error, Human Reliability Associates

    (2005)
  • WuX. et al.

    Top 10 algorithms in data mining

    Knowl. Inf. Syst.

    (2008)
  • FlachP.

    Machine Learning: The Art and Science of Algorithms that Make Sense of Data

    (2012)
  • FujitaH. et al.

    Resilience analysis of critical infrastructures: A cognitive approach based on granular computing

    IEEE Trans. Cybern.

    (2019)
  • CristianiniN. et al.

    An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods

    (2000)
  • RussellS. et al.

    Artificial Intelligence: A Modern Approach

    (2003)
  • BishopC.M.

    Pattern Recognition and Machine Learning

    (2006)
  • BandyopadhyayS. et al.
  • KumarT.P. et al.

    Prediction of cancer class with majority voting genetic programming classifier using gene expression data

    IEEE/ACM Trans. Comput. Biol. Bioinform.

    (2009)
  • KumarR. et al.

    Classification rule discovery for diabetes patients by using genetic programming

    Int. J. Soft Comput. Eng.

    (2012)
  • LarrañagaP. et al.

    Machine learning in bioinformatics

    Brief. Bioinform.

    (2006)
  • LiuK.-H. et al.

    A genetic programming-based approach to the classification of multiclass microarray datasets

    Bioinformatics

    (2009)
  • MaulikU. et al.

    Multiobjective Genetic Algorithms for Clustering: Applications in Data Mining and Bioinformatics

    (2011)
  • PodgorelecV. et al.

    Knowledge discovery with classification rules in a cardiovascular dataset

    Comput. Methods Programs Biomed.

    (2005)
  • SoniJ. et al.

    Intelligent and effective heart disease prediction system using weighted associative classifiers

    Int. J. Comput. Sci. Eng.

    (2011)
  • VargasC.M.B. et al.

    Computational Biology and Applied Bioinformatics

    (2011)
  • FreitasA.A.

    Soft Computing for Knowledge Discovery and Data Mining, Part II

    (2008)
  • PappaG.L. et al.

    Automating the Design of Data Mining Algorithms: An Evolutionary Computation Approach

    (2010)
  • Cited by (19)

    • Classification model of machine learning for medical data analysis

      2022, Statistical Modeling in Machine Learning: Concepts and Applications
    • An intelligent hybrid classification algorithm integrating fuzzy rule-based extraction and harmony search optimization: Medical diagnosis applications

      2021, Knowledge-Based Systems
      Citation Excerpt :

      developed a new learning approach for the kernel extreme learning machine based on the chaotic moth-flame optimization strategy where the model was used for medical diagnosis problems of Parkinson’s disease and breast cancer. [18] utilized a machine learning approach, i.e. deep learning, to diagnose and classify diabetes using a specifically selected set of characteristics. [19] used an evolutionary algorithm integrated with a rule-based classification approach on four medical datasets in which genetic programming was applied to make a search method for if-then classification rules.

    • Using machine learning techniques for rising star prediction in basketball

      2021, Knowledge-Based Systems
      Citation Excerpt :

      classical machine learning technique are not much efficient in situations where the decisions are time-dependent, for such situations, [38] presented a machine learning model that have the ability to work in time varying systems. Machine learning techniques based on evolutionary framework has been used in medical domain on clinical data [39]. For prediction of breast cancer, Support Vector Machines and Artificial Neural Networks has been applied by [40] on Wisconsin Breast Cancer dataset.

    • Hyper-heuristic local search for combinatorial optimisation problems

      2020, Knowledge-Based Systems
      Citation Excerpt :

      Many COPs are known as NP-hard problems, hence impractical for exact methods. Instead, meta-heuristic algorithms have been used to deal with the COPs because they are able to find reasonably good solutions within an acceptable timeframe [2–9]. Local search (LS) algorithms are a type of meta-heuristic which have been shown to be very effective on a range of COPs [10–14].

    View all citing articles on Scopus

    No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.knosys.2019.104982.

    1

    http://bisite.usal.es.

    View full text