Evolutionary computing for knowledge discovery in medical diagnosis

https://doi.org/10.1016/S0933-3657(03)00002-2Get rights and content

Abstract

One of the major challenges in medical domain is the extraction of comprehensible knowledge from medical diagnosis data. In this paper, a two-phase hybrid evolutionary classification technique is proposed to extract classification rules that can be used in clinical practice for better understanding and prevention of unwanted medical events. In the first phase, a hybrid evolutionary algorithm (EA) is utilized to confine the search space by evolving a pool of good candidate rules, e.g. genetic programming (GP) is applied to evolve nominal attributes for free structured rules and genetic algorithm (GA) is used to optimize the numeric attributes for concise classification rules without the need of discretization. These candidate rules are then used in the second phase to optimize the order and number of rules in the evolution for forming accurate and comprehensible rule sets. The proposed evolutionary classifier (EvoC) is validated upon hepatitis and breast cancer datasets obtained from the UCI machine-learning repository. Simulation results show that the evolutionary classifier produces comprehensible rules and good classification accuracy for the medical datasets. Results obtained from t-tests further justify its robustness and invariance to random partition of datasets.

Introduction

Clinical medicine is facing a challenge of knowledge discovery from the growing volume of data. Nowadays enormous amounts of information are collected continuously by monitoring physiological parameters of patients. The growing amounts of data has made manual analysis by medical experts a tedious task and sometimes impossible. Many hidden and potentially useful relationships may not be recognized by the analyst. The explosive growth of data requires an automated way to extract useful knowledge. One of the possible approaches to this problem is by means of data mining or knowledge discovery from databases (KDD) [1], [3]. Through data mining, interesting knowledge and regularities can be extracted and the discovered knowledge can be applied in the corresponding field to increase the working efficiency and to improve the quality of decision making.

An important task in knowledge discovery is to extract comprehensible classification rules from the data. Classification rules are typically useful for medical problems which have been massively applied particularly in the area of medical diagnosis [10], [27]. Such rules can be verified by medical experts and may provide better understanding of the problem in-hand. Numerous techniques have been applied to classification in data mining over the past few decades, such as expert systems, artificial neural networks, linear programming, database systems, and evolutionary algorithms [4], [6], [21], [36], [40], [42]. Among these approaches, evolutionary algorithms have been emerged as a promising technique in dealing with the increasing challenge of data mining in medical domain [27]. The evolutionary algorithm (EA) is a class of computational techniques inspired by the natural evolution process that imitates the mechanism of natural selection and survival-of-the-fittest in solving real life problems [24], [35].

The genetic programming (GP) [20] and genetic algorithms [12] are two popular approaches in evolutionary algorithms. Although, GP and GA are based on the same evolution principle, they often adopt a different chromosome representation, e.g. GA uses a fixed-length chromosome structure while GP applies a tree-based chromosome representation. Recently, GA has been utilized at different stages of knowledge discovery process in medical data mining applications. Kim and Han [18] and Liu et al. [22] applied GA at the pre-processing stage to reduce the dimension/difficulty of the problem and to increase the learning efficiency in data mining. Komosinski and Krawiec [19] proposed an evolutionary algorithm for feature weighting that gives quantitative information about the relative importance of the features. Hruschka and Ebecken [14] and Meesad and Yen [23] used GA at the post-processing stage to extract rules from a neural network. Other GA approaches for generating classification rules in data mining include [7], [8], [38].

Brameier and Banzhaf [3] proposed a linear genetic programming (LGP) classification approach for data mining in medical domain, but the issue of comprehensibility of the classification rules has not been addressed. Wong and Leung [41] proposed a grammar-based GP for constructing the classification rules. However, the grammar is domain specific such that a new grammar has to be provided for each new problem, and the use of grammar also reduces the autonomy of GP in discovering novel knowledge. To address the issue of comprehensibility of classification rules, Bojarczuk et al. [2] proposed a non-standard tree structure GP where functions are constructed via Boolean operators and terminal sets are chosen based on Booleanized attributes. However, the numeric attributes in this approach need to be discretized into nominal boundaries a priori in order to use the Booleanized attributes. This restricts the search capability of GP, i.e. the classification accuracy depends a lot on how well the boundaries were defined.

One possible approach of handling both nominal and numeric attributes in evolutionary data classification is through the hybridization of GA and GP. Howard and D’Angelo [13] proposed a hybrid GA and GP called genetic algorithm-program (GA-P), which has been applied to evolve expressions for symbolic regression problems. In their approach, GP was used to construct expression tree and GA was applied to find numeric constant and coefficient of nominal attributes in the expression. Although the GA-P utilized the concept of GA and GP hybridization, it was designed for the regression application that is different from the problem of data classification addressed in this paper. Unlike GA-P, a two-phase evolutionary process is adopted in our approach, i.e. the hybrid evolutionary algorithm is applied to generate good rules in the first phase, which are then used to evolve comprehensible rule sets in the second phase.

Besides evolutionary algorithm-based approaches, a number of algorithms based on artificial neural networks have also been applied to solve the classification problem in medical diagnosis. However, one common problem of artificial neural networks is that they are essentially a “black-box” system. Although good predictive accuracy can often be achieved, the user is prevented from knowing what is going on inside the “black-box”. Setiono [31], [32] proposed the approach of NeuralRule for rule extraction from artificial neural networks, which attempts to extract comprehensible information while preserves high accuracy of the network. Although the rule extraction algorithm is capable of obtaining compact rule sets, it often needs an independent process of network training/pruning and rules extraction in data mining. Taha and Gosh [34] proposed a BIO-RE algorithm for extracting rules from artificial neural networks, but the approach can only be applied to data with binary attributes. Peña-Reyes and Sipper [26] proposed the method of fuzzy-genetic approach (GA) by applying genetic algorithm to generate fuzzy classification rules, which has reported good results on Wisconsin diagnostic breast cancer (WDBC) dataset [33].

This paper proposes a two-phase hybrid evolutionary rule extraction algorithm that incorporates both GA and GP to discover comprehensible classification rules for data mining in medical applications. The paper is organized as follows: Section 2 gives an overview of classification rule-learning in medical diagnostic problem. Section 3 describes the proposed two-phase hybrid evolutionary classifier (EvoC) in detailed. The hepatitis and breast cancer datasets are described in Section 4, and the classification results of the proposed evolutionary classifier are compared with existing approaches. Conclusions are drawn in Section 5.

Section snippets

Decision rules in classification

Given a set of labeled instances, the objective of classification is to discover the hidden relations or regulations between attributes and classes. The classification rules are extracted in the hope that they can be used to automate classification of future instances. In the classification task, the discovered knowledge is usually represented in the form of decision trees or IF–THEN classification rules, which has the advantage of being a high-level and symbolic knowledge representation that

A two-phase hybrid evolutionary classifier

In this paper, the classification task is formulated as a complex search optimization problem, where hidden relationships of the attributes to class are targeted knowledge to be discovered. The candidate solution that is in the form of a comprehensible Boolean rule set is obtained through a two-phase evolution mechanism as shown in Fig. 1. The first phase searches for a pool of good candidate rules using Michigan coding approach [24], while the second phase finds the best Boolean rule set by

The medical diagnosis datasets

The medical diagnosis datasets used in this study are the hepatitis dataset and breast cancer diagnosis databases obtained from University of California, Irvine (UCI) machine-learning repository at http://www.ics.uci.edu/∼mlearn/MLRepository.html. The hepatitis dataset was collected at Carnegie-Mellon University [5] and donated to UCI ML repository in 1988. The two breast cancer diagnosis datasets, i.e. Wisconsin breast cancer database (WBCD) and Wisconsin diagnostic breast cancer (WDBC), were

Conclusions

A two-phase hybrid evolutionary classifier capable of extracting comprehensible classification rules with good accuracy in medical diagnosis has been proposed in this paper. In the first phase, genetic programming has been applied to evolve nominal attributes for free structured rules while genetic algorithms have been used to optimize the numeric attributes for concise classification rules without the need of discretization. The second phase then formulates accurate rule sets by optimizing the

Acknowledgements

The authors would like to thank Gail Gong and Dr. W.H. Wolberg for making the hepatitis and breast cancer datasets, respectively, available in public, and the WEKA Development Group for providing the source code of WEKA. The authors also wish to thank Prof. Klaus-Peter Adlassnig and the anonymous reviewers for their valuable comments and helpful suggestions, which greatly improved the paper’s quality.

References (43)

  • C.C Bojarczuk et al.

    Genetic programming for knowledge discovery in chest-pain diagnosis

    IEEE Eng. Med. Biol. Mag.

    (2000)
  • M Brameier et al.

    A comparison of linear genetic programming neural networks in medical data mining

    IEEE Trans. Evol. Comput.

    (2001)
  • R Cattral et al.

    Rule acquisition with a genetic algorithm

    Proc. IEEE Cong. Evol. Comput.

    (1999)
  • Cestnik G, Konenenko I, Bratko I. Assistant-86: a knowledge-elicitation tool for sophisticated users. In: Bratko I,...
  • Chang YH, Zheng B, Wang XH, Good WF. Computer-aided diagnosis of breast cancer using artificial neural networks:...
  • Congdon CB. Classification of epidemiological data: a comparison of genetic algorithm and decision tree approaches. In:...
  • Fidelis MV, Lopes HS, Freitas AA. Discovering comprehensible classification rules with a genetic algorithm. In:...
  • Frank E, Witten IH. Generating accurate rule sets without global optimization. In: Proceedings of the 15th...
  • Freitas AA. A survey of evolutionary algorithms for data mining and knowledge discovery. In: Ghosh A, Tsutsui S,...
  • Garner SR. WEKA: the Waikato environment for knowledge analysis. In: Proceedings of the New Zealand Computer Science...
  • Goldberg DE. Genetic algorithms in search, optimization and machine learning. Reading (MA): Addison-Wesley;...
  • Cited by (133)

    • Application of hybrid computational intelligence in health care

      2020, Hybrid Computational Intelligence: Challenges and Applications
    • Review on plantar data analysis for disease diagnosis

      2018, Biocybernetics and Biomedical Engineering
      Citation Excerpt :

      Traditionally, hospitals continuously collect huge amounts of information by monitoring physiological parameters of patients. This becomes a great opportunity and a challenge, because the manual analysis of large amounts of medical data is a hard task [16,17]. Clinical Decision Support Systems (CDSS) are currently useful for analyzing medical data, and much work has been done in medical diagnosis problems [18–26], but in the case of diagnosis of diseases related to plantar pathologies, only a few works has been reported.

    • A survey on data mining and machine learning techniques for diagnosing hepatitis disease

      2023, International Journal of Biomedical Engineering and Technology
    View all citing articles on Scopus
    View full text