Evolutionary computing for knowledge discovery in medical diagnosis
Introduction
Clinical medicine is facing a challenge of knowledge discovery from the growing volume of data. Nowadays enormous amounts of information are collected continuously by monitoring physiological parameters of patients. The growing amounts of data has made manual analysis by medical experts a tedious task and sometimes impossible. Many hidden and potentially useful relationships may not be recognized by the analyst. The explosive growth of data requires an automated way to extract useful knowledge. One of the possible approaches to this problem is by means of data mining or knowledge discovery from databases (KDD) [1], [3]. Through data mining, interesting knowledge and regularities can be extracted and the discovered knowledge can be applied in the corresponding field to increase the working efficiency and to improve the quality of decision making.
An important task in knowledge discovery is to extract comprehensible classification rules from the data. Classification rules are typically useful for medical problems which have been massively applied particularly in the area of medical diagnosis [10], [27]. Such rules can be verified by medical experts and may provide better understanding of the problem in-hand. Numerous techniques have been applied to classification in data mining over the past few decades, such as expert systems, artificial neural networks, linear programming, database systems, and evolutionary algorithms [4], [6], [21], [36], [40], [42]. Among these approaches, evolutionary algorithms have been emerged as a promising technique in dealing with the increasing challenge of data mining in medical domain [27]. The evolutionary algorithm (EA) is a class of computational techniques inspired by the natural evolution process that imitates the mechanism of natural selection and survival-of-the-fittest in solving real life problems [24], [35].
The genetic programming (GP) [20] and genetic algorithms [12] are two popular approaches in evolutionary algorithms. Although, GP and GA are based on the same evolution principle, they often adopt a different chromosome representation, e.g. GA uses a fixed-length chromosome structure while GP applies a tree-based chromosome representation. Recently, GA has been utilized at different stages of knowledge discovery process in medical data mining applications. Kim and Han [18] and Liu et al. [22] applied GA at the pre-processing stage to reduce the dimension/difficulty of the problem and to increase the learning efficiency in data mining. Komosinski and Krawiec [19] proposed an evolutionary algorithm for feature weighting that gives quantitative information about the relative importance of the features. Hruschka and Ebecken [14] and Meesad and Yen [23] used GA at the post-processing stage to extract rules from a neural network. Other GA approaches for generating classification rules in data mining include [7], [8], [38].
Brameier and Banzhaf [3] proposed a linear genetic programming (LGP) classification approach for data mining in medical domain, but the issue of comprehensibility of the classification rules has not been addressed. Wong and Leung [41] proposed a grammar-based GP for constructing the classification rules. However, the grammar is domain specific such that a new grammar has to be provided for each new problem, and the use of grammar also reduces the autonomy of GP in discovering novel knowledge. To address the issue of comprehensibility of classification rules, Bojarczuk et al. [2] proposed a non-standard tree structure GP where functions are constructed via Boolean operators and terminal sets are chosen based on Booleanized attributes. However, the numeric attributes in this approach need to be discretized into nominal boundaries a priori in order to use the Booleanized attributes. This restricts the search capability of GP, i.e. the classification accuracy depends a lot on how well the boundaries were defined.
One possible approach of handling both nominal and numeric attributes in evolutionary data classification is through the hybridization of GA and GP. Howard and D’Angelo [13] proposed a hybrid GA and GP called genetic algorithm-program (GA-P), which has been applied to evolve expressions for symbolic regression problems. In their approach, GP was used to construct expression tree and GA was applied to find numeric constant and coefficient of nominal attributes in the expression. Although the GA-P utilized the concept of GA and GP hybridization, it was designed for the regression application that is different from the problem of data classification addressed in this paper. Unlike GA-P, a two-phase evolutionary process is adopted in our approach, i.e. the hybrid evolutionary algorithm is applied to generate good rules in the first phase, which are then used to evolve comprehensible rule sets in the second phase.
Besides evolutionary algorithm-based approaches, a number of algorithms based on artificial neural networks have also been applied to solve the classification problem in medical diagnosis. However, one common problem of artificial neural networks is that they are essentially a “black-box” system. Although good predictive accuracy can often be achieved, the user is prevented from knowing what is going on inside the “black-box”. Setiono [31], [32] proposed the approach of NeuralRule for rule extraction from artificial neural networks, which attempts to extract comprehensible information while preserves high accuracy of the network. Although the rule extraction algorithm is capable of obtaining compact rule sets, it often needs an independent process of network training/pruning and rules extraction in data mining. Taha and Gosh [34] proposed a BIO-RE algorithm for extracting rules from artificial neural networks, but the approach can only be applied to data with binary attributes. Peña-Reyes and Sipper [26] proposed the method of fuzzy-genetic approach (GA) by applying genetic algorithm to generate fuzzy classification rules, which has reported good results on Wisconsin diagnostic breast cancer (WDBC) dataset [33].
This paper proposes a two-phase hybrid evolutionary rule extraction algorithm that incorporates both GA and GP to discover comprehensible classification rules for data mining in medical applications. The paper is organized as follows: Section 2 gives an overview of classification rule-learning in medical diagnostic problem. Section 3 describes the proposed two-phase hybrid evolutionary classifier (EvoC) in detailed. The hepatitis and breast cancer datasets are described in Section 4, and the classification results of the proposed evolutionary classifier are compared with existing approaches. Conclusions are drawn in Section 5.
Section snippets
Decision rules in classification
Given a set of labeled instances, the objective of classification is to discover the hidden relations or regulations between attributes and classes. The classification rules are extracted in the hope that they can be used to automate classification of future instances. In the classification task, the discovered knowledge is usually represented in the form of decision trees or IF–THEN classification rules, which has the advantage of being a high-level and symbolic knowledge representation that
A two-phase hybrid evolutionary classifier
In this paper, the classification task is formulated as a complex search optimization problem, where hidden relationships of the attributes to class are targeted knowledge to be discovered. The candidate solution that is in the form of a comprehensible Boolean rule set is obtained through a two-phase evolution mechanism as shown in Fig. 1. The first phase searches for a pool of good candidate rules using Michigan coding approach [24], while the second phase finds the best Boolean rule set by
The medical diagnosis datasets
The medical diagnosis datasets used in this study are the hepatitis dataset and breast cancer diagnosis databases obtained from University of California, Irvine (UCI) machine-learning repository at http://www.ics.uci.edu/∼mlearn/MLRepository.html. The hepatitis dataset was collected at Carnegie-Mellon University [5] and donated to UCI ML repository in 1988. The two breast cancer diagnosis datasets, i.e. Wisconsin breast cancer database (WBCD) and Wisconsin diagnostic breast cancer (WDBC), were
Conclusions
A two-phase hybrid evolutionary classifier capable of extracting comprehensible classification rules with good accuracy in medical diagnosis has been proposed in this paper. In the first phase, genetic programming has been applied to evolve nominal attributes for free structured rules while genetic algorithms have been used to optimize the numeric attributes for concise classification rules without the need of discretization. The second phase then formulates accurate rule sets by optimizing the
Acknowledgements
The authors would like to thank Gail Gong and Dr. W.H. Wolberg for making the hepatitis and breast cancer datasets, respectively, available in public, and the WEKA Development Group for providing the source code of WEKA. The authors also wish to thank Prof. Klaus-Peter Adlassnig and the anonymous reviewers for their valuable comments and helpful suggestions, which greatly improved the paper’s quality.
References (43)
- et al.
Genetic algorithms approach to feature discretization in artificial neural networks for the prediction of stock price index
Expert Syst. Appl.
(2000) - et al.
Evolutionary weighting of image features for diagnosing of CNS tumors
Artif. Intell. Med.
(2000) - et al.
A fuzzy-genetic approach to breast cancer diagnosis
Artif. Intell. Med.
(1999) - et al.
Evolutionary computation in medicine: an overview
Artif. Intell. Med.
(2000) Some notes on neural learning algorithm benchmarking
NeuralComputing
(1995)Extracting rules from pruned neural networks for breast cancer diagnosis
Artif. Intell. Med.
(1996)Generating concise and accurate classification rules for breast cancer diagnosis
Artif. Intell. Med.
(2000)- et al.
Automating the drug scheduling of cancer chemotherapy via evolutionary computation
Artif. Intell. Med.
(2002) - et al.
Integrating membership functions and fuzzy rule sets from multiple knowledge sources
Fuzzy Sets Syst.
(2000) - Banzhaf W, Nordin P, Keller RE, Francone FD. Genetic programming: an introduction on the automatic evolution of...
Genetic programming for knowledge discovery in chest-pain diagnosis
IEEE Eng. Med. Biol. Mag.
A comparison of linear genetic programming neural networks in medical data mining
IEEE Trans. Evol. Comput.
Rule acquisition with a genetic algorithm
Proc. IEEE Cong. Evol. Comput.
Cited by (133)
Artificial intelligence approaches to physiological parameter analysis in the monitoring and treatment of non-communicable diseases: A review
2024, Biomedical Signal Processing and ControlApplication of hybrid computational intelligence in health care
2020, Hybrid Computational Intelligence: Challenges and ApplicationsReview on plantar data analysis for disease diagnosis
2018, Biocybernetics and Biomedical EngineeringCitation Excerpt :Traditionally, hospitals continuously collect huge amounts of information by monitoring physiological parameters of patients. This becomes a great opportunity and a challenge, because the manual analysis of large amounts of medical data is a hard task [16,17]. Clinical Decision Support Systems (CDSS) are currently useful for analyzing medical data, and much work has been done in medical diagnosis problems [18–26], but in the case of diagnosis of diseases related to plantar pathologies, only a few works has been reported.
A predictive method for hepatitis disease diagnosis using ensembles of neuro-fuzzy technique
2019, Journal of Infection and Public HealthA survey on data mining and machine learning techniques for diagnosing hepatitis disease
2023, International Journal of Biomedical Engineering and TechnologyMachine learning for morbid glomerular hypertrophy
2022, Scientific Reports