A fuzzy genetic programming-based algorithm for subgroup discovery and the application to one problem of pathogenesis of acute sore throat conditions in humans
Introduction
Subgroup discovery (SD) is a descriptive data mining technique for describing unusual features with monitored properties of interest [40], [66]. This task contributes interesting knowledge to the scientific community from two view-points, specifically both features those including the provision of interest and precision. SD has been included within the concept of Supervised Descriptive Rule Discovery [42], together with further descriptive techniques such as emerging patterns [18] and contrast set mining [5].
Differing SD algorithms have been implemented throughout the literature in order to solve SD tasks based on beam search such as CN2-SD [45] or SD [27], exhaustive such as SD-Map [3], or genetic algorithms such as SDIGA [15] and NMEEF-SD [8], amongst others.
Genetic programming [41] is a methodology based on evolutionary algorithms (EAs) and it has been used for classification purposes [20], rule learning [43], [44] and genetic-based machine learning [21]. Amongst its advantages can be highlighted the following:
- •
flexibility in the learning process due to the use of populations with dynamic size and individuals with structure and size variable. This property facilitates the obtaining of descriptive rules for the search space,
- •
simplicity, since it allows to learn rules in a flexible way without the necessity to include all variables in the individuals,
- •
diversity amongst the rules, which is acquired through specific operators promoting the diversity at phenotype or genotype level.
This paper presents a new approach named Fuzzy Genetic Programming-based for Subgroup Discovery: FuGePSD. This algorithm represents an evolutionary fuzzy system (EFS) [34] based on genetic programming [41] which employs a tree structure with a variable-length to represent the individuals of the population. FuGePSD employs several genetic operators in order to obtain rules to which are as general and precise as possible describing new information of the search space. In this way, FuGePSD includes an operator to promote the diversity at genotype level where rules describing the same examples are penalised. Moreover, drop and the insertion of genetic operators enhances the increase in precision and generality of the rules.
Benefits offered by the FuGePSD technique are delivered in a complete experimental study supported by appropriate statistical tests. The study is focused on datasets with continuous variables and the validity of FuGePSD is analysed with respect to alternative EAs for SD. Statistical tests confirm the highly effective performance and suitability for this new approach. Moreover, the behaviour of FuGePSD in real problems is applied to a study related to sore throat. This problem is an acute upper respiratory tract infection that impinges on the throat’s respiratory mucosa, and can be linked with fever, headache and general malaise. The dataset analysed distinguishes for the high dimensionality with a wide number of features. Results acquired show the quality of the new proposal presented in this paper which are highlighted by experts in this field.
The paper is organised as follows. Firstly, preliminary concepts are described in Section 2. Next, Section 3 presents the new approach in which a description of the algorithm, operation scheme, fitness functions and genetic operators required in order to facilitate its analysis can be observed. Sections 4 Experimental framework, 5 Experimental study present all information related to the experimental framework and the study, respectively. In Section 6, a case study is presented, and results arising there from are discussed by researchers with expertise in this field. Finally, the major salient conclusions are outlined.
Section snippets
Preliminaries
This section introduces the main concepts used for the algorithm presented. Firstly, a brief introduction to EFSs, and a short review of the SD proposals based on EAs in the specialised literature are presented in Section 2.1. Secondly, the definition, main properties and elements of the SD technique are outlined in Section 2.2. Thirdly, major properties and quality measures for fuzzy rules in SD are summarised in 2.3. Finally, the use of EFSs in SD throughout the literature is presented in
FuGePSD: Fuzzy Genetic Programming-based learning for Subgroup Discovery
This section presents the approach for SD called Fuzzy Genetic Programming-based learning for Subgroup Discovery, FuGePSD. It involves an EFS based on a genetic programming algorithm [41] with the ability to extract descriptive fuzzy rules for the SD task.
The following subsections present the main concepts of FuGePSD. A complete description of the algorithm, its components and scheme are delineated in Section 3.1. Secondly, a representation of the fuzzy individuals through the context-free
Experimental framework
This section outlines the main details of the experimental study performed. Specifically, Section 4.1 summarises the datasets analysed in the study for the SD algorithms presented (Section 4.2). Finally, Section 4.3 presents the statistical tests applied in order to analise the results obtained with respect to different EAs for SD task.
Experimental study
This section presents results for each quality measures with respect to the algorithms considered in the study. NMEEFSD is abbreviated to NMEEF, CGBA-SD to CGBA and FuGePSD to FuGeP. An analysis of the best parameters (minimum confidence and number of linguistic labels) has previously been carried out. Results shown in Table 4 are the average results obtained with them.
FuGePSD obtains the highest values in the AVERAGE of the three quality measures analysed in this paper. However, the results
A case study: pathogenesis of acute sore throat conditions in humans
Sore throat (sometimes known as ‘pharyngitis’ or ‘tonsilitis’) is an acute upper respiratory tract infection that impinges on the throat’s respiratory mucosa, and can be linked with fever, headache and general malaise. Moreover, acute otitis media, acute sinusitis and peritonsillar abscess represent suppurative complications of this condition, predominantly the first of these. 85–95% of adult acute sore throat conditions are ascribable to viruses, as are 70% of those in children aged 5–16 years
Concluding remarks
In this paper, a new proposal based on genetic programming for SD has been presented. The genetic programming together fuzzy logic bring a range of advantages to the FuGePSD algorithm:
- •
Flexibility in the generation of the individuals since they are constructed with tree structures and the variables are included in a dynamic manner. In this way, the use of genetic programming allows to evolve individuals without the necessity to include all variables in the representation facilitating the
Acknowledgment
This work was partially supported by the Spanish Ministry of Economy and Competitiveness under Projects TIN2012-33856 (FEDER Founds).
Profs. Grootveld, Elizondo and Carmona are very grateful to De Montfort University, Leicester, UK for the provision of a HEIF collaborative award to support this project.
References (68)
- et al.
Fuzzy rules for describing subgroups from Influenza A virus using a multi-objective evolutionary algorithm
Appl. Soft Comput.
(2013) - et al.
MEFES: an evolutionary proposal for the detection of exceptions in subgroup discovery. An application to concentrating photovoltaic technology
Knowl.-Based Syst.
(2013) - et al.
Web usage mining to improve the design of an e-commerce website: OrOliveSur.com
Expert Syst. Appl.
(2012) - et al.
Examination of the metabolic status of rat air pouch inflammatory exudate by high field proton NMR spectroscopy
Biochim. Biophys. Acta-Molec. Basis Dis.
(1999) - et al.
Enhancing evolutionary instance selection algorithms by means of fuzzy rough set based feature selection
Inform. Sci.
(2012) - et al.
An experimental comparison of performance measures for classification
Pattern Recogn. Lett.
(2009) - et al.
METSK-HDe: a multiobjective evolutionary algorithm to learn accurate TSK-fuzzy systems in high-dimensional and large-scale regression problems
Inform. Sci.
(2014) - et al.
Cognitive systems based on adaptive algorithms
- et al.
QAR-CIP-NSGA-II: a new multi-objective evolutionary algorithm to mine quantitative association rules
Inform. Sci.
(2014) - et al.
Integrating fuzzy knowledge by genetic algorithms
IEEE Trans. Evol. Comput.
(1998)