Elsevier

Information Sciences

Volume 298, 20 March 2015, Pages 180-197
Information Sciences

A fuzzy genetic programming-based algorithm for subgroup discovery and the application to one problem of pathogenesis of acute sore throat conditions in humans

https://doi.org/10.1016/j.ins.2014.11.030Get rights and content

Abstract

This paper proposes a novel algorithm for subgroup discovery task based on genetic programming and fuzzy logic called Fuzzy Genetic Programming-based for Subgroup Discovery (FuGePSD). The genetic programming allows to learn compact expressions with the main objective to obtain rules for describing simple, interesting and interpretable subgroups. This algorithm incorporates specific operators in the search process to promote the diversity between the individuals. The evolutionary scheme of FuGePSD is codified through the genetic cooperative-competitive approach promoting the competition and cooperation between the individuals of the population in order to find out the optimal solutions for the SD task.

FuGePSD displays its potential with high-quality results in a wide experimental study performed with respect to others evolutionary algorithms for subgroup discovery. Moreover, the quality of this proposal is applied to a case study related to acute sore throat problems.

Introduction

Subgroup discovery (SD) is a descriptive data mining technique for describing unusual features with monitored properties of interest [40], [66]. This task contributes interesting knowledge to the scientific community from two view-points, specifically both features those including the provision of interest and precision. SD has been included within the concept of Supervised Descriptive Rule Discovery [42], together with further descriptive techniques such as emerging patterns [18] and contrast set mining [5].

Differing SD algorithms have been implemented throughout the literature in order to solve SD tasks based on beam search such as CN2-SD [45] or SD [27], exhaustive such as SD-Map [3], or genetic algorithms such as SDIGA [15] and NMEEF-SD [8], amongst others.

Genetic programming [41] is a methodology based on evolutionary algorithms (EAs) and it has been used for classification purposes [20], rule learning [43], [44] and genetic-based machine learning [21]. Amongst its advantages can be highlighted the following:

  • flexibility in the learning process due to the use of populations with dynamic size and individuals with structure and size variable. This property facilitates the obtaining of descriptive rules for the search space,

  • simplicity, since it allows to learn rules in a flexible way without the necessity to include all variables in the individuals,

  • diversity amongst the rules, which is acquired through specific operators promoting the diversity at phenotype or genotype level.

This paper presents a new approach named Fuzzy Genetic Programming-based for Subgroup Discovery: FuGePSD. This algorithm represents an evolutionary fuzzy system (EFS) [34] based on genetic programming [41] which employs a tree structure with a variable-length to represent the individuals of the population. FuGePSD employs several genetic operators in order to obtain rules to which are as general and precise as possible describing new information of the search space. In this way, FuGePSD includes an operator to promote the diversity at genotype level where rules describing the same examples are penalised. Moreover, drop and the insertion of genetic operators enhances the increase in precision and generality of the rules.

Benefits offered by the FuGePSD technique are delivered in a complete experimental study supported by appropriate statistical tests. The study is focused on datasets with continuous variables and the validity of FuGePSD is analysed with respect to alternative EAs for SD. Statistical tests confirm the highly effective performance and suitability for this new approach. Moreover, the behaviour of FuGePSD in real problems is applied to a study related to sore throat. This problem is an acute upper respiratory tract infection that impinges on the throat’s respiratory mucosa, and can be linked with fever, headache and general malaise. The dataset analysed distinguishes for the high dimensionality with a wide number of features. Results acquired show the quality of the new proposal presented in this paper which are highlighted by experts in this field.

The paper is organised as follows. Firstly, preliminary concepts are described in Section 2. Next, Section 3 presents the new approach in which a description of the algorithm, operation scheme, fitness functions and genetic operators required in order to facilitate its analysis can be observed. Sections 4 Experimental framework, 5 Experimental study present all information related to the experimental framework and the study, respectively. In Section 6, a case study is presented, and results arising there from are discussed by researchers with expertise in this field. Finally, the major salient conclusions are outlined.

Section snippets

Preliminaries

This section introduces the main concepts used for the algorithm presented. Firstly, a brief introduction to EFSs, and a short review of the SD proposals based on EAs in the specialised literature are presented in Section 2.1. Secondly, the definition, main properties and elements of the SD technique are outlined in Section 2.2. Thirdly, major properties and quality measures for fuzzy rules in SD are summarised in 2.3. Finally, the use of EFSs in SD throughout the literature is presented in

FuGePSD: Fuzzy Genetic Programming-based learning for Subgroup Discovery

This section presents the approach for SD called Fuzzy Genetic Programming-based learning for Subgroup Discovery, FuGePSD. It involves an EFS based on a genetic programming algorithm [41] with the ability to extract descriptive fuzzy rules for the SD task.

The following subsections present the main concepts of FuGePSD. A complete description of the algorithm, its components and scheme are delineated in Section 3.1. Secondly, a representation of the fuzzy individuals through the context-free

Experimental framework

This section outlines the main details of the experimental study performed. Specifically, Section 4.1 summarises the datasets analysed in the study for the SD algorithms presented (Section 4.2). Finally, Section 4.3 presents the statistical tests applied in order to analise the results obtained with respect to different EAs for SD task.

Experimental study

This section presents results for each quality measures with respect to the algorithms considered in the study. NMEEFSD is abbreviated to NMEEF, CGBA-SD to CGBA and FuGePSD to FuGeP. An analysis of the best parameters (minimum confidence and number of linguistic labels) has previously been carried out. Results shown in Table 4 are the average results obtained with them.

FuGePSD obtains the highest values in the AVERAGE of the three quality measures analysed in this paper. However, the results

A case study: pathogenesis of acute sore throat conditions in humans

Sore throat (sometimes known as ‘pharyngitis’ or ‘tonsilitis’) is an acute upper respiratory tract infection that impinges on the throat’s respiratory mucosa, and can be linked with fever, headache and general malaise. Moreover, acute otitis media, acute sinusitis and peritonsillar abscess represent suppurative complications of this condition, predominantly the first of these. 85–95% of adult acute sore throat conditions are ascribable to viruses, as are 70% of those in children aged 5–16 years

Concluding remarks

In this paper, a new proposal based on genetic programming for SD has been presented. The genetic programming together fuzzy logic bring a range of advantages to the FuGePSD algorithm:

  • Flexibility in the generation of the individuals since they are constructed with tree structures and the variables are included in a dynamic manner. In this way, the use of genetic programming allows to evolve individuals without the necessity to include all variables in the representation facilitating the

Acknowledgment

This work was partially supported by the Spanish Ministry of Economy and Competitiveness under Projects TIN2012-33856 (FEDER Founds).

Profs. Grootveld, Elizondo and Carmona are very grateful to De Montfort University, Leicester, UK for the provision of a HEIF collaborative award to support this project.

References (68)

  • L.A. Zadeh

    The concept of a linguistic variable and its applications to approximate reasoning. Parts I, II, III

    Inform. Sci.

    (1975)
  • R. Agrawal et al.

    Fast discovery of association rules

  • A. Asuncion, D.J. Newman, UCI Machine Learning Repository, 2007....
  • M. Atzmueller et al.

    SD-Map – a fast algorithm for exhaustive subgroup discovery

  • M. Atzmueller, F. Puppe, H.P. Buscher, Towards knowledge-intensive subgroup discovery, in: Proceedings of the Lernen,...
  • S. Bay et al.

    Detecting group differences: mining contrast sets

    Data Min. Knowl. Disc.

    (2001)
  • P.S. Callery et al.

    Biosynthesis of 5-aminopentanoic acid and 2-piperidone from cadaverine and 1-piperideine in the mouse

    J. Neurochem.

    (1984)
  • C.J. Carmona et al.

    NMEEF-SD: non-dominated multi-objective evolutionary algorithm for extracting fuzzy rules in subgroup discovery

    IEEE Trans. Fuzzy Syst.

    (2010)
  • C.J. Carmona et al.

    Overview on evolutionary subgroup discovery: analysis of the suitability and potential of the search performed by evolutionary algorithms

    WIREs Data Min. Knowl. Disc.

    (2014)
  • C.J. Carmona, P. González, M.J. del Jesus, C. Romero, S. Ventura, Evolutionary algorithms for subgroup discovery...
  • K. Deb et al.

    A fast and elitist multiobjective genetic algorithm: NSGA-II

    IEEE Trans. Evol. Comput.

    (2002)
  • M.J. del Jesus et al.

    Evolutionary fuzzy rule induction process for subgroup discovery: a case study in marketing

    IEEE Trans. Fuzzy Syst.

    (2007)
  • J. Demsar

    Statistical comparisons of classifiers over multiple data sets

    J. Learning Res.

    (2006)
  • G.Z. Dong et al.

    Mining border descriptions of emerging patterns from dataset pairs

    Knowl. Inform. Syst.

    (2005)
  • A.E. Eiben et al.

    Introduction to Evolutionary Computation

    (2003)
  • P. Espejo et al.

    A survey on the application of genetic programming to classification

    IEEE Trans. Syst. Man Cybernet. – Part C: Appl. Rev.

    (2010)
  • A. Fernández et al.

    Genetics-based machine learning for rule induction: state of the art, taxonomy, and comparative study

    IEEE Trans. Evol. Comput.

    (2010)
  • D.B. Fogel

    Evolutionary Computation – Toward a New Philosophy of Machine Intelligence

    (1995)
  • J.C. Fothergill et al.

    Catabolism of l-lysine by Pseudomonas aeruginosa

    J. Gen. Microbiol.

    (1977)
  • M. Friedman

    The use of ranks to avoid the assumption of normality implicit in the analysis of variance

    J. Am. Stat. Assoc.

    (1937)
  • D. Gamberger et al.

    Expert-guided subgroup discovery: methodology and application

    J. Artif. Intell. Res.

    (2002)
  • S. García et al.

    Study of statistical techniques and performance measures for genetics-based machine learning: accuracy and interpretability

    Soft Comput.

    (2009)
  • S. García et al.

    An extension on “Statistical comparisons of classifiers over multiple data sets” for all pairwise comparisons

    J. Machine Learn. Res.

    (2008)
  • D.E. Goldberg

    Genetic Algorithms in Search, Optimization and Machine Learning

    (1989)
  • Cited by (0)

    View full text