Genetic programming and frequent itemset mining to identify feature selection patterns of iEEG and fMRI epilepsy data

https://doi.org/10.1016/j.engappai.2014.12.008Get rights and content

Highlights

  • We used FIM to compute confidence intervals for GP-based feature selection.

  • We found intra-subject consistency and inter-subject variability in feature subsets.

  • We achieved over 60% median sensitivity and selectivity for both modalities.

Abstract

Pattern classification for intracranial electroencephalogram (iEEG) and functional magnetic resonance imaging (fMRI) signals has furthered epilepsy research toward understanding the origin of epileptic seizures and localizing dysfunctional brain tissue for treatment. Prior research has demonstrated that implicitly selecting features with a genetic programming (GP) algorithm more effectively determined the proper features to discern biomarker and non-biomarker interictal iEEG and fMRI activity than conventional feature selection approaches. However for each the iEEG and fMRI modalities, it is still uncertain whether the stochastic properties of indirect feature selection with a GP yield (a) consistent results within a patient data set and (b) features that are specific or universal across multiple patient data sets. We examined the reproducibility of implicitly selecting features to classify interictal activity using a GP algorithm by performing several selection trials and subsequent frequent itemset mining (FIM) for separate iEEG and fMRI epilepsy patient data. We observed within-subject consistency and across-subject variability with some small similarity for selected features, indicating a clear need for patient-specific features and possible need for patient-specific feature selection or/and classification. For the fMRI, using nearest-neighbor classification and 30 GP generations, we obtained over 60% median sensitivity and over 60% median selectivity. For the iEEG, using nearest-neighbor classification and 30 GP generations, we obtained over 65% median sensitivity and over 65% median selectivity except one patient.

Introduction

Epilepsy is a neurological disorder that impairs millions worldwide with recurrent often uncontrollable seizures (World Health Organization, 2012). Physicians at epilepsy centers acquire combinations of various noninvasive or/and invasive brain signal modalities to diagnose the seizure onsets in patients, plan neurosurgical treatment, or fathom ictogenesis mechanisms and epilepsy symptom (Bragin et al., 2010, Donaire et al., 2009a, Donaire et al., 2009b, Engel, 1993, Engel et al., 2010, Fried, 1995, Rosenow and Luders, 2001, Sierra-Marcos et al., 2013, Staba and Bragin, 2011). These modalities include scalp electroencephalography (EEG), intracranial electroencephalography (iEEG), magnetoelectroencephalography (MEG), functional magnetic resonance imaging (fMRI), and other neuroimaging data. Traditionally, clinicians resort to subjective manual procedures when screening iEEG or fMRI for epilepsy patient diagnoses. But signal processing techniques in pattern classification and recognition for fMRI and iEEG have become useful tools in epilepsy research to help clinicians more objectively discern differences between functional and dysfunctional brain regions, identifying diseased brain tissue for therapy (Ayoubian et al., 2012, Donaire et al., 2009a, Donaire et al., 2009b, Fernandez-Blanco et al., 2012, Gaspard et al., 2014, Gotman et al., 1995, Grewal and Gotman, 2005, Halford, 2009, Han et al., 2011, Keogh and Cordes, 2007, Lee et al., 2009, Navakatikyan et al., 2006, Osorio et al., 1995, Osorio et al., 1998, Qu and Gotman, 1997, Saab and Gotman, 2005, Tzallas et al., 2009, Tzallas et al., 2012, Wilson and Emerson, 2002, Worrell et al., 2012). With further development, validation, and acceptance across several research groups, such algorithms may translate to practical clinical use as decision-support tools for physicians.

One approach to developing semi-automated pattern classification and decision-support tools for epilepsy data has been in implementing evolutionary computation techniques to select, combine, or create measures (extracted features) that quantify the difference between interictal biomarkers (e.g., pathological gamma oscillations in iEEG, resting-state blood oxygenation changes in fMRI) and interictal background or basal activity (Burrell et al., 2007a, Smart et al., 2007, Smart et al., 2011). Scant research has been published on the application of evolutionary computation to interictal resting-state fMRI signals from epilepsy patients (Burrell et al., 2007b). Instead, pattern detection applications for ictal (seizure) activity dominate prior research involving evolutionary computation techniques applied to iEEG and EEG, including mostly genetic algorithms (Haydari et al., 2011, Hsu and Yu, 2010, Ocak, 2008, Patnaik and Manyam, 2008, Rivero et al., 2013, Shen et al., 2013), genetic programming (Sotelo et al., 2013a, 2013b), and harmony search optimization (Gandhi et al., 2012, Zainuddin et al., 2013), although some projects have focused on spike detection applications (Haydari et al., 2011, Kinnear et al., 1999, Marchesi et al., 1997a, Shen et al., 2013). Additional studies have used evolutionary computation in other manners for epilepsy data (Bandarabadi et al., 2011, Firpi et al., 2005a, Harikumar et al., 2004, Rivero et al., 2013, Wei et al., 2010). Only a few groups apply evolutionary computation techniques to gamma oscillation pattern detection for encephalography (Firpi et al., 2007, Smart et al., 2007, Smart et al., 2011). We have focused on pattern classification research for interictal rather than ictal events of interest for the following philosophical positions: (1) certain analyses of interictal brain signals can lead to diagnostic and prognostic information for identifying seizure onset zones equivalent to or complementary to certain analysis of ictal brain signals (Bettus et al., 2011, Crepon et al., 2010, Heers et al., 2014, Korzeniewska et al., 2014, Lu et al., 2014, Matsumoto et al., 2013, Spencer et al., 2008, Thornton et al., 2011, Valentin et al., 2014, Worrell and Gotman, 2011, Zhang et al., 2014); and (2) some patients do not have seizures during epilepsy monitoring, so using brain signals not dependent on seizures provides a means to offer some clinical analysis for patients rather than send them home without any diagnosis. Because it is not a trivial problem to detect interictal biomarkers within iEEG or fMRI signals, especially depending on the signal-to-noise ratio for each brain signal modality, we have used evolutionary computation techniques to search for robust optimal albeit relatively complex and somewhat human-intractable solutions. Commonly referenced alternative approaches for interictal biomarker detectors using iEEG signals still involve a human verification stage to discard numerous false positive detections (Crepon et al., 2010, Gardner et al., 2007, Worrell et al., 2008) despite some algorithm developments (Zelmann et al., 2012). Such human involvement counteracts the main purpose of the semi-automated approach and depending on the false-positive rate of the detection method might be as laborious as marking the actual true-positive events without running the pattern classification. On the other hand, we demonstrated via our prior work on interictal biomarker detection algorithms that – for at least our epilepsy brain signal data – one may gain higher pattern classification performance for features selected using an evolutionary computation method than for features selected by conventional or popular-in-literature methods for iEEG (Firpi et al., 2007, Smart et al., 2007, Smart et al., 2011) and for fMRI (Burrell et al., 2007a).

Evolutionary computation is a discipline comprising the study and development of evolution-based (Darwin, 1978) search optimization algorithms: for instance, an initial search space contains a population of organisms (possible optimal solutions) that stochastically undergoes mutation and recombination before survival selection (i.e., choosing the most fit organisms) and generational (iterative) production of a new population of organisms from the prior population (refining the possible optimal solutions) until no more evolution (optimization convergence). Since its pioneering inception in the 1950s through 1960s (Baeck et al., 1997, Barricelli, 1957, Barricelli, 1962, Barricelli, 1963, Fogel, 1998, Fogel et al., 1968, Fraser, 1960, Turing, 1950), evolutionary computation has spawned numerous computer science techniques that today has five classes or dialects of research areas: (1) evolutionary programming (Fogel and Fogel, 1986, Fogel et al., 1968, Sebald and Fogel, 1994), (2) genetic algorithms (Booker et al., 1989, Holland, 1992a, Holland, 1992b, Holland, 1995), (3) evolution strategy (Schwefel, 1981, Schwefel, 1995), (4) genetic programming (Koza, 1989, Koza, 1992, Koza, 1994, Koza, 1996, Koza, 1997, Koza, 1999, Koza et al., 2006), and (5) swarm intelligence (Beni, 2005, Beni and Wang, 1989, Blum and Merkle, 2008, Bonabeau et al., 1999, Dorigo and Gambardella, 1997, Kennedy et al., 2001). In particular, the genetic programming (GP) methodology is a global search optimization procedure that heuristically develops possible solutions (programs) to a predefined problem statement using biological evolution concepts (i.e., mutation, crossover, and selection) (Koza, 1989, Koza, 1992, Koza, 1994). The GP algorithm (see Algorithm 1) initializes a set (population) of solutions (individuals) of size P with each element representing a mathematical operation on the input of the GP in the form of a tree structure, uses an objective function to compute an index (fitness) for each individual, and executes the evolutionary processes to create new populations that optimize the fitness to compute best individual. The evolutionary processes have many variations in implementation but the same basic concepts: the selection stage chooses a current population subset (intermediate population) based upon individual fitness; the crossover stage creates new individuals using combinations of paired individuals from the intermediate population, forming a new population; the mutation stage introduces diversity into the new population by randomly altering the makeup of a subset of individuals in the new population; and the survival stage simply selects the fittest individuals from the new population, creating a new initial population of size P for subsequent GP iterations (generations). The algorithm ends upon attaining a predefined number of generations or predefined fitness value. From this final population of solutions, one may select the best (optimal) solution according to the chosen fitness function.

Algorithm 1

Pseudocode for GP Algorithm.

For pattern classification problems to quantitatively discriminate non-biomarker (e.g., basal or baseline activity) and biomarker (e.g., spikes, seizures, PGOs, abnormal CBF activations) brain signals from epilepsy patients, GP has been used for both noninvasive and invasive electroencephalography (Fernández-Blanco et al., 2013, Firpi et al., 2005b, Firpi et al., 2005c, Firpi et al., 2006, Guo et al., 2011, Lopes, 2007, Marchesi et al., 1997b, Smart et al., 2007, Sotelo et al., 2013a, Sotelo et al., 2013b), MEG (Georgopoulos et al., 2009, Theofilatos et al., 2009), and fMRI (Burrell et al., 2007b). However, GP is not a feature selection algorithm. Technically, its use in this way is a mischaracterized application, where algorithmic issues such as bloat, fitness function definition, and choices for the terminals (features) and functions can substantially affect ‘feature selection’ results. Also, GP the algorithm can output practically inelegant solutions since theoretical parsimony is not a guaranteed effect (Kelly, 1995). These limitations accentuate the importance of examining whether GP-based feature-selection solutions demonstrate reproducible results and useful patterns for pattern classification of epilepsy data. Consequently, we investigated GP-based feature selection (i.e., implicit feature selection with GP algorithm) for interictal resting-state brain recordings with focus on two main questions regarding the computed feature subsets. Across patients, do selected subsets exhibit the same feature subsets, indicating universal measures, or different subsets, indicating unconventional case-by-case measures? Per patient, are the selected subsets similar if not the same in content, indicating consistency in solutions? Since the confidence interval concept embodies the computation of consistency or reliability in an estimated value or parameter set (Neyman, 1937), we investigated these two questions under the same aim: construct and evaluate confidence intervals for GP-based feature selection. Since a feature subset list is qualitative rather than quantitative data, we considered frequent itemset mining (FIM) as an approach for confidence interval construction.

Frequent itemset mining is the first stage in association rule learning, an established data-mining method to discover highly replicable information within a multitude of data (Agrawal et al., 1993, Agrawal et al., 1996, Agrawal and Srikant, 1994, Agrawal and Srikant, 1995, Rakesh and Ramakrishnan, 1994, Rakesh and Ramakrishnan, 1995, Rakesh and Ramakrishnan, 1998, Rakesh et al., 1993, Zaki, 2000). As its name implies, an FIM algorithm discovers (i.e., mines) frequently occurring itemsets (i.e., collections of data variables) for pattern observation. An event (item) represents some variable of interest in the data-mining framework. A set of items (itemset), sometimes called a transaction, is an observed combination occurring events. A collection of multiple itemsets (database) is the input for FIM analysis to identify patterns. An itemset percentage (support), s, indicates its regularity or frequency within the database, where an itemset with n events that exceeds a support threshold (or likelihood level), λ, is termed a λ-frequent n-itemset. A maximal (max) λ-frequent n-itemset is an itemset such that any (n+1)-itemset of which it is a subset has s<λ. Alternatively stated, a max λ-frequent itemset is an itemset that has infrequent (s<λ) proper supersets. It is important to note that for a given support threshold λ, FIM may output multiple itemsets as max frequent n-itemsets and these max itemsets may range in cardinality (e.g., 2-itemsets and 4-itemsets without 3-itemsets) (Fig. 3). As illustrated (Fig. 3), given a database (upper left), the FIM algorithm computes frequent itemsets (gray rectangles) and max-frequent itemsets (black rectangles) with the support threshold λ, while the FIM avoids sub-threshold trials within the database (white rectangles) for final output results. The max-frequent itemset with the highest occurrence and largest size in the database represents the final solution (e.g., Fig. 1D). Because numerous potentially coincident event combinations must be evaluated to identify at least one pattern in a database, data-mining algorithms such as FIM provide efficient computational execution in terms of memory usage, disk access, and computational burden to search for putative patterns, aiming to avoid spurious results. Among many different FIM implementations (Agrawal et al., 1996, Borgelt, 2005, Zaki, 2000), the APRIORI algorithm (see Algorithm 2) is likely the best known and most often used approach over decades (Bodon, 2003). Yet, there exists few applications of FIM to epilepsy data (Bourien et al., 2005, Bourien et al., 2004, Exarchos et al., 2006, Smart et al., 2012) and none apply FIM in the same manner that we present with this work.

Algorithm 2

Pseudocode for APRIORI FIM Algorithm.

We present a framework to essentially compute confidence intervals for GP-based feature selection that categorize epileptic biomarkers (not seizures but interictal resting-state activity) and brain activity not considered as epileptic biomarkers by implementing the APRIORI FIM algorithm after several GP feature-selection trials. This approach (Fig. 1) transforms stochastic results of the GP analysis into more deterministic results via FIM. In Section 2, we explain the approach details: Section 2.1 for the acquisition of each the iEEG and fMRI signals (Fig. 1A); Section 2.2 for computation of signal measures via feature extraction process (e.g., Fig. 1B); Section 2.3 for selection of a subset of these measures using GP, a process that we repeated in multiple trials for application of FIM (Figs. 1C and 2); Section 2.4 for recognition of patterns in repeatedly selected measures (features) via FIM (Fig. 1D); and Section 2.5 for our three main computational experiments to apply and validate the framework.

Section snippets

Signal acquisition

We analyzed fMRI collected from one patient group and iEEG collected from another patient group. For each de-identified dataset, the Internal Review Boards at the Georgia Institute of Technology, Emory University, and the University of Pennsylvania approved data analysis. For each dataset, a board-certified clinician annotated “gold standard” epileptic biomarkers, which provided classification labels (i.e., biomarker, non-biomarker) for the in silico experiments.

We retrospectively analyzed

Observed patterns with FIM features

In our first experiment, we observed whether any pattern resulted from mining the GP-selected features across patients and generations to evaluate the reproducibility and subset size of the GP-based feature-selection. We computed the max frequent itemsets, or the most frequently occurring feature subsets among the 100 trials, for each patient, biological data modality, and number of GP iterations (Table 4, Table 5). For each max frequent itemset, the first item occurred most and the last item

Discussion

Applying FIM to repeated GP-based feature selection, we found patterns in the cardinality of the selected feature subsets, reproducibility of the subsets, and correlations between infrequently selected measures as well as a validation of patient-specific feature subsets.

Conclusions

We developed a method to categorize biomarker and non-biomarker epileptic activity in iEEG and fMRI signals from epilepsy patients by combining GP and FIM techniques. We used FIM to compute qualitative confidence intervals for features selected via a GP algorithm. We observed within-subject consistency and across-subject variability for GP-based feature selection for both fMRI and iEEG signals. We concluded that the problem of detecting interictal biomarkers for each iEEG and fMRI signal

Acknowledgments

Grant funds from the United Negro College Fund Special Programs Corporation NASA Harriett G. Jenkins Pre-doctoral Fellowship Program to Dr. Smart and the National Institute of Neurological Disorders and Stroke (1R01NS048598-01A2) to both Drs. Burrell and Smart provided partial research support for this work. The authors thank the physicians from the Center for Functional Neuroimaging and Department of Neurology at the University of Pennsylvania, the Children’s Hospital of Philadelphia, and the

References (138)

  • J.J. Halford

    Computerized epileptiform transient detection in the scalp electroencephalogram: obstacles to progress and the example of computerized ECG interpretation

    Clin. Neurophysiol.

    (2009)
  • Y. Han et al.

    Features and futures: seizure detection in partial epilepsies

    Neurosurg. Clin. N. Am.

    (2011)
  • K.-C. Hsu et al.

    Detection of seizures in EEG using subband nonlinear parameters and genetic algorithm

    Comput. Biol. Med.

    (2010)
  • H.S. Lopes

    Genetic programming for epileptic pattern recognition in electroencephalographic signals

    Appl. Soft Comput.

    (2007)
  • M.A. Navakatikyan et al.

    Seizure detection algorithm for neonates based on wave-sequence analysis

    Clin. Neurophysiol.

    (2006)
  • H. Ocak

    Optimal classification of epileptic seizures in EEG using wavelet analysis and genetic algorithm

    Signal Process.

    (2008)
  • J.W. Pan et al.

    Intracranial EEG power and metabolism in human epilepsy

    Epilepsy Res.

    (2009)
  • L.M. Patnaik et al.

    Epileptic EEG detection using neural networks and post-classification

    Comput. Methods Programs Biomed.

    (2008)
  • R. Agrawal et al.

    Mining association rules between sets of items in large databases. In: Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data

    (1993)
  • R. Agrawal et al.

    Fast discovery of association rules, advances in knowledge discovery and data mining

    Am. Assoc. Artif. Intell.

    (1996)
  • R. Agrawal et al.

    Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th International Conference on Very Large Data Bases

    (1994)
  • R. Agrawal et al.

    Mining sequential patterns. In: Proceedings of the Eleventh International Conference on Data Engineering

    (1995)
  • L. Ayoubian et al.

    Automatic seizure detection in SEEG using high frequency activities in wavelet domain

    Med. Eng. Phys.

    (2012)
  • T. Baeck et al.

    Handbook of Evolutionary Computation

    (1997)
  • M. Bandarabadi et al.

    Wepilet, optimal orthogonal wavelets for epileptic seizure prediction with one single surface channel

    Conf. Proc. IEEE Eng. Med. Biol. Soc.

    (2011)
  • Barricelli, N.A., 1957. Symbiogenetic Evolution Processes Realized by Artificial...
  • N.A. Barricelli

    Numerical testing of evolution theories

    Acta Biotheor.

    (1962)
  • N.A. Barricelli

    Numerical testing of evolution theories. Part II. Preliminary tests of performance, symbiogenesis and terrestrial life

    Acta Biotheor.

    (1963)
  • G. Beni

    From swarm intelligence to swarm robotics. In: Proceedings of the 2004 International Conference on Swarm Robotics

    (2005)
  • Beni, G., Wang, J., 1989. Swarm Intelligence in Cellular Robotic Systems, NATO Advanced Workshop on Robotics and...
  • G. Bettus et al.

    Interictal functional connectivity of human epileptic networks assessed by intracerebral EEG and BOLD signal fluctuations

    PLoS One

    (2011)
  • C. Blum et al.

    Swarm Intelligence: Introduction and Applications

    (2008)
  • Bodon, F., 2003. A fast APRIORI implementation. In: CEUR Workshop Proceedings. IEEE ICDM Workshop on Frequent Itemset...
  • F. Bodon

    A trie-based APRIORI implementation for mining frequent item sequences. In: Proceedings of the First International Workshop on Open Source Data Mining: Frequent Pattern Mining Implementations

    (2005)
  • E. Bonabeau et al.

    Swarm Intelligence: From Natural to Artificial Systems

    (1999)
  • C. Borgelt

    An implementation of the FP-growth algorithm. In: Proceedings of the First International Workshop on Open Source Data Mining: Frequent Pattern Mining Implementations

    (2005)
  • J. Bourien et al.

    Mining reproducible activation patterns in epileptic intracerebral EEG signals: application to interictal activity

    IEEE Trans. Biomed. Eng.

    (2004)
  • A. Bragin et al.

    High-frequency oscillations in human brain

    Hippocampus

    (1999)
  • A. Bragin et al.

    High-frequency oscillations in epileptic brain

    Curr. Opin. Neurol.

    (2010)
  • Burrell, L., Vachtsevanos, G.J., Glynn, S., Litt, B., 2007a. Feature analysis of functional MRI for discrimination...
  • Burrell, L.S., Glynn, S.M., Vachtsevanos, G.J., Litt, B., 2007b. Feature analysis of functional MRI for discrimination...
  • J.M. Chambers

    Graphical Methods for Data Analysis

    (1983)
  • B. Crepon et al.

    Mapping interictal oscillations greater than 200 Hz recorded with intracranial macroelectrodes in human epilepsy

    Brain

    (2010)
  • C. Darwin

    The Origin of Species by Means of Natural Selection

    (1978)
  • L. Davis

    Adapting operator probabilities in genetic algorithms. In: Proceedings of the Third International Conference on Genetic algorithms

    (1989)
  • A. Donaire et al.

    Sequential analysis of fMRI images: a new approach to study human epileptic networks

    Epilepsia

    (2009)
  • M. Dorigo et al.

    Ant colony system: a cooperative learning approach to the traveling salesman problem

    IEEE Trans. Evol. Comput.

    (1997)
  • R.O. Duda et al.

    Pattern Classification

    (2000)
  • J. Engel

    Surgical Treatment of the Epilepsies

    New York, NY

    (1987)
  • J. Engel

    Clinical neurophysiology, neuroimaging, and the surgical treatment of epilepsy

    Curr. Opin. Neurol. Neurosurg.

    (1993)
  • Cited by (12)

    • Image feature selection using genetic programming for figure-ground segmentation

      2017, Engineering Applications of Artificial Intelligence
      Citation Excerpt :

      However, this method may be inefficient for problems with a large number of samples or classes. Smart and Burrell (2015) apply GP to design a filter based method to select features for pattern classification problems on functional magnetic resonance imaging (fMRI) and intra-cranial electroencephalogram (iEEG) signals. The lexicographic parsimony pressure is used to control bloat in GP.

    • A novel genetic programming approach for epileptic seizure detection

      2016, Computer Methods and Programs in Biomedicine
      Citation Excerpt :

      Wang et al. [17] modified the feature extraction with the use of Wavelet Transform along with Shannon Entropy. Smart et al. [18] demonstrated that implicitly selecting features with a genetic programming (GP) algorithm more effectively determined the proper features to discern biomarker and non-biomarker interictal iEEG and fMRI activity than conventional feature selection approaches. Nicolaou et al. [19] integrated the concept of permutation entropy with the support vector machine to achieve very high classification accuracy.

    • Effectiveness of Feature Selection in Text Summarization

      2023, Proceedings - 11th IEEE International Conference on Intelligent Computing and Information Systems, ICICIS 2023
    View all citing articles on Scopus
    View full text