Elsevier

Information Sciences

Volume 274, 1 August 2014, Pages 95-107
Information Sciences

Evolutionary combination of kernels for nonlinear feature transformation

https://doi.org/10.1016/j.ins.2014.02.140Get rights and content

Abstract

The performance of kernel-based feature transformation methods depends on the choice of kernel function and its parameters. In addition, most of these methods do not consider the classification information and error for the mapping features. In this paper, we propose to determine a kernel function for kernel principal components analysis (KPCA) and kernel linear discriminant analysis (KLDA), considering the classification information. To this end, we combine the conventional kernel functions using genetic algorithm and genetic programming in linear and non-linear forms, respectively. We use the classification error and the mutual information between features and classes in the kernel feature space as evolutionary fitness functions. The proposed methods are evaluated on the basis of the University of California Irvine (UCI) datasets and Aurora2 speech database. We evaluate the methods using clustering validity indices and classification accuracy. The experimental results demonstrate that KPCA using a nonlinear combination of kernels based on genetic programming and the classification error fitness function outperforms conventional KPCA using Gaussian kernel and also KPCA using linear combination of kernels.

Introduction

The feature extraction is a crucial step in the pattern recognition process which greatly affects the performance of pattern recognition systems. In this step, the useful discriminative information should be extracted from the pattern in such a way that a classifier can recognize different patterns. Several methods have been proposed to make the most discriminative features and also robust to noise. A group of these methods are based on the feature mapping using linear or non-linear transformation approaches.

Some well-known examples of the linear transformation methods are Principal Component Analysis (PCA) [8], Linear Discriminant Analysis (LDA) [8] and its family including Heteroscedastic LDA (HLDA) [22], pairwise LDA (PLDA) [23], null space based linear discriminant analysis (NLDA) [24] and evolutionary based LDA [28]. These methods project the original feature vectors into a new feature space via a linear transformation based on different mapping criteria. For example, PCA finds the large variance in the original feature space for the mapping data. On the other hand, LDA maximizes the ratio of the between-class variation and the within-class variation for projecting features into a subspace. However, the drawback of these transformations is that their mapping criteria are different with the classifier error criterion [37], [40] and can potentially corrupt the classifier performance. There are several ways to overcome this drawback [19]. A solution is proposed by the authors in [40] to improve these transformations based on minimizing Hidden Markov Model (HMM) classification error.

In the nonlinear feature transformation approaches, based on the Cover’s theorem, if the input feature space is mapped nonlinearly to a high-dimensional space, non-separable patterns can become linearly separable in the transformed space [18]. This mapping is usually done via a kernel-based transformation such as kernel PCA (KPCA) [18], [35], kernel LDA (KLDA) [18], [27] and kernel class-wise locality preserving projection (KCLPP) [20] or a neural network likes as nonlinear PCA (NLPCA) [25].

However, the performance of the kernel-based transformation methods and the kernel-based classification methods depend on the choice of kernel function and its parameters. So, there are many approaches to the determination of a suitable kernel function especially for kernel-based classifiers. The best known methods are kernel estimation [36], kernel parameters optimization [21], [43], multiple kernel learning [2], [11], [12], [42], determining kernel-based on mutual information [5] and kernel optimization using evolutionary algorithms and convex optimization methods [15], [16]. As another method, genetic programming (GP) is used to improve the support vector machine (SVM) kernel function for higher classification accuracy [10], [14], [34]. In [14], the authors find a near optimal kernel function using strongly typed genetic programming (STGP [29]). Authors propose a GP based kernel construction method for the relevance vector machine (RVM) [3]. In [26], [30], [31], authors obtain SVM kernel functions and their parameters using genetic algorithm where the fitness function is SVM classification error.

In this paper, we propose two methods to obtain more suitable kernel functions for kernel-based feature transformation methods (KPCA and KLDA) with attention to this reality that their mapping criteria do not consider the classifier error criterion. To this end, we consider the classification error and also the mutual information between features and classes in the kernel feature space as criteria for determining the kernel function. For this purpose, we determine our new kernel functions based on linear combination and also non-linear combination of basic kernel functions. Linear combination is performed using genetic algorithm. On the other hand, the genetic programming is used for non-linear combination of kernel functions. The classification error and the mutual information between features and classes are used as fitness functions in the mentioned genetic algorithm and genetic programming.

The rest of the paper is organized as follows. In Section 2, we briefly explain the kernel idea, KPCA and KLDA. Section 3 explains kernel combination and our method for combining linear and nonlinear kernels using genetic algorithm and genetic programming, respectively. Section 4 contains our experiments and results. Finally, we give our conclusion in Section 5.

Section snippets

Kernel-based feature mapping

Kernel-based nonlinear feature transformation and classification are a new research area in the machine learning. Besides using kernel-based classifiers such as the well-known SVM, kernel-based-methods are used to transform and map the feature space. KPCA and KLDA are well-known examples of kernel-based feature transformation methods.

Kernel functions allow us to compute the dot products in the mapped higher dimensional spaces without explicit mapping in these spaces. Based on Mercer’s theorem,

Kernel combination

The performance of kernel methods usually depends significantly on an appropriate choice of the kernel function. Several methods have been proposed to learn the kernel from data or to combine the kernels [34], [38]. Most of these methods are used for classification using SVM, but regression [38] and also clustering [38] were studied.

As mentioned earlier, KPCA and KLDA have the disadvantage that their mapping criteria are different from the classifier error criterion and also classifier. In this

Experiments and results

In this section, we compare the proposed method with conventional KLDA and KPCA using Gaussian kernel and report the results in the following sub-sections. We employ various performance measurements including Dunn, SD and Davies–Bouldin (DB) indices in order to show class separation and discrimination after and before applying transformations to features. Also, we report the classification accuracy after applying mentioned feature transformation techniques to the UCI datasets and Aurora2 speech

Conclusion

In this paper, we propose two methods to obtain a kernel function for KPCA-based feature mapping such that classification error and information can be considered in the determining kernel function. For this purpose, we combined kernel functions in linear and non-linear manners using genetic algorithm and genetic programming, respectively; whereas fitness functions for these evolutionary algorithms are computed based on classification error and the mutual information between features and classes

References (43)

  • W. Bing, Z. Wen-qiong, C. Ling, L. Jia-hong, A GP-based kernel construction and optimization method for RVM, in: The...
  • C. Blake, E. Keogh, C.J. Merz, UCI Repository of Machine Learning Databases, 1998....
  • M. Cuturi et al.

    A mutual information kernel for sequence

    IEEE Int. Joint Conf. Neural Netw.

    (2004)
  • L. Davie et al.

    A cluster separation measure

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1979)
  • K. Deb

    Multi-Objective Optimization using Evolutionary Algorithms

    (2001)
  • R.O. Duda et al.

    Pattern Classification

    (2001)
  • K. Dunn et al.

    Well separated clusters and optimal fuzzy partitions

    J. Cybern.

    (1974)
  • C. Gagne et al.

    Genetic programming for kernel-based learning with co-evolving subsets selection

    PPSN

    (2006)
  • M. Gonen et al.

    Multiple kernel learning algorithms

    J. Mach. Learn. Res.

    (2011)
  • M. Halkidi, M. Vazirgiannis, Y. Batistakis, Quality scheme assessment in the clustering process, in: Proceeding the 4th...
  • T. Howley et al.

    The genetic kernel support vector machine: description and evaluation

    Artif. Intell. Rev.

    (2005)
  • Cited by (13)

    • Collaboration graph for feature set partitioning in data classification

      2023, Expert Systems with Applications
      Citation Excerpt :

      According to the results, this technique can provide good results in other areas such as speaker confirmation or identifications. In (Zamani, Akbari, & Nasersharif, 2014), two methods including Kernel Principal Components Analysis (KPCA) and Kernel Linear Discriminant Analysis (KLDA) are proposed. For this purpose, kernel-based functions are combined in linear and non-linear forms by genetic algorithm and genetic programming, respectively.

    • Dimensionality reduction by feature clustering for regression problems

      2015, Information Sciences
      Citation Excerpt :

      Statistical methods seem to have been more popular in this area. Recently, machine learning techniques have drawn attention and some useful dimensionality reduction approaches based on these techniques have been developed [17,16,62,63]. Distributional clustering [2,4,9] and feature clustering [10,26,25] are effective feature extraction methods for text classification.

    View all citing articles on Scopus
    View full text