Evolutionary combination of kernels for nonlinear feature transformation
Introduction
The feature extraction is a crucial step in the pattern recognition process which greatly affects the performance of pattern recognition systems. In this step, the useful discriminative information should be extracted from the pattern in such a way that a classifier can recognize different patterns. Several methods have been proposed to make the most discriminative features and also robust to noise. A group of these methods are based on the feature mapping using linear or non-linear transformation approaches.
Some well-known examples of the linear transformation methods are Principal Component Analysis (PCA) [8], Linear Discriminant Analysis (LDA) [8] and its family including Heteroscedastic LDA (HLDA) [22], pairwise LDA (PLDA) [23], null space based linear discriminant analysis (NLDA) [24] and evolutionary based LDA [28]. These methods project the original feature vectors into a new feature space via a linear transformation based on different mapping criteria. For example, PCA finds the large variance in the original feature space for the mapping data. On the other hand, LDA maximizes the ratio of the between-class variation and the within-class variation for projecting features into a subspace. However, the drawback of these transformations is that their mapping criteria are different with the classifier error criterion [37], [40] and can potentially corrupt the classifier performance. There are several ways to overcome this drawback [19]. A solution is proposed by the authors in [40] to improve these transformations based on minimizing Hidden Markov Model (HMM) classification error.
In the nonlinear feature transformation approaches, based on the Cover’s theorem, if the input feature space is mapped nonlinearly to a high-dimensional space, non-separable patterns can become linearly separable in the transformed space [18]. This mapping is usually done via a kernel-based transformation such as kernel PCA (KPCA) [18], [35], kernel LDA (KLDA) [18], [27] and kernel class-wise locality preserving projection (KCLPP) [20] or a neural network likes as nonlinear PCA (NLPCA) [25].
However, the performance of the kernel-based transformation methods and the kernel-based classification methods depend on the choice of kernel function and its parameters. So, there are many approaches to the determination of a suitable kernel function especially for kernel-based classifiers. The best known methods are kernel estimation [36], kernel parameters optimization [21], [43], multiple kernel learning [2], [11], [12], [42], determining kernel-based on mutual information [5] and kernel optimization using evolutionary algorithms and convex optimization methods [15], [16]. As another method, genetic programming (GP) is used to improve the support vector machine (SVM) kernel function for higher classification accuracy [10], [14], [34]. In [14], the authors find a near optimal kernel function using strongly typed genetic programming (STGP [29]). Authors propose a GP based kernel construction method for the relevance vector machine (RVM) [3]. In [26], [30], [31], authors obtain SVM kernel functions and their parameters using genetic algorithm where the fitness function is SVM classification error.
In this paper, we propose two methods to obtain more suitable kernel functions for kernel-based feature transformation methods (KPCA and KLDA) with attention to this reality that their mapping criteria do not consider the classifier error criterion. To this end, we consider the classification error and also the mutual information between features and classes in the kernel feature space as criteria for determining the kernel function. For this purpose, we determine our new kernel functions based on linear combination and also non-linear combination of basic kernel functions. Linear combination is performed using genetic algorithm. On the other hand, the genetic programming is used for non-linear combination of kernel functions. The classification error and the mutual information between features and classes are used as fitness functions in the mentioned genetic algorithm and genetic programming.
The rest of the paper is organized as follows. In Section 2, we briefly explain the kernel idea, KPCA and KLDA. Section 3 explains kernel combination and our method for combining linear and nonlinear kernels using genetic algorithm and genetic programming, respectively. Section 4 contains our experiments and results. Finally, we give our conclusion in Section 5.
Section snippets
Kernel-based feature mapping
Kernel-based nonlinear feature transformation and classification are a new research area in the machine learning. Besides using kernel-based classifiers such as the well-known SVM, kernel-based-methods are used to transform and map the feature space. KPCA and KLDA are well-known examples of kernel-based feature transformation methods.
Kernel functions allow us to compute the dot products in the mapped higher dimensional spaces without explicit mapping in these spaces. Based on Mercer’s theorem,
Kernel combination
The performance of kernel methods usually depends significantly on an appropriate choice of the kernel function. Several methods have been proposed to learn the kernel from data or to combine the kernels [34], [38]. Most of these methods are used for classification using SVM, but regression [38] and also clustering [38] were studied.
As mentioned earlier, KPCA and KLDA have the disadvantage that their mapping criteria are different from the classifier error criterion and also classifier. In this
Experiments and results
In this section, we compare the proposed method with conventional KLDA and KPCA using Gaussian kernel and report the results in the following sub-sections. We employ various performance measurements including Dunn, SD and Davies–Bouldin (DB) indices in order to show class separation and discrimination after and before applying transformations to features. Also, we report the classification accuracy after applying mentioned feature transformation techniques to the UCI datasets and Aurora2 speech
Conclusion
In this paper, we propose two methods to obtain a kernel function for KPCA-based feature mapping such that classification error and information can be considered in the determining kernel function. For this purpose, we combined kernel functions in linear and non-linear manners using genetic algorithm and genetic programming, respectively; whereas fitness functions for these evolutionary algorithms are computed based on classification error and the mutual information between features and classes
References (43)
- et al.
Averaging of kernel functions
Neurocomputing
(2013) - et al.
Localized algorithms for multiple kernel learning
Pattern Recogn.
(2013) - et al.
The optimization of the kind and parameters of kernel function in KPCA for process monitoring
Comput. Chem. Eng.
(2012) - et al.
Feature extraction using a fast null space based linear discriminant analysis algorithm
Inform. Sci.
(2012) - et al.
Improving linear discriminant analysis with artificial immune system-based evolutionary algorithms
Inform. Sci.
(2012) - et al.
Semi-supervised kernel density estimation for video annotation
Comput. Vis. Image Underst.
(2009) - et al.
Feature extraction and dimensionality reduction algorithms and their applications in vowel recognition
Pattern Recogn.
(2003) - et al.
Learning the kernel matrix by maximizing a KFD-based class separability criterion
Pattern Recogn.
(2007) - et al.
Optimized discriminative transformations for speech features based on minimum classification error
Pattern Recogn. Lett.
(2011) - et al.(2008)
A mutual information kernel for sequence
IEEE Int. Joint Conf. Neural Netw.
A cluster separation measure
IEEE Trans. Pattern Anal. Mach. Intell.
Multi-Objective Optimization using Evolutionary Algorithms
Pattern Classification
Well separated clusters and optimal fuzzy partitions
J. Cybern.
Genetic programming for kernel-based learning with co-evolving subsets selection
PPSN
Multiple kernel learning algorithms
J. Mach. Learn. Res.
The genetic kernel support vector machine: description and evaluation
Artif. Intell. Rev.
Cited by (13)
Collaboration graph for feature set partitioning in data classification
2023, Expert Systems with ApplicationsCitation Excerpt :According to the results, this technique can provide good results in other areas such as speaker confirmation or identifications. In (Zamani, Akbari, & Nasersharif, 2014), two methods including Kernel Principal Components Analysis (KPCA) and Kernel Linear Discriminant Analysis (KLDA) are proposed. For this purpose, kernel-based functions are combined in linear and non-linear forms by genetic algorithm and genetic programming, respectively.
Learning a Multiple Kernel Similarity Metric for kinship verification
2018, Information SciencesOptimizing kernel methods to reduce dimensionality in fault diagnosis of industrial systems
2015, Computers and Industrial EngineeringDimensionality reduction by feature clustering for regression problems
2015, Information SciencesCitation Excerpt :Statistical methods seem to have been more popular in this area. Recently, machine learning techniques have drawn attention and some useful dimensionality reduction approaches based on these techniques have been developed [17,16,62,63]. Distributional clustering [2,4,9] and feature clustering [10,26,25] are effective feature extraction methods for text classification.
Optimal learning rates of l<sup>p</sup>-type multiple kernel learning under general conditions
2015, Information Sciences