Elsevier

Applied Soft Computing

Volume 40, March 2016, Pages 569-580
Applied Soft Computing

PGGP: Prototype Generation via Genetic Programming

https://doi.org/10.1016/j.asoc.2015.12.015Get rights and content

Highlights

  • A genetic program for prototype generation (PGGP) is proposed.

  • PGGP learns to combine instances via genetic programming to generate prototypes.

  • An extensive experimental evaluation is performed.

  • The proposed method is compared to many other techniques.

  • The proposed approach compares favorably with most generation methods proposed so far.

Abstract

Prototype generation (PG) methods aim to find a subset of instances taken from a large training data set, in such a way that classification performance (commonly, using a 1NN classifier) when using prototypes is equal or better than that obtained when using the original training set. Several PG methods have been proposed so far, most of them consider a small subset of training instances as initial prototypes and modify them trying to maximize the classification performance on the whole training set. Although some of these methods have obtained acceptable results, training instances may be under-exploited, because most of the times they are only used to guide the search process. This paper introduces a PG method based on genetic programming in which many training samples are combined through arithmetic operators to build highly effective prototypes. The genetic program aims to generate prototypes that maximize an estimate of the generalization performance of an 1NN classifier. Experimental results are reported on benchmark data to assess PG methods. Several aspects of the genetic program are evaluated and compared to many alternative PG methods. The empirical assessment shows the effectiveness of the proposed approach outperforming most of the state of the art PG techniques when using both small and large data sets. Better results were obtained for data sets with numeric attributes only, although the performance of the proposed technique on mixed data was very competitive as well.

Introduction

Pattern classification is the task of associating objects with labels, where objects are usually represented by numerical vectors. The field has been studied extensively and a wide diversity of methods are available out there (see e.g., [13]). Among the most popular pattern classification methods are those based on similarity or distance estimation. This type of methods rely on similarity measures to assign labels to new objects, a representative classifier of this methodology is KNN [3]. Similarity-based methods have proved to be very effective on classical pattern classification tasks, including handwritten digit recognition and text categorization. However, despite their acceptable performance, they require computing similarity estimates with all of the training objects when a new instance needs to be classified, which can be computationally expensive; besides, this type of methods require considerable storage resources and they can be sensitive to noisy instances.

Prototype-based classifiers aim at alleviating the above issues by using only a subset of representative instances for classification instead of the whole training set. The main goal of prototype-based classifiers is to achieve comparable performance to methods that use the whole data set of instances, while reducing the computational cost and storage requirements. The key issue in prototype-based classification is that of determining what are the prototypes to be used for classification. There are two main alternatives for solving this problem: selection and generation of prototypes. In the former approach, a subset of the whole set of training objects is selected as the set of prototypes [20], [10]. The second approach consists of generating a set of representative instances by using information from the data set of objects [23]. Although there are not comprehensive studies comparing generation and selection strategies,1 generation methods are more general than selection ones and, in fact, prototype selection can be considered a special case of prototype generation (PG) [23].

This paper introduces a genetic programming approach to the PG problem. The proposed method combines instances from the original training set to produce prototypes. The instances to be merged and the combination strategy are automatically determined via genetic programming, where the genetic program aims at generating prototypes that maximize an estimate of the generalization performance of an 1NN classification rule. The proposed strategy automatically selects the number of prototypes per class. Also, it generates prototypes by combining many training examples, in contrast to previous works that consider training instances only to guide the search process. Additionally, the formulation of the problem allows us to generate prototypes that are non-linear combinations of instances, which may help us to better characterize the original input space.

An experimental assessment of the proposed strategy is carried out using a suite of benchmark pattern-classification problems [23]. The considered benchmark allows us to compare the performance between the proposed approach and the most representative PG methods. In terms of accuracy and data set reduction, the proposed method outperforms most alternative approaches when considering both small and large data sets. Despite its effectiveness, the intuitive idea behind the proposed method is very simple and there are many ways in which this approach can be extended.

One should note that although our main goal is to generate prototypes for pattern classification with KNN techniques, there are other tasks and problems that could be benefit by PG techniques like ours, including: oversampling, where the goal is generating new/artificial instances for reducing the class-imbalance problem; codebook learning, where one wants to learn a set of instances that can be used as reference to represent more complex objects (e.g., images [5] or videos [25]); instance-weighting for domain adaptation, where one wants to find important instances in the source domain to be exploited in the target domain, among other tasks. Besides, it has been shown very recently that a similar formulation can be adopted to generate features [11].

The rest of this paper is organized as follows. Next section reviews related work on PG emphasizing those methods based on heuristic optimization. Section 3 introduces the proposed GP approach. Section 4 describes experimental settings and reports experimental results obtained by the proposed strategy. Finally, Section 5 outlines conclusions and future work directions.

Section snippets

Related work

Triguero et al. [23] presented a taxonomy and a comparative study among several PG methods.2 A total of 32 different strategies are classified and an experimental comparison among 25 of these methods is reported. In that study, the method achieving the highest classification performance is GENN (Generalized Editing using NN) [15], a detrimental method that removes and relabels instances. GENN is a conservative method because it aims to edit only to an extent

PGGP: Prototype Generation via Genetic Programming

This paper introduces PGGP: a method for Prototype Generation via Genetic Programming. PGGP automatically combines instances from a particular class to generate classification prototypes for that class. The combination strategy is determined by a genetic program that aims at maximizing an estimate of the generalization performance of an 1NN classifier. Although the prototypes of a class are determined only by examples of its class, under the proposed approach the prototypes for all classes are

Experiments and results

This section reports experimental results obtained with PGGP using a benchmark to evaluate PG methods. The goals of this experimental assessment are to analyze the performance of PGGP under different settings, to determine the impact that the mitosis operator has in the performance of PGGP, and to compare PGGP performance to other state-of-the-art PG methods.

Conclusions

We introduced a PG method based on genetic programming in which many training samples are combined through arithmetic operators to build highly effective prototypes. The genetic program aims to generate prototypes that maximize an estimate of the generalization performance of an 1NN classifier. We extensively evaluated the performance of the proposed method and compare it to a wide variety of methods, including the best PG methods proposed so far. Experimental results allow us to draw the

Acknowledgments

This work was partially supported by the LACCIR programme under project ID R1212LAC006. Hugo Jair Escalante was supported by the internships programme of CONACyT under Grant No. 234415.

References (25)

  • F. Fernandez et al.

    Evolutionary design of nearest prototype classifiers

    J. Heuristics

    (2004)
  • U. Garain

    Prototype reduction using an artificial immune system

    Pattern Anal Appl.

    (2008)
  • Cited by (0)

    View full text