ABSTRACT
Feature selection is an important process within machine learning problems. Through pressures imposed on models during evolution, genetic programming performs basic feature selection, and so analysis of the evolved models can provide some insights into the utility of input features. Previous work has tended towards a presence model of feature selection, where the frequency of a feature appearing within evolved models is a metric for its utility. In this paper, we identify some drawbacks with using this approach, and instead propose the integration of importance measures for feature selection that measure the influence of a feature within a model. Using sensitivity-like analysis methods inspired by importance measures used in random forest regression, we demonstrate that genetic programming introduces many features into evolved models that have little impact on a given model's behaviour, and this can mask the true importance of salient features. The paper concludes by exploring bloat control methods and adaptive terminal selection methods to influence the identification of useful features within the search performed by genetic programming, with results suggesting that a combination of adaptive terminal selection and bloat control may help to improve generalisation performance.
- Francesco Archetti, Stefano Lanzeni, Enza Messina, and Leonardo Vanneschi. 2006. Genetic programming for human oral bioavailability of drugs. In Proceedings of the 8th annual conference on Genetic and evolutionary computation. ACM, 255--262. Google ScholarDigital Library
- Leo Breiman. 2001. Random forests. Machine learning 45, 1 (2001), 5--32. Google ScholarDigital Library
- Leo Breiman, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and regression trees. CRC press.Google Scholar
- Q. Chen, B. Xue, B. Niu, and M. Zhang. 2016. Improving generalisation of genetic programming for high-dimensional symbolic regression with feature selection. In 2016 IEEE Congress on Evolutionary Computation (CEC). 3793--3800.Google Scholar
- Grant Dick. 2014. Bloat and generalisation in symbolic regression. In Asia-Pacific Conference on Simulated Evolution and Learning. Springer International Publishing, 491--502. Google ScholarDigital Library
- Grant Dick. 2015. Improving Geometric Semantic Genetic Programming with Safe Tree Initialisation. In European Conference on Genetic Programming. Springer International Publishing, 28--40.Google Scholar
- Grant Dick, Aysha P Rimoni, and Peter A Whigham. 2015. A re-examination of the use of genetic programming on the oral bioavailability problem. In Proceedings of the 2015 Annual Conference on Genetic and Evolutionary Computation. ACM, 1015--1022. Google ScholarDigital Library
- Grant Dick and Peter A Whigham. 2013. Controlling bloat through parsimonious elitist replacement and spatial structure. In European Conference on Genetic Programming. Springer Berlin Heidelberg, 13--24. Google ScholarDigital Library
- Isabelle Guyon and André Elisseeff. 2003. An introduction to variable and feature selection. Journal of machine learning research 3, Mar (2003), 1157--1182. Google ScholarDigital Library
- David Harrison and Daniel L Rubinfeld. 1978. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management 5, 1 (1978), 81--102.Google ScholarCross Ref
- John R. Koza. 1992. Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA, USA. Google ScholarDigital Library
- Röisín Loughran, Alexandros Agapitos, Ahmed Kattan, Anthony Brabazon, and Michael O'Neill. 2017. Feature selection for speaker verification using genetic programming. Evolutionary Intelligence (2017), 1--21.Google Scholar
- Durga Prasad Muni, Nikhil R Pal, and Jyotirmay Das. 2006. Genetic programming for simultaneous feature selection and classifier design. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 36, 1 (2006), 106--117. Google ScholarDigital Library
- Kourosh Neshatian and Mengjie Zhang. 2009. Genetic programming for feature subset ranking in binary classification problems. In European conference on genetic programming. Springer, 121--132. Google ScholarDigital Library
- J Ross Quinlan. 1993. Combining instance-based and model-based learning. In Proceedings of the Tenth International Conference on Machine Learning. 236--243. Google ScholarDigital Library
- Andrea Saltelli, Karen Chan, E Marian Scott, and others. 2000. Sensitivity analysis. Vol. 1. Wiley New York.Google Scholar
- Sean Stijven, Wouter Minnebo, and Katya Vladislavleva. 2011. Separating the wheat from the chaff: on feature selection and feature importance in regression random forests and symbolic regression. In Proceedings of the 13th annual conference companion on Genetic and evolutionary computation. ACM, 623--630. Google ScholarDigital Library
- Peter A Whigham and Grant Dick. 2010. Implicitly controlling bloat in genetic programming. IEEE Transactions on Evolutionary Computation 14, 2 (2010), 173--190. Google ScholarDigital Library
- Bing Xue, Mengjie Zhang, Will N Browne, and Xin Yao. 2016. A survey on evolutionary computation approaches to feature selection. IEEE Transactions on Evolutionary Computation 20, 4 (2016), 606--626.Google ScholarDigital Library
- I-C Yeh. 1998. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete research 28, 12 (1998), 1797--1808.Google Scholar
Index Terms
Sensitivity-like analysis for feature selection in genetic programming
Recommendations
Classifier design with feature selection and feature extraction using layered genetic programming
This paper proposes a novel method called FLGP to construct a classifier device of capability in feature selection and feature extraction. FLGP is developed with layered genetic programming that is a kind of the multiple-population genetic programming. ...
Image feature selection using genetic programming for figure-ground segmentation
Figure-ground segmentation is the process of separating regions of interest from unimportant background. One challenge is to segment images with high variations (e.g. containing a cluttered background), which requires effective feature sets to capture ...
Separating the wheat from the chaff: on feature selection and feature importance in regression random forests and symbolic regression
GECCO '11: Proceedings of the 13th annual conference companion on Genetic and evolutionary computationFeature selection in high-dimensional data sets is an open problem with no universal satisfactory method available. In this paper we discuss the requirements for such a method with respect to the various aspects of feature importance and explore them ...
Comments