ABSTRACT
Missing values are an unavoidable problem in many real-world datasets. Dealing with incomplete data is an crucial requirement for classification because inadequate treatment of missing values often causes large classification error. Feature construction has been successfully applied to improve classification with complete data, but it has been seldom applied to incomplete data. Genetic programming-based multiple feature construction (GPMFC) is a current encouraging feature construction method which uses genetic programming to evolve new multiple features from original features for classification tasks. GPMFC can improve the accuracy and reduce the complexity of many decision trees and rule-based classifiers; however, it cannot directly work with incomplete data. This paper proposes IGPMFC which is extended from GPMFC to tackle with incomplete data. IGPMFC uses genetic programming with interval functions to directly evolve multiple features for classification with incomplete data. Experimental results reveal that not only IGPMFC can substantially improve the accuracy, but also can reduce the complexity of learnt classifiers facing with incomplete data.
- A. Asuncion and D. Newman. UCI machine learning repository, 2007.Google Scholar
- J. O. Berger. Statistical decision theory and Bayesian analysis. Springer Science & Business Media, 2013.Google Scholar
- A. Bifet, G. Holmes, B. Pfahringer, and E. Frank. Fast perceptron decision tree learning from evolving data streams. In Advances in knowledge discovery and data mining, pages 299--310. 2010. Google ScholarDigital Library
- L. Breiman, J. Friedman, C. J. Stone, and R. A. Olshen. Classification and regression trees. CRC press, 1984.Google Scholar
- S. Buuren and K. Groothuis-Oudshoorn. mice: Multivariate imputation by chained equations in R. Journal of statistical software, 45, 2011.Google Scholar
- P. G. Espejo, S. Ventura, and F. Herrera. A survey on the application of genetic programming to classification. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 40:121--144, 2010. Google ScholarDigital Library
- A. Farhangfar, L. Kurgan, and J. Dy. Impact of imputation of missing values on classification error for discrete data. Pattern Recognition, 41:3692--3705, 2008. Google ScholarDigital Library
- P. J. García-Laencina, J.-L. Sancho-Gómez, and A. R. Figueiras-Vidal. Pattern classification with missing data: a review. Neural Computing and Applications, 19:263--282, 2010. Google ScholarDigital Library
- J. W. Graham. Missing data analysis: Making it work in the real world. Annual review of psychology, 60:549--576, 2009.Google Scholar
- H. Guo, Q. Zhang, and A. K. Nandi. Feature extraction and dimensionality reduction by genetic programming based on the fisher criterion. Expert Systems, 25:444--459, 2008.Google ScholarCross Ref
- M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The weka data mining software: an update. ACM SIGKDD explorations newsletter, 11:10--18, 2009. Google ScholarDigital Library
- M. A. Hall. Correlation-based feature selection for machine learning. PhD thesis, The University of Waikato, 1999.Google Scholar
- J. Han, M. Kamber, and J. Pei. Data mining: concepts and techniques: concepts and techniques. Elsevier, 2011. Google ScholarDigital Library
- E. Hansen and G. W. Walster. Global optimization using interval analysis: revised and expanded, volume 264. CRC Press, 2003.Google Scholar
- J. R. Koza. Genetic programming: on the programming of computers by means of natural selection, volume 1. 1992. Google ScholarDigital Library
- Y. Lin and B. Bhanu. Evolutionary feature synthesis for object recognition. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, 35:156--171, 2005. Google ScholarDigital Library
- R. J. Little and D. B. Rubin. Statistical analysis with missing data. John Wiley & Sons, 2014.Google ScholarDigital Library
- S. Luke, L. Panait, G. Balan, S. Paus, Z. Skolicki, E. Popovici, K. Sullivan, J. Harrison, J. Bassett, R. Hubley, et al. A java-based evolutionary computation research system. Online (March 2004) http://cs.gmu.edu/~eclab/projects/ecj, 2004.Google Scholar
- M. Muharram and G. D. Smith. Evolutionary constructive induction. Knowledge and Data Engineering, IEEE Transactions on, 17:1518--1528, 2005. Google ScholarDigital Library
- D. R. Musser. Introspective sorting and selection algorithms. Softw., Pract. Exper., 27:983--993, 1997. Google ScholarDigital Library
- K. Neshatian, M. Zhang, and P. Andreae. A filter approach to multiple feature construction for symbolic learning classifiers using genetic programming. Evolutionary Computation, IEEE Transactions on, 16:645--661, 2012. Google ScholarDigital Library
- J. R. Quinlan. C4. 5: programs for machine learning. Elsevier, 2014.Google ScholarDigital Library
- H. Shi. Best-first decision tree learning. Master's thesis, University of Waikato, Hamilton, NZ, 2007. COMP594.Google Scholar
- M. G. Smith and L. Bull. Genetic programming with a genetic algorithm for feature construction and selection. Genetic Programming and Evolvable Machines, 6:265--281, 2005. Google ScholarDigital Library
- A. Srinivasan and R. D. King. Feature construction with inductive logic programming: A study of quantitative predictions of biological activity aided by structural attributes. Data Mining and Knowledge Discovery, 3:37--57, 1999. Google ScholarDigital Library
- X. Tan, B. Bhanu, and Y. Lin. Fingerprint classification based on learned features. Systems, Man, and Cybernetics, Tart C: Applications and Reviews, IEEE Transactions on, 35:287--300, 2005. Google ScholarDigital Library
- C. T. Tran, P. Andreae, and M. Zhang. Impact of imputation of missing values on genetic programming based multiple feature construction for classification. In Evolutionary Computation (CEC), 2015 IEEE Congress on, pages 2398--2405, 2015.Google ScholarCross Ref
- C. T. Tran, M. Zhang, and P. Andreae. Multiple imputation for missing data using genetic programming. In Proceedings of the 2015 annual conference on genetic and evolutionary computation, pages 583--590, 2015. Google ScholarDigital Library
- C. T. Tran, M. Zhang, and P. Andreae. Directly evolving classifiers for missing data using genetic programming. In Evolutionary Computation (CEC), 2016 IEEE Congress on, pages 5278--5285, 2016.Google ScholarCross Ref
- C. T. Tran, M. Zhang, and P. Andreae. A genetic programming-based imputation method for classification with missing data. In European Conference on Genetic Programming, pages 149--163, 2016.Google ScholarCross Ref
- C. T. Tran, M. Zhang, P. Andreae, and B. Xue. Directly constructing multiple features for classification with missing data using genetic programming with interval functions. In Proceedings of the 2016 on Genetic and Evolutionary Computation Conference Companion, pages 69--70, 2016. Google ScholarDigital Library
- I. R. White, P. Royston, and A. M. Wood. Multiple imputation using chained equations: Issues and guidance for practice. Statistics in medicine, 30:377--399, 2011.Google ScholarCross Ref
Index Terms
Genetic programming based feature construction for classification with incomplete data
Recommendations
Multiple feature construction for effective biomarker identification and classification using genetic programming
GECCO '14: Proceedings of the 2014 Annual Conference on Genetic and Evolutionary ComputationBiomarker identification, i.e., detecting the features that indicate differences between two or more classes, is an important task in omics sciences. Mass spectrometry (MS) provide a high throughput analysis of proteomic and metabolomic data. The number ...
Directly Constructing Multiple Features for Classification with Missing Data using Genetic Programming with Interval Functions
GECCO '16 Companion: Proceedings of the 2016 on Genetic and Evolutionary Computation Conference CompanionMissing values are a common issue in many industrial and real-world datasets. Genetic programming-based multiple feature construction (GPMFC) is a recent promising filter approach to constructing multiple features for classification using genetic ...
Multiple imputation and genetic programming for classification with incomplete data
GECCO '17: Proceedings of the Genetic and Evolutionary Computation ConferenceMany industrial and research datasets suffer from an unavoidable issue of missing values. One of the most common approaches to solving classification with incomplete data is to use an imputation method to fill missing values with plausible values before ...
Comments