Skip to main content

Genetic Programming-Based Selection of Imputation Methods in Symbolic Regression with Missing Values

  • Conference paper
  • First Online:
  • 1223 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12576))

Abstract

Data incompleteness represents a serious issue in real-world applications of machine learning. Imputation methods are algorithms for restoring missing values in the data based on other available entries. Imputation methods have an influence on the learning performance on incomplete data. Therefore, the choice of a right imputation method has an important role when constructing prediction models. It is common to use one imputation method to impute all the incomplete features. However, the imputation method that works well for some features might not be suitable for others, hence, it would be more useful to select the right imputation method for each feature. In fact, selecting an imputation method for the whole data set is still a challenging issue, let al one selecting different imputation methods for all incomplete features. Therefore, this work proposes the use of genetic programming to search for the right combination of imputation methods for symbolic regression. The role of GP is to select imputation methods for incomplete features and evolve symbolic regression. It incorporates a heterogeneous set of imputation methods as part of the symbolic regression process. The results show that the proposed method can automatically find the most effective combination of imputation methods for a variety of incomplete regression data sets.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Al-Helali, B., Chen, Q., Xue, B., Zhang, M.: A hybrid GP-KNN imputation for symbolic regression with missing values. In: Mitrovic, T., Xue, B., Li, X. (eds.) AI 2018. LNCS (LNAI), vol. 11320, pp. 345–357. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-03991-2_33

    Chapter  Google Scholar 

  2. Al-Helali, B., Chen, Q., Xue, B., Zhang, M.: Genetic programming-based simultaneous feature selection and imputation for symbolic regression with incomplete data. In: Palaiahnakote, S., Sanniti di Baja, G., Wang, L., Yan, W.Q. (eds.) ACPR 2019. LNCS, vol. 12047, pp. 566–579. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-41299-9_44

    Chapter  Google Scholar 

  3. Al-Helali, B., Chen, Q., Xue, B., Zhang, M.: A genetic programming-based wrapper imputation method for symbolic regression with incomplete data. In: 2019 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 2395–2402. IEEE (2019)

    Google Scholar 

  4. Al-Helali, B., Chen, Q., Xue, B., Zhang, M.: Genetic programming for imputation predictor selection and ranking in symbolic regression with high-dimensional Incomplete Data. In: Liu, J., Bailey, J. (eds.) AI 2019. LNCS (LNAI), vol. 11919, pp. 523–535. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-35288-2_42

    Chapter  Google Scholar 

  5. Al-Helali, B., Chen, Q., Xue, B., Zhang, M.: Hessian complexity measure for genetic programming-based imputation predictor selection in symbolic regression with incomplete data. In: Hu, T., Lourenço, N., Medvet, E., Divina, F. (eds.) EuroGP 2020. LNCS, vol. 12101, pp. 1–17. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-44094-7_1

    Chapter  Google Scholar 

  6. Angelov, B.: Towards data science: working with missing data in machine learning (2017). https://towardsdatascience.com/working-with-missing-data-in-machine-learning-9c0a430df4ce

  7. Arslan, A.K., Tunç, Z., Güldoğan, E., Çolak, C.: Performance comparison of some imputation methods used in missing value (s)analysis: a simulation study. Turk. Klinikleri J. Biostatistics11(1) (2019)

    Google Scholar 

  8. Austel, V., et al.: Globally optimal symbolic regression. arXiv preprint arXiv:1710.10720 (2017)

  9. Banzhaf, W., Nordin, P., Keller, R.E., Francone, F.D.: Genetic programming: an introduction, vol. 1. Morgan Kaufmann San Francisco (1998)

    Google Scholar 

  10. Brandejsky, T.: Model identification from incomplete data set describing state variable subset only-the problem of optimizing and predicting heuristic incorporation into evolutionary system. In: Nostradamus 2013: Prediction, Modeling and Analysis of Complex Systems, pp. 181–189. Springer (2013)

    Google Scholar 

  11. Çüm, S., Demir, E.K., Gelbal, S., Kışla, T.: A comparison of advanced methods used for missing data imputation under different conditions (2019)

    Google Scholar 

  12. Dheeru, D., Karra Taniskidou, E.: UCI machine learning repository (2017). http://archive.ics.uci.edu/ml

  13. Donders, A.R.T., Van Der Heijden, G.J., Stijnen, T., Moons, K.G.: A gentle introduction to imputation of missing values. J. Clin. Epidemiol. 59(10), 1087–1091 (2006)

    Article  Google Scholar 

  14. Fortin, F.A., Rainville, F.M.D., Gardner, M.A., Parizeau, M., Gagné, C.: Deap: evolutionary algorithms made easy. J. Mach. Learn. Res. 13, 2171–2175 (2012)

    MathSciNet  Google Scholar 

  15. Garciarena, U., Mendiburu, A., Santana, R.: Towards a more efficient representation of imputation operators in tpot. arXiv preprint arXiv:1801.04407 (2018)

  16. Garciarena, U., Santana, R., Mendiburu, A.: Evolving imputation strategies for missing data in classification problems with tpot. arXiv preprint arXiv:1706.01120 (2017)

  17. Heidt, K.: Comparison of imputation methods for mixed data missing at random (2019)

    Google Scholar 

  18. Kearney, J., Barkat, S.: Autoimpute, a python package for handling missing data. https://pypi.org/project/autoimpute/

  19. McPhee, N.F., Poli, R., Langdon, W.B.: Field Guide to Genetic Programming. Lulu. com, Morrisville (2008)

    Google Scholar 

  20. Olson, R.S., Moore, J.H.: TPOT: a tree-based pipeline optimization tool for automating machine learning. In: Hutter, F., Kotthoff, L., Vanschoren, J. (eds.) Automated Machine Learning. TSSCML, pp. 151–160. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-05318-5_8

    Chapter  Google Scholar 

  21. Pornprasertmanit, S., Miller, P., Schoemann, A., Quick, C., Jorgensen, T., Pornprasertmanit, M.S.: Package ‘simsem’ (2016)

    Google Scholar 

  22. Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys, vol. 81. JohnWiley & Sons, New Jersey (2004)

    MATH  Google Scholar 

  23. Schafer, J.L.: Multiple imputation: a primer. Stat. Methods Med. Res. 8(1), 3–15 (1999)

    Article  Google Scholar 

  24. Suganuma, M., Shirakawa, S., Nagao, T.: A genetic programming approach to designing convolutional neural network architectures. In: Proceedings of the Genetic and Evolutionary Computation Conference, pp. 497–504 (2017)

    Google Scholar 

  25. Takahashi, M., Ito, T.: Multiple imputation of turnover in edinet data: toward the improvement of imputation for the economic census, pp. 24–26. Work Session on Statistical Data Editing, UNECE (2012)

    Google Scholar 

  26. Vanschoren, J., Van Rijn, J.N., Bischl, B., Torgo, L.: Openml: networked science in machine learning. ACM SIGKDD Explor. Newsl. 15(2), 49–60 (2014)

    Article  Google Scholar 

  27. Vladislavleva, E., Smits, G., Den Hertog, D.: On the importance of data balancing for symbolic regression. IEEE Trans. Evol. Comput. 14(2), 252–277 (2010)

    Article  Google Scholar 

  28. Zhang, F., Mei, Y., Nguyen, S., Zhang, M.: Evolving scheduling heuristics viagenetic programming with feature selection in dynamic flexible job shopscheduling. IEEE Trans. Cybern. (2020)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Baligh Al-Helali , Qi Chen , Bing Xue or Mengjie Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Al-Helali, B., Chen, Q., Xue, B., Zhang, M. (2020). Genetic Programming-Based Selection of Imputation Methods in Symbolic Regression with Missing Values. In: Gallagher, M., Moustafa, N., Lakshika, E. (eds) AI 2020: Advances in Artificial Intelligence. AI 2020. Lecture Notes in Computer Science(), vol 12576. Springer, Cham. https://doi.org/10.1007/978-3-030-64984-5_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-64984-5_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-64983-8

  • Online ISBN: 978-3-030-64984-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics