Skip to main content

Advertisement

Log in

Choosing function sets with better generalisation performance for symbolic regression models

  • Published:
Genetic Programming and Evolvable Machines Aims and scope Submit manuscript

Abstract

Supervised learning by means of Genetic Programming (GP) aims at the evolutionary synthesis of a model that achieves a balance between approximating the target function on the training data and generalising on new data. The model space searched by the Evolutionary Algorithm is populated by compositions of primitive functions defined in a function set. Since the target function is unknown, the choice of function set’s constituent elements is primarily guided by the makeup of function sets traditionally used in the GP literature. Our work builds upon previous research of the effects of protected arithmetic operators (i.e. division, logarithm, power) on the output value of an evolved model for input data points not encountered during training. The scope is to benchmark the approximation/generalisation of models evolved using different function set choices across a range of 43 symbolic regression problems. The salient outcomes are as follows. Firstly, Koza’s protected operators of division and exponentiation have a detrimental effect on generalisation, and should therefore be avoided. This result is invariant of the use of moderately sized validation sets for model selection. Secondly, the performance of the recently introduced analytic quotient operator is comparable to that of the sinusoidal operator on average, with their combination being advantageous to both approximation and generalisation. These findings are consistent across two different system implementations, those of standard expression-tree GP and linear Grammatical Evolution. We highlight that this study employed very large test sets, which create confidence when benchmarking the effect of different combinations of primitive functions on model generalisation. Our aim is to encourage GP researchers and practitioners to use similar stringent means of assessing generalisation of evolved models where possible, and also to avoid certain primitive functions that are known to be inappropriate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. In reality, ppow was defined [35] as \(srexpt(x_1, x_2) = |x_1|^{x_2}\), but this does not take into account the fact that raising 0 to a negative number results in a division by zero.

  2. In such cases, Keijzer [30] proposes using the ranges observed in the training set, but those ranges can change with unseen data, particularly when extrapolating, or when modelling time-series data.

References

  1. A. Agapitos, R. Loughran, M. Nicolau, S. Lucas, M. O’Neill, A. Brabazon, A survey of statistical machine learning elements in genetic programming. IEEE Trans. Evol. Comput. 23(6), 1029–1048 (2019)

    Article  Google Scholar 

  2. R.M.A. Azad, C. Ryan, Variance based selection to improve test set performance in genetic programming, in GECCO’11: Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation (ACM, Dublin, 2011), pp. 1315–1322

  3. P. Barmpalexis, A. Karagianni, G. Karasavvaides, K. Kachrimanis, Comparison of multi-linear regression, particle swarm optimization artificial neural networks and genetic programming in the development of mini-tablets. Int. J. Pharm. 551(1), 166–176 (2018). https://doi.org/10.1016/j.ijpharm.2018.09.026

    Article  Google Scholar 

  4. M. Castelli, L. Manzoni, S. Silva, L. Vanneschi, A comparison of the generalization ability of different genetic programming frameworks, in IEEE Congress on Evolutionary Computation (CEC 2010) (IEEE Press, Barcelona, 2010)

  5. Q. Chen, B. Xue, M. Zhang, Improving generalisation of genetic programming for symbolic regression with angle-driven geometric semantic operators. IEEE Trans. Evol. Comput. 23(3), 488–502 (2019). https://doi.org/10.1109/TEVC.2018.2869621

    Article  Google Scholar 

  6. Q. Chen, M. Zhang, B. Xue, Feature selection to improve generalisation of genetic programming for high-dimensional symbolic regression. IEEE Trans. Evol. Comput. 21(5), 792–806 (2017). https://doi.org/10.1109/TEVC.2017.2683489

    Article  Google Scholar 

  7. Q. Chen, M. Zhang, B. Xue, Structural risk minimisation-driven genetic programming for enhancing generalisation in symbolic regression. IEEE Trans. Evol. Comput. 23(4), 703–717 (2019). https://doi.org/10.1109/TEVC.2018.2881392

    Article  Google Scholar 

  8. O. Claveria, E. Monte, S. Torra, Assessment of the effect of the financial crisis on agents expectations through symbolic regression. Appl. Econ. Lett. 24(9), 648–652 (2017). https://doi.org/10.1080/13504851.2016.1218419

    Article  Google Scholar 

  9. L.F. dal Piccol Sotto, V.V. de Melo, Studying bloat control and maintenance of effective code in linear genetic programming for symbolic regression. Neurocomputing 180, 79–93 (2016). https://doi.org/10.1016/j.neucom.2015.10.109. Progress in Intelligent Systems Design Selected papers from the 4th Brazilian Conference on Intelligent Systems (BRACIS 2014)

  10. O. Claveria, E. Monte, S. Torra, Using survey data to forecast real activity with evolutionary algorithms: a cross-country analysis. J. Appl. Econ. 20(2), 329–349 (2017). https://doi.org/10.1016/S1514-0326(17)30015-6

    Article  Google Scholar 

  11. G. D’Angelo, R. Pilla, C. Tascini, S. Rampone, A proposal for distinguishing between bacterial and viral meningitis using genetic programming and decision trees. Soft Comput. 23(22), 11775–11791 (2019). https://doi.org/10.1007/s00500-018-03729-y

    Article  Google Scholar 

  12. I. De Falco, A. Della Cioppa, T. Koutny, M. Krcma, U. Scafuri, E. Tarantino, Genetic programming-based induction of a glucose-dynamics model for telemedicine. J. Netw. Comput. Appl. 119, 1–13 (2018). https://doi.org/10.1016/j.jnca.2018.06.007

    Article  Google Scholar 

  13. F.O. de Franca, A greedy search tree heuristic for symbolic regression. Inf. Sci. 442, 18–32 (2018). https://doi.org/10.1016/j.ins.2018.02.040

    Article  MathSciNet  MATH  Google Scholar 

  14. V.V. de Melo, W. Banzhaf, Improving the prediction of material properties of concrete using kaizen programming with simulated annealing. Neurocomputing 246, 25–44 (2017). https://doi.org/10.1016/j.neucom.2016.12.077

    Article  Google Scholar 

  15. V.V. de Melo, W. Banzhaf, Automatic feature engineering for regression models with machine learning: an evolutionary computation and statistics hybrid. Inf. Sci. 430–431, 287–313 (2018). https://doi.org/10.1016/j.ins.2017.11.041

    Article  MathSciNet  Google Scholar 

  16. G. Dick, Revisiting interval arithmetic for regression problems in genetic programming, in Genetic and Evolutionary Computation Conference—GECCO 2017, Berlin, Germany, July 15–19, 2017, Companion, Proceedings, ed. by G. Ochoa (ACM 2017), pp. 129–130. https://doi.org/10.1145/3067695.3076107

  17. A.I. Diveev, N.B. Konyrbaev, E.A. Sofronova, Method of binary analytic programming to look for optimal mathematical expression. Procedia Comput. Sci. 103, 597–604 (2017). XII International Symposium Intelligent Systems, INTELS 2016, 5–7 October 2016. Moscow, Russia (2016). https://doi.org/10.1016/j.procs.2017.01.073

  18. T. Dou, P. Rockett, Comparison of semantic-based local search methods for multiobjective genetic programming. Genet. Program Evol. Mach. 19(4), 535–563 (2018). https://doi.org/10.1007/s10710-018-9325-4

    Article  Google Scholar 

  19. I. Fajfar, T. Tuma, Creation of numerical constants in robust gene expression programming. Entropy 20(10), 756 (2018). https://doi.org/10.3390/e20100756

    Article  Google Scholar 

  20. M. Fenton, J. McDermott, D. Fagan, S. Forstenlechner, M. O’Neill, E. Hemberg, PonyGE2: Grammatical Evolution in Python. http://ncra.ucd.ie/Site/GEVA.html (2019)

  21. F.A. Fortin, F.M. De Rainville, M.A. Gardner, M. Parizeau, C. Gagné, DEAP. https://github.com/DEAP/deap (2019)

  22. J.H. Friedman, Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2000)

    Article  MathSciNet  Google Scholar 

  23. A. Garg, J.S.L. Lam, B.N. Panda, A hybrid computational intelligence framework in modelling of coal-oil agglomeration phenomenon. Appl. Soft Comput. 55, 402–412 (2017). https://doi.org/10.1016/j.asoc.2017.01.054

    Article  Google Scholar 

  24. B. Ghaddar, N. Sakr, Y. Asiedu, Spare parts stocking analysis using genetic programming. Eur. J. Oper. Res. 252(1), 136–144 (2016). https://doi.org/10.1016/j.ejor.2015.12.041

    Article  Google Scholar 

  25. E.M. Golafshani, A. Behnood, Automatic regression methods for formulation of elastic modulus of recycled aggregate concrete. Appl. Soft Comput. 64, 377–400 (2018). https://doi.org/10.1016/j.asoc.2017.12.030

    Article  Google Scholar 

  26. I. Gonzalez-Taboada, B. Gonzalez-Fonteboa, F. Martinez-Abella, J.L. Perez-Ordonez, Prediction of the mechanical properties of structural recycled concrete using multivariable regression and genetic programming. Constr. Build. Mater. 106, 480–499 (2016). https://doi.org/10.1016/j.conbuildmat.2015.12.136

    Article  Google Scholar 

  27. M.A. Haeri, M.M. Ebadzadeh, G. Folino, Statistical genetic programming for symbolic regression. Appl. Soft Comput. 60, 447–469 (2017). https://doi.org/10.1016/j.asoc.2017.06.050

    Article  Google Scholar 

  28. S.A. Hosseini, A. Tavana, S.M. Abdolahi, S. Darvishmaslak, Prediction of blast induced ground vibrations in quarry sites: a comparison of GP, RSM and MARS. Soil Dyn. Earthq. Eng. 119, 118–129 (2019). https://doi.org/10.1016/j.soildyn.2019.01.011

    Article  Google Scholar 

  29. A. Kattan, A. Agapitos, Y.S. Ong, A.A. Alghamedi, M. O’Neill, GP made faster with semantic surrogate modelling. Inf. Sci. 355–356, 169–185 (2016). https://doi.org/10.1016/j.ins.2016.03.030

    Article  Google Scholar 

  30. M. Keijzer, Improving symbolic regression with interval arithmetic and linear scaling, in Genetic Programming. Proceedings of EuroGP’2003, LNCS, vol. 2610, ed. by C. Ryan, T. Soule, M. Keijzer, E. Tsang, R. Poli, E. Costa (Springer, Essex, 2003), pp. 70–82

    Chapter  Google Scholar 

  31. M. Khandelwal, R.S. Faradonbeh, M. Monjezi, D.J. Armaghani, M.Z.B.A. Majid, S. Yagiz, Function development for appraising brittleness of intact rocks using genetic programming and non-linear multiple regression models. Eng. Comput. 33(1), 13–21 (2017). https://doi.org/10.1007/s00366-016-0452-3

    Article  Google Scholar 

  32. M.F. Korns, Accuracy in symbolic regression, in Genetic Programming Theory and Practice IX, Genetic and Evolutionary Computation, ed. by R. Riolo, E. Vladislavleva, J.H. Moore (Springer, New York, 2011), pp. 129–151

    Google Scholar 

  33. M. Kovacic, A. Mihevc, M. Tercelj, Roll wear modeling using genetic programming—industry case study (Mater, Technol, 2019). 10.17222/mit.2018.104

    Google Scholar 

  34. M. Kovacic, A. Turnsek, D. Ocvirk, G. Gantar, Increasing the tensile strength and elongation of 16mncrs5 steel using genetic programming. Mater. Technol. 51(6), 883–888 (2017). 10.17222/mit.2016.293

    Google Scholar 

  35. J. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection (MIT Press, Cambridge, 1992)

    MATH  Google Scholar 

  36. G. Kronberger, M. Kommenda, E. Lughofer, S. Saminger-Platz, A. Promberger, F. Nickel, S. Winkler, M. Affenzeller, Using robust generalized fuzzy modeling and enhanced symbolic regression to model tribological systems. Appl. Soft Comput. 69, 610–624 (2018). https://doi.org/10.1016/j.asoc.2018.04.048

    Article  Google Scholar 

  37. J. Kubalik, E. Alibekov, R. Babuska, Optimal control via reinforcement learning with symbolic policy approximation. IFAC-PapersOnLine 50(1), 4162–4167 (2017). https://doi.org/10.1016/j.ifacol.2017.08.805. 20th IFAC World Congress

    Article  Google Scholar 

  38. J. Kubalik, E. Alibekov, J. Zegklitz, R. Babuska, Hybrid single node genetic programming for symbolic regression. Trans. Comput. Collect. Intell. 9770, 61–82 (2016). https://doi.org/10.1007/978-3-662-53525-7_4

    Article  Google Scholar 

  39. H.C. Kwak, S. Kho, Predicting crash risk and identifying crash precursors on korean expressways using loop detector data. Accid. Anal. Prev. 88, 9–19 (2016). https://doi.org/10.1016/j.aap.2015.12.004

    Article  Google Scholar 

  40. W.B. Langdon, The Genetic Programming Bibliography. http://www.gpbib.cs.ucl.ac.uk/ (2020)

  41. W.B. Langdon, R. Poli, Foundations of Genetic Programming (Springer, New York, 2002)

    Book  Google Scholar 

  42. M. Lichman, UCI Machine Learning Repository. http://archive.ics.uci.edu/ml (2013)

  43. H. Liu, H. Lin, X. Jiang, X. Mao, Q. Liu, B. Li, Estimation of mass matrix in machine tool’s weak components research by using symbolic regression. Comput. Ind. Eng. 127, 998–1011 (2019). https://doi.org/10.1016/j.cie.2018.11.033

    Article  Google Scholar 

  44. Q. Lu, J. Ren, Z. Wang, Using genetic programming with prior formula knowledge to solve symbolic regression problem. Comput. Intell. Neurosci. (2016). https://doi.org/10.1155/2016/1021378

    Article  Google Scholar 

  45. S. Luke, E.O. Scott, L. Panait, G. Balan, S. Paus, Z. Skolicki, R. Kicinger, E. Popovici, K. Sullivan, J. Harrison, J. Bassett, R. Hubley, A. Desai, A. Chircop, J. Compton, W. Haddon, S. Donnelly, B. Jamil, J. Zelibor, E. Kangas, F. Abidi, H. Mooers, J. O’Beirne, L. Manzoni, K.A. Talukder, S. McKay, J. McDermott, J. Zou, A. Rutherford, D. Freelan, E. Wei, S. Rajendran, A. Dhawan, B. Brumbac, J. Hilty, A. Kabir, ECJ 27: A Java-based Evolutionary Computation Research System. https://cs.gmu.edu/~eclab/projects/ecj (2019)

  46. Mahler, S., Robilliard, D., Fonlupt, C.: Tarpeian bloat control and generalization accuracy, in Proceedings of the 8th European Conference on Genetic Programming, Lecture Notes in Computer Science, vol. 3447. ed. by M. Keijzer, A. Tettamanzi, P. Collet, J.I. van Hemert, M. Tomassini, (Springer, Lausanne, 2005), pp. 203–214

  47. L.F. Miranda, L.O.V.B. Oliveira, J.F.B.S. Martins, G.L. Pappa, How noisy data affects geometric semantic genetic programming, in Genetic and Evolutionary Computation Conference—GECCO 2017, Berlin, Germany, July 15–19, 2017, Companion, Proceedings, ed. by G. Ochoa (ACM, 2017), pp. 985–992. https://doi.org/10.1145/3071178.3071300

  48. J.L. Montana, C.L. Alonso, C.E. Borges, C. Tirnauca, Model-driven regularization approach to straight line program genetic programming. Expert Syst. Appl. 57, 76–90 (2016). https://doi.org/10.1016/j.eswa.2016.03.003

    Article  Google Scholar 

  49. S.S. Mousavi Astarabadi, M.M. Ebadzadeh, A decomposition method for symbolic regression problems. Appl. Soft Comput. 62, 514–523 (2018). https://doi.org/10.1016/j.asoc.2017.10.041

    Article  Google Scholar 

  50. J. Ni, R.H. Drieberg, P.I. Rockett, The use of an analytic quotient operator in genetic programming. IEEE Trans. Evol. Comput. 17(1), 146–152 (2013)

    Article  Google Scholar 

  51. J. Ni, P. Rockett, Tikhonov regularization as a complexity measure in multiobjective genetic programming. IEEE Trans. Evol. Comput. 19(2), 157–166 (2015)

    Article  Google Scholar 

  52. M. Nicolau, Understanding grammatical evolution: initialisation. Genet. Program Evolv Mach. 18(4), 1–41 (2017). https://doi.org/10.1007/s10710-017-9309-9

    Article  Google Scholar 

  53. M. Nicolau, I. Dempsey, Introducing grammar based extensions for grammatical evolution, in IEEE Congress on Evolutionary Computation (CEC 2006), pp. 2663–2670

  54. M. Nicolau, M. O’Neill, A. Brabazon, Termination in grammatical evolution: Grammar design, wrapping, and tails, in IEEE Congress on Evolutionary Computation (CEC 2012) (2012), pp. 1–8

  55. N.Y. Nikolaev, H. Iba, Regularization approach to inductive genetic programming. IEEE Trans. Evol. Comput. 54(4), 359–375 (2001)

    Article  Google Scholar 

  56. M. O’Neill, C. Ryan, Grammatical Evolution: Evolutionary Automatic Programming in a Arbitrary Language, Genetic programming, vol. 4 (Kluwer, Alphen aan den Rijn, 2003)

    Book  Google Scholar 

  57. L. Pagie, P. Hogeweg, Evolutionary consequences of coevolving targets. Evol. Comput. 5(4), 401–418 (1997)

    Article  Google Scholar 

  58. X. Pan, M.K. Uddin, B. Ai, X. Pan, Influential factors of carbon emissions intensity in oecd countries: evidence from symbolic regression. J. Clean. Prod. (2019). https://doi.org/10.1016/j.jclepro.2019.02.195

    Article  Google Scholar 

  59. T.P. Pawlak, K. Krawiec, Competent geometric semantic genetic programming for symbolic regression and boolean function synthesis. Evol. Comput. 26(2), 177–212 (2018). https://doi.org/10.1162/EVCO_a_00205

    Article  Google Scholar 

  60. S. Polanco-Martagon, J. Ruiz-Ascencio, M.A. Duarte-Villasenor, Symbolic modeling of the Pareto-optimal sets of two unity gain cells. DYNA 83(197), 128–137 (2016). 10.15446/dyna.v83n197.50919

    Article  Google Scholar 

  61. M. Quade, M. Abel, K. Shafi, R.K. Niven, B.R. Noack, Prediction of dynamical systems by symbolic regression. Phys. Rev. E 94, 012214 (2016). https://doi.org/10.1103/PhysRevE.94.012214

    Article  MathSciNet  Google Scholar 

  62. S.S. Rathore, S. Kumar, Towards an ensemble based system for predicting the number of software faults. Expert Syst. Appl. 82, 357–382 (2017). https://doi.org/10.1016/j.eswa.2017.04.014

    Article  Google Scholar 

  63. S. Silva, GPLAB: A Genetic Programming Toolbox for MATLAB. http://gplab.sourceforge.net/download.html(2019)

  64. S. Silva, L. Vanneschi, Operator equalisation, bloat and overfitting: a study on human oral bioavailability prediction, in GECCO’09: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation (ACM, Montreal, 2009), pp. 1115–1122

  65. A. Sohani, M. Zabihigivi, M.H. Moradi, H. Sayyaadi, H.H. Balyani, A comprehensive performance investigation of cellulose evaporative cooling pad systems using predictive approaches. Appl. Therm. Eng. 110, 1589–1608 (2017). https://doi.org/10.1016/j.applthermaleng.2016.08.216

    Article  Google Scholar 

  66. R. Taghizadeh-Mehrjardi, K. Nabiollahi, R. Kerry, Digital mapping of soil organic carbon at multiple depths using different data mining techniques in Baneh Region, Iran. Geoderma 266, 98–110 (2016). https://doi.org/10.1016/j.geoderma.2015.12.003

    Article  Google Scholar 

  67. A. Tahmassebi, A.H. Gandomi, Building energy consumption forecast using multi-objective genetic programming. Measurement 118, 164–171 (2018). https://doi.org/10.1016/j.measurement.2018.01.032

    Article  Google Scholar 

  68. Y. Tao, Y.J. Chen, X. Fu, B. Jiang, Y. Zhang, Evolutionary ensemble learning algorithm to modeling of warfarin dose prediction for chinese. IEEE J. Biomed. Health Inform. (2018). https://doi.org/10.1109/JBHI.2018.2812165

    Article  Google Scholar 

  69. P.T. Thuong, N.X. Hoai, X. Yao, Combining conformal prediction and genetic programming for symbolic interval regression, in Genetic and Evolutionary Computation Conference—GECCO 2017, Berlin, Germany, July 15–19, 2017, Companion, Proceedings, ed. by G. Ochoa, (ACM, 2017), pp. 1001–1008. https://doi.org/10.1145/3071178.3071280

  70. L. Vanneschi, M. Castelli, S. Silva, Measuring bloat, overfitting and functional complexity in genetic programming, in GECCO’10: Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation (ACM, Portland, USA, 2010), pp. 877–884

  71. V. Vladimir, The Nature of Statistical Learning Theory (Springer, New York, 1999)

    Google Scholar 

  72. E.J. Vladislavleva, G.F. Smits, D. den Hertog, Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Trans. Evol. Comput. 13(2), 333–349 (2009)

    Article  Google Scholar 

  73. P. Whigham, G. Dick, J. Maclaurin, C.A. Owen, libgges: Grammar-Guided Evolutionary Search. https://github.com/DEAP/deap (2019)

  74. X. Yao, Universal approximation by genetic programming, in Foundations of Genetic Programming, ed. by T. Haynes, W.B. Langdon, U.M. O’Reilly, R. Poli, J. Rosca (Orlando, Florida, USA, 1999), pp. 66–67

  75. Y.S. Yeun, W.S. Ruy, Y.S. Yang, N.J. Kim, Implementing linear models in genetic programming. IEEE Trans. Evol. Comput. 8(6), 542–566 (2004)

    Article  Google Scholar 

  76. E. Flores, M. Abatal, A. Bassam, L. Trujillo, P. Juarez-Smith, Y. El Hamzaoui, Modeling the adsorption of phenols and nitrophenols by activated carbon using genetic programming. J. Clean. Prod. 161, 860–870 (2017). https://doi.org/10.1016/j.jclepro.2017.05.192

    Article  Google Scholar 

  77. A. Zameer, J. Arshad, A. Khan, M.A.Z. Raja, Intelligent and robust prediction of short term wind power using genetic programming based ensemble of neural networks. Energy Convers. Manag. 134, 361–372 (2017). https://doi.org/10.1016/j.enconman.2016.12.032

    Article  Google Scholar 

  78. J. Zhong, W. Cai, M. Lees, L. Luo, Automatic model construction for the behavior of human crowds. Appl. Soft Comput. 56, 368–378 (2017). https://doi.org/10.1016/j.asoc.2017.03.020

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Miguel Nicolau.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Nicolau, M., Agapitos, A. Choosing function sets with better generalisation performance for symbolic regression models. Genet Program Evolvable Mach 22, 73–100 (2021). https://doi.org/10.1007/s10710-020-09391-4

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10710-020-09391-4

Keywords

Navigation