Abstract
Supervised learning by means of Genetic Programming (GP) aims at the evolutionary synthesis of a model that achieves a balance between approximating the target function on the training data and generalising on new data. The model space searched by the Evolutionary Algorithm is populated by compositions of primitive functions defined in a function set. Since the target function is unknown, the choice of function set’s constituent elements is primarily guided by the makeup of function sets traditionally used in the GP literature. Our work builds upon previous research of the effects of protected arithmetic operators (i.e. division, logarithm, power) on the output value of an evolved model for input data points not encountered during training. The scope is to benchmark the approximation/generalisation of models evolved using different function set choices across a range of 43 symbolic regression problems. The salient outcomes are as follows. Firstly, Koza’s protected operators of division and exponentiation have a detrimental effect on generalisation, and should therefore be avoided. This result is invariant of the use of moderately sized validation sets for model selection. Secondly, the performance of the recently introduced analytic quotient operator is comparable to that of the sinusoidal operator on average, with their combination being advantageous to both approximation and generalisation. These findings are consistent across two different system implementations, those of standard expression-tree GP and linear Grammatical Evolution. We highlight that this study employed very large test sets, which create confidence when benchmarking the effect of different combinations of primitive functions on model generalisation. Our aim is to encourage GP researchers and practitioners to use similar stringent means of assessing generalisation of evolved models where possible, and also to avoid certain primitive functions that are known to be inappropriate.
Similar content being viewed by others
Notes
In reality, ppow was defined [35] as \(srexpt(x_1, x_2) = |x_1|^{x_2}\), but this does not take into account the fact that raising 0 to a negative number results in a division by zero.
In such cases, Keijzer [30] proposes using the ranges observed in the training set, but those ranges can change with unseen data, particularly when extrapolating, or when modelling time-series data.
References
A. Agapitos, R. Loughran, M. Nicolau, S. Lucas, M. O’Neill, A. Brabazon, A survey of statistical machine learning elements in genetic programming. IEEE Trans. Evol. Comput. 23(6), 1029–1048 (2019)
R.M.A. Azad, C. Ryan, Variance based selection to improve test set performance in genetic programming, in GECCO’11: Proceedings of the 13th Annual Conference on Genetic and Evolutionary Computation (ACM, Dublin, 2011), pp. 1315–1322
P. Barmpalexis, A. Karagianni, G. Karasavvaides, K. Kachrimanis, Comparison of multi-linear regression, particle swarm optimization artificial neural networks and genetic programming in the development of mini-tablets. Int. J. Pharm. 551(1), 166–176 (2018). https://doi.org/10.1016/j.ijpharm.2018.09.026
M. Castelli, L. Manzoni, S. Silva, L. Vanneschi, A comparison of the generalization ability of different genetic programming frameworks, in IEEE Congress on Evolutionary Computation (CEC 2010) (IEEE Press, Barcelona, 2010)
Q. Chen, B. Xue, M. Zhang, Improving generalisation of genetic programming for symbolic regression with angle-driven geometric semantic operators. IEEE Trans. Evol. Comput. 23(3), 488–502 (2019). https://doi.org/10.1109/TEVC.2018.2869621
Q. Chen, M. Zhang, B. Xue, Feature selection to improve generalisation of genetic programming for high-dimensional symbolic regression. IEEE Trans. Evol. Comput. 21(5), 792–806 (2017). https://doi.org/10.1109/TEVC.2017.2683489
Q. Chen, M. Zhang, B. Xue, Structural risk minimisation-driven genetic programming for enhancing generalisation in symbolic regression. IEEE Trans. Evol. Comput. 23(4), 703–717 (2019). https://doi.org/10.1109/TEVC.2018.2881392
O. Claveria, E. Monte, S. Torra, Assessment of the effect of the financial crisis on agents expectations through symbolic regression. Appl. Econ. Lett. 24(9), 648–652 (2017). https://doi.org/10.1080/13504851.2016.1218419
L.F. dal Piccol Sotto, V.V. de Melo, Studying bloat control and maintenance of effective code in linear genetic programming for symbolic regression. Neurocomputing 180, 79–93 (2016). https://doi.org/10.1016/j.neucom.2015.10.109. Progress in Intelligent Systems Design Selected papers from the 4th Brazilian Conference on Intelligent Systems (BRACIS 2014)
O. Claveria, E. Monte, S. Torra, Using survey data to forecast real activity with evolutionary algorithms: a cross-country analysis. J. Appl. Econ. 20(2), 329–349 (2017). https://doi.org/10.1016/S1514-0326(17)30015-6
G. D’Angelo, R. Pilla, C. Tascini, S. Rampone, A proposal for distinguishing between bacterial and viral meningitis using genetic programming and decision trees. Soft Comput. 23(22), 11775–11791 (2019). https://doi.org/10.1007/s00500-018-03729-y
I. De Falco, A. Della Cioppa, T. Koutny, M. Krcma, U. Scafuri, E. Tarantino, Genetic programming-based induction of a glucose-dynamics model for telemedicine. J. Netw. Comput. Appl. 119, 1–13 (2018). https://doi.org/10.1016/j.jnca.2018.06.007
F.O. de Franca, A greedy search tree heuristic for symbolic regression. Inf. Sci. 442, 18–32 (2018). https://doi.org/10.1016/j.ins.2018.02.040
V.V. de Melo, W. Banzhaf, Improving the prediction of material properties of concrete using kaizen programming with simulated annealing. Neurocomputing 246, 25–44 (2017). https://doi.org/10.1016/j.neucom.2016.12.077
V.V. de Melo, W. Banzhaf, Automatic feature engineering for regression models with machine learning: an evolutionary computation and statistics hybrid. Inf. Sci. 430–431, 287–313 (2018). https://doi.org/10.1016/j.ins.2017.11.041
G. Dick, Revisiting interval arithmetic for regression problems in genetic programming, in Genetic and Evolutionary Computation Conference—GECCO 2017, Berlin, Germany, July 15–19, 2017, Companion, Proceedings, ed. by G. Ochoa (ACM 2017), pp. 129–130. https://doi.org/10.1145/3067695.3076107
A.I. Diveev, N.B. Konyrbaev, E.A. Sofronova, Method of binary analytic programming to look for optimal mathematical expression. Procedia Comput. Sci. 103, 597–604 (2017). XII International Symposium Intelligent Systems, INTELS 2016, 5–7 October 2016. Moscow, Russia (2016). https://doi.org/10.1016/j.procs.2017.01.073
T. Dou, P. Rockett, Comparison of semantic-based local search methods for multiobjective genetic programming. Genet. Program Evol. Mach. 19(4), 535–563 (2018). https://doi.org/10.1007/s10710-018-9325-4
I. Fajfar, T. Tuma, Creation of numerical constants in robust gene expression programming. Entropy 20(10), 756 (2018). https://doi.org/10.3390/e20100756
M. Fenton, J. McDermott, D. Fagan, S. Forstenlechner, M. O’Neill, E. Hemberg, PonyGE2: Grammatical Evolution in Python. http://ncra.ucd.ie/Site/GEVA.html (2019)
F.A. Fortin, F.M. De Rainville, M.A. Gardner, M. Parizeau, C. Gagné, DEAP. https://github.com/DEAP/deap (2019)
J.H. Friedman, Greedy function approximation: a gradient boosting machine. Ann. Stat. 29, 1189–1232 (2000)
A. Garg, J.S.L. Lam, B.N. Panda, A hybrid computational intelligence framework in modelling of coal-oil agglomeration phenomenon. Appl. Soft Comput. 55, 402–412 (2017). https://doi.org/10.1016/j.asoc.2017.01.054
B. Ghaddar, N. Sakr, Y. Asiedu, Spare parts stocking analysis using genetic programming. Eur. J. Oper. Res. 252(1), 136–144 (2016). https://doi.org/10.1016/j.ejor.2015.12.041
E.M. Golafshani, A. Behnood, Automatic regression methods for formulation of elastic modulus of recycled aggregate concrete. Appl. Soft Comput. 64, 377–400 (2018). https://doi.org/10.1016/j.asoc.2017.12.030
I. Gonzalez-Taboada, B. Gonzalez-Fonteboa, F. Martinez-Abella, J.L. Perez-Ordonez, Prediction of the mechanical properties of structural recycled concrete using multivariable regression and genetic programming. Constr. Build. Mater. 106, 480–499 (2016). https://doi.org/10.1016/j.conbuildmat.2015.12.136
M.A. Haeri, M.M. Ebadzadeh, G. Folino, Statistical genetic programming for symbolic regression. Appl. Soft Comput. 60, 447–469 (2017). https://doi.org/10.1016/j.asoc.2017.06.050
S.A. Hosseini, A. Tavana, S.M. Abdolahi, S. Darvishmaslak, Prediction of blast induced ground vibrations in quarry sites: a comparison of GP, RSM and MARS. Soil Dyn. Earthq. Eng. 119, 118–129 (2019). https://doi.org/10.1016/j.soildyn.2019.01.011
A. Kattan, A. Agapitos, Y.S. Ong, A.A. Alghamedi, M. O’Neill, GP made faster with semantic surrogate modelling. Inf. Sci. 355–356, 169–185 (2016). https://doi.org/10.1016/j.ins.2016.03.030
M. Keijzer, Improving symbolic regression with interval arithmetic and linear scaling, in Genetic Programming. Proceedings of EuroGP’2003, LNCS, vol. 2610, ed. by C. Ryan, T. Soule, M. Keijzer, E. Tsang, R. Poli, E. Costa (Springer, Essex, 2003), pp. 70–82
M. Khandelwal, R.S. Faradonbeh, M. Monjezi, D.J. Armaghani, M.Z.B.A. Majid, S. Yagiz, Function development for appraising brittleness of intact rocks using genetic programming and non-linear multiple regression models. Eng. Comput. 33(1), 13–21 (2017). https://doi.org/10.1007/s00366-016-0452-3
M.F. Korns, Accuracy in symbolic regression, in Genetic Programming Theory and Practice IX, Genetic and Evolutionary Computation, ed. by R. Riolo, E. Vladislavleva, J.H. Moore (Springer, New York, 2011), pp. 129–151
M. Kovacic, A. Mihevc, M. Tercelj, Roll wear modeling using genetic programming—industry case study (Mater, Technol, 2019). 10.17222/mit.2018.104
M. Kovacic, A. Turnsek, D. Ocvirk, G. Gantar, Increasing the tensile strength and elongation of 16mncrs5 steel using genetic programming. Mater. Technol. 51(6), 883–888 (2017). 10.17222/mit.2016.293
J. Koza, Genetic Programming: On the Programming of Computers by Means of Natural Selection (MIT Press, Cambridge, 1992)
G. Kronberger, M. Kommenda, E. Lughofer, S. Saminger-Platz, A. Promberger, F. Nickel, S. Winkler, M. Affenzeller, Using robust generalized fuzzy modeling and enhanced symbolic regression to model tribological systems. Appl. Soft Comput. 69, 610–624 (2018). https://doi.org/10.1016/j.asoc.2018.04.048
J. Kubalik, E. Alibekov, R. Babuska, Optimal control via reinforcement learning with symbolic policy approximation. IFAC-PapersOnLine 50(1), 4162–4167 (2017). https://doi.org/10.1016/j.ifacol.2017.08.805. 20th IFAC World Congress
J. Kubalik, E. Alibekov, J. Zegklitz, R. Babuska, Hybrid single node genetic programming for symbolic regression. Trans. Comput. Collect. Intell. 9770, 61–82 (2016). https://doi.org/10.1007/978-3-662-53525-7_4
H.C. Kwak, S. Kho, Predicting crash risk and identifying crash precursors on korean expressways using loop detector data. Accid. Anal. Prev. 88, 9–19 (2016). https://doi.org/10.1016/j.aap.2015.12.004
W.B. Langdon, The Genetic Programming Bibliography. http://www.gpbib.cs.ucl.ac.uk/ (2020)
W.B. Langdon, R. Poli, Foundations of Genetic Programming (Springer, New York, 2002)
M. Lichman, UCI Machine Learning Repository. http://archive.ics.uci.edu/ml (2013)
H. Liu, H. Lin, X. Jiang, X. Mao, Q. Liu, B. Li, Estimation of mass matrix in machine tool’s weak components research by using symbolic regression. Comput. Ind. Eng. 127, 998–1011 (2019). https://doi.org/10.1016/j.cie.2018.11.033
Q. Lu, J. Ren, Z. Wang, Using genetic programming with prior formula knowledge to solve symbolic regression problem. Comput. Intell. Neurosci. (2016). https://doi.org/10.1155/2016/1021378
S. Luke, E.O. Scott, L. Panait, G. Balan, S. Paus, Z. Skolicki, R. Kicinger, E. Popovici, K. Sullivan, J. Harrison, J. Bassett, R. Hubley, A. Desai, A. Chircop, J. Compton, W. Haddon, S. Donnelly, B. Jamil, J. Zelibor, E. Kangas, F. Abidi, H. Mooers, J. O’Beirne, L. Manzoni, K.A. Talukder, S. McKay, J. McDermott, J. Zou, A. Rutherford, D. Freelan, E. Wei, S. Rajendran, A. Dhawan, B. Brumbac, J. Hilty, A. Kabir, ECJ 27: A Java-based Evolutionary Computation Research System. https://cs.gmu.edu/~eclab/projects/ecj (2019)
Mahler, S., Robilliard, D., Fonlupt, C.: Tarpeian bloat control and generalization accuracy, in Proceedings of the 8th European Conference on Genetic Programming, Lecture Notes in Computer Science, vol. 3447. ed. by M. Keijzer, A. Tettamanzi, P. Collet, J.I. van Hemert, M. Tomassini, (Springer, Lausanne, 2005), pp. 203–214
L.F. Miranda, L.O.V.B. Oliveira, J.F.B.S. Martins, G.L. Pappa, How noisy data affects geometric semantic genetic programming, in Genetic and Evolutionary Computation Conference—GECCO 2017, Berlin, Germany, July 15–19, 2017, Companion, Proceedings, ed. by G. Ochoa (ACM, 2017), pp. 985–992. https://doi.org/10.1145/3071178.3071300
J.L. Montana, C.L. Alonso, C.E. Borges, C. Tirnauca, Model-driven regularization approach to straight line program genetic programming. Expert Syst. Appl. 57, 76–90 (2016). https://doi.org/10.1016/j.eswa.2016.03.003
S.S. Mousavi Astarabadi, M.M. Ebadzadeh, A decomposition method for symbolic regression problems. Appl. Soft Comput. 62, 514–523 (2018). https://doi.org/10.1016/j.asoc.2017.10.041
J. Ni, R.H. Drieberg, P.I. Rockett, The use of an analytic quotient operator in genetic programming. IEEE Trans. Evol. Comput. 17(1), 146–152 (2013)
J. Ni, P. Rockett, Tikhonov regularization as a complexity measure in multiobjective genetic programming. IEEE Trans. Evol. Comput. 19(2), 157–166 (2015)
M. Nicolau, Understanding grammatical evolution: initialisation. Genet. Program Evolv Mach. 18(4), 1–41 (2017). https://doi.org/10.1007/s10710-017-9309-9
M. Nicolau, I. Dempsey, Introducing grammar based extensions for grammatical evolution, in IEEE Congress on Evolutionary Computation (CEC 2006), pp. 2663–2670
M. Nicolau, M. O’Neill, A. Brabazon, Termination in grammatical evolution: Grammar design, wrapping, and tails, in IEEE Congress on Evolutionary Computation (CEC 2012) (2012), pp. 1–8
N.Y. Nikolaev, H. Iba, Regularization approach to inductive genetic programming. IEEE Trans. Evol. Comput. 54(4), 359–375 (2001)
M. O’Neill, C. Ryan, Grammatical Evolution: Evolutionary Automatic Programming in a Arbitrary Language, Genetic programming, vol. 4 (Kluwer, Alphen aan den Rijn, 2003)
L. Pagie, P. Hogeweg, Evolutionary consequences of coevolving targets. Evol. Comput. 5(4), 401–418 (1997)
X. Pan, M.K. Uddin, B. Ai, X. Pan, Influential factors of carbon emissions intensity in oecd countries: evidence from symbolic regression. J. Clean. Prod. (2019). https://doi.org/10.1016/j.jclepro.2019.02.195
T.P. Pawlak, K. Krawiec, Competent geometric semantic genetic programming for symbolic regression and boolean function synthesis. Evol. Comput. 26(2), 177–212 (2018). https://doi.org/10.1162/EVCO_a_00205
S. Polanco-Martagon, J. Ruiz-Ascencio, M.A. Duarte-Villasenor, Symbolic modeling of the Pareto-optimal sets of two unity gain cells. DYNA 83(197), 128–137 (2016). 10.15446/dyna.v83n197.50919
M. Quade, M. Abel, K. Shafi, R.K. Niven, B.R. Noack, Prediction of dynamical systems by symbolic regression. Phys. Rev. E 94, 012214 (2016). https://doi.org/10.1103/PhysRevE.94.012214
S.S. Rathore, S. Kumar, Towards an ensemble based system for predicting the number of software faults. Expert Syst. Appl. 82, 357–382 (2017). https://doi.org/10.1016/j.eswa.2017.04.014
S. Silva, GPLAB: A Genetic Programming Toolbox for MATLAB. http://gplab.sourceforge.net/download.html(2019)
S. Silva, L. Vanneschi, Operator equalisation, bloat and overfitting: a study on human oral bioavailability prediction, in GECCO’09: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation (ACM, Montreal, 2009), pp. 1115–1122
A. Sohani, M. Zabihigivi, M.H. Moradi, H. Sayyaadi, H.H. Balyani, A comprehensive performance investigation of cellulose evaporative cooling pad systems using predictive approaches. Appl. Therm. Eng. 110, 1589–1608 (2017). https://doi.org/10.1016/j.applthermaleng.2016.08.216
R. Taghizadeh-Mehrjardi, K. Nabiollahi, R. Kerry, Digital mapping of soil organic carbon at multiple depths using different data mining techniques in Baneh Region, Iran. Geoderma 266, 98–110 (2016). https://doi.org/10.1016/j.geoderma.2015.12.003
A. Tahmassebi, A.H. Gandomi, Building energy consumption forecast using multi-objective genetic programming. Measurement 118, 164–171 (2018). https://doi.org/10.1016/j.measurement.2018.01.032
Y. Tao, Y.J. Chen, X. Fu, B. Jiang, Y. Zhang, Evolutionary ensemble learning algorithm to modeling of warfarin dose prediction for chinese. IEEE J. Biomed. Health Inform. (2018). https://doi.org/10.1109/JBHI.2018.2812165
P.T. Thuong, N.X. Hoai, X. Yao, Combining conformal prediction and genetic programming for symbolic interval regression, in Genetic and Evolutionary Computation Conference—GECCO 2017, Berlin, Germany, July 15–19, 2017, Companion, Proceedings, ed. by G. Ochoa, (ACM, 2017), pp. 1001–1008. https://doi.org/10.1145/3071178.3071280
L. Vanneschi, M. Castelli, S. Silva, Measuring bloat, overfitting and functional complexity in genetic programming, in GECCO’10: Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation (ACM, Portland, USA, 2010), pp. 877–884
V. Vladimir, The Nature of Statistical Learning Theory (Springer, New York, 1999)
E.J. Vladislavleva, G.F. Smits, D. den Hertog, Order of nonlinearity as a complexity measure for models generated by symbolic regression via pareto genetic programming. IEEE Trans. Evol. Comput. 13(2), 333–349 (2009)
P. Whigham, G. Dick, J. Maclaurin, C.A. Owen, libgges: Grammar-Guided Evolutionary Search. https://github.com/DEAP/deap (2019)
X. Yao, Universal approximation by genetic programming, in Foundations of Genetic Programming, ed. by T. Haynes, W.B. Langdon, U.M. O’Reilly, R. Poli, J. Rosca (Orlando, Florida, USA, 1999), pp. 66–67
Y.S. Yeun, W.S. Ruy, Y.S. Yang, N.J. Kim, Implementing linear models in genetic programming. IEEE Trans. Evol. Comput. 8(6), 542–566 (2004)
E. Flores, M. Abatal, A. Bassam, L. Trujillo, P. Juarez-Smith, Y. El Hamzaoui, Modeling the adsorption of phenols and nitrophenols by activated carbon using genetic programming. J. Clean. Prod. 161, 860–870 (2017). https://doi.org/10.1016/j.jclepro.2017.05.192
A. Zameer, J. Arshad, A. Khan, M.A.Z. Raja, Intelligent and robust prediction of short term wind power using genetic programming based ensemble of neural networks. Energy Convers. Manag. 134, 361–372 (2017). https://doi.org/10.1016/j.enconman.2016.12.032
J. Zhong, W. Cai, M. Lees, L. Luo, Automatic model construction for the behavior of human crowds. Appl. Soft Comput. 56, 368–378 (2017). https://doi.org/10.1016/j.asoc.2017.03.020
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Nicolau, M., Agapitos, A. Choosing function sets with better generalisation performance for symbolic regression models. Genet Program Evolvable Mach 22, 73–100 (2021). https://doi.org/10.1007/s10710-020-09391-4
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10710-020-09391-4