Automatic feature engineering for regression models with machine learning: An evolutionary computation and statistics hybrid
Introduction
In a traditional regression task, one seeks to model the relationship between a dependent variable (the response) and one or more independent variables (also called explanatory variables). In a statistical regression approach, the practitioner employs a predetermined function f (or manually develops a variation) to combine the explanatory variables x in order to calculate output y: where β is a set of parameters (constants, one for each variable), and ϵ is a measure of error. An optimization method has to optimize β to minimize the “lack of fit”.
Symbolic Regression (SR), on the other hand, is a non-linear regression analysis technique that generates mathematical expressions to fit a given dataset. Being an optimization algorithm, SR optimizes the mathematical expressions according to some criterion, such as goodness-of-fit and/or expression complexity. Another particular aspect of SR is that it assumes no a priori model; nevertheless, one may be provided.
Therefore, an initial expression, or group of expressions, is randomly generated from the operand and operator sets provided by the user. Operands are the features of the dataset and other constants, such as π. Operators are the functions that generate data (random distributions, for instance) or functions to be applied to the operands (arithmetical, geometrical, etc.). As SR may start with random expressions, and usually there is no mechanism to avoid specific constructions, the algorithm is free to explore the search space of solutions. Thus, it may find high-quality models that would never be discovered by humans because the relationships among the variables could not make sense from a human perspective. Nonetheless, if necessary, domain knowledge and bias can be employed in grammar-based SR algorithms [34].
One may notice that SR is a more general, mixed-type problem, where not only the parameters β must be optimized but also an appropriate function f must be found. Therefore, SR must optimize both the model structure and its parameters, while traditional regression techniques optimize the parameters of a model supplied by the user. Clearly, SR solves a substantially more difficult problem and requires particular algorithms in order to work properly; consequently, many researchers are still looking for good heuristics to improve the search.
Over the last years, SR has been widely studied with Evolutionary Computation (EC) techniques able to produce computer code. As examples one may cite Genetic Programming (GP, [25], [5]), Multi Expression Programming (MEP, [31]), Gene Expression Programming (GEP, [15]), Grammatical Evolution (GE, [34]), Linear Genetic Programming (LGP, [8]), Cartesian Genetic Programming (CGP, [29]), Behavioral Programming (BP, [26]), and Stack-based Genetic Programming (Stack-based GP, [32], [39]). These methods evolve populations of individuals, each being a single model. Related non-EC techniques may also be found in the literature, for instance Fast Feature Extraction (FFX, [27]) and Prioritized grammar enumeration (PGE, [42]); different from the EC methods, these two are very successful in performing feature engineering, which is usually defined as a process of creating relevant features from the original features in the data in order to increase predictive power of the learning algorithm.
As previously stated, Evolutionary Computation is largely used for finding models composed of a single highly predictive feature. In EC methods, the differential survival of fitter solutions is one of the main ingredients. In most cases, a population of individuals is evolved to solve a particular task, where an individual represents a complete solution. Competition among the individuals is used to control evolution allowing the population to converge to the best individual. There is no guarantee of convergence to the optimum, of course, due to the stochasticity of the process, only of approximation. Thus, it is important to examine techniques that can provide better guidance in stochastic global optimization tasks such as SR.
An interesting proposal for better guidance is the Cooperative co-evolution algorithm (CCeA, [33]). It was proposed as a Genetic Algorithm extension to provide an environment where subcomponents could “emerge” and collaboration could automatically appear. CCeA is an evolutionary approach, therefore one expects that emergent behavior will occur. Over the years, some issues of this approach have been improved, e.g. larger populations than traditional Evolutionary Algorithms or multiple populations (a population for each subcomponent); the credit assignment problem; random selection of subcomponents for combination or based on their individual fitnesses (good subcomponents do not always produce good solutions when put together); among others. CCeAs have been applied with success in solving several tasks including SR [1], [6], [33], [41].
De Melo [9] proposed Kaizen Programming (KP) to also search for collaboration among subcomponents. KP, different from CCeA, was proposed as an iterative approach focused on efficient problem-solving techniques that could come from Statistics, ML, Classical Artificial Intelligence (AI), Econometrics, or other related areas. For instance, KP has been used with Logistic Regression [12], [10], CART decision tree [13], and Random Forests [36]. Also, a greedy approach was developed for solving a control problem known as the virtual Lawn Mower [37].
These techniques used by KP can be seen as powerful local optimizers that may need good starting points. For providing such points, KP searches the solution space through random search, recombination, variation, and sampling, among other methods. KP then uses those starting points as subcomponents and the local optimization techniques try to find the best combination of the subcomponents to get the highest-quality solution. Later, one can identify what was combined and the importance given by local optimizers to each subcomponent.
The application of KP to SR means that, instead of directly searching for a solution, KP will search for a set of features (mostly non-linear, but single features may be selected) for a known model; in this paper, a standard linear model optimized by Ordinary Least Squares. A relevant characteristic of such approach is the posterior use of other statistical tools for further feature and model selection (AIC, BIC, among others), to calculate prediction and confidence intervals, and to perform residual analysis, among others.
A known statistical approach related to KP for regression tasks is called basis expansion ([20], Chapter 5, Page 115):
The core idea in this chapter is to augment/replace the vector of inputs X with additional variables, which are transformations of X, and then use linear models in this new space of derived input features.
This basis expansion strategy is used by techniques such as MARS [17]. MARS starts a model having only the mean of the response values as the intercept term. Then it greedly iterates adding those pairs of basis functions to the model that gives the maximum reduction in the loss function (sum-of-squares residual error). The basis functions are simply hinge functions that return the positive part of a partition, otherwise returning zero.
Another deterministic related method is the Fast Function Extraction (FFX, [27]). FFX uses path-wise regularized linear learning techniques to create generalized linear models, i.e., a composition of nonlinear basis functions (formulas) with linearly-learned coefficients. The basis functions for a given complexity (tree depth) are all created and evaluated at once, and complexity increases in the course of iterations, resulting in an exponential growth in the amount of basis functions. Also, as all combinations are created there is no learning to identify poor basis functions, which are greedly selected by a stepwise procedure. Many models are built with different levels of complexity and are filtered using a multi-objective domination procedure.
Traditional statistical methods for basis expansion are deterministic, mostly limited in terms of the functions that can be used and in the complexity of the discovered basis (due to using exhaustive search). KP, on the other hand, has experts that can be either deterministic or non-deterministic allowing, for instance, that rejected basis functions may reappear in a later cycle and be significant.
A similar approach was taken by Icke and Bongard [21], who proposed a hybrid version in which the resulting features of many models of an FFX run are passed onto GP for another step in model building. The authors hypothesized that such approach would increase the chances of GP to succeed by letting FFX extract informative features while GP builds more complex models. The results reported in the paper are that the hybrid algorithm provided advantage over GP only for the bigger datasets with 10 and 25 variables. As one may notice, in this case, GP is the method supposed to solve the problem; this is exactly the opposite idea of our approach.
In this paper, we present a deeper investigation of the technique presented in [9]. The main contributions here are more details on the framework, how we used GP’s components in KP, a comprehensive experimental section, a comparison with many related work from the literature, and a longer discussion on the pros and cons of KP. Nevertheless, it is important to be clear that the focus of this paper is on SR benchmark functions, while the performance on real-world datasets has been investigated elsewhere [11], [14].
This paper is organized as follows: Section 2 discusses the most related work, Section 3 introduces the proposed hybrid method, Section 4 reports on the experiments using a number of selected benchmark functions from the literature, and Section 5 concludes. A further discussion regarding KP properties, including identified weaknesses, is shown in the Appendix C section. Supplementary material containing more detailed descriptive statistics of the experiments reported here is available.
Section snippets
Related work
Although there are many SR techniques employing some kind of constant optimization, here we present the most related ones, meaning they are EC techniques that build linear models through the combination of individuals or partial solutions. Other related approaches were described in the Introduction.
Keijzer [22] investigated the use of linear regression on the output of arbitrary symbolic expressions with application in symbolic regression. He showed that the use of a scaled error measure
The hybrid method
Kaizen Programming (KP) was proposed in [9] as a hybrid technique based on the concepts of the Kaizen methodology. KP is an abstraction of important parts of both the Kaizen methodology and the Plan-Do-Check-Act (PDCA) [13] cycle with the goal of an application as a Computational Intelligence tool. Therefore, it is not a simulation of a Kaizen event.
For easier understanding of KP, Table 1 relates important terms used here (taken from the Kaizen methodology) to concepts from EC and ML. From now
Experimental analysis
This section presents a comparison of KP with related methods in solving three sets of well-known symbolic regression benchmark functions. These functions present different levels of difficulty, being largely employed in the literature to evaluate the performance of symbolic regression techniques. Investigation on real-world datasets is out of the scope of this paper.
Another focus of this paper is the comparison with results presented in the literature instead of running the algorithms. This
Summary and conclusions
Kaizen Programming (KP) is a hybrid algorithm that uses a collaborative problem-solving approach in which partial solutions have to be joined to produce a complete solution.
KP was evaluated with three well-known symbolic regression benchmark sets (Nguyen, Keijzer, and Korns) in three experiments with distinct configurations. KP results were compared with those from GP with semantically-based crossover, Abstract Expression Grammar, and Prioritized Grammar Enumeration. A sensitivity analysis was
Acknowledgments
This paper was supported by the Brazilian Government CNPq (Universal) grant (486950/2013-1) and CAPES (Science without Borders) grant (12180-13-0) to V.V.M., and Canada’s NSERC Discovery grant RGPIN 283304-2012 to W.B.
References (42)
- et al.
Cooperative co-evolution inspired operators for classical gp schemes
Nature Inspired Cooperative Strategies for Optimization (NICSO 2007)
(2008) - et al.
Multiple regression genetic programming
Proceedings of the 16th Conference on Genetic and Evolutionary Computation
(2014) Selective pressure in evolutionary algorithms: acharacterization of selection mechanisms
Evolutionary Computation, 1994. IEEE World Congress on Computational Intelligence., Proceedings of the First IEEE Conference on
(1994)- et al.
Evolutionary Computation 1: Basic Algorithms and Operators
(2000) - et al.
Genetic Programming: An Introduction: On the Automatic Evolution of Computer Programs and Its Applications
(1998) - O. Barrière, E. Lutton, P.-H. Wuillemin, C. Baudrit, M. Sicard, N. Perrot, EVOLVE- A Bridge between Probability, Set...
- et al.
Convex Optimization
(2004) - et al.
Evolving teams of predictors with linear genetic programming
Genet. Program. Evolvable Mach.
(2001) Kaizen programming
Proceedings of the 16th Conference on Genetic and Evolutionary Computation
(2014)Gene expression programming in problem solving
Soft Computing and Industry
(2002)
DEAP: evolutionary algorithms made easy
J. Mach. Learn. Res.
Multivariate adaptive regression splines
Ann.Stat.
Tools and Methods for the Improvement of Quality
The weka data mining software: an update
SIGKDD Explor. Newsl.
The elements of statistical learning: data mining, inference and prediction
Math. Intell.
Improving genetic programming based symbolic regression using deterministic machine learning
Evolutionary Computation (CEC), 2013 IEEE Congress on
Scaled symbolic regression
Genet. Program. Evolvable Mach.
Abstract expression grammar symbolic regression
Genetic Programming Theory and Practice VIII
Trustable symbolic regression models: using ensembles, interval arithmetic and pareto fronts to develop robust and trust-aware models
Genetic programming theory and practice V
Genetic Programming - On the Programming of Computers by Means of Natural Selection
Behavioral programming: a broader and more detailed take on semantic gp
Proceedings of the 16th Conference on Genetic and Evolutionary Computation
Cited by (28)
Machine learning accelerates the materials discovery
2022, Materials Today CommunicationsCitation Excerpt :In addition to the above means of constructing effective features, automatic feature engineering and deep learning provide two new ideas for constructing effective features. For example, automatic feature engineering has become an emerging topic of academic research [417]. The automatic feature engineering aims to automatically create candidate features from a dataset and select several best features for training.
Development of a machine learning-based soft sensor for an oil refinery's distillation column
2022, Computers and Chemical EngineeringA multi-output machine learning approach for generation of surrogate models in process engineering
2022, Computer Aided Chemical EngineeringSecond order Takagi-Sugeno fuzzy model with domain adaptation for nonlinear regression
2021, Information SciencesCitation Excerpt :Data science is a multi-disciplinary field which plays an important role in analyzing various kinds of data and obtaining insights from them. And numerous machine learning methods are very effective against different categories of data-related problems including clustering [1–4], classification [5–7], and regression [8–12]. However, most systems in real world more or less have nonlinearity with various degrees.
An intelligent evolutionary extreme gradient boosting algorithm development for modeling scour depths under submerged weir
2021, Information SciencesCitation Excerpt :The XGBoost-Grid is executed by setting the search space as follows. Lambda: [0.1, 0.5, 0.9], Alpha: [0.1, 0.5, 0.9], Colsample by level: [0.1, 0.5, 0.9], Colsample by tree: [0.1, 0.5, 0.9], Subsample rate: [0.1, 0.5, 0.9], Child Weight: [2,5,8], Gamma: [0.1, 2, 5], number of Estimator: [50], learning Rate: [0.02, 0.8, 0.055], and the Maximum Depth: [5,10,18]. It is also evaluated based on 5-fold cross-validation.
Application of novel hybrid deep leaning model for cleaner production in a paper industrial wastewater treatment system
2021, Journal of Cleaner ProductionCitation Excerpt :Machine learning is one of the most widely used and common AI. It could not only extract features from the independent variables through feature engineering, but also use these extracted features to model complexity (Veloso de Melo and Banzhaf., 2017), which has attracted a lot of attention from scholars and researchers recently. Mjalli et al. (2007) used artificial neural network (ANN) for forecasting effluent quality in wastewater treatment system, including chemical oxygen demand (COD), biological oxygen demand (BOD) and total suspended solids (TSS).