Elsevier

Information Sciences

Volumes 430–431, March 2018, Pages 287-313
Information Sciences

Automatic feature engineering for regression models with machine learning: An evolutionary computation and statistics hybrid

https://doi.org/10.1016/j.ins.2017.11.041Get rights and content

Abstract

Symbolic Regression (SR) is a well-studied task in Evolutionary Computation (EC), where adequate free-form mathematical models must be automatically discovered from observed data. Statisticians, engineers, and general data scientists still prefer traditional regression methods over EC methods because of the solid mathematical foundations, the interpretability of the models, and the lack of randomness, even though such deterministic methods tend to provide lower quality prediction than stochastic EC methods. On the other hand, while EC solutions can be big and uninterpretable, they can be created with less bias, finding high-quality solutions that would be avoided by human researchers. Another interesting possibility is using EC methods to perform automatic feature engineering for a deterministic regression method instead of evolving a single model; this may lead to smaller solutions that can be easy to understand. In this contribution, we evaluate an approach called Kaizen Programming (KP) to develop a hybrid method employing EC and Statistics. While the EC method builds the features, the statistical method efficiently builds the models, which are also used to provide the importance of the features; thus, features are improved over the iterations resulting in better models. Here we examine a large set of benchmark SR problems known from the EC literature. Our experiments show that KP outperforms traditional Genetic Programming - a popular EC method for SR - and also shows improvements over other methods, including other hybrids and well-known statistical and Machine Learning (ML) ones. More in line with ML than EC approaches, KP is able to provide high-quality solutions while requiring only a small number of function evaluations.

Introduction

In a traditional regression task, one seeks to model the relationship between a dependent variable (the response) and one or more independent variables (also called explanatory variables). In a statistical regression approach, the practitioner employs a predetermined function f (or manually develops a variation) to combine the explanatory variables x in order to calculate output y: y=f(x,β)+ϵ, where β is a set of parameters (constants, one for each variable), and ϵ is a measure of error. An optimization method has to optimize β to minimize the “lack of fit”.

Symbolic Regression (SR), on the other hand, is a non-linear regression analysis technique that generates mathematical expressions to fit a given dataset. Being an optimization algorithm, SR optimizes the mathematical expressions according to some criterion, such as goodness-of-fit and/or expression complexity. Another particular aspect of SR is that it assumes no a priori model; nevertheless, one may be provided.

Therefore, an initial expression, or group of expressions, is randomly generated from the operand and operator sets provided by the user. Operands are the features of the dataset and other constants, such as π. Operators are the functions that generate data (random distributions, for instance) or functions to be applied to the operands (arithmetical, geometrical, etc.). As SR may start with random expressions, and usually there is no mechanism to avoid specific constructions, the algorithm is free to explore the search space of solutions. Thus, it may find high-quality models that would never be discovered by humans because the relationships among the variables could not make sense from a human perspective. Nonetheless, if necessary, domain knowledge and bias can be employed in grammar-based SR algorithms [34].

One may notice that SR is a more general, mixed-type problem, where not only the parameters β must be optimized but also an appropriate function f must be found. Therefore, SR must optimize both the model structure and its parameters, while traditional regression techniques optimize the parameters of a model supplied by the user. Clearly, SR solves a substantially more difficult problem and requires particular algorithms in order to work properly; consequently, many researchers are still looking for good heuristics to improve the search.

Over the last years, SR has been widely studied with Evolutionary Computation (EC) techniques able to produce computer code. As examples one may cite Genetic Programming (GP, [25], [5]), Multi Expression Programming (MEP, [31]), Gene Expression Programming (GEP, [15]), Grammatical Evolution (GE, [34]), Linear Genetic Programming (LGP, [8]), Cartesian Genetic Programming (CGP, [29]), Behavioral Programming (BP, [26]), and Stack-based Genetic Programming (Stack-based GP, [32], [39]). These methods evolve populations of individuals, each being a single model. Related non-EC techniques may also be found in the literature, for instance Fast Feature Extraction (FFX, [27]) and Prioritized grammar enumeration (PGE, [42]); different from the EC methods, these two are very successful in performing feature engineering, which is usually defined as a process of creating relevant features from the original features in the data in order to increase predictive power of the learning algorithm.

As previously stated, Evolutionary Computation is largely used for finding models composed of a single highly predictive feature. In EC methods, the differential survival of fitter solutions is one of the main ingredients. In most cases, a population of individuals is evolved to solve a particular task, where an individual represents a complete solution. Competition among the individuals is used to control evolution allowing the population to converge to the best individual. There is no guarantee of convergence to the optimum, of course, due to the stochasticity of the process, only of approximation. Thus, it is important to examine techniques that can provide better guidance in stochastic global optimization tasks such as SR.

An interesting proposal for better guidance is the Cooperative co-evolution algorithm (CCeA, [33]). It was proposed as a Genetic Algorithm extension to provide an environment where subcomponents could “emerge” and collaboration could automatically appear. CCeA is an evolutionary approach, therefore one expects that emergent behavior will occur. Over the years, some issues of this approach have been improved, e.g. larger populations than traditional Evolutionary Algorithms or multiple populations (a population for each subcomponent); the credit assignment problem; random selection of subcomponents for combination or based on their individual fitnesses (good subcomponents do not always produce good solutions when put together); among others. CCeAs have been applied with success in solving several tasks including SR [1], [6], [33], [41].

De Melo [9] proposed Kaizen Programming (KP) to also search for collaboration among subcomponents. KP, different from CCeA, was proposed as an iterative approach focused on efficient problem-solving techniques that could come from Statistics, ML, Classical Artificial Intelligence (AI), Econometrics, or other related areas. For instance, KP has been used with Logistic Regression [12], [10], CART decision tree [13], and Random Forests [36]. Also, a greedy approach was developed for solving a control problem known as the virtual Lawn Mower [37].

These techniques used by KP can be seen as powerful local optimizers that may need good starting points. For providing such points, KP searches the solution space through random search, recombination, variation, and sampling, among other methods. KP then uses those starting points as subcomponents and the local optimization techniques try to find the best combination of the subcomponents to get the highest-quality solution. Later, one can identify what was combined and the importance given by local optimizers to each subcomponent.

The application of KP to SR means that, instead of directly searching for a solution, KP will search for a set of features (mostly non-linear, but single features may be selected) for a known model; in this paper, a standard linear model optimized by Ordinary Least Squares. A relevant characteristic of such approach is the posterior use of other statistical tools for further feature and model selection (AIC, BIC, among others), to calculate prediction and confidence intervals, and to perform residual analysis, among others.

A known statistical approach related to KP for regression tasks is called basis expansion ([20], Chapter 5, Page 115):

The core idea in this chapter is to augment/replace the vector of inputs X with additional variables, which are transformations of X, and then use linear models in this new space of derived input features.

This basis expansion strategy is used by techniques such as MARS [17]. MARS starts a model having only the mean of the response values as the intercept term. Then it greedly iterates adding those pairs of basis functions to the model that gives the maximum reduction in the loss function (sum-of-squares residual error). The basis functions are simply hinge functions that return the positive part of a partition, otherwise returning zero.

Another deterministic related method is the Fast Function Extraction (FFX, [27]). FFX uses path-wise regularized linear learning techniques to create generalized linear models, i.e., a composition of nonlinear basis functions (formulas) with linearly-learned coefficients. The basis functions for a given complexity (tree depth) are all created and evaluated at once, and complexity increases in the course of iterations, resulting in an exponential growth in the amount of basis functions. Also, as all combinations are created there is no learning to identify poor basis functions, which are greedly selected by a stepwise procedure. Many models are built with different levels of complexity and are filtered using a multi-objective domination procedure.

Traditional statistical methods for basis expansion are deterministic, mostly limited in terms of the functions that can be used and in the complexity of the discovered basis (due to using exhaustive search). KP, on the other hand, has experts that can be either deterministic or non-deterministic allowing, for instance, that rejected basis functions may reappear in a later cycle and be significant.

A similar approach was taken by Icke and Bongard [21], who proposed a hybrid version in which the resulting features of many models of an FFX run are passed onto GP for another step in model building. The authors hypothesized that such approach would increase the chances of GP to succeed by letting FFX extract informative features while GP builds more complex models. The results reported in the paper are that the hybrid algorithm provided advantage over GP only for the bigger datasets with 10 and 25 variables. As one may notice, in this case, GP is the method supposed to solve the problem; this is exactly the opposite idea of our approach.

In this paper, we present a deeper investigation of the technique presented in [9]. The main contributions here are more details on the framework, how we used GP’s components in KP, a comprehensive experimental section, a comparison with many related work from the literature, and a longer discussion on the pros and cons of KP. Nevertheless, it is important to be clear that the focus of this paper is on SR benchmark functions, while the performance on real-world datasets has been investigated elsewhere [11], [14].

This paper is organized as follows: Section 2 discusses the most related work, Section 3 introduces the proposed hybrid method, Section 4 reports on the experiments using a number of selected benchmark functions from the literature, and Section 5 concludes. A further discussion regarding KP properties, including identified weaknesses, is shown in the Appendix C section. Supplementary material containing more detailed descriptive statistics of the experiments reported here is available.

Section snippets

Related work

Although there are many SR techniques employing some kind of constant optimization, here we present the most related ones, meaning they are EC techniques that build linear models through the combination of individuals or partial solutions. Other related approaches were described in the Introduction.

Keijzer [22] investigated the use of linear regression on the output of arbitrary symbolic expressions with application in symbolic regression. He showed that the use of a scaled error measure

The hybrid method

Kaizen Programming (KP) was proposed in [9] as a hybrid technique based on the concepts of the Kaizen methodology. KP is an abstraction of important parts of both the Kaizen methodology and the Plan-Do-Check-Act (PDCA) [13] cycle with the goal of an application as a Computational Intelligence tool. Therefore, it is not a simulation of a Kaizen event.

For easier understanding of KP, Table 1 relates important terms used here (taken from the Kaizen methodology) to concepts from EC and ML. From now

Experimental analysis

This section presents a comparison of KP with related methods in solving three sets of well-known symbolic regression benchmark functions. These functions present different levels of difficulty, being largely employed in the literature to evaluate the performance of symbolic regression techniques. Investigation on real-world datasets is out of the scope of this paper.

Another focus of this paper is the comparison with results presented in the literature instead of running the algorithms. This

Summary and conclusions

Kaizen Programming (KP) is a hybrid algorithm that uses a collaborative problem-solving approach in which partial solutions have to be joined to produce a complete solution.

KP was evaluated with three well-known symbolic regression benchmark sets (Nguyen, Keijzer, and Korns) in three experiments with distinct configurations. KP results were compared with those from GP with semantically-based crossover, Abstract Expression Grammar, and Prioritized Grammar Enumeration. A sensitivity analysis was

Acknowledgments

This paper was supported by the Brazilian Government CNPq (Universal) grant (486950/2013-1) and CAPES (Science without Borders) grant (12180-13-0) to V.V.M., and Canada’s NSERC Discovery grant RGPIN 283304-2012 to W.B.

References (42)

  • M. Aichour et al.

    Cooperative co-evolution inspired operators for classical gp schemes

    Nature Inspired Cooperative Strategies for Optimization (NICSO 2007)

    (2008)
  • I. Arnaldo et al.

    Multiple regression genetic programming

    Proceedings of the 16th Conference on Genetic and Evolutionary Computation

    (2014)
  • T. Bäck

    Selective pressure in evolutionary algorithms: acharacterization of selection mechanisms

    Evolutionary Computation, 1994. IEEE World Congress on Computational Intelligence., Proceedings of the First IEEE Conference on

    (1994)
  • T. Bäck et al.

    Evolutionary Computation 1: Basic Algorithms and Operators

    (2000)
  • W. Banzhaf et al.

    Genetic Programming: An Introduction: On the Automatic Evolution of Computer Programs and Its Applications

    (1998)
  • O. Barrière, E. Lutton, P.-H. Wuillemin, C. Baudrit, M. Sicard, N. Perrot, EVOLVE- A Bridge between Probability, Set...
  • S. Boyd et al.

    Convex Optimization

    (2004)
  • M. Brameier et al.

    Evolving teams of predictors with linear genetic programming

    Genet. Program. Evolvable Mach.

    (2001)
  • V.V. De Melo

    Kaizen programming

    Proceedings of the 16th Conference on Genetic and Evolutionary Computation

    (2014)
  • C. Ferreira

    Gene expression programming in problem solving

    Soft Computing and Industry

    (2002)
  • F.-A. Fortin et al.

    DEAP: evolutionary algorithms made easy

    J. Mach. Learn. Res.

    (2012)
  • J.H. Friedman

    Multivariate adaptive regression splines

    Ann.Stat.

    (1991)
  • H. Gitlow et al.

    Tools and Methods for the Improvement of Quality

    (1989)
  • M. Hall et al.

    The weka data mining software: an update

    SIGKDD Explor. Newsl.

    (2009)
  • T. Hastie et al.

    The elements of statistical learning: data mining, inference and prediction

    Math. Intell.

    (2005)
  • I. Icke et al.

    Improving genetic programming based symbolic regression using deterministic machine learning

    Evolutionary Computation (CEC), 2013 IEEE Congress on

    (2013)
  • M. Keijzer

    Scaled symbolic regression

    Genet. Program. Evolvable Mach.

    (2004)
  • M.F. Korns

    Abstract expression grammar symbolic regression

    Genetic Programming Theory and Practice VIII

    (2011)
  • M. Kotanchek et al.

    Trustable symbolic regression models: using ensembles, interval arithmetic and pareto fronts to develop robust and trust-aware models

    Genetic programming theory and practice V

    (2008)
  • J.R. Koza

    Genetic Programming - On the Programming of Computers by Means of Natural Selection

    (1993)
  • K. Krawiec et al.

    Behavioral programming: a broader and more detailed take on semantic gp

    Proceedings of the 16th Conference on Genetic and Evolutionary Computation

    (2014)
  • Cited by (28)

    • Machine learning accelerates the materials discovery

      2022, Materials Today Communications
      Citation Excerpt :

      In addition to the above means of constructing effective features, automatic feature engineering and deep learning provide two new ideas for constructing effective features. For example, automatic feature engineering has become an emerging topic of academic research [417]. The automatic feature engineering aims to automatically create candidate features from a dataset and select several best features for training.

    • Second order Takagi-Sugeno fuzzy model with domain adaptation for nonlinear regression

      2021, Information Sciences
      Citation Excerpt :

      Data science is a multi-disciplinary field which plays an important role in analyzing various kinds of data and obtaining insights from them. And numerous machine learning methods are very effective against different categories of data-related problems including clustering [1–4], classification [5–7], and regression [8–12]. However, most systems in real world more or less have nonlinearity with various degrees.

    • An intelligent evolutionary extreme gradient boosting algorithm development for modeling scour depths under submerged weir

      2021, Information Sciences
      Citation Excerpt :

      The XGBoost-Grid is executed by setting the search space as follows. Lambda: [0.1, 0.5, 0.9], Alpha: [0.1, 0.5, 0.9], Colsample by level: [0.1, 0.5, 0.9], Colsample by tree: [0.1, 0.5, 0.9], Subsample rate: [0.1, 0.5, 0.9], Child Weight: [2,5,8], Gamma: [0.1, 2, 5], number of Estimator: [50], learning Rate: [0.02, 0.8, 0.055], and the Maximum Depth: [5,10,18]. It is also evaluated based on 5-fold cross-validation.

    • Application of novel hybrid deep leaning model for cleaner production in a paper industrial wastewater treatment system

      2021, Journal of Cleaner Production
      Citation Excerpt :

      Machine learning is one of the most widely used and common AI. It could not only extract features from the independent variables through feature engineering, but also use these extracted features to model complexity (Veloso de Melo and Banzhaf., 2017), which has attracted a lot of attention from scholars and researchers recently. Mjalli et al. (2007) used artificial neural network (ANN) for forecasting effluent quality in wastewater treatment system, including chemical oxygen demand (COD), biological oxygen demand (BOD) and total suspended solids (TSS).

    View all citing articles on Scopus
    View full text