A greedy search tree heuristic for symbolic regression

https://doi.org/10.1016/j.ins.2018.02.040Get rights and content

Abstract

Symbolic Regression tries to find a mathematical expression that describes the relationship of a set of explanatory variables to a measured variable. The main objective is to find a model that minimizes the error and, optionally, that also minimizes the expression size. A smaller expression can be seen as an interpretable model considered a reliable decision model. This is often performed with Genetic Programming, which represents their solution as expression trees. The shortcoming of this algorithm lies on this representation that defines a rugged search space and contains expressions of any size and difficulty. These pose as a challenge to find the optimal solution under computational constraints. This paper introduces a new data structure, called Interaction-Transformation (IT), that constrains the search space in order to exclude a region of larger and more complicated expressions. In order to test this data structure, it was also introduced an heuristic called SymTree. The obtained results show evidence that SymTree are capable of obtaining the optimal solution whenever the target function is within the search space of the IT data structure and competitive results when it is not. Overall, the algorithm found a good compromise between accuracy and simplicity for all the generated models.

Introduction

Many decision making process can be automated by learning a computational model through a set of observed data. For example, credit risk can be estimated by using explanatory variables related to the consumer behavior [9]. A recommender system can estimate the likelihood of a given person to consume an item given their past transactions [1].

There are many techniques devised to generate such models, from the simple Linear Regression [26] to more advanced universal approximators like Neural Networks [13]. The former has the advantage of being simple and easily interpretable, but the relationship must be close to linear for the approximation to be acceptable. The latter can numerically approximate any function given the constraint that the final form of the function is pre-determined, usually as a weighted sum of a nonlinear function applied to a linear combination of the original variables. This constraint makes the regression model hard to understand, since the implications of changing the value of an input variable is not easily traced to the target variable. Because of that, these models are often called Black Box Models.

The concerns with using black box models for decision making process are the inability to predict what these models will do in critical scenarios and whether their response are biased by the data used to adjust the parameters of the model.

For example, there is recent concern on how driverless cars will deal with variants of the famous Trolley problem [5], [29]. Driverless cars use classification and regression models to decide the next action to perform, such as speed up, slow down, break, turn left or right by some degrees, etc. If the model used by these cars are difficult to understand, the manufacturer cannot be sure the actions the car will do in extreme situations. Faced with the decision of killing pedestrians or killing the driver, what choice will it make? Even though these situations may be rare, it is important to understand whether the model comprehends all possible alternatives to prevent life losses.

Another recent example concerns the regression models used to choose which online ad to show to a given user. It was found in [10] that the model presented a bias towards the gender of the user. Whenever the user was identified as a male, the model chose ads of higher paying jobs than when the user was a female person. In this case, the bias was introduced by the data used as a reference to adjust the parameters of the model. Historically, the income distribution of men is skewed towards higher salaries than women [28].

An interpretable model could provide a better insight to such concerns since everything will be explicitly described in the mathematical expression of the model. In the example of the driverless car, an inspection on the use of variables corresponding to location of bystanders around the car could reveal what would be the probable actions taken by the model. Simlarly, the inspection the mathematical expression to choose the online ads, could reveal a negative correlation for the combination of salary and the female gender.

As such, an interpretable model should have both high accuracy regarding the target variable and, at the same time, be as simple as possible to allow the interpretation of the decision making process.

Currently this type of model is being studied through Symbolic Regression [4], a field of study that aims to find a symbolic expression that fits an examplary data set accurately. Often, it is also included as a secondary objective that such expression is as simple as possible. This is often solved by means of Genetic Programming [19], a metaheuristic from the Evolutionary Algorithms [3] field that evolves an expression tree by minimizing the model error and maximizing the simplicity of such tree. Currently, the main challenges in such approach is that the search space induced by the tree representation not always allow a smooth transition between the current solution towards an incremental improvement and, since the search space is unrestricted, it allows the representation of black box models as well.

This main objective of this paper is to introduce a new data structure, named Interaction-Transformation (IT), for representing mathematical expressions that constrains the search space by removing the region comprising uninterpretable expressions. Additionaly, a greedy divisive search heuristic called SymTree is proposed to verify the suitability of such a data structure to generate smaller Symbolic Regression models.

The data structure simply describes a mathematical expression as the summation of polynomial functions and transformation functions applied to the original set of variables. This data structure restrict the search space of mathematical expressions and, as such, is not capable of representing every possible expression.

As such, there are two hypothesis being tested in this paper:

  • H1. The IT data structure constrain the search space such as it is only possible to generate smaller expressions.

  • H2. Even though the search space is constrained, this data structure is capable of finding function approximations with competitive accuracy when compared to black box models.

In order to test these hypothesis the SymTree algorithm will be applied to standard benchmark functions commonly used on the literature. These functions are low dimensional functions used in the literature but that still is a challenge to many Symbolic Regression algorithms. The functions will be evaluated by means of Mean Squared Error and number of nodes in the tree representation of the generated expression. Finally, these results will be compared to three standard regression approaches (linear and nonlinear), three recent variations of Genetic Programming applied to this problem and two other Symbolic Regression algorithms from the literature.

The experimental results will show that the proposed algorithm coupled with this data structure is indeed capable of finding the original form of the target functions whenever the particular function is representable by the structure. Also, when the function is not representable by the IT data structure, the algorithm can still manage to find an approximation that compromises between simplicity and accuracy. Regarding numerical results, the algorithm performed better than the tested Symbolic Regression algorithms in most benchmarks and it was competitive when compared against an advanced black box model extensively used on the literature.

The remainder of this paper is organized as follows, Section 2 gives a brief explanation of Symbolic Regression and classical solution through Genetic Programming. In Section 3 some recent work on this application is reported along with their contribution. Section 4 describes the proposed algorithm in detail, highlighting its advantages and limitations. Section 5 explains the experiments performed to assess the performance of the proposed algorithm and compare its results with the algorithms described in Section 3. Finally, Section 6 summarizes the contributions of this work and discuss some of the possibilities for future research.

Section snippets

Symbolic regression

Consider the problem where we have collected n data points X={x1,,xn}, called explanatory variables, and a set of n corresponding target variables Y={y1,,yn}. Each data point is described as a vector with d measurable variables xiRd. The goal is to find a function f^(x):XY, also called a model, that approximates the relationship of a given xi with its corresponding yi.

Sometimes, this can be accomplished by a linear regression where the model is described by a linear function assuming that

Literature review

Recently, many extensions to the cannonical GP or even new algorithms were proposed in order to cope with the shortcomings pointed out in the previous section. This section will highlight some of the recent publications that reported to have achieved improvements over previous works.

In [30] the authors propose the neat-GP, a GP based algorithm with mechanisms borrowed from Neuroevolution of augmenting topologies [27] (NEAT) algorithm and the Flat Operator Equalization bloat control method for

Constrained representation for Symbolic Regression

The overall idea introduced in this section is that if a given representation for Symbolic Regression does not comprehend bloated mathematical expressions, it will allow the algorithms to focus the search only on the subset of expressions that can be interpretable.

For this purpose, a new Data Structure used to represent the search space of mathematical expressions will be introduced followed by an heuristic algorithm that makes use of such representation to find approximations to nonlinear

Experimental results

The goal of the IT data structure and SymTree algorithm, proposed in this paper is to achieve a concise and descriptive approximation function that minimizes the error to the measured data. As such, not only we should minimize an error metric (i.e., absolute or squared error) but we should also minimize the size of the final expression.

So, in order to test both SymTree and the representation, we have performed experiments with a total of 17 different benchmark functions commonly used in the

Conclusion

In this paper a new data structure for mathematical expressions, named Interaction-Transformation, was proposed with the goal of constraining the search space with only simple and interpretable expressions represented as linear combination of compositions of non-linear functions with polynomial functions. Also, in order to test this data structure, a heuristic approach was introduced to assess the Symbolic Regression problem, called SymTree. The heuristic can be classified as a greedy search

References (31)

  • J.-F. Bonnefon, A. Shariff, I. Rahwan, Autonomous vehicles need experimental ethics: Are we ready for utilitarian...
  • T. Chen, C. Guestrin, Xgboost: a scalable tree boosting system, arXiv preprint arXiv:1603.02754...
  • T. Chen et al.

    Higgs boson discovery with boosted trees

  • T. Chen, T. He, xgboost: extreme gradient boosting, R package version...
  • A. Datta et al.

    Automated experiments on ad privacy settings

    Proc. Priv. Enhanc. Technol.

    (2015)
  • Cited by (47)

    • DoME: A deterministic technique for equation development and Symbolic Regression

      2022, Expert Systems with Applications
      Citation Excerpt :

      These nodes are iteratively expanded by creating new children with simple functions. To generate the child nodes, a greedy heuristic is used (Olivetti de França, 2018). As in previously described approaches, this technique is based on linear combinations of different expressions.

    • Implementation of predictive models: Practical aspects

      2022, Theoretical and Computational Chemistry
    View all citing articles on Scopus
    View full text