Elsevier

Information Sciences

Volume 547, 8 February 2021, Pages 553-567
Information Sciences

Enhancing gene expression programming based on space partition and jump for symbolic regression

https://doi.org/10.1016/j.ins.2020.08.061Get rights and content

Highlights

  • Divide the space of mathematical expression into many subspaces.

  • Let the population jump among these subspaces for keeping its diversity.

  • Combine the multi-armed bandit and the -greedy strategy to choose a subspace.

  • Analyze the time and space complexity of the algorithm.

  • Evaluate the reasonable range of the number of subspaces.

Abstract

When solving a symbolic regression problem, the gene expression programming (GEP) algorithm could fall into a premature convergence which terminates the optimization process too early, and may only reach a poor local optimum. To address the premature convergence problem of GEP, we propose a novel algorithm named SPJ-GEP, which can maintain the GEP population diversity and improve the accuracy of the GEP search by allowing the population to jump efficiently between segmented subspaces. SPJ-GEP first divides the space of mathematical expressions into k subspaces that are mutually exclusive. It then creates a subspace selection method that combines the multi-armed bandit and the -greedy strategy to choose a jump subspace. In this way, the analysis is made on the population diversity and the range of the number of subspaces. The analysis results show that SPJ-GEP does not significantly increase the computational complexity of time and space than classical GEP methods. Besides, an evaluation is conducted on a set of standard SR benchmarks. The evaluation results show that the proposed SPJ-GEP keeps a higher population diversity and has an enhanced accuracy compared with three baseline GEP methods.

Introduction

Symbolic regression (SR) is a regression analysis that discovers a model that best fits a given dataset in the space of mathematical expressions. Unlike machine learning or neural network regression analysis that focuses on optimizing parameters in a predefined model, SR aims to find appropriate models and their parameters at the same time. Genetic programming (GP) [1] is a commonly used approach in SR to search for the optimal model. GP evolves to change individual structures of the population to generate fitted models or computer programs by the three key genetic algorithm (GA) operations: selection, crossover, and mutation. To represent a mathematical expression, classical GPs usually describe individual encodings in trees [1], [2], [3], [4], [5]. Graph-based GPs, such as graph encoding GP [6], [7] and Cartesian genetic programming [8], [9], encode individuals into graphs. Linear GPs, such as gene expression programming (GEP) [10], [11], [12] and linear GP [13], convert individuals into linear strings.

Since these GPs all utilize GA operations, like genetic algorithms(GA), these GPs are prone to premature convergence [14]. From the perspective of exploration and exploitation [15], the reason for premature convergence is that individuals of a population are similar, hence, they tend to exploit their neighborhood instead of new regions. Therefore, maintaining the population diversity is a crucial task in evolutionary algorithms. A diverse population can encourage global exploration and reduce premature convergence [16], [17].

In order to preserve the population diversity, the evolutionary computing (EC) community often uses two strategies: 1) parameter control and 2) space partition. The parameter control strategy [18] adjusts parameters of evolutionary algorithms based on population diversity, such as varying population size [19], [20], and dynamically adjusting the probability of crossover [21], [22] and mutation [23], [24]. The strategy is easy to implement population diversity and does not require additional storage spaces. However, it does not know or remember where individuals are in a search space so that it could produce invalid individuals, such as individuals similar to those of the previous generations.

The space partition strategy [25], [26], [27], [28], [29], [30] splits a search space into many subspaces and generates individuals in different subspaces. As individuals in different subspaces have different phenotypes or genotypes, the strategy is easy to control population diversity quantitatively by generating individuals from different subspaces. Meanwhile, the strategy remembers an individual’s approximate position in the search space according to the individual’s subspace. Although the space partition strategy has been successfully applied in GA, it is not suitable for the SR problem, because the whole search space of SR is so large that maintaining fine-grained subspaces is intractable computationally.

In this paper, we propose a new gene expression programming based on space partition and jump (named SPJ-GEP) to maintain the population diversity. SPJ-GEP has the advantages of the above two strategies: it requires small additional storage space, remembers the position of an individual in the search space, and maintains quantitative population diversity. The SPJ-GEP partitions the space of mathematical expressions into k subspaces based on the chromosome coding. Moreover, it initializes individuals in one of the k subspaces, as shown in Step 1 in Fig. 1.

Next, SPJ-GEP selects a suitable subspace to search for individuals with better fitnesses based on a subspace selection method that combines the multi-armed bandit (MAB) [31] and the -greedy strategy [32], as shown in Step 2 and 3 in Fig. 1. This method utilizes MAB to choose one of the subspaces because MAB can balance the exploration by searching other subspaces while maintaining the exploitation of the selected subspace. However, MAB will be invalid when the number of visiting subspaces is higher than a specific value. To preserve population diversity, the method then switches to the -greedy strategy to choose a subspace according to a proposed time formula. The formula decides when to use the -greedy strategy.

At last, SPJ-GEP uses a new crossover method to make individuals jump from the original subspace to another selected subspace, as shown in Step 4 in Fig. 1. The method makes these newly selected individuals intersect with the best individual in the selected subspace so that they can start searching at the latest local optimal position.

The characteristics of SPJ-GEP indicates that classical GEPs [10], [11], [12] are a special case of SPJ-GEP when the number of subspaces k equals 1. On the other hand, if k is large enough that each subspace has only one individual, SPJ-GEP will degenerate into a random selection subspace algorithm. Therefore, k is a critical parameter in SPJ-GEP. In this paper, the range of k is decided by the population diversity and the probability of jump between subspaces. We analyze the complexity of time and space of SPJ-GEP and prove that SPJ-GEP does not significantly increase the time and space complexity compared with classical GEPs.

The main contributions in the paper are summarized as follows:

  • We propose the SPJ-GEP algorithm, which allows individuals in a population to jump between subspaces according to the MAB and the -greedy strategy. This approach maintains the population diversity.

  • We provide a solid analysis approach to evaluate the range of the number of subspaces k, and we analyze the algorithm’s time and space complexity.

  • Our evaluation results show that SPJ-GEP surpasses the three baseline GEP methods: GEP[10], GEP-ADF[11], and self-learning GEP [12].

The rest of this paper is organized as follows. Section 2 describes related background techniques. Then, Section 3 Gene expression programming based on space partition and jump, 4 Analysis of SPJ-GEP provide details for the proposed SPJ-GEP algorithm and its analysis, respectively. Moreover, Section 5 shows the experimental results and analysis. Section 6 discusses related works. Section 7 concludes the paper and points out possible future work.

Section snippets

Gene expression programming for symbolic regression

For a given dataset X,Y, the goal of symbolic regression is to discover a function f(X)=Y, which can minimize the error between Y and Y, from the space of mathematical expression that consists of function symbols (e.g., +,-,×,/,sin,cos,…) and terminal symbols (e.g., variables and coefficients). In order to find the best fit function, GP encodes an individual (f(X)) into a tree (as shown in Fig. 2), and applies crossover and mutation to change the individuals for the optimizing function f.

Gene expression programming based on space partition and jump

The gene expression programming based on space partition and jump (SPJ-GEP) has four components: space partition, subspace selection, crossover, and mutation, as shown in Algorithm 1. Its frequently used notations are listed in Table 1.

SPJ-GEP first executes the space partition components (line 1) to split the mathematical expression space Ω into k subspaces ω1,,ωk. Then, according to UCB or -greed method (line 6–9), it runs the subspace selection strategy to choose a subspace ωi for exploring

Time and space complexity

Compared with classical GEPs [33], [37], [38], SPJ-GEP requires additional structures to record the visiting times (nωi) and the best fitness (fωi), as well as extra computation to obtain UCBωi in each subspace. The additional time and space complexity are related to the number (k) of all subspaces. Suppose the time and space complexities for classical GEPs within g iterations are O(gep) and Θ(gep), respectively. For SPJ-GEP, they are O(gep+g×c×k)=O(gep+n×k) and Θ(gep+m×k), where n=c×k,k>1,

Dataset and experimental parameters

In this paper, the dataset consists of 20 SR test problems that are derived from the GP benchmarks [40], as shown in Table 2. The functions and constants of the data set are shown in Table 3. To evaluate the proposed algorithm SPJ-GEP, we have created three algorithms SPJ-GEP, SPJ-GEP-ADF, and SPJ-SL-GEP based on the three baseline GEPs: GEP [10], GEP-ADF [11], and SL-GEP [12], respectively. The three new algorithms have the same parameters as these GEPs have except for the additional

Related work

Similar to our idea of space partition that maintains the population diversity in this study, a method named NrGA was proposed to use a binary space partitioning (BSP) tree. NrGA recursively subdivides space into two and stores individual visiting information, and the method is integrated with GA so that individual revisits are completely eliminated [25], [26], [27]. Although NrGA can maintain the population diversity by visiting the BSP tree, it needs a lot of additional time and space to

Conclusion and future work

In the paper, we propose a novel algorithm, SPJ-GEP, to deal with the SR problem. Using the new approach that partitions the space of mathematical expressions into subspaces, SPJ-GEP guides the population effectively jump among these subspaces with a subspace selection method. SPJ-GEP maintains the population diversity while keeping the balance between subspace exploration and exploitation. Therefore, the proposed SPJ-GEP has the following advantages. SPJ-GEP can be easily embedded in other

CRediT authorship contribution statement

Qiang Lu: Conceptualization, Methodology, Validation, Writing - original draft, Writing - review & editing. Shuo Zhou: Formal analysis, Data curation, Software, Visualization. Fan Tao: Visualization, Validation. Jake Luo: Writing - review & editing. Zhiguang Wang: Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgements

This work was supported by China National Key Research Project (No. 2019YFC0312003), National Natural Science Foundation of China (No. 61402532) and the Science Foundation of China University of Petroleum-Beijing (No. 01JB0415).

References (50)

  • J.F. Miller et al.

    Cartesian Genetic Programming

  • J.F. Miller

    Cartesian genetic programming: its status and future

    Genet. Program Evolvable Mach.

    (2019)
  • C. Ferreira

    Gene expression programming: a new adaptive algorithm for solving problems

    Complex Syst.

    (2001)
  • C. Ferreira

    Automatically defined functions in gene expression programming

  • J. Zhong et al.

    Self-learning gene expression programming

    IEEE Trans. Evol. Comput.

    (2016)
  • M.F. Brameier et al.

    Linear genetic programming

    (2007)
  • Yee Leung et al.

    Degree of population diversity: a perspective on premature convergence in genetic algorithms and its markov chain analysis

    IEEE Trans. Neural Networks

    (1997)
  • M. Črepinšek, S.-H. Liu, M. Mernik, Exploration and Exploitation in Evolutionary Algorithms: A Survey, ACM Comput....
  • E. Burke et al.

    Diversity in genetic programming: an analysis of measures and correlation with fitness

    IEEE Trans. Evol. Comput.

    (2004)
  • D. Sudholt, The Benefits of Population Diversity in Evolutionary Algorithms: A Survey of Rigorous Runtime Analyses....
  • G. Karafotias et al.

    Parameter control in evolutionary algorithms: trends and challenges

    IEEE Trans. Evol. Comput.

    (2015)
  • R.E. Smith et al.

    Adaptively resizing populations: algorithm, analysis, and first results

    Complex Systems

    (1995)
  • G.R. Harik, F.G. Lobo, A Parameter-less Genetic Algorithm, in: Proceedings of the 1st Annual Conference on Genetic and...
  • L.J. Eshelman et al.
  • Sibylle D. Mfiller, Nicol N. Schraudolph, Petros D. Koumoutsakos, Step size adaptation in evolution strategies using...
  • Cited by (12)

    • AB-GEP: Adversarial bandit gene expression programming for symbolic regression

      2022, Swarm and Evolutionary Computation
      Citation Excerpt :

      The number of ADFs and the head length of ADFs in SL-GEP are 2 and 3, respectively, as recommended in [6]. The confidence interval and the probability of convergence in SPJ-GEP are 0.1 and 0.8, as recommended in [22]. To obtain the performance metrics of the six algorithms: GEP, SPJ-GEP, SL-GEP, GSGP, the proposed AB-GEP and AB-GSGP, each of the algorithms runs thirty times on the twenty-four SR benchmarks and the eight real-world benchmarks, respectively.

    • A novel genetic expression programming assisted calibration strategy for discrete element models of composite joints with ductile adhesives

      2022, Thin-Walled Structures
      Citation Excerpt :

      Reports concerning its use in predicting the failure strength [52] and mixed mode behaviours of adhesive joints [53] can also be found. In this study, GEP, a variant of GP algorithm developed by Ferreira [54] was adopted to find the best fit through employing the crossover and mutation of function structures [55]. GEP algorithm encodes the individuals as the linear strings of fixed length (chromosomes), which are subsequently expressed as the nonlinear entities of various sizes and structures (i.e. expression trees, ET).

    • Estimating microscale DE parameters of brittle adhesive joints using genetic expression programming

      2022, International Journal of Adhesion and Adhesives
      Citation Excerpt :

      The function is determined via a continuous optimization to the best fit based on the symbols from the predefined space of mathematical expressions. In this work, a variant of genetic programming (GP) algorithm, GEP, was used to find the best fit through employing the crossover and mutation of function structures [40]. GP algorithm has been widely used in finding the optimized regression model in a wide range of topics, e.g., to estimate the compressive strength of rock [41,42] or predict the failure strength of adhesive joints [43,44].

    View all citing articles on Scopus
    View full text