Enhancing gene expression programming based on space partition and jump for symbolic regression
Introduction
Symbolic regression (SR) is a regression analysis that discovers a model that best fits a given dataset in the space of mathematical expressions. Unlike machine learning or neural network regression analysis that focuses on optimizing parameters in a predefined model, SR aims to find appropriate models and their parameters at the same time. Genetic programming (GP) [1] is a commonly used approach in SR to search for the optimal model. GP evolves to change individual structures of the population to generate fitted models or computer programs by the three key genetic algorithm (GA) operations: selection, crossover, and mutation. To represent a mathematical expression, classical GPs usually describe individual encodings in trees [1], [2], [3], [4], [5]. Graph-based GPs, such as graph encoding GP [6], [7] and Cartesian genetic programming [8], [9], encode individuals into graphs. Linear GPs, such as gene expression programming (GEP) [10], [11], [12] and linear GP [13], convert individuals into linear strings.
Since these GPs all utilize GA operations, like genetic algorithms(GA), these GPs are prone to premature convergence [14]. From the perspective of exploration and exploitation [15], the reason for premature convergence is that individuals of a population are similar, hence, they tend to exploit their neighborhood instead of new regions. Therefore, maintaining the population diversity is a crucial task in evolutionary algorithms. A diverse population can encourage global exploration and reduce premature convergence [16], [17].
In order to preserve the population diversity, the evolutionary computing (EC) community often uses two strategies: 1) parameter control and 2) space partition. The parameter control strategy [18] adjusts parameters of evolutionary algorithms based on population diversity, such as varying population size [19], [20], and dynamically adjusting the probability of crossover [21], [22] and mutation [23], [24]. The strategy is easy to implement population diversity and does not require additional storage spaces. However, it does not know or remember where individuals are in a search space so that it could produce invalid individuals, such as individuals similar to those of the previous generations.
The space partition strategy [25], [26], [27], [28], [29], [30] splits a search space into many subspaces and generates individuals in different subspaces. As individuals in different subspaces have different phenotypes or genotypes, the strategy is easy to control population diversity quantitatively by generating individuals from different subspaces. Meanwhile, the strategy remembers an individual’s approximate position in the search space according to the individual’s subspace. Although the space partition strategy has been successfully applied in GA, it is not suitable for the SR problem, because the whole search space of SR is so large that maintaining fine-grained subspaces is intractable computationally.
In this paper, we propose a new gene expression programming based on space partition and jump (named SPJ-GEP) to maintain the population diversity. SPJ-GEP has the advantages of the above two strategies: it requires small additional storage space, remembers the position of an individual in the search space, and maintains quantitative population diversity. The SPJ-GEP partitions the space of mathematical expressions into k subspaces based on the chromosome coding. Moreover, it initializes individuals in one of the k subspaces, as shown in Step 1 in Fig. 1.
Next, SPJ-GEP selects a suitable subspace to search for individuals with better fitnesses based on a subspace selection method that combines the multi-armed bandit (MAB) [31] and the -greedy strategy [32], as shown in Step 2 and 3 in Fig. 1. This method utilizes MAB to choose one of the subspaces because MAB can balance the exploration by searching other subspaces while maintaining the exploitation of the selected subspace. However, MAB will be invalid when the number of visiting subspaces is higher than a specific value. To preserve population diversity, the method then switches to the -greedy strategy to choose a subspace according to a proposed time formula. The formula decides when to use the -greedy strategy.
At last, SPJ-GEP uses a new crossover method to make individuals jump from the original subspace to another selected subspace, as shown in Step 4 in Fig. 1. The method makes these newly selected individuals intersect with the best individual in the selected subspace so that they can start searching at the latest local optimal position.
The characteristics of SPJ-GEP indicates that classical GEPs [10], [11], [12] are a special case of SPJ-GEP when the number of subspaces k equals 1. On the other hand, if k is large enough that each subspace has only one individual, SPJ-GEP will degenerate into a random selection subspace algorithm. Therefore, k is a critical parameter in SPJ-GEP. In this paper, the range of k is decided by the population diversity and the probability of jump between subspaces. We analyze the complexity of time and space of SPJ-GEP and prove that SPJ-GEP does not significantly increase the time and space complexity compared with classical GEPs.
The main contributions in the paper are summarized as follows:
- •
We propose the SPJ-GEP algorithm, which allows individuals in a population to jump between subspaces according to the MAB and the -greedy strategy. This approach maintains the population diversity.
- •
We provide a solid analysis approach to evaluate the range of the number of subspaces k, and we analyze the algorithm’s time and space complexity.
- •
Our evaluation results show that SPJ-GEP surpasses the three baseline GEP methods: GEP[10], GEP-ADF[11], and self-learning GEP [12].
Section snippets
Gene expression programming for symbolic regression
For a given dataset , the goal of symbolic regression is to discover a function , which can minimize the error between and Y, from the space of mathematical expression that consists of function symbols (e.g., ,…) and terminal symbols (e.g., variables and coefficients). In order to find the best fit function, GP encodes an individual () into a tree (as shown in Fig. 2), and applies crossover and mutation to change the individuals for the optimizing function f.
Gene expression programming based on space partition and jump
The gene expression programming based on space partition and jump (SPJ-GEP) has four components: space partition, subspace selection, crossover, and mutation, as shown in Algorithm 1. Its frequently used notations are listed in Table 1.
SPJ-GEP first executes the space partition components (line 1) to split the mathematical expression space into k subspaces . Then, according to UCB or -greed method (line 6–9), it runs the subspace selection strategy to choose a subspace for exploring
Time and space complexity
Compared with classical GEPs [33], [37], [38], SPJ-GEP requires additional structures to record the visiting times () and the best fitness (), as well as extra computation to obtain in each subspace. The additional time and space complexity are related to the number (k) of all subspaces. Suppose the time and space complexities for classical GEPs within g iterations are and , respectively. For SPJ-GEP, they are and , where ,
Dataset and experimental parameters
In this paper, the dataset consists of 20 SR test problems that are derived from the GP benchmarks [40], as shown in Table 2. The functions and constants of the data set are shown in Table 3. To evaluate the proposed algorithm SPJ-GEP, we have created three algorithms SPJ-GEP, SPJ-GEP-ADF, and SPJ-SL-GEP based on the three baseline GEPs: GEP [10], GEP-ADF [11], and SL-GEP [12], respectively. The three new algorithms have the same parameters as these GEPs have except for the additional
Related work
Similar to our idea of space partition that maintains the population diversity in this study, a method named NrGA was proposed to use a binary space partitioning (BSP) tree. NrGA recursively subdivides space into two and stores individual visiting information, and the method is integrated with GA so that individual revisits are completely eliminated [25], [26], [27]. Although NrGA can maintain the population diversity by visiting the BSP tree, it needs a lot of additional time and space to
Conclusion and future work
In the paper, we propose a novel algorithm, SPJ-GEP, to deal with the SR problem. Using the new approach that partitions the space of mathematical expressions into subspaces, SPJ-GEP guides the population effectively jump among these subspaces with a subspace selection method. SPJ-GEP maintains the population diversity while keeping the balance between subspace exploration and exploitation. Therefore, the proposed SPJ-GEP has the following advantages. SPJ-GEP can be easily embedded in other
CRediT authorship contribution statement
Qiang Lu: Conceptualization, Methodology, Validation, Writing - original draft, Writing - review & editing. Shuo Zhou: Formal analysis, Data curation, Software, Visualization. Fan Tao: Visualization, Validation. Jake Luo: Writing - review & editing. Zhiguang Wang: Supervision.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgements
This work was supported by China National Key Research Project (No. 2019YFC0312003), National Natural Science Foundation of China (No. 61402532) and the Science Foundation of China University of Petroleum-Beijing (No. 01JB0415).
References (50)
- et al.
Enhancing the search ability of differential evolution through orthogonal crossover
Inf. Sci.
(2012) - et al.
Distributed evolutionary algorithms and their models: a survey of the state-of-the-art
Appl. Soft Comput.
(2015) - et al.
A space search optimization algorithm with accelerated convergence strategies
Appl. Soft Comput.
(2013) Genetic programming as a means for programming computers by natural selection
Stat. Comput.
(1994)- et al.
Grammar-based genetic programming: a survey
Genet. Program Evolvable Mach.
(2010) - et al.
Using genetic programming with prior formula knowledge to solve symbolic regression problem
Comput. Intell. Neurosci.
(2016) - A. Moraglio, K. Krawiec, C.G. Johnson, Geometric Semantic Genetic Programming, in: Parallel Problem Solving from Nature...
- et al.
Improving generalization of genetic programming for symbolic regression with angle-driven geometric semantic operators
IEEE Trans. Evol. Comput.
(2019) - et al.
Comparison of tree and graph encodings as function of problem complexity
- et al.
Distilling free-form natural laws from experimental data
Science
(2009)
Cartesian Genetic Programming
Cartesian genetic programming: its status and future
Genet. Program Evolvable Mach.
Gene expression programming: a new adaptive algorithm for solving problems
Complex Syst.
Automatically defined functions in gene expression programming
Self-learning gene expression programming
IEEE Trans. Evol. Comput.
Linear genetic programming
Degree of population diversity: a perspective on premature convergence in genetic algorithms and its markov chain analysis
IEEE Trans. Neural Networks
Diversity in genetic programming: an analysis of measures and correlation with fitness
IEEE Trans. Evol. Comput.
Parameter control in evolutionary algorithms: trends and challenges
IEEE Trans. Evol. Comput.
Adaptively resizing populations: algorithm, analysis, and first results
Complex Systems
Cited by (12)
Tactical unit algorithm: A novel metaheuristic algorithm for optimal loading distribution of chillers in energy optimization
2024, Applied Thermal EngineeringGene expression programming with dual strategies and neighborhood search for symbolic regression problems
2023, Applied Soft ComputingAB-GEP: Adversarial bandit gene expression programming for symbolic regression
2022, Swarm and Evolutionary ComputationCitation Excerpt :The number of ADFs and the head length of ADFs in SL-GEP are 2 and 3, respectively, as recommended in [6]. The confidence interval and the probability of convergence in SPJ-GEP are 0.1 and 0.8, as recommended in [22]. To obtain the performance metrics of the six algorithms: GEP, SPJ-GEP, SL-GEP, GSGP, the proposed AB-GEP and AB-GSGP, each of the algorithms runs thirty times on the twenty-four SR benchmarks and the eight real-world benchmarks, respectively.
A novel genetic expression programming assisted calibration strategy for discrete element models of composite joints with ductile adhesives
2022, Thin-Walled StructuresCitation Excerpt :Reports concerning its use in predicting the failure strength [52] and mixed mode behaviours of adhesive joints [53] can also be found. In this study, GEP, a variant of GP algorithm developed by Ferreira [54] was adopted to find the best fit through employing the crossover and mutation of function structures [55]. GEP algorithm encodes the individuals as the linear strings of fixed length (chromosomes), which are subsequently expressed as the nonlinear entities of various sizes and structures (i.e. expression trees, ET).
Estimating microscale DE parameters of brittle adhesive joints using genetic expression programming
2022, International Journal of Adhesion and AdhesivesCitation Excerpt :The function is determined via a continuous optimization to the best fit based on the symbols from the predefined space of mathematical expressions. In this work, a variant of genetic programming (GP) algorithm, GEP, was used to find the best fit through employing the crossover and mutation of function structures [40]. GP algorithm has been widely used in finding the optimized regression model in a wide range of topics, e.g., to estimate the compressive strength of rock [41,42] or predict the failure strength of adhesive joints [43,44].