Introduction

Genetic programming (GP) is a popular and powerful evolutionary optimization algorithm that solves user-defined tasks by the evolution of computer programs [7, 10, 23]. It has attracted increasing attention from researchers in various research fields and a number of enhanced GP variants have been proposed recently, such as Cartesian genetic programming [14], Semantic Genetic Programming (SGP) [4, 9, 15], Grammatical evolution (GE) [16], Gene Expression Programming (GEP) [8, 22, 23], Linear Genetic Programming (LGP) [3] and Multiple Regression Genetic Programming (MRGP) [1]. So far, GP and its variants have been applied to a number of practical applications, including symbolic regression [24], classification [7], time series prediction problem [23] and program synthesis [11, 20].

In GPs, a solution is constructed by combining a set of primitives consisting of functions and input features. Thus, without exploiting the structure information in advance, to some extent, the searching process is merely recombination with the given building blocks. Most existing GPs [4, 8, 14, 16] search solutions in the entire search space directly without utilizing the relationships among input features to accelerate the search. As a result, they often suffer from low search efficacy on complicated symbolic regression problems.

In practical applications, however, the physical system that generates the observed data usually can be decomposed into a number of separable sub systems. Accordingly, the analytical function to model the system can be decomposed into a number of separable sub functions. By searching the sub functions in smaller sub spaces first and then building the final analytical function in a bottom-up manner, the solution structure can be fully exploited and the search efficacy could be improved significantly. Inspired by this, Luo et al. [12] recently proposed a block building programming (BBP) for symbolic regression. The BBP adopted a separability detection method to judge whether the target function is separable. Then, based on the detection results, the original target function is divided into several blocks and further into factors. The factors are then modeled by an optimization engine such as GP. However, to accomplish the separability detection task, the BBP requires the system being able to generate desired training data that satisfy certain constrains, which limits its application scopes. Udrescu and Tegmark [19] recently also developed a symbolic regression system (named AI Feynman) with a separability detection method. However, the separability detection method in AI Feynman requires a highly accurate neural network to predict the output of given input features, which is hard to construct. Besides, the separability detection method in AI Feynman has not yet been integrated with GP to improve the performance of GP.

To address the aforementioned issues, this paper proposes an efficient Genetic Programming with Separability Detection (SD-GP) for symbolic regression. In the proposed SD-GP, the separability detection method in BBP is extended to detect additive separable characteristic of the target model. Besides, by using a gaussian process surrogate model, the proposed method does not require the system be able to generate new data for separability detection, which is more flexible for practical use. Then, based on the separability detection results, a chromosome representation is proposed to encode solutions. The proposed chromosome representation consists of multiple sub chromosomes. These sub chromosomes generally consist of two kinds: the first kind is simply encoded by all the input features, while the second kind is the separated chromosomes that are encoded by the separable features. Notably, the chromosome encoded by all the input features is still maintained so as to enhance the stability of our method, considering that there can be some ground truth equation having no separable features. The final solution is a weighted sum of all sub functions, and the weights of sub functions are optimized using the least squares method. In this way, the proposed SD-GP can hopefully attain optimal solution efficiently and can ensure the global search ability even when the separability detection is not accurate enough.

The rest of the paper is organized as follows. “Preliminaries” presents related background techniques. “The proposed algorithm” describes the proposed SD-GP algorithm, followed by the experiment study in “Experiments and results”. Finally, “Conclusion” draws the conclusions and future work.

Preliminaries

In [12], Luo et al. proposed a separability detection technique named as the Bi-Correlation Test (BiCT). In this paper, we further extended the BiCT to detect the additive separable characteristics of input features in a more flexible manner. To facilitate reader better comprehend our method, we briefly describe the BiCT technique in this section. Generally, the BiCT aims to detect whether the given system is partially separable by calculating the correlation coefficient of certain sample data generated by the system.

Definition 1

A scalar function \(f(\mathbf {x})\) with n continuous variables \(\mathbf {x}=[x_{1}, x_{2},\) \(\dots , x_{n}]\) (\(f:\mathbb {R}^{n}\mapsto \mathbb {R}\), \(x\in \mathbb {R}^{n}\)) is partially separable if and only if it can be rewritten as

$$\begin{aligned} f(\mathbf {x})=c_{0}\otimes \varphi _{1}(\mathbf {x_{1}})\otimes \varphi _{2}(\mathbf {x_{2}})\otimes \dots \otimes \varphi _{c}(\mathbf {x}_{c}) \end{aligned}$$
(1)

where \(c_{0}\) is a constant, \(\mathbf {x_{i}}\) represents a partition of the vector \(\mathbf {x}\) that contains \(n_{i}\) variables, \(\varphi _{i}\) is a scalar sub function (\(\varphi _{i}:\mathbb {R}^{n_{i}}\mapsto \mathbb {R}\)), and the binary operator \(\otimes _{i}\) can be plus (\(+\)), minus (−) and times (\(*\)), c is the number of separable partitions.

In order to test whether \(\mathbf {x_{test}}\) containing a certain part of the input variables can be separated from the input vector \(\mathbf {x}\) to form \(\mathbf {x_{i}}\) in Eq. (1), BiCT first constructs four input matrices \(\mathbf {X_{A}}\), \(\mathbf {X_{B}}\), \(\mathbf {X_{C}}\) and \(\mathbf {X_{D}}\) for sampling. The structures of matrices \(\mathbf {X_{A}}\) and \(\mathbf {X_{B}}\) are described in (2).

$$\begin{aligned} \mathbf {X_{A}}= & {} \mathbf {(X_{ fixed},X_{ random\_A})}\nonumber \\= & {} \left( \begin{array}{ccccc} x_{1,1}^\mathrm{(fixed)} &{} \cdots &{} x_{1,m}^\mathrm{(fixed)} &{} \cdots &{} x_{1,n}^\mathrm{(random\_A)}\\ x_{2,1}^\mathrm{(fixed)} &{} \cdots &{} x_{2,m}^\mathrm{(fixed)} &{} \cdots &{} x_{2,n}^\mathrm{(random\_A)}\\ \vdots &{} \ddots &{} \vdots &{} \ddots &{} \vdots \\ x_{N,1}^\mathrm{(fixed)} &{} \cdots &{} x_{N,m}^\mathrm{(fixed)} &{} \cdots &{} x_{N,n}^\mathrm{(random\_A)}\\ \end{array} \right) \nonumber \\ \mathbf {X_{B}}= & {} \mathbf {(X_{fixed},X_{random\_B})}\nonumber \\= & {} \left( \begin{array}{ccccc} x_{1,1}^\mathrm{(fixed)} &{} \cdots &{} x_{1,m}^\mathrm{(fixed)} &{} \cdots &{} x_{1,n}^\mathrm{(random\_B)}\\ x_{2,1}^\mathrm{(fixed)} &{} \cdots &{} x_{2,m}^\mathrm{(fixed)} &{} \cdots &{} x_{2,n}^\mathrm{(random\_B)}\\ \vdots &{} \ddots &{} \vdots &{} \ddots &{} \vdots \\ x_{N,1}^\mathrm{(fixed)} &{} \cdots &{} x_{N,m}^\mathrm{(fixed)} &{} \cdots &{} x_{N,n}^\mathrm{(random\_B)}\\ \end{array} \right) \end{aligned}$$
(2)

where m and n are the variable numbers of vector \(\mathbf {x_{test}}=(x_{1},x_{2},\dots ,x_{m})\) and \(\mathbf {x}=(x_{1},x_{2},\dots ,x_{n})\), and N is the number of samples. The values of the first m columns which correspond to the sampling values of the variables in \(\mathbf {x_{test}}\), are the same in both \(\mathbf {X_{A}}\) and \(\mathbf {X_{B}}\) and make up the fixed matrix \(\mathbf {X_{fixed}}\). The rest \(n-m\) columns which represent the samples of remaining variables in \(\mathbf {x}\), are randomly sampled and form random matrices \(\mathbf {X_{random\_A}}\) and \(\mathbf {X_{random\_B}}\), respectively. The structures of matrices \(\mathbf {X_{C}}\) and \(\mathbf {X_{D}}\) are similar to \(\mathbf {X_{A}}\) and \(\mathbf {X_{B}}\), but the random matrices in \(\mathbf {X_{C}}\) and \(\mathbf {X_{D}}\) correspond to the variables in \(\mathbf {x_{test}}\), while the fixed matrix represents the samples of the remaining variables in the vector \(\mathbf {x}\).

The corresponding sampled outputs of \(\mathbf {X_{A}}\), \(\mathbf {X_{B}}\), \(\mathbf {X_{C}}\) and \(\mathbf {X_{D}}\) are represented as \(\mathbf {Y_{A}}\), \(\mathbf {Y_{B}}\), \(\mathbf {Y_{C}}\) and \(\mathbf {Y_{D}}\), respectively. Then, the correlation coefficient \(r_{AB}\) between \(\mathbf {Y_{A}}\) and \(\mathbf {Y_{B}}\) is estimated as follows:

$$\begin{aligned} { r_{AB}=\frac{1}{N-1}\sum \limits _{i=1}^{N}\frac{a_{i}-{\bar{a}}}{\sigma _{a}}\cdot \frac{b_{i}-{\bar{b}}}{\sigma _{b}} } \end{aligned}$$
(3)

where N is the number of samples in \(\mathbf {Y_{A}}\) and \(\mathbf {Y_{b}}\), \(a_i\) and \(b_i\) are the i-th values in \(\mathbf {Y_{A}}\) and \(\mathbf {Y_{b}}\), \({\bar{a}}\) and \({\bar{b}}\) are the sample mean values of \(\mathbf {Y_{A}}\) and \(\mathbf {Y_{b}}\), and \(\sigma _{a}\) and \(\sigma _{b}\) are the sample standard deviations of \(\mathbf {Y_{A}}\) and \(\mathbf {Y_{b}}\). Similarly, \(r_{CD}\) can be obtained based on \(\mathbf {Y_{C}}\) and \(\mathbf {Y_{D}}\). If both \(r_{AB}\) and \(r_{CD}\) are both equal to 1, then it can be judged that the variables in \(\mathbf {x_{test}}\) can be separated from the input variables as a whole and the original data are partially separable.

Based on the above descriptions, we can find that the BiCT technique could not distinguish whether the system is separable in the form of addition or multiplication. Besides, the BiCT requires that the system can provide the outputs of randomly generated input features (i.e., \(\mathbf {X_{A}}\), \(\mathbf {X_{B}}\), \(\mathbf {X_{C}}\) and \(\mathbf {X_{D}}\)). This requirement is usually not easy to satisfy when the given problem only provides a fixed set of data for training.

The proposed algorithm

In this section, we first describes the proposed surrogate-assisted additive separability detection method named as Surrogate-assisted Additive Bi-Correlation Test (SAA-BiCT). Then, the proposed genetic programming with SAA-BiCT is presented.

The surrogate-assisted additive bi-correlation test method

Although BiCT can effectively detect the separability from the observed data automatically, there are two shortcomings that limit its application. First, BiCT can only detect the features with separability, but is disable to distinguish whether the separable features are additive or multiplicative. For example, variables in \(f_1(x_{1},x_{2},x_{3})=x_{1}+x_{2}+x_{3}\) and \(f_2(x_{1},x_{2},x_{3})=x_{1}*x_{2}*x_{3}\) are considered to be partially separable when tested by BiCT, but the former function is additive combination and the latter is multiplicative combination. Second, if the problem only provides a fixed set of data for training, the special sampling input and output matrices for detection may not be available, which makes it inapplicable. To overcome the aforementioned issues, we proposed an enhanced additive separability detection method named SAA-BiCT.

In the SAA-BiCT, the same method as mentioned in Sect. 2 is utilized to construct input matrices \(\mathbf {X_{A}}\), \(\mathbf {X_{B}}\), \(\mathbf {X_{C}}\) and \(\mathbf {X_{D}}\). To avoid the limitation caused by insufficient data in the original data set, we adopt a Gaussian process regression (GPR) as the surrogate model to predict the output matrices of the four input matrices. Specially, in this paper, the Squared Exponential with Periodic Element is adopted as the kernel function of the GPR model, which can be expressed as follows:

$$\begin{aligned} k(x,x^{'})=\sigma ^2_f \mathrm{exp}{\left[ \frac{-(x-x^{'})^2}{2l^2}\right] } + \sigma ^2_n \delta (x,x'), \end{aligned}$$
(4)

where \(\sigma _f\), l and \(\sigma _n\) are hyper-parameters of the kernel function. The JADE [21] is adopted as the solver to optimize the hyper parameters of the GPR model.

After generating the input and output matrices, the proposed SAA-BiCT does not directly use their corresponding output matrices to calculate the linear correlation coefficients. Assuming the data dimension is n, and \(x_{1},\) \(x_{2}, \dots ,x_{m}\) are tested variables and \(\mathbf {x_{test}}=(x_{1},x_{2},\dots ,x_{m})\), we define the matrices \({\Gamma _\mathbf{test}}\) and \({\Gamma _\mathbf{rest}}\) as follows:

$$\begin{aligned} {\Gamma _\mathbf{test}}= \left( \begin{array}{c} \zeta (\mathbf {x_{test}}^{(1)})\\ \zeta (\mathbf {x_{test}}^{(2)})\\ \vdots \\ \zeta (\mathbf {x_{test}}^{(N)})\\ \end{array} \right) ,\ {\Gamma _\mathbf{rest}}= \left( \begin{array}{c} \zeta (\mathbf {x_{rest}}^{(1)})\\ \zeta (\mathbf {x_{rest}}^{(2)})\\ \vdots \\ \zeta (\mathbf {x_{rest}}^{(N)})\\ \end{array} \right) , \end{aligned}$$
(5)

where N is the number of samples, \(\mathbf {x_{rest}}=(x_{1},x_{2},\dots ,x_{n-m})\) is a vector made up of the remaining input variables, and \(\zeta \) is a nonlinear function. Then matrices \(\mathbf {Y^{'}_{A}}\), \(\mathbf {Y^{'}_{B}}\), \(\mathbf {Y^{'}_{C}}\) and \(\mathbf {Y^{'}_{D}}\) can be obtained by

$$\begin{aligned} \begin{aligned} \mathbf {Y^{'}_{A}}=\mathbf {Y_{A}}+\Gamma _{\mathbf {test}},\ \mathbf {Y^{'}_{B}}=\mathbf {Y_{B}}+\Gamma _{\mathbf {test}},\\ \mathbf {Y^{'}_{C}}=\mathbf {Y_{C}}+\Gamma _{\mathbf {rest}},\ \mathbf {Y^{'}_{D}}=\mathbf {Y_{D}}+\Gamma _{\mathbf {rest}}. \end{aligned} \end{aligned}$$
(6)

We use the newly constructed \(\mathbf {Y^{'}_{A}}\) and \(\mathbf {Y^{'}_{B}}\) to calculate the correlation coefficient \(r_{AB}\) and \(\mathbf {Y^{'}_{C}}\), and \(\mathbf {Y^{'}_{D}}\) to calculate \(r_{CD}\) by Eq. (3). If the results of \(r_{AB}\) and \(r_{CD}\) are 1, the variables in \(\mathbf {x_{test}}\) can be considered to be additively separable.

Based on the above mechanism, the proposed SAA-BiCT can distinguish the additive separable structures from the original data model. SAA-BiCT applies an additive separability factor to the process of testing, so that it can only test the model from the perspective of additive separability. For example, function \(f(x_{1},x_{2},x_{3},x_{4})=x_{1}*x_{2}*x_{3}*x_{4}\) is a multiplicative model. When SAA-BiCT tests whether \(x_{1}\) can be additively separable, a factor \(\zeta (x_{1})\) is added to the original model and extends the function to be \(f(x_{1},x_{2},x_{3},x_{4})=x_{1}*x_{2}*x_{3}*x_{4}+\zeta (x_{1})\). Then by calculating the correlation coefficients, SAA-BiCT can judge that \(x_{1}\) is not separable. It is worth pointing out that different \(\zeta \) forms can lead to different detection effects. In this paper, we adopt a simple \(\zeta \) function as shown in (7) which works well based on our empirical test.

$$\begin{aligned} \zeta (\mathbf {x})=\sum \limits _{x_{i}\in \mathbf {x}}x_{i}^{2} \end{aligned}$$
(7)

Genetic programming with separability detection

To harness the separability characteristic of the given problem to improve the search performance, the proposed SD-GP adopts a multi-chromosome representation method. In SD-GP, each sub chromosome represents a sub function of the model to be solved, and the final model is the weighted sum of all sub functions. Figure 1 illustrates the general framework of the proposed SD-GP. It first uses the SAA-BiCT to detect the additive separability of the observed model. Then, the original input variables are disassembled for each sub chromosome according to the additive separation features. The least squares estimator method is utilized to optimize the weight of each sub function. The population of chromosomes is repeatedly evolved using genetic operators such as crossover, mutation and selection, until the termination condition is met.

Fig. 1
figure 1

General framework of SD-GP

When testing separability on the observed data, traversing all combinations takes a lot of time. However, it is not necessary to perform the SAA-BiCT on all input variable combinations one by one under some circumstances. For instance, if the input variables are \(x_{1},x_{2},x_{3}\), there are 7 combinations, which are \((x_{1})\), \((x_{2})\), \((x_{3})\), \((x_{1},x_{2})\), \((x_{1},x_{3})\), \((x_{2},x_{3})\) and \((x_{1},x_{2},x_{3})\). A data set that contains n input variables contains \(C^{1}_{n}+C^{2}_{n}+\dots +C^{n}_{n}\) combinations of variables. As the number of variables increases, the computational time will increase dramatically. Whereas, in SAA-BiCT, we only need to test \((x_{1})\), \((x_{2})\) and \((x_{3})\) and totally test \(C^{1}_{n}+C^{2}_{n}+\dots +C^{\lfloor {n/2}\rfloor }_{n}\) times to achieve the same detection effect. In this paper, we aim to reduce the number of variables in each separated part so as to reduce the difficulty of searching the sub functions. Therefore, the proposed SAA-BiCT starts with combinations with fewer variables. That is, each single variable is tested at the beginning, and then the combinations of them. In addition, all variables to be tested are put into a vector before detection. Once variables or their combinations is separable, these variables are removed from the vector. If only combinations formed by the variables in the vector are detected, unnecessary detections can be avoided and the separable feature set can be obtained.

The encoded solution I of a chromosome in SD-GP is obtained by

$$\begin{aligned} I=\beta _{0}+\beta _{1}C_{1}+\beta _{2}C_{2}+\dots +\beta _{k}C_{k}, \end{aligned}$$
(8)

where \(C_{i}\) is one of the sub chromosomes, \(\beta _{i}\) is the weight calculated by the least squares estimator method and k is the number of sub chromosomes. In this paper, we adopts the fixed-length gene expression encoding method to encode each sub chromosome \(C_{i}\).

Basic structure and implementation details of genetic programming

The basic structure of the gene expression chromosome contains a Head section and a Tail section. The Head section can contain both function symbols (e.g., \(+\), \(*\), \(\sin \)) and terminal symbols (e.g., \(x_{1}, x_{2}\)), while the Tail section can only contain terminal symbols. To ensure the whole chromosome can be decoded correctly, the length of Head (h) and Tail (t) must satisfy \(t=h\cdot (u-1)+1\), where u is the number of maximum operands of function symbols.

Fig. 2
figure 2

An example of decoding a chromosome

Table 1 Synthetic problems

The process of using a chromosome to represent a mathematical function is as shown in Fig. 2. Given a string-based representation in Fig. 2, one should treat the first operator as the root node in encoding tree. Then, the rest symbols in the string are traversed in a breath-first-search fashion. That is, \(+\) and \(x_1\) should be the children of root node −, and \(*\) and \(\sin \) should be the children of the second symbol \(+\), and so on. With the constructed encoding tree, the equation \(x_1*x_1+\sin (x_2)-x_1\) can be obtained naturally.

The generation of sub chromosomes generally depends on the obtained additive separation features. Each separable group of variables is assigned to a sub chromosome, and the sub chromosome can only utilize the separable group of variables to construct sub function. To ensure the global search ability of the algorithm, an additional sub chromosome is added to each chromosome, which can utilize all variables involved to construct sub function. In SD-GP, the minimum number of sub chromosomes \(K_{min}\) (e.g., \(K_{min} = 5\)) is set in advance. When the number of separable variable sets found by SAA-BiCT is smaller than \(K_{min}\), the remaining sub chromosomes can utilize all variables to construct sub functions.

The SD-GP adopts SL-GEP [22] as the GP solver to evolve the above multi-chromosomes. The SL-GEP is a recently published GP variant which can efficiently search solution by using extended genetic operators borrowed from differential evolution (DE) [17, 21]. It should be noted that the SL-GEP adopts a gene expression chromosome representation with automatically defined functions (C-ADFs). In this paper, we only borrow the genetic operators of SL-GEP to evolve the proposed chromosomes. Generally, the evolutionary process in SL-GEP is composed of four main steps, which are initialization, mutation, crossover and selection. In the initialization, according to the chromosome structure, corresponding symbols are randomly assigned to the Head and Tail sections to form the initial chromosomes. Then the chromosomes are iteratively evolved through mutation, crossover and selection, until the termination condition is satisfied. In this paper, the algorithm will terminate when a pre-defined maximum number of generations is reached or a pre-defined satisfying solution is found.

Table 2 Important parameter setting of SD-GP
Table 3 Comparison results

Experiments and results

This section investigates the performance of the proposed SD-GP algorithm. First, the experimental settings, including the synthetic test problems, the compared algorithms and the parameter settings of all algorithms are described. Then, the comparison results are presented. Finally, the effectiveness of the separability detection of SAA-BiCT is discussed.

Experiment settings

To test the effectiveness of the proposed SD-GP for solving problems with separable additive structure, a synthetic problem set containing 15 symbolic regression problems with different features is designed for testing, as listed in Table 1. The problems can be divided into three categories: Category I is made up of basic multivariate problems. The problems in Category II have more complex forms with more square or even cubic structures, which are harder to be solved. Category III contains more function primitives, including \(\sin \), \(\cos \), \(\exp \) and \(\ln \). The format U[abN] in column Data Set means that the number of samples is N and each input variable is randomly sampled in the interval (ab).

To better evaluate the performance of the proposed SD-GP, we choose three other GP variants for comparison. The first algorithm is a modified SL-GEP [22] (labelled as S-SLGEP), which removes the ADFs in chromosome representation and uses least squares estimator method to create constants. Since the S-SLGEP is adopted as the fundamental component in the proposed method, comparison with S-SLGEP indicates the effectiveness of the proposed preprocessing component, SAA-BiCT.

Fig. 3
figure 3

Comparison on convergence speeds of problem 1–9

Fig. 4
figure 4

Comparison on convergence speeds of problem 10–15

The second algorithm is a multi-gene GP (SMGGP), which adopts the chromosome representation and genetic operators in SD-GP but abandons the separability detection technique. Comparison with SMGGP serves as component analysis to indicate the efficacy of the separability component. The last algorithm is the Genetic Programming Toolbox for the Identification of Physical Systems version 2 (GPTIPS-2) [18], a widely used open source genetic programming toolbox for multi-gene symbolic regression on MATLAB platform [2, 5, 6, 13]. SMGGP and GPTIPS-2 are multi-gene algorithms, but unlike SD-GP, they do not use the knowledge obtained from the observed data to assist the searching process, so the total number of their sub chromosomes (K) is fixed in advance. In our experiments, the parameter K of SMGGP and GPTIPS-2 is set to 5. For SD-GP, we set the minimum number of sub chromosomes \(K_{\min }=5\). The length of the string representing the sub chromosome (L) in SD-GP and SMGGP is 21. Because S-SLGEP is a single-chromosome GP algorithm, so to enhance its expression ability, its chromosome can contain more symbols than other three algorithms and its L is set to 81. For GPTIPS-2, the maximum depth (\(D_{\max }\)) of the GP tree is set to 4. In addition, the fitness value V of all algorithms is calculated by

$$\begin{aligned} V=\frac{1}{N}\sum \limits _{i=1}^{N}(y_{i}-o_{i})^{2}, \end{aligned}$$
(9)

where \(y_{i}\) is the output value calculated by the indicate chromosome and \(o_{i}\) is the real output value in the observed data set. The common parameters of SD-GP and S-SLGEP are set the same as suggested in SL-GEP [22]. Table 2 lists the important parameter settings of the proposed SD-GP.

Results and discussion

For each test problem, we run all algorithms for 30 independent times and count the total successful times for solving problems correctly. The experimental results are recorded in Table 3. Column SAA-BiCT in SD-GP shows whether SAA-BiCT is able to detect all additive separable features correctly from the input variables.

The experimental results show that the proposed SD-GP performs much better than other compared algorithms. In general, SMGGP, GPTIPS-2 and SD-GP, which adopt multi-gene encoding technique, perform significantly better than S-SLGEP which uses only a single chromosome. Among the three multi-gene algorithms, SD-GP performs the best, especially when SAA-BiCT can correctly detect all the additive separable features of the observed data. It is difficult for SMGGP and GPTIPS-2 to solve multivariate problems which contain many additive high-power structures, such as problems in Category II. However, with the detected knowledge, SD-GP can search efficiently on these problems.

In SD-GP, the number of sub chromosomes is determined by the number of separable features detected. In this way, SD-GP can adaptively adjust the chromosome size so as to fit the complexity of the problem. Meanwhile, both SMGGP and GPTIPS-2 use a fixed size of sub chromosomes to solve all problems, which is less flexible than the proposed SD-GP. For example, when they solve Problem 11 that requires 8 sub chromosomes, they cannot automatically increase the number of sub chromosomes, thereby weakening their problem-solving ability. Owing to SAA-BiCT, the SD-GP can overcome the above weakness. It should be noted that even when SAA-BiCT fails to detect correct separable features, SD-GP still has promising anti-interference ability to correct errors, just as what happened during the solution of Problem 3, Problem 10, Problem 11 and Problem 15. On Problem 10, SAA-BiCT wrongly considers that both \(x_{2}\) and \(x_{3}\) can be separated independently, but in reality they are only separable as a whole. However, since SD-GP uses extra sub chromosomes to contain all variables and limits the minimum number of sub chromosomes to 5, the separable part \(sin(x_{2}+x_{3})\) can be found by additional sub chromosomes successfully.

In order to further test the efficiency of SD-GP in finding correct solutions, we plot the convergence graphs of fitness values on nine canonical problems in Fig. 3, and six problems with relatively complex primitives in Fig. 4. The fitness values are obtained by Eq. (9). Thus the closer the fitness is to zero, the better the solution is. The results in Fig. 3 and Fig. 4 show that SD-GP always converges the fastest among the compared algorithms, which indicates that the proposed method is effective to improve the search efficiency of GP.

Table 4 Additive separability test for misjudgment

Finally, we test the additive separability detection capability of SAA-BiCT and BiCT. The variables that are mistakenly judged to be additive separable are listed in Table 4. It can be observed that BiCT has misjudgments on all multiplicative functions, which donates that it has no discerning ability to the multiplicative separation and additive separation. Meanwhile, SAA-BiCT can correctly identify the additive separability features on most cases. It is worth pointing out that the detection accuracy of SAA-BiCT is limited by the prediction accuracy and the calculation error, so misjudgments may occur sometimes.

Conclusion

In this paper, we proposed a genetic programming with separability detection technique (SD-GP) to improve the search efficacy of GP on symbolic regression problems with separable structures. SD-GP uses a separability detection method called SAA-BiCT to discover the additive separable features in the observed model, and then it generates and accelerates the evolution of sub chromosomes with the obtained knowledge. Experiments have indicated that SD-GP has advantages in problems with seperable structures compared to S-SLGEP, SMGGP and the widely used MATLAB toolbox GPTIPS-2. There are several future research directions. First, the proposed method can be further improved by designing a more accurate prediction method to assist the separability detection. Second, different forms of \(\zeta \) in Eq. (7) can be tried to achieve better separability detection results. Hereafter, with the separability detection component, the proposed algorithm can be further extended to high-dimensional scenario by reducing the computation overhead. In addition, the proposed algorithm framework can be further improved by incorporating other useful knowledge learned from the observed data, such as symmetry characteristic of input features.