Elsevier

Ecological Modelling

Volume 146, Issues 1–3, 1 December 2001, Pages 231-241
Ecological Modelling

Variants of genetic programming for species distribution modelling — fitness sharing, partial functions, population evaluation

https://doi.org/10.1016/S0304-3800(01)00309-XGet rights and content

Abstract

We investigate the use of partial functions, fitness sharing and committee learning in genetic programming. The primary intended application of the work is in learning spatial relationships for ecological modelling. The approaches are evaluated using a well-studied ecological modelling problem, the greater glider population density problem. Combinations of the three treatments (partial functions, fitness sharing and committee learning) are compared on the dimensions of accuracy and computational cost. Fitness sharing significantly improves learning accuracy, and populations of partial functions substantially reduce computational cost. The results of committee learning are more equivocal, and require further investigation. The learned models are highly predictive, but also highly explanatory.

Introduction

Spatial relationships lie at the core of many ecological relationships. A number of methods for learning spatial relationships have been proposed (Whigham et al., 1992, Dibble, 1994, Bowman, 1997). Genetic Programming has shown itself to be one of the more successful approaches to learning such spatial relationships (Whigham, 1996) but suffers from high computational cost. The work reported here aims to reduce the computational cost of genetic programming for learning in search spaces such as those encountered in ecological modelling.

The greater glider dataset is described in detail in (Stockwell et al., 1990). Briefly, it consists of a 20×20 grid of cells. For each cell, the values of seven independent variables are recorded: the degree of development (D-3 categories); whether there exists a stream corridor (ST-2 categories); stand condition from a forestry perspective (SC-6 categories); site quality from a forestry perspective (SQ-4 categories); floristic nutrients (FN-4 categories); slope (S-3 categories); and erosion (E-3 categories) (NB in the study area, all sites were highly eroded, E=3, so the erosion attribute may be effectively ignored). For each cell, we also have a value for the putative dependent variable, the greater glider density (GD-4 categories, ranging from 0, absent to 3, abundant).

An important aspect of this dataset is that it was originally studied from a non-spatial perspective. Discussions with Stockwell raised the possibility spatial relationships may be important, and this was subsequently confirmed (McKay et al., 1997).

Genetic programming, like other forms of evolutionary computation, can suffer from premature convergence, where variation is eliminated from a population before the desired solution is achieved. The problem can be alleviated with larger populations, but this is not a realistic option in domains such as spatial learning, where evaluating individuals may carry a very high computational cost.

A number of alternative approaches have been investigated with toy problems using alternative approaches — fitness sharing, partial functions, population evaluation (McKay, 2000a, McKay, 2000b, McKay, 2000c). This paper extends that work to a realistic ecological modelling problem.

Fitness sharing was introduced by Deb and Goldberg (1989), in a form (explicit fitness sharing) which relied on a distance metric to define the similarity between individuals. Similar individuals are punished for that similarity by being required to share the raw fitness they receive. However there are disadvantages to this approach, because it requires a priori definition of the distance function, before knowledge of the shape of the search space has been acquired.

Smith et al. (1992) noted that the requirement to define a distance metric could be avoided in problems where the fitness of an individual is built up from the payoff from discrete sub-problems. In this case, the payoff for a sub-problem could simply be shared amongst the individuals which perform well on that sub-problem. The resultant approach is known as implicit fitness sharing. Since most ecological modelling problems share this payoff structure, implicit fitness sharing is highly relevant.

Most work in evolutionary computation makes use of total functions (functions whose values are fully defined for all possible inputs). In many cases this is a matter of simple necessity. For example, the genome structure in standard genetic algorithms does not support partially defined functions (the ‘don't care’ symbols used in genetic algorithms simply indicate that a particular input value is ignored, not that the output value is undefined for a given input). But partial functions are rarely used in genetic programming, where they do have a sensible meaning.

On the other hand, total functions are under evolutionary pressure to solve all parts of a problem; this pressure tends to reduce diversity, perhaps fatally so if a critical part of an optimal solution is more difficult to find than equivalent parts of a local optimum. Equally important, since a partial function is in general less complex than a total function of the same depth, we can expect the evaluation costs of individual partial functions to be lower than that for total functions.

One difficulty with populations of partial functions lies in determining how to use the population to predict new cases. The ‘standard’ approach, of using the fittest individual, fails because the fittest individual may not have a defined output for the new case. One possibility is to wait until the learning process has converged, then impose an evolutionary pressure on the population toward totality. Apart from the additional computational cost this may impose (the system is required to respond to two different evolutionary pressures, first the learning problem itself, then the totality problem), there is considerable difficulty in determining the appropriate stage at which to introduce the totality pressure.

An alternative is to use population evaluation, in which the population vote on the predicted value for the new case. The vote is weighted by the previously evaluated fitness of the individuals, and individuals which are undefined for the given case simply abstain from voting. In theory, population evaluation should also give improved generalisation, when compared with evaluation by the fittest individual.

Section snippets

Details of approach

Comparisons have been carried out on the greater glider problem. The experiments compare the use of implicit fitness sharing, raw fitness, and a combination of the two. Each treatment is repeated twice, using a population of total functions, and a population of partial functions. The results are evaluated both on the training set, and on a set of unseen test cases, and using both the fittest individual and a voting mechanism.

Experimental design

The search space for these experiments is defined by the grammar in Table 2 (based on Whigham, 1996; for total functions, the productions leading to ‘undef’ are deleted).

The grammar productions from spab to ‘undef’ deserve comment. The initial version of this grammar had a direct production spab -> ‘undef’. This resulted in far too many ‘undef’ leaves in the trees, with only a tiny fraction of the population being defined anywhere. This results from a well-known problem with grammar generation

Results

The number of cases (out of a possible 200) correctly predicted at the end of a run of 50 generations, and averaged over 25 runs, are shown in Table 4 and Table 5. Computational requirements are shown in Table 6.

Fig. 1 shows the number of training cases correctly predicted by the fittest individual, plotted against generation. In the legend, ‘total’ and ‘partial’ refer to populations of partial and total functions, respectively. ‘Raw’, ‘share’ and ‘ramp’ refer to the use of raw fitness

Discussion

Comparing fitness sharing approaches with raw fitness, it is clear that fitness sharing gives notably improved performance when compared with raw fitness, for all treatments with the single exception of the combination of partial functions with fitness sharing throughout the run, under ‘best individual’ evaluation. This latter result is expected, because in that case, the best individual may well not be total, and so cannot be expected to make as many correct predictions as the corresponding

Conclusions

The results reported here confirm those previously reported from experiments with very different toy problems: fitness sharing provides clear benefits over raw fitness; these benefits are enhanced by population evaluation mechanisms; and populations of partial functions may yield further benefits when computational costs are taken into account. The methods discussed were particularly aimed at spatial modelling of ecological systems, and are expected to extend well to a range of similar problems.

Acknowledgements

The ideas in this paper have benefited greatly from discussions over the years at UNSW and CSIRO with David Stockwell, Richard Davis, Peter Laut, Paul Darwen, Peter Whigham, Xin Yao and Ko-Hsin Liang. The system has been implemented through modifications to Brian Ross’ innovative DCTG-GP system, which contributed greatly to the rapidity with which a wide range of different algorithms could be prototyped in the development of this system, and supported a simple mechanism for defining and

References (19)

  • W.W Cohen

    Grammatically biased learning: learning logic programs using an explicit antecedent description language

    Art. Intel.

    (1994)
  • W Banzhaf et al.
  • Bowman, A., 1997. S+ SpatialStats: A Spatial Statistics module for S-Plus. Maths&Stats Newsletter. CTI Statistics,...
  • K Deb et al.

    An investigation of niche species formation in genetic function optimization

  • Dibble, C., 1994. Beyond Data: Handling Spatial and Analytical Contexts with Genetics Based Machine Learning Sixth...
  • R.I McKay et al.

    Learning spatial relationships: some approaches

  • McKay, R.I., 2000. Fitness Sharing in Genetic Programming, Genetic and Evolutionary Computation Conference GECCO-2000,...
  • McKay, R.I., 2000. Partial Functions in Fitness-Shared Genetic Programming, Congress on Evolutionary Computation CEC...
  • McKay, R.I., 2000. Committee Learning of Partial Functions in Fitness-Shared Genetic Programming, Asia-Pacific...
There are more references available in the full text version of this article.

Cited by (13)

  • Evolutionary algorithms for species distribution modelling: A review in the context of machine learning

    2019, Ecological Modelling
    Citation Excerpt :

    In addition, GPs are used, but only limited (Jeong et al., 2011; McKay, 2001; Whigham, 2000). Another interesting application is the use of Bayesian theory in GAs (McClean et al., 2005; Termansen et al., 2006). Feature selection is always implemented in binary strings whereas binary and continuous strings are used for parameter estimation.

  • Toward a new generation of ecological modelling techniques: Review and bibliometrics

    2015, Developments in Environmental Modelling
    Citation Excerpt :

    GP is an ideal approach for finding solutions where the variables are constantly changing or there is not even an ideal solution. GP was used to model species richness and distribution, assess fish stock–recruitment relationships and model algal blooms (Chen et al., 2000; McKay, 2001; Muttil and Lee, 2005; Olden et al., 2008). Generally, the numbers of publications using GP and GAs are greatest in ecological studies, especially in conservation planning for biodiversity and ecosystems (Sarkar et al., 2006).

  • Web-based tools for data analysis and quality assurance on a life-history trait database of plants of Northwest Europe

    2006, Environmental Modelling and Software
    Citation Excerpt :

    Only in rare cases researchers from the field of ecology make use of symbolic machine learning methods (Džeroski, 2001), e.g. to explain classes in terms of their instances' attribute values. Some applications of genetic algorithms in ecological modelling – e.g. to find prediction rules for algal dynamics – can be found in Bobbin and Recknagel (2001), McKay (2001) and Recknagel (2001). The main reason for the rare usage of data mining techniques in the application domain of ecology is the limited knowledge in this domain about the capability of symbolic data mining algorithms to generate easily understandable class descriptions.

View all citing articles on Scopus
View full text