Machine learning of poorly predictable ecological data

https://doi.org/10.1016/j.ecolmodel.2005.11.015Get rights and content

Abstract

This paper reports on research using a variety of machine learning techniques to a difficult modelling problem, the spatial distribution of an endangered Australian marsupial, the southern brown bandicoot (Isoodon obesulus). Four learning techniques – decision trees/rules, neural networks, support vector machines and genetic programming – were applied to the problem. Support vector and neural network approaches gave marginally better predictivity, but in the context of low overall accuracy, decision trees and genetic programming gave more useful results because of the human comprehensibility of their models.

Introduction

Spatial phenomena and spatial interaction in the real world are highly intricate. This makes mathematical models, which often rely on good or strong theoretical knowledge, very hard to build, since in many cases the available theories are suspect, and at best poor. Machine learning has consequently been used to explore such datasets, and where the data are highly regular and the effects are strong, has yielded valuable insights. However ecological modelling problems frequently combine small datasets, noisy data, and weak domain theories, and thus provide severe challenges to machine learning techniques. This paper presents a comparative study of what can be obtained from a variety of machine learning techniques on one such difficult problem, the distribution of a native Australian animal, the southern brown bandicoot (Isoodon obesulus).

The southern brown bandicoot is a small, omnivorous, ground-dwelling marsupial which occurs in southern and eastern Australia. Habitat fragmentation, feral predators and other factors have led to a continuing decline in the distribution and abundance of the bandicoot.

In order to protect it, it is necessary to understand the relationships between its population and these factors. In 1998 and 1999, over 300 sites were surveyed in South Australia, and population data for the bandicoot, and surrounding factors, such as vegetation, soil, fire history and geomorphology were obtained. In this paper, four machine learning techniques (decision tree/rule learning, genetic programming, neural networks and support vector machines) are used to try to identify the potential relationships between the bandicoot and geographical factors.

The rest of this paper is organised as follows. Section 2 describes the modelling problem and underlying dataset, while Section 3 surveys the learning techniques used in this research. The data preparation and experimental setup are described in Section 4, while Section 5 presents the results. The results are discussed in Section 6, and the conclusions are presented in Section 7.

Section snippets

Bandicoot conservation and survey data

The southern brown bandicoot I. obesulus is a small (∼0.5–1.5 kg), omnivorous, ground-dwelling marsupial occurring in southern and eastern Australia. Habitat fragmentation, feral predators and other factors have caused a decline in its abundance and one subspecies, I. o. obesulus, which is the focus of this paper, is listed as endangered by Australia's Environmental Protection and Biodiversity Conservation Act 1999.

In order to conserve the southern brown bandicoot, it is necessary to understand

Machine learning techniques

Previous investigations on the dataset using Generalised Linear Modelling had resulted in little understanding of the data. For example, the best forward selection model for the dataset included Gld, Drscore and Xanscore, with a residual deviance of 281.4 and only 97.4 of the sum deviance explained (Paull, unpublished data). Hence it was seen as a suitable candidate for investigation with machine learning techniques, as the potential would be—an understanding of the influences on the

Data preparation and experimental setup

The original data contained 344 sites (i.e. instances) described in terms of the attributes discussed in Section 3. One of these sites contained a missing value for one attribute. It was deemed expedient to omit that instance rather than needlessly raise the complex issues involved in dealing with missing values.

The class distribution of the instances is highly skewed—there are only 13 instances of class 3 (> 20 diggings per 100 m), 48 of class 2 (6–20 diggings), 123 of class 3 (1–5 diggings) and

Results

The results of the runs are shown in Table 5. Each entry gives the mean and standard deviation, over ten runs, of the error rate (as a percentage) for the particular combination of learning mechanism and learning parameters.

The first point to note is that none of the performances is particularly good. While some of the algorithms fit the training data accurately, this is simply the result of overfitting, as the test-set performance is much weaker.

For the unbalanced-class data, none of the

Discussion

Although there is some indication in the data that non-comprehensible representation methods may have given slightly better learning performance on the balanced-class data, this effect is absent in the unbalanced-class data. On the other hand, the usefulness of a black-box predictor with an error rate of close to 50% (unbalanced) or 60% (balanced) is open to serious question. With comprehensible representations, since the predictive performance of the classifiers has been validated on an

Conclusions

We investigated the performance of a range of machine learning methods in generating models from a species distribution dataset. The conservation problem underlying the dataset is of some importance, and methods to extract an understanding of the data are highly desirable. However the dataset has so far proven highly resistant to analysis.

Where the comprehensibility of the modelling is not an issue, and predictive accuracy is all that is required, the results suggest that support vector

Acknowledgements

We would like to thank ForestrySA for access to the study sites and archived fire records. We would particularly like to acknowledge the actions of the authors of the relevant software systems – Ross Quinlan, Jeff Elman, Chih-Chung Chang, Chih-Jen Lin and Brian Ross – in making their software freely available to researchers.

References (18)

  • P.A. Whigham

    Induction of a marsupial density model using genetic programming and spatial relationships

    Ecol. Model.

    (2000)
  • Chang, C.-C., Lin, C.-J., 2001. Lib. SVM: A Library for Support Vector Machines....
  • S. Haykin

    Neural Networks, A Comprehensive Foundation

    (1994)
  • J.R. Koza

    Genetic Programming: On the Programming of Computers by Means of Natural Selection

    (1992)
  • R.C. McDonald et al.

    Australian Soil and Land Survey Field Handbook

    (1990)
  • Munsell® Color, 1994. Munsell® Soil Color Charts, Macbeth Division of Kollmogoran Instruments, New Windsor, New...
  • Nuñez, H., Angulo, C., Catala, A., 2000. Rule Extraction from Support Vector Machines. In: Verleysen, M. (Ed.),...
  • Paull, D.J., 1993. The distribution, ecology and conservation of the southern brown bandicoot (Isoodon obesulus...
  • D.J. Paull

    The distribution of the southern brown bandicoot Isoodon obesulus obesulus in South Australia

    Wildl. Res.

    (1995)
There are more references available in the full text version of this article.

Cited by (49)

  • Advances in image acquisition and processing technologies transforming animal ecological studies

    2021, Ecological Informatics
    Citation Excerpt :

    Machine learning techniques, such as, decision trees/rules, neural networks; support vector machines and genetic programming were applied to the problem of spatial distribution of southern brown bandicoot (Isoodon obesulus) with poor data. Decision trees and genetic programming gave better results in the context of the studied problem (Shan et al., 2006). Data from 344 sites was used which was unbalanced and also had some missing values.

  • Machine learning in the Australian critical zone

    2021, Data Science Applied to Sustainability Analysis
  • Local conditions affecting current and potential distribution of the invasive round goby – Species distribution modelling with spatial constraints

    2018, Estuarine, Coastal and Shelf Science
    Citation Excerpt :

    Maximum iterations were set to 1000 to achieve “maximum entropy” under regularization constraints (Phillips et al., 2006). Multicollinearity in environmental variables can confound Maxent, as well as all other species distribution models (Phillips et al., 2004; Shan et al., 2006; Elith & Leathwick, 2009), however regularization provides an alternative system for selection of variables by adding all variables desired and examining how the model responds (Merow et al., 2013). We have used a two-step approach for variable selection.

View all citing articles on Scopus
View full text