Elsevier

Information Sciences

Volume 526, July 2020, Pages 86-101
Information Sciences

SOAP: Semantic outliers automatic preprocessing

https://doi.org/10.1016/j.ins.2020.03.071Get rights and content

Abstract

Genetic Programming (GP) is an evolutionary algorithm for the automatic generation of symbolic models expressed as syntax trees. GP has been successfully applied in many domain, but most research in this area has not considered the presence of outliers in the training set. Outliers make supervised learning problems difficult, and sometimes impossible, to solve. For instance, robust regression methods cannot handle more than 50% of outlier contamination, referred to as their breakdown point. This paper studies problems where outlier contamination is high, reaching up to 90% contamination levels, extreme cases that can appear in some domains. This work shows, for the first time, that a random population of GP individuals can detect outliers in the output variable. From this property, a new filtering algorithm is proposed called Semantic Outlier Automatic Preprocessing (SOAP), which can be used with any learning algorithm to differentiate between inliers and outliers. Since the method uses a GP population, the algorithm can be carried out for free in a GP symbolic regression system. The approach is the only method that can perform such an automatic cleaning of a dataset without incurring an exponential cost as the percentage of outliers in the dataset increases.

Introduction

Genetic programming (GP) has proven to be a powerful modeling technique for regression tasks, particularly when compact and interpretable models are sought [47]. However, like any other data-driven technique, the quality of the final solution depends on the characteristics of the data used to train, or in this case evolve, a model. Even the most sophisticated algorithm cannot produce an accurate model if the input features (predictors) are completely unrelated to the output variable (target). A difficult learning task is encountered when the training dataset is corrupted by outliers, which can severely impede the learning process.

There are several approaches to deal with outliers, but all of them are limited in scope. For instance, a robust objective function can be used to guide a regression algorithm, such as Least Median Squares [13]. However, these functions cannot deal with more than 50% of outliers in the training dataset, in such a case a sampling technique is required such as Random Sampling Consensus (RANSAC) [10], [26]. A shortcoming of RANSAC is that it relies on extracting multiple samples from the training set over several iterations, and the number of required samples grows exponentially with the percentage of outliers in the dataset. There are also filtering techniques for time series data [36]. However, filtering techniques also breakdown when the number of outliers exceeds 50% of the total samples, and are difficult to apply in multi-dimensional problems. For this reason many authors simply recommend visual inspection and manual removal of outliers. Indeed, there are few options to deal with such extreme cases of outlier contamination [6], [18], [36]. It is important to note, moreover, that such cases are not just theoretical, they can appear in real-world scenarios. For instance, RANSAC is widely used in computer vision [10], and has also been used to build Quantitative Structure Activity Relationship (QSAR) models [21] and to analyze microbial metabolomics data [44]. Other studies that have dealt with datasets with many outliers can be found in areas such as complex non-linear systems [50], on-line process monitoring [8], astronomy [12] and multi-target regression [9].

Automatically dealing with large amounts of outliers could allow us to develop new applications and technologies. For instance, such methods can be used to rethink the design and use of sensors. Large amounts of resources are invested to develop accurate and reliable sensors. If we want to understand and model a phenomenon it is assumed that our measurements have to be very reliable. It may be possible, on the other hand, to leverage a powerful outlier detection algorithm to design low-cost sensor networks that can be completely wrong most of the time, and still provide enough of a signal to model a particular system or process. On the other hand, such methods can be used in the development of systems that interact with humans, such as recommendation systems [5], [37], [43]. These types of systems attempt to measure some form of human behavior, which is prone to spontaneity and error. The difficulty in adequately modeling human actions or responses can be partially attributed to the prevalence of outlier behavior by humans in unstructured environments. Therefore, adequate modeling techniques that can account for, and possibly detect, outliers can be useful in this domain.

In the case of supervised learning with GP, dealing with data contamination by outliers has received insufficient research attention [22], [26], [30], [46]. In fact, literature on this topic is scarce, and only our previous works have studied how to handle more than 50% of outliers in the output variable, by using RANSAC [26] or a preliminary version of the proposal in the present work [25] that was only evaluated on very simple univariate synthetic problems. In [26], it is shown that even a small number of outliers can severely degrade the accuracy of symbolic regression with GP. However, the relation between random GP trees and outliers has not been studied in detail.

The goal of this paper is twofold. First, to characterize the behavior of GP trees, evaluating their ability to model inlier and outlier data; an aspect of GP trees has not been studied before. Second, based on the previous analysis, to propose and develop an outlier detection algorithm. The study is focused on regression problems, and outliers in the target variable.

Therefore, the first contribution of this paper is to reveal a previously unknown, and unexpected, property of randomly generated GP trees, or in general of random syntax trees. It is empirically shown that a large proportion of randomly generated GP trees can be used to identify which instances in a training set are outliers and which are not. It appears that this is an intrinsic property of randomly generated GP individuals since it holds across several multi-dimensional problems and over many contamination levels, reaching up to 90% of outliers in the training set in some cases. We provide a conceptual hypothesis to explain this property, based on the nature of the space of all possible program outputs given a particular training set, also referred to as the semantic space of a problem. While it is normally assumed that the semantic space of a GP system for a problem with n data sample (or fitness cases) is ℜn, our results indicate that this is not the case. Randomly generated GP trees are confined to specific regions of semantic space, such that when a data point lies outside these regions they can be detected as outliers [24], [34], [35].

Based on the above discovery, the second contribution of this paper is to propose a new algorithm to filter outliers from a contaminated dataset, which we call Semantic Outlier Automatic Preprocessing (SOAP). The proposed algorithm is noteworthy for two reasons. The first reason is performance. Using a comprehensive evaluation process, results show that SOAP can remove outliers from multi-dimensional real-world problems that are contaminated by as much as 90% of outliers. It must be stressed that no other method can deal with such a scenario in an automatic manner. The second reason is simplicity. SOAP only requires a population (as small as 100 individuals) of randomly generated GP trees to determine which instances in a dataset are outliers and which are not. After removing the outliers, the dataset can then be modeled by any regression algorithm. If GP is used, the filtering process can be obtained for free using the initial population of the evolutionary process.

The remainder of this paper proceeds as follows. Section 2 provides an overview of basic concepts on robust regression and outlier detection. Section 3 reviews related work on outlier detection with evolutionary methods. Section 4 studies how a random population of GP trees responds to a contaminated dataset, using five real-world problems, and discusses the possible implications of these results. Then, Section 5 presents the SOAP algorithm and describes how it can be used in a real-world setting, contrasting it with related studies. A detailed discussion of the results and proposed methods is presented in Section 6. Finally, Section 7 presents conclusions and outlines future work.

Section snippets

Outliers

All regression methods are heavily influenced by anomalies in the training dataset [36]. These anomalies are referred to as outliers, and can be present in the input variables (also called horizontal outliers), the output variable (also called vertical outliers), or in both. Outliers can be generated by several causes, such as human error, equipment malfunction, extreme random noise or missing data [6], [18], [36]. When outliers are rare, then it is possible to define them as data points that

Related work

As stated in Section 2, [26] presents several results that are relevant to robust regression in GP. First, that work showed that both LMS and LTS are applicable to GP, and that, at least empirically, their breakdown point can be confirmed for symbolic regression. Second, given the general usefulness of sampling and subset selection of training instances for robust regression [19], the work also tested the applicability of sampling techniques in GP, such as interleaved sampling [14] and Lexicase

Response of GP trees to outliers

The goal of this section is to analyze how GP (syntax) trees respond to inliers and outliers in the training set. In other words, to analyze if there is any difference between the ability of GP trees to model inliers as opposed to outliers. Even a small number of outliers skews a GP search when fitness is determined by a standard error measure such as the root mean squared error (RMSE) [26]. The search produces very poor models, with high training and testing errors. This section presents a

Proposed outlier filter

Given the above results, a new approach for removing vertical outliers is proposed. The method is called Semantic Outlier Automatic Preprocessing, or SOAP, which is summarized in Algorithm 1. The algorithm proceeds as follows. First, a random set P of GP trees is generated and their fitness is computed. Kernel density estimation is performed, and the peak value is computed g*. Then, we select a subset of individuals P′, such that the value of the estimated density function g(f(k)) for all

Discussion and open issues

There are three general strategies to deal with outliers automatically. The first approach is to use a regression process to build a model while excluding the outliers. This approach is taken by most of the robust techniques, such as LMS or LTS, since the determination of which points are outliers depends on obtaining the residuals from a fitted model. The second approach is to use a filtering process, such as the Hampel identifier. Finally, the third approach is to use a sampling process, such

Conclusion

This paper studies the response of random GP trees, or more generally of random syntax trees, to outliers in a training set. The first contribution was to show that random GP trees respond differently to inliers than outliers. Particularly, they are able to fit inlier data points better than outliers. It is hypothesized that the GP trees are exploiting an underlying regularity in the input and target variables; i.e. that semantic space cannot be sampled randomly but instead that GP trees reside

CRediT authorship contribution statement

Leonardo Trujillo: Conceptualization, Methodology, Supervision, Funding acquisition, Writing - original draft, Resources, Project administration. Uriel López: Conceptualization, Investigation, Methodology, Software, Validation, Visualization, Writing - review & editing, Formal analysis. Pierrick Legrand: Formal analysis, Writing - review & editing, Funding acquisition, Supervision, Project administration.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research was funded by CONACYT (Mexico) Fronteras de la Ciencia 2015-2 Project No. FC-2015-2/944, and second author is supported by CONACYT graduate scholarship No. 573397. Funding was also provided by the FP7-Marie Curie-IRSES 2013 European Commission program through project ACoBSEC with contract No. 612689.

References (51)

  • I.C. Yeh

    Modeling of strength of high-performance concrete using artificial neural networks

    (1998)
  • B. Adrian W. et al.

    Applied Smoothing Techniques for Data Analysis

    (1997)
  • A. Alfons et al.

    Sparse least trimmed squares regression for analyzing high-dimensional large data sets

    Ann. Appl. Stat.

    (2013)
  • L. Beadle et al.

    Semantic analysis of program initialisation in genetic programming

    Genet. Program. Evolvable Mach.

    (2009)
  • D. Bertsimas, R. Mazumder, Least quantile regression via modern optimization, 2013, ArXiv...
  • V. Chandola et al.

    Anomaly detection: A survey

    ACM Comput. Surv.

    (2009)
  • Z. Chen et al.

    Evolutionary multi-objective optimization based ensemble autoencoders for image outlier detection

    Neurocomputing

    (2018)
  • M.A. Fischler et al.

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography

    Commun. ACM

    (1981)
  • F.A. Fortin

    DEAP: evolutionary algorithms made easy

    J. Mach. Learn. Res.

    (2012)
  • D. Fustes et al.

    SOM ensemble for unsupervised outlier analysis. Application to outlier identification in the Gaia astronomical survey

    Expert Syst. Appl.

    (2013)
  • I. Goodfellow et al.

    Deep Learning

    (2016)
  • R.I. Hartley et al.

    Multiple View Geometry in Computer Vision, 2nd Edition

    (2004)
  • A. Hast et al.

    Optimal RANSAC-towards a repeatable algorithm for finding the optimal set

    J. WSC

    (2013)
  • V.J. Hodge et al.

    A survey of outlier detection methodologies

    Artif. Intell. Rev.

    (2004)
  • M. Hubert et al.

    High-breakdown robust multivariate methods

    Stat. Sci.

    (2008)
  • Cited by (3)

    • Out-of-core outlier removal for large-scale indoor point clouds

      2022, Graphical Models
      Citation Excerpt :

      ii) Some of the parameters are not adequately intuitive. Inspired by several papers [48–50], a self-determined parameter algorithm will be appealing in future research. Linlin Ge: Conceptualization, Methodology, Software, Validation, Investigation, Writing – Original Draft, Visualization.

    • Type-based outlier removal framework for point clouds

      2021, Information Sciences
      Citation Excerpt :

      This inconsistency among the thresholds may confuse the user. Trujillo et al. [47] proposed an automatic outlier removal method based on genetic programming. Inspired by their work, we intend to investigate an automatic parameter determination solution based on a heuristic algorithm as part of our future research.

    View full text