SOAP: Semantic outliers automatic preprocessing
Introduction
Genetic programming (GP) has proven to be a powerful modeling technique for regression tasks, particularly when compact and interpretable models are sought [47]. However, like any other data-driven technique, the quality of the final solution depends on the characteristics of the data used to train, or in this case evolve, a model. Even the most sophisticated algorithm cannot produce an accurate model if the input features (predictors) are completely unrelated to the output variable (target). A difficult learning task is encountered when the training dataset is corrupted by outliers, which can severely impede the learning process.
There are several approaches to deal with outliers, but all of them are limited in scope. For instance, a robust objective function can be used to guide a regression algorithm, such as Least Median Squares [13]. However, these functions cannot deal with more than 50% of outliers in the training dataset, in such a case a sampling technique is required such as Random Sampling Consensus (RANSAC) [10], [26]. A shortcoming of RANSAC is that it relies on extracting multiple samples from the training set over several iterations, and the number of required samples grows exponentially with the percentage of outliers in the dataset. There are also filtering techniques for time series data [36]. However, filtering techniques also breakdown when the number of outliers exceeds 50% of the total samples, and are difficult to apply in multi-dimensional problems. For this reason many authors simply recommend visual inspection and manual removal of outliers. Indeed, there are few options to deal with such extreme cases of outlier contamination [6], [18], [36]. It is important to note, moreover, that such cases are not just theoretical, they can appear in real-world scenarios. For instance, RANSAC is widely used in computer vision [10], and has also been used to build Quantitative Structure Activity Relationship (QSAR) models [21] and to analyze microbial metabolomics data [44]. Other studies that have dealt with datasets with many outliers can be found in areas such as complex non-linear systems [50], on-line process monitoring [8], astronomy [12] and multi-target regression [9].
Automatically dealing with large amounts of outliers could allow us to develop new applications and technologies. For instance, such methods can be used to rethink the design and use of sensors. Large amounts of resources are invested to develop accurate and reliable sensors. If we want to understand and model a phenomenon it is assumed that our measurements have to be very reliable. It may be possible, on the other hand, to leverage a powerful outlier detection algorithm to design low-cost sensor networks that can be completely wrong most of the time, and still provide enough of a signal to model a particular system or process. On the other hand, such methods can be used in the development of systems that interact with humans, such as recommendation systems [5], [37], [43]. These types of systems attempt to measure some form of human behavior, which is prone to spontaneity and error. The difficulty in adequately modeling human actions or responses can be partially attributed to the prevalence of outlier behavior by humans in unstructured environments. Therefore, adequate modeling techniques that can account for, and possibly detect, outliers can be useful in this domain.
In the case of supervised learning with GP, dealing with data contamination by outliers has received insufficient research attention [22], [26], [30], [46]. In fact, literature on this topic is scarce, and only our previous works have studied how to handle more than 50% of outliers in the output variable, by using RANSAC [26] or a preliminary version of the proposal in the present work [25] that was only evaluated on very simple univariate synthetic problems. In [26], it is shown that even a small number of outliers can severely degrade the accuracy of symbolic regression with GP. However, the relation between random GP trees and outliers has not been studied in detail.
The goal of this paper is twofold. First, to characterize the behavior of GP trees, evaluating their ability to model inlier and outlier data; an aspect of GP trees has not been studied before. Second, based on the previous analysis, to propose and develop an outlier detection algorithm. The study is focused on regression problems, and outliers in the target variable.
Therefore, the first contribution of this paper is to reveal a previously unknown, and unexpected, property of randomly generated GP trees, or in general of random syntax trees. It is empirically shown that a large proportion of randomly generated GP trees can be used to identify which instances in a training set are outliers and which are not. It appears that this is an intrinsic property of randomly generated GP individuals since it holds across several multi-dimensional problems and over many contamination levels, reaching up to 90% of outliers in the training set in some cases. We provide a conceptual hypothesis to explain this property, based on the nature of the space of all possible program outputs given a particular training set, also referred to as the semantic space of a problem. While it is normally assumed that the semantic space of a GP system for a problem with n data sample (or fitness cases) is ℜn, our results indicate that this is not the case. Randomly generated GP trees are confined to specific regions of semantic space, such that when a data point lies outside these regions they can be detected as outliers [24], [34], [35].
Based on the above discovery, the second contribution of this paper is to propose a new algorithm to filter outliers from a contaminated dataset, which we call Semantic Outlier Automatic Preprocessing (SOAP). The proposed algorithm is noteworthy for two reasons. The first reason is performance. Using a comprehensive evaluation process, results show that SOAP can remove outliers from multi-dimensional real-world problems that are contaminated by as much as 90% of outliers. It must be stressed that no other method can deal with such a scenario in an automatic manner. The second reason is simplicity. SOAP only requires a population (as small as 100 individuals) of randomly generated GP trees to determine which instances in a dataset are outliers and which are not. After removing the outliers, the dataset can then be modeled by any regression algorithm. If GP is used, the filtering process can be obtained for free using the initial population of the evolutionary process.
The remainder of this paper proceeds as follows. Section 2 provides an overview of basic concepts on robust regression and outlier detection. Section 3 reviews related work on outlier detection with evolutionary methods. Section 4 studies how a random population of GP trees responds to a contaminated dataset, using five real-world problems, and discusses the possible implications of these results. Then, Section 5 presents the SOAP algorithm and describes how it can be used in a real-world setting, contrasting it with related studies. A detailed discussion of the results and proposed methods is presented in Section 6. Finally, Section 7 presents conclusions and outlines future work.
Section snippets
Outliers
All regression methods are heavily influenced by anomalies in the training dataset [36]. These anomalies are referred to as outliers, and can be present in the input variables (also called horizontal outliers), the output variable (also called vertical outliers), or in both. Outliers can be generated by several causes, such as human error, equipment malfunction, extreme random noise or missing data [6], [18], [36]. When outliers are rare, then it is possible to define them as data points that
Related work
As stated in Section 2, [26] presents several results that are relevant to robust regression in GP. First, that work showed that both LMS and LTS are applicable to GP, and that, at least empirically, their breakdown point can be confirmed for symbolic regression. Second, given the general usefulness of sampling and subset selection of training instances for robust regression [19], the work also tested the applicability of sampling techniques in GP, such as interleaved sampling [14] and Lexicase
Response of GP trees to outliers
The goal of this section is to analyze how GP (syntax) trees respond to inliers and outliers in the training set. In other words, to analyze if there is any difference between the ability of GP trees to model inliers as opposed to outliers. Even a small number of outliers skews a GP search when fitness is determined by a standard error measure such as the root mean squared error (RMSE) [26]. The search produces very poor models, with high training and testing errors. This section presents a
Proposed outlier filter
Given the above results, a new approach for removing vertical outliers is proposed. The method is called Semantic Outlier Automatic Preprocessing, or SOAP, which is summarized in Algorithm 1. The algorithm proceeds as follows. First, a random set P of GP trees is generated and their fitness is computed. Kernel density estimation is performed, and the peak value is computed g*. Then, we select a subset of individuals P′, such that the value of the estimated density function g(f(k)) for all
Discussion and open issues
There are three general strategies to deal with outliers automatically. The first approach is to use a regression process to build a model while excluding the outliers. This approach is taken by most of the robust techniques, such as LMS or LTS, since the determination of which points are outliers depends on obtaining the residuals from a fitted model. The second approach is to use a filtering process, such as the Hampel identifier. Finally, the third approach is to use a sampling process, such
Conclusion
This paper studies the response of random GP trees, or more generally of random syntax trees, to outliers in a training set. The first contribution was to show that random GP trees respond differently to inliers than outliers. Particularly, they are able to fit inlier data points better than outliers. It is hypothesized that the GP trees are exploiting an underlying regularity in the input and target variables; i.e. that semantic space cannot be sampled randomly but instead that GP trees reside
CRediT authorship contribution statement
Leonardo Trujillo: Conceptualization, Methodology, Supervision, Funding acquisition, Writing - original draft, Resources, Project administration. Uriel López: Conceptualization, Investigation, Methodology, Software, Validation, Visualization, Writing - review & editing, Formal analysis. Pierrick Legrand: Formal analysis, Writing - review & editing, Funding acquisition, Supervision, Project administration.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgment
This research was funded by CONACYT (Mexico) Fronteras de la Ciencia 2015-2 Project No. FC-2015-2/944, and second author is supported by CONACYT graduate scholarship No. 573397. Funding was also provided by the FP7-Marie Curie-IRSES 2013 European Commission program through project ACoBSEC with contract No. 612689.
References (51)
- et al.
Recommender systems survey
Knowl. Based Syst.
(2013) - et al.
Exploring process data with the use of robust outlier detection algorithms
J. Process Control
(2003) - et al.
Outlier robust extreme machine learning for multi-target regression
Expert Syst. Appl.
(2020) - et al.
Least trimmed squares regression, least median squares regression, and mathematical programming
Math. Comput. Model.
(2002) - et al.
Balancing learning and overfitting in genetic programming with interleaved sampling of training data
- et al.
Competent geometric semantic genetic programming for symbolic regression and boolean function synthesis
Evol. Comput.
(2018) - et al.
Random sample consensus combined with partial least squares regression (RANSAC-PLS) for microbial metabolomics data mining and phenotype improvement
J. Biosci. Bioeng.
(2016) Neat genetic programming: controlling bloat naturally
Inf. Sci.
(2016)- et al.
Accurate quantitative estimation of energy performance of residential buildings using statistical machine learning tools
Energy Build.
(2012) - et al.
Detecting outliers for complex nonlinear systems with dynamic ensemble learning
Chaos Solitons Fractals
(2019)
Modeling of strength of high-performance concrete using artificial neural networks
Applied Smoothing Techniques for Data Analysis
Sparse least trimmed squares regression for analyzing high-dimensional large data sets
Ann. Appl. Stat.
Semantic analysis of program initialisation in genetic programming
Genet. Program. Evolvable Mach.
Anomaly detection: A survey
ACM Comput. Surv.
Evolutionary multi-objective optimization based ensemble autoencoders for image outlier detection
Neurocomputing
Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography
Commun. ACM
DEAP: evolutionary algorithms made easy
J. Mach. Learn. Res.
SOM ensemble for unsupervised outlier analysis. Application to outlier identification in the Gaia astronomical survey
Expert Syst. Appl.
Deep Learning
Multiple View Geometry in Computer Vision, 2nd Edition
Optimal RANSAC-towards a repeatable algorithm for finding the optimal set
J. WSC
A survey of outlier detection methodologies
Artif. Intell. Rev.
High-breakdown robust multivariate methods
Stat. Sci.
Cited by (3)
Out-of-core outlier removal for large-scale indoor point clouds
2022, Graphical ModelsCitation Excerpt :ii) Some of the parameters are not adequately intuitive. Inspired by several papers [48–50], a self-determined parameter algorithm will be appealing in future research. Linlin Ge: Conceptualization, Methodology, Software, Validation, Investigation, Writing – Original Draft, Visualization.
Type-based outlier removal framework for point clouds
2021, Information SciencesCitation Excerpt :This inconsistency among the thresholds may confuse the user. Trujillo et al. [47] proposed an automatic outlier removal method based on genetic programming. Inspired by their work, we intend to investigate an automatic parameter determination solution based on a heuristic algorithm as part of our future research.
Automatic Learning for Commercial Registration Renewal - The Case of Camara de Comercio of Barranquilla-Colombia
2023, IEEE Engineering Management Review