Selection of significant input variables for time series forecasting
Introduction
Time series data are a collection of observations made sequentially with time and their past values are used for making time series forecasting. Time series forecasting provides useful information for decision making activities of operational planning and management (Makridakis et al., 2003). Examples of time series forecasting applications are numerous such as electrical load forecasting (Hahn et al., 2009), water quality forecasting (Muttil and Chau, 2006) and hydrological forecasting (Wang et al., 2009, Abbot and Marohasy, 2014).
Selection of significant input variables is an important step in the development of time series forecasting models and is defined as the task of ‘appropriately’ selecting a subset S of s significant input variables from a pool set C of c candidate input variables (May et al., 2008). The appropriateness of selection is in general decided on the basis of model performance. On one hand, when S is under-specified, the model performance becomes poor since the selected variables do not fully describe the behaviour of the modelled system. On the other hand, when S is over-specified, resulting in either irrelevant or redundant input variables, the model performance again becomes poor due to irrelevant or redundant variables adding extra noise which deteriorates the accuracy of the model. Investigations have clearly shown the degradation of the prediction results when correlation between input variables is present (Alexandridis et al., 2005). In addition, the irrelevant variables can cloud true relationships that exist between important variables. Also, the irrelevant or redundant variables increase the complexity and the number of parameters of the model. Furthermore, the inclusion of redundant input variables can increase time and cost in data collection (Austin and Tu, 2004).
Input selection can be basically divided into two approaches, namely, model-free and model-based (Fernando et al., 2009, Alves Da Silva et al., 2008). The major difference between the two approaches is that the model-free approach does not depend on the model structure and the model calibration as with the model-based approach (Alves Da Silva et al., 2008). In contrast, the model-free approach considers the statistical relationship in the form of linear or non-linear correlation between model inputs and model output. The Pearson's correlation is the simplest model-free technique that can detect the linear correlation. The partial mutual information has been developed as a model-free technique that can detect both linear and non-linear correlation (May et al., 2008). Both techniques (i.e. Pearson's correlation and partial mutual information) select the significant input variables based on one or more statistical tests to avoid selection by chance. Input variables which have statistically significant correlation with the model output can be selected as significant input variables for model development.
The model based approach, on the other hand, considers candidate input variables as significant input variables if they contribute relatively higher to one or more of the modelling performance indicators than the other input variables (Galelli and Castelletti, 2013). Two such indicators that are commonly used are mean square error (MSE) between observed and forecasted outputs (Li and Peng, 2007) and coefficient of determination (R2) (Austin and Tu, 2004). To obtain the relative contribution of each input variable, some selection procedures including forward, backward, and stepwise selection (Austin and Tu, 2004, Muttil and Chau, 2007) and optimization selection (Llobet et al., 2004) are commonly used for the model-based input selection approach.
The forward selection involves starting with no input variables in the model, adding input variables one at a time and keeping that input variable in the model if it passes some tests for statistical significance. Alternatively, instead of using statistical tests, individual input variables are ranked according to the change of model performance that they cause in the forward selection process. Input variables with high ranking can then be selected as significant inputs. The backward selection works similar to the forward selection but in the reverse direction by starting with all candidate input variables in the model and removing one input variable at a time. The stepwise selection is a combination of the forward and backward selection in which at any step, input variables are subjected to inclusion and exclusion from the model, based on their performance on some tests for statistical significance. Alternative to the stepwise selection is the optimization selection in which significant inputs variables are selected in an automatic search process that optimizes (i.e. minimises or maximises) some indicators of model performance such as MSE and R2.
It can be seen that the model-based approach is biased in a sense that the input variables are selected based on prescribed modelling performances such as MSE and R2. These modelling performances can be inherently subjective (Li and Peng, 2007). However, Alves Da Silva et al. (2008) argued that as the model-based technique selects significant inputs based on some modelling performances, it is more likely to achieve high modelling outcomes when the model using these significant inputs is applied on unseen test data.
A literature survey was conducted as a part of this study, which identified that there are an inadequate number of studies on comparative evaluation of model-based and model-free input selection methods. Galelli et al. (2014) proposed a performance evaluation framework including a quantitative metric and several qualitative criteria for input variable selection techniques. They compared the performance of model-based and model-free methods for input selection on 26 synthetic datasets. However, their study did not perform such an analysis on real datasets and they also did not evaluate the benefit of selected inputs on the forecasting performance. On the other hand, Muttil and Chau (2007) compared two model-based input selection methods (artificial neural network and genetic programming). They showed that both these methods produced similar results. Therefore, the contribution of this study is to undertake a performance based evaluation of popularly used input selection techniques for use in environmental systems, which is demonstrated through the application of 3 input selection techniques on 4 hypothetical and 2 real datasets. The first two techniques are model-free techniques which are based on linear correlation function (Tsay, 2005) and partial mutual information technique (May et al., 2008), while the third technique is a model-based technique based on genetic programming (Muttil and Chau, 2007). The aim of this investigation is to test the ability of these techniques in selecting correct significant inputs. The forecasting performance of inputs selected by different techniques was also evaluated.
Section snippets
Datasets
The first four datasets are hypothetical time series datasets and the last two are real-world time series datasets. Hypothetical datasets were used in this study for benchmarking and analysis since the relationships between inputs and outputs, and the structure of the data generating models were known. Therefore, the use of hypothetical datasets allows appropriate evaluation of the performance of the input selection techniques. The real datasets were used to illustrate the application of the
Methods and techniques
This section briefly describes the three input selection techniques, namely, partial linear correlation (PLC), partial mutual information (PMI), and genetic programming (GP) techniques used in this study. For a detailed description of these techniques, the reader is referred to citations presented in each of the following sub-sections.
Results and discussion
The results reported in this section were based on the following four assumptions:
- i)
For each of the 4 hypothetical datasets, 1000 data points were generated but only the last 700 data points were used for identifying significant inputs to avoid the initialization effects. The initialization effects are due to the choice of initial values for the generation of the time series, which is often randomly selected and thus might give a different pattern to the true pattern of the time series data.
Discussion
The relatively better performance of GP models observed in this study could be based on its evolutionary search domain which is constituted by all polynomials of any form over the input variables and constants. That search domain allows the detection of complex non-linear relationships between model output and input variables, and all possible interactions between the input variables (Jayawardena et al., 2005). The results from PMI technique is comparable to the GP technique, although it has a
Conclusions
This paper presents an investigation of the performance of three input variable selection techniques, which include two model-free techniques based on PLC and PMI and one model-based technique using GP. Four hypothetical datasets in which the true significant inputs and the underlying processes of generating the data were known, were used to test the selection performance of the four techniques. Two real datasets were used to show how the three input selection techniques can be applied to real
Acknowledgement
The authors would like to thank Dr. Robert May for providing the software tool to implement the PMI input selection technique.
References (27)
- et al.
Input selection and optimisation for monthly rainfall forecasting in queensland, australia, using artificial neural networks
Atmos. Res.
(2014) - et al.
A two-stage evolutionary algorithm for variable selection in the development of RBF neural network models
Chemom. Intell. Lab. Syst.
(2005) - et al.
Input space to neural network based load forecasters
Int. J. Forecast.
(2008) - et al.
Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality
J. Clin. Epidemiol.
(2004) - et al.
Jordan recurrent neural network versus IHACRES in modelling daily streamflows
J. Hydrol.
(2008) - et al.
Selection of input variables for data driven models: an average shifted histogram partial mutual information estimator approach
J. Hydrol.
(2009) - et al.
An evaluation framework for input variable selection algorithms for environmental data-driven models
Environ. Model. Softw.
(2014) - et al.
Electric load forecasting methods: tools for decision making
Eur. J. Operat. Res.
(2009) - et al.
Neural input selection–a fast model-based approach
Neurocomputing
(2007) - et al.
Building parsimonious fuzzy ARTMAP models by variable selection with a cascaded genetic algorithm: application to multisensor systems for gas analysis
Sensors Actuat. B: Chem.
(2004)
Non-linear variable selection for artificial neural networks using partial mutual information
Environ. Model. Softw.
Machine-learning paradigms for selecting ecologically significant input factors
Eng. Appl. Artif. Intell.
Seasonal to interannual rainfall probabilistic forecasts for improved water supply management: Part 1 – a strategy for system predictor identification
J. Hydrol.
Cited by (43)
Ocean wave energy forecasting using optimised deep learning neural networks
2021, Ocean EngineeringGenetic programming in water resources engineering: A state-of-the-art review
2018, Journal of HydrologyCitation Excerpt :Moreover, units of terminals must be carefully taken into account to avoid potential conflict with incorrect dimensionality of GP-induced models. There are different ways to select the most influential variables needed for data-driven techniques and readers are referred to Tran et al. (2015) for further details. This task was frequently addressed by heuristic search algorithm of GP in the reviewed papers.
An improved gene expression programming model for streamflow forecasting in intermittent streams
2018, Journal of HydrologyA hybrid ETS–ANN model for time series forecasting
2017, Engineering Applications of Artificial Intelligence