Elsevier

Environmental Modelling & Software

Volume 64, February 2015, Pages 156-163
Environmental Modelling & Software

Selection of significant input variables for time series forecasting

https://doi.org/10.1016/j.envsoft.2014.11.018Get rights and content

Highlights

  • Comparative evaluation of two model-free and a model-based input selection method.

  • Four synthetic and two real datasets are used for the comparative evaluation.

  • Model-free techniques: partial linear correlation (PLC) and partial mutual information.

  • Model-based technique based on genetic programming (GP).

  • Inputs selected by both PLC and GP are recommended as the significant inputs.

Abstract

Appropriate selection of inputs for time series forecasting models is important because it not only has the potential to improve performance of forecasting models, but also helps reducing cost in data collection. This paper presents an investigation of selection performance of three input selection techniques, which include two model-free techniques, partial linear correlation (PLC) and partial mutual information (PMI) and a model-based technique based on genetic programming (GP). Four hypothetical datasets and two real datasets were used to demonstrate the performance of the three techniques. The results suggested that the model-free PLC technique due to its computational simplicity and the model-based GP technique due to its ability to detect non-linear relationships (demonstrated by its relatively good performance on a hypothetical complex non-linear dataset) are recommended for the input selection task. Candidate inputs which are selected by both these recommended techniques should be considered as significant inputs.

Introduction

Time series data are a collection of observations made sequentially with time and their past values are used for making time series forecasting. Time series forecasting provides useful information for decision making activities of operational planning and management (Makridakis et al., 2003). Examples of time series forecasting applications are numerous such as electrical load forecasting (Hahn et al., 2009), water quality forecasting (Muttil and Chau, 2006) and hydrological forecasting (Wang et al., 2009, Abbot and Marohasy, 2014).

Selection of significant input variables is an important step in the development of time series forecasting models and is defined as the task of ‘appropriately’ selecting a subset S of s significant input variables from a pool set C of c candidate input variables (May et al., 2008). The appropriateness of selection is in general decided on the basis of model performance. On one hand, when S is under-specified, the model performance becomes poor since the selected variables do not fully describe the behaviour of the modelled system. On the other hand, when S is over-specified, resulting in either irrelevant or redundant input variables, the model performance again becomes poor due to irrelevant or redundant variables adding extra noise which deteriorates the accuracy of the model. Investigations have clearly shown the degradation of the prediction results when correlation between input variables is present (Alexandridis et al., 2005). In addition, the irrelevant variables can cloud true relationships that exist between important variables. Also, the irrelevant or redundant variables increase the complexity and the number of parameters of the model. Furthermore, the inclusion of redundant input variables can increase time and cost in data collection (Austin and Tu, 2004).

Input selection can be basically divided into two approaches, namely, model-free and model-based (Fernando et al., 2009, Alves Da Silva et al., 2008). The major difference between the two approaches is that the model-free approach does not depend on the model structure and the model calibration as with the model-based approach (Alves Da Silva et al., 2008). In contrast, the model-free approach considers the statistical relationship in the form of linear or non-linear correlation between model inputs and model output. The Pearson's correlation is the simplest model-free technique that can detect the linear correlation. The partial mutual information has been developed as a model-free technique that can detect both linear and non-linear correlation (May et al., 2008). Both techniques (i.e. Pearson's correlation and partial mutual information) select the significant input variables based on one or more statistical tests to avoid selection by chance. Input variables which have statistically significant correlation with the model output can be selected as significant input variables for model development.

The model based approach, on the other hand, considers candidate input variables as significant input variables if they contribute relatively higher to one or more of the modelling performance indicators than the other input variables (Galelli and Castelletti, 2013). Two such indicators that are commonly used are mean square error (MSE) between observed and forecasted outputs (Li and Peng, 2007) and coefficient of determination (R2) (Austin and Tu, 2004). To obtain the relative contribution of each input variable, some selection procedures including forward, backward, and stepwise selection (Austin and Tu, 2004, Muttil and Chau, 2007) and optimization selection (Llobet et al., 2004) are commonly used for the model-based input selection approach.

The forward selection involves starting with no input variables in the model, adding input variables one at a time and keeping that input variable in the model if it passes some tests for statistical significance. Alternatively, instead of using statistical tests, individual input variables are ranked according to the change of model performance that they cause in the forward selection process. Input variables with high ranking can then be selected as significant inputs. The backward selection works similar to the forward selection but in the reverse direction by starting with all candidate input variables in the model and removing one input variable at a time. The stepwise selection is a combination of the forward and backward selection in which at any step, input variables are subjected to inclusion and exclusion from the model, based on their performance on some tests for statistical significance. Alternative to the stepwise selection is the optimization selection in which significant inputs variables are selected in an automatic search process that optimizes (i.e. minimises or maximises) some indicators of model performance such as MSE and R2.

It can be seen that the model-based approach is biased in a sense that the input variables are selected based on prescribed modelling performances such as MSE and R2. These modelling performances can be inherently subjective (Li and Peng, 2007). However, Alves Da Silva et al. (2008) argued that as the model-based technique selects significant inputs based on some modelling performances, it is more likely to achieve high modelling outcomes when the model using these significant inputs is applied on unseen test data.

A literature survey was conducted as a part of this study, which identified that there are an inadequate number of studies on comparative evaluation of model-based and model-free input selection methods. Galelli et al. (2014) proposed a performance evaluation framework including a quantitative metric and several qualitative criteria for input variable selection techniques. They compared the performance of model-based and model-free methods for input selection on 26 synthetic datasets. However, their study did not perform such an analysis on real datasets and they also did not evaluate the benefit of selected inputs on the forecasting performance. On the other hand, Muttil and Chau (2007) compared two model-based input selection methods (artificial neural network and genetic programming). They showed that both these methods produced similar results. Therefore, the contribution of this study is to undertake a performance based evaluation of popularly used input selection techniques for use in environmental systems, which is demonstrated through the application of 3 input selection techniques on 4 hypothetical and 2 real datasets. The first two techniques are model-free techniques which are based on linear correlation function (Tsay, 2005) and partial mutual information technique (May et al., 2008), while the third technique is a model-based technique based on genetic programming (Muttil and Chau, 2007). The aim of this investigation is to test the ability of these techniques in selecting correct significant inputs. The forecasting performance of inputs selected by different techniques was also evaluated.

Section snippets

Datasets

The first four datasets are hypothetical time series datasets and the last two are real-world time series datasets. Hypothetical datasets were used in this study for benchmarking and analysis since the relationships between inputs and outputs, and the structure of the data generating models were known. Therefore, the use of hypothetical datasets allows appropriate evaluation of the performance of the input selection techniques. The real datasets were used to illustrate the application of the

Methods and techniques

This section briefly describes the three input selection techniques, namely, partial linear correlation (PLC), partial mutual information (PMI), and genetic programming (GP) techniques used in this study. For a detailed description of these techniques, the reader is referred to citations presented in each of the following sub-sections.

Results and discussion

The results reported in this section were based on the following four assumptions:

  • i)

    For each of the 4 hypothetical datasets, 1000 data points were generated but only the last 700 data points were used for identifying significant inputs to avoid the initialization effects. The initialization effects are due to the choice of initial values for the generation of the time series, which is often randomly selected and thus might give a different pattern to the true pattern of the time series data.

Discussion

The relatively better performance of GP models observed in this study could be based on its evolutionary search domain which is constituted by all polynomials of any form over the input variables and constants. That search domain allows the detection of complex non-linear relationships between model output and input variables, and all possible interactions between the input variables (Jayawardena et al., 2005). The results from PMI technique is comparable to the GP technique, although it has a

Conclusions

This paper presents an investigation of the performance of three input variable selection techniques, which include two model-free techniques based on PLC and PMI and one model-based technique using GP. Four hypothetical datasets in which the true significant inputs and the underlying processes of generating the data were known, were used to test the selection performance of the four techniques. Two real datasets were used to show how the three input selection techniques can be applied to real

Acknowledgement

The authors would like to thank Dr. Robert May for providing the software tool to implement the PMI input selection technique.

References (27)

Cited by (43)

  • Genetic programming in water resources engineering: A state-of-the-art review

    2018, Journal of Hydrology
    Citation Excerpt :

    Moreover, units of terminals must be carefully taken into account to avoid potential conflict with incorrect dimensionality of GP-induced models. There are different ways to select the most influential variables needed for data-driven techniques and readers are referred to Tran et al. (2015) for further details. This task was frequently addressed by heuristic search algorithm of GP in the reviewed papers.

  • A hybrid ETS–ANN model for time series forecasting

    2017, Engineering Applications of Artificial Intelligence
View all citing articles on Scopus
View full text