Prediction of tropospheric ozone concentrations: Application of a methodology based on the Darwin’s Theory of Evolution
Introduction
Ozone (O3) is a strong photochemical oxidant present in different layers of the atmosphere. In the troposphere, this irritating and reactive gas has negative impacts on human health, climate, vegetation and materials (Bytnerowicz et al., 2006, Pires et al., 2008). Tropospheric ozone is the result of three basic processes: (i) photochemical production by the interaction of hydrocarbons and nitrogen oxides under the action of suitable ambient meteorological conditions (Guerra et al., 2004, Zolghadri et al., 2004); (ii) vertical transport of stratospheric air, rich in ozone, into the troposphere (Dueñas, Fernández, Cañete, Carretero, & Liger, 2002); and (iii) horizontal transport due to the wind that brings O3 produced in other regions.
The formation of O3 is a complex, nonlinear, time and space varying process. Accordingly, several studies presented different statistical approaches to predict O3 concentrations (Al-Alawi et al., 2008, Coman et al., 2008, Omidvari et al., 2008, Pires et al., 2008a, Sousa et al., 2007, Sousa et al., 2006, Sousa et al., 2009), including linear and nonlinear models. The applied linear models found in the literature were: (i) multiple linear regression (MLR); (ii) principal component regression (PCR); (iii) quantile regression; and (iv) time series. On the other hand, the most common nonlinear model was the artificial neural network (ANN). The selection of a model must consider some features, such as, complexity, flexibility, accuracy and speed of computation (Pires, Martins, Sousa, Alvim-Ferraz, & Pereira, 2008b). ANN models usually presented better performance than the linear ones (Al-Alawi et al., 2008, Sousa et al., 2006, Sousa et al., 2007) due to the nonlinearity behaviour associated to the O3 formation. However, they are included in a group called black box models, having limited interpretation. Moreover, the selection of the optimal network architecture and the computation time are the main disadvantages of these models.
Besides the structure, the success of a statistical model depends of several factors: (i) the data size; (ii) the method to optimize their parameters; (iii) the input variables; and (iv) the collinearity between the input variables. The collinearity between the input variables can be eliminated through the application of principal component analysis (PCA). It is mathematically defined as an orthogonal linear transformation that modifies the original data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on (Pires, Sousa, et al., 2008). Thus, the principal components (PC) are orthogonal and uncorrelated to each other, being determined by linear combinations of the original variables. The directions of the new coordinate axes are given by the eigenvectors of the covariance matrix of the original variables. The magnitude of the eigenvalues corresponds to the variance of the data along the eigenvector direction. Varimax rotation is the most widely employed orthogonal rotation in PCA, because it tends to produce simplification of the unrotated loadings to easier interpretation of the results. It simplifies the loadings by rigidly rotating the PC axes such that the variable projections (loadings) on each PC tend to be high or low. After the rotation, the loadings show the relative contributions of the original variables on each PC.
As many factors could influence the performance of models, their development should have more degrees of freedom. For the models referred above, their structure is fixed in advance and only the parameters are optimized. In stochastic processes, such as the prediction of O3 concentrations, the structure of the models should be more flexible. In this context, genetic programming (GP) could be a successful methodology, as it does not assume in advance any structure for the model. GP can optimize both the structure of the model and its parameters, simultaneously. As far as it is known, no study was published applying GP for predicting air pollutant concentrations. This study aims to predict the next day hourly average O3 concentrations applying GP to the original input variables and their correspondent principal components.
Section snippets
Genetic programming
Genetic programming (GP) is an artificial intelligence methodology that uses principles of the Darwin’s Theory of Evolution. Its search strategy is based on genetic algorithms (GA) introduced by John Holland in the 1960s (Goldberg, 1989). GA use bit strings as chromosomes and are commonly applied in function optimization. This algorithm has several disadvantages, for example, the length of the strings is static (Koza, 1992). Additionally, the size and the shape of the model, solution of a given
Data
The inputs of GP models were the hourly averages of air pollutant concentrations and meteorological variables measured 24 h before. The atmospheric concentrations of carbon monoxide (CO), nitrogen oxide (NO), nitrogen dioxide (NO2) and O3, were collected in an urban site (Antas) with traffic influences situated in Oporto, Northern Portugal. This site belongs to the air quality monitoring network of Oporto Metropolitan Area that is managed by the Regional Commission of Coordination and
Results and discussion
GP procedure was coded by the authors using Matlab. Table 1 shows the main control parameters of GP. The tree size is defined as the number of levels in the tree. For example, in Fig. 1, the tree size is 4. The fittest individuals correspond to the ones that presented the lowest errors in the training step. As the results obtained by GP method are probabilistic, several runs should be made before taking conclusions. In this study, four different runs were done using 3, 4 and 5 populations at
Conclusions
Aiming the prediction of the next day hourly average of O3 concentrations, GP was applied using as inputs the OV and their PC. This methodology was able to select the relevant variables. Applying GP with original variables, T, RH and O3 were considered significant inputs for prediction. On the other hand, when applied to PC, the selected ones had important contributions of the same variables and also of NO2. GP models using the OV presented better performance in training period and worse
Acknowledgements
Authors are grateful to Comissão de Coordenação da Direcção Regional-Norte and to Instituto Geofísico da Universidade do Porto, for kindly providing the air quality and meteorological data. This work was supported by Fundação para a Ciência e Tecnologia (FCT). J.C.M. Pires also thanks the FCT for the fellowship SFRH/BD/23302/2005.
References (20)
- et al.
Combining principal component regression and artificial neural networks for more accurate predictions of ground-level ozone
Environmental Modelling and Software
(2008) - et al.
Dynamic proportion portfolio insurance using genetic programming with principal component analysis
Expert Systems with Applications
(2008) - et al.
Hourly ozone prediction for a 24-h horizon using neural networks
Environmental Modelling and Software
(2008) - et al.
Assessment of ozone variations and meteorological effects in an urban area in the Mediterranean Coast
Science of the Total Environment
(2002) - et al.
Adaptive genetic programming for steady-state process modelling
Computers and Chemical Engineering
(2004) - et al.
Study on the formation and transport of ozone in relation to the air quality management and vegetation protection in Tenerife (Canary Islands)
Chemosphere
(2004) - et al.
Time series analysis of ozone data in Isfahan
Physica A: Statistical Mechanics and its Applications
(2008) - et al.
Parallel genetic programming and its application to trading model induction
Parallel Computing
(1997) - et al.
Genetic programming techniques for hand written digit recognition
Signal Processing
(2004) - et al.
Management of air quality monitoring using principal component and cluster analysis–Part II: CO, NO2 and O3
Atmospheric Environment
(2008)
Cited by (12)
A novel dual-scale ensemble learning paradigm with error correction for predicting daily ozone concentration based on multi-decomposition process and intelligent algorithm optimization, and its application in heavily polluted regions of China
2022, Atmospheric Pollution ResearchCitation Excerpt :High concentrations of secondary pollutants are generated near the ground due to photochemical reactions of these precursors under specific meteorological conditions, of which ozone is a typical case (Guerra et al., 2004). Ozone is a widely distributed air pollutant, and excessive ambient ozone levels are a primary hallmark of photochemical pollution (Pires et al., 2011). While the Chinese government has made significant progress in controlling fine particle pollution, another daunting anti-pollution task lies ahead of the authorities as the problem of ozone-based photochemical smog has become increasingly severe in recent years.
Performance and emission characteristics of a CI engine using nano particles additives in biodiesel-diesel blends and modeling with GP approach
2017, FuelCitation Excerpt :GP has been applied to a wide range of problems in artificial intelligence, engineering and science, chemical and biological processes and mechanical issues [18–22]. Pires, et al. [23] used GP method to predict the next day hourly average tropospheric ozone (O3) concentrations. The results showed very good agreement between predicted and measured data.
Developing a predictive tropospheric ozone model for Tabriz
2013, Atmospheric EnvironmentCitation Excerpt :In a study of Kaohsiung in Taiwan, a genetic algorithm-based model was developed by Tseng and Chang (2001) for assessing the relocation strategy of the urban air quality monitoring network with respect to the multi-objective and multi-pollutant design criteria. Pires et al. (2011) used genetic programming to predict the next day hourly average ozone concentrations in Oporto, Portugal, using hourly average concentrations of environmental variables (CO, NO, NO2) and meteorological variables (e.g. temperature, solar radiation) measured 24 h earlier. GEP was developed by Ferreira (2001a,b) but the authors are not aware of its past applications to modeling tropospheric ozone time series.
Correction methods for statistical models in tropospheric ozone forecasting
2011, Atmospheric EnvironmentCitation Excerpt :The formation of O3 is a complex, non-linear, time and space varying process. Accordingly, several studies presented different statistical approaches to predict O3 concentrations (Yi and Prybutok, 1996; Spellman, 1999; Abdul-Wahab and Al-Alawi, 2002; Ballester et al., 2002; Wang et al., 2003; Baur et al., 2004; Corani, 2005; Gómez-Sanchis et al., 2006; Schlink et al., 2006; Wang and Lu, 2006; Al-Alawi et al., 2008; Coman et al., 2008; Omidvari et al., 2008; Pires et al., 2008b, 2010, 2011; Ortiz-García et al., 2010), including linear and non-linear models. The applied linear models found in the literature were: (i) multiple linear regression (MLR); (ii) principal component regression (PCR); (iii) independent component regression; (iv) quantile regression; (v) partial least squares regression; and (vi) time series.
Ozone concentration forecasting using statistical learning approaches
2017, Journal of Materials and Environmental Science