Prediction of tropospheric ozone concentrations: Application of a methodology based on the Darwin’s Theory of Evolution

doi:10.1016/j.eswa.2010.07.122

Expert Systems with Applications

Volume 38, Issue 3, March 2011, Pages 1903-1908

https://doi.org/10.1016/j.eswa.2010.07.122 Get rights and content

Abstract

This study aims to predict the next day hourly average tropospheric ozone (O₃) concentrations using genetic programming (GP). Due to the complexity of this problem, GP is an adequate methodology as it can optimize, simultaneously, the structure of the model and its parameters. It is an artificial intelligence methodology that uses the same principles of the Darwinian Theory of Evolution. GP enables the automatic generation of mathematical expressions that are modified following an iterative process applying genetic operations.

The inputs of the models were the hourly average concentrations of carbon monoxide (CO), nitrogen oxide (NO), nitrogen dioxide (NO₂) and O₃, and some meteorological variables (temperature – T; solar radiation – SR; relative humidity – RH; and wind speed – WS) measured 24 h before. GP was also applied to the principal components (PC) obtained from these variables. The analysed period was from May to July 2004 divided in training and test periods.

GP was able to select the most relevant variables for prediction of O₃ concentrations. The original variables, T, RH and O₃ measured 24 h before were considered significant inputs for prediction. The selected PC had also important contributions of the same variables and of NO₂. GP models using the original variables presented better performance in training period and worse performance in test period when compared with the models obtained using PC. The results achieved using the GP methodology demonstrated that it can be very useful to solve several environmental complex problems.

Introduction

Ozone (O₃) is a strong photochemical oxidant present in different layers of the atmosphere. In the troposphere, this irritating and reactive gas has negative impacts on human health, climate, vegetation and materials (Bytnerowicz et al., 2006, Pires et al., 2008). Tropospheric ozone is the result of three basic processes: (i) photochemical production by the interaction of hydrocarbons and nitrogen oxides under the action of suitable ambient meteorological conditions (Guerra et al., 2004, Zolghadri et al., 2004); (ii) vertical transport of stratospheric air, rich in ozone, into the troposphere (Dueñas, Fernández, Cañete, Carretero, & Liger, 2002); and (iii) horizontal transport due to the wind that brings O₃ produced in other regions.

The formation of O₃ is a complex, nonlinear, time and space varying process. Accordingly, several studies presented different statistical approaches to predict O₃ concentrations (Al-Alawi et al., 2008, Coman et al., 2008, Omidvari et al., 2008, Pires et al., 2008a, Sousa et al., 2007, Sousa et al., 2006, Sousa et al., 2009), including linear and nonlinear models. The applied linear models found in the literature were: (i) multiple linear regression (MLR); (ii) principal component regression (PCR); (iii) quantile regression; and (iv) time series. On the other hand, the most common nonlinear model was the artificial neural network (ANN). The selection of a model must consider some features, such as, complexity, flexibility, accuracy and speed of computation (Pires, Martins, Sousa, Alvim-Ferraz, & Pereira, 2008b). ANN models usually presented better performance than the linear ones (Al-Alawi et al., 2008, Sousa et al., 2006, Sousa et al., 2007) due to the nonlinearity behaviour associated to the O₃ formation. However, they are included in a group called black box models, having limited interpretation. Moreover, the selection of the optimal network architecture and the computation time are the main disadvantages of these models.

Besides the structure, the success of a statistical model depends of several factors: (i) the data size; (ii) the method to optimize their parameters; (iii) the input variables; and (iv) the collinearity between the input variables. The collinearity between the input variables can be eliminated through the application of principal component analysis (PCA). It is mathematically defined as an orthogonal linear transformation that modifies the original data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on (Pires, Sousa, et al., 2008). Thus, the principal components (PC) are orthogonal and uncorrelated to each other, being determined by linear combinations of the original variables. The directions of the new coordinate axes are given by the eigenvectors of the covariance matrix of the original variables. The magnitude of the eigenvalues corresponds to the variance of the data along the eigenvector direction. Varimax rotation is the most widely employed orthogonal rotation in PCA, because it tends to produce simplification of the unrotated loadings to easier interpretation of the results. It simplifies the loadings by rigidly rotating the PC axes such that the variable projections (loadings) on each PC tend to be high or low. After the rotation, the loadings show the relative contributions of the original variables on each PC.

As many factors could influence the performance of models, their development should have more degrees of freedom. For the models referred above, their structure is fixed in advance and only the parameters are optimized. In stochastic processes, such as the prediction of O₃ concentrations, the structure of the models should be more flexible. In this context, genetic programming (GP) could be a successful methodology, as it does not assume in advance any structure for the model. GP can optimize both the structure of the model and its parameters, simultaneously. As far as it is known, no study was published applying GP for predicting air pollutant concentrations. This study aims to predict the next day hourly average O₃ concentrations applying GP to the original input variables and their correspondent principal components.

Section snippets

Genetic programming

Genetic programming (GP) is an artificial intelligence methodology that uses principles of the Darwin’s Theory of Evolution. Its search strategy is based on genetic algorithms (GA) introduced by John Holland in the 1960s (Goldberg, 1989). GA use bit strings as chromosomes and are commonly applied in function optimization. This algorithm has several disadvantages, for example, the length of the strings is static (Koza, 1992). Additionally, the size and the shape of the model, solution of a given

Data

The inputs of GP models were the hourly averages of air pollutant concentrations and meteorological variables measured 24 h before. The atmospheric concentrations of carbon monoxide (CO), nitrogen oxide (NO), nitrogen dioxide (NO₂) and O₃, were collected in an urban site (Antas) with traffic influences situated in Oporto, Northern Portugal. This site belongs to the air quality monitoring network of Oporto Metropolitan Area that is managed by the Regional Commission of Coordination and

Results and discussion

GP procedure was coded by the authors using Matlab. Table 1 shows the main control parameters of GP. The tree size is defined as the number of levels in the tree. For example, in Fig. 1, the tree size is 4. The fittest individuals correspond to the ones that presented the lowest errors in the training step. As the results obtained by GP method are probabilistic, several runs should be made before taking conclusions. In this study, four different runs were done using 3, 4 and 5 populations at

Conclusions

Aiming the prediction of the next day hourly average of O₃ concentrations, GP was applied using as inputs the OV and their PC. This methodology was able to select the relevant variables. Applying GP with original variables, T, RH and O₃ were considered significant inputs for prediction. On the other hand, when applied to PC, the selected ones had important contributions of the same variables and also of NO₂. GP models using the OV presented better performance in training period and worse

Acknowledgements

Authors are grateful to Comissão de Coordenação da Direcção Regional-Norte and to Instituto Geofísico da Universidade do Porto, for kindly providing the air quality and meteorological data. This work was supported by Fundação para a Ciência e Tecnologia (FCT). J.C.M. Pires also thanks the FCT for the fellowship SFRH/BD/23302/2005.

References (20)

S.M. Al-Alawi et al.
Combining principal component regression and artificial neural networks for more accurate predictions of ground-level ozone
Environmental Modelling and Software
(2008)
J.S. Chen et al.
Dynamic proportion portfolio insurance using genetic programming with principal component analysis
Expert Systems with Applications
(2008)
A. Coman et al.
Hourly ozone prediction for a 24-h horizon using neural networks
Environmental Modelling and Software
(2008)
C. Dueñas et al.
Assessment of ozone variations and meteorological effects in an urban area in the Mediterranean Coast
Science of the Total Environment
(2002)
B. Grosman et al.
Adaptive genetic programming for steady-state process modelling
Computers and Chemical Engineering
(2004)
J.-C. Guerra et al.
Study on the formation and transport of ozone in relation to the air quality management and vegetation protection in Tenerife (Canary Islands)
Chemosphere
(2004)
M. Omidvari et al.
Time series analysis of ozone data in Isfahan
Physica A: Statistical Mechanics and its Applications
(2008)
M. Oussaidène et al.
Parallel genetic programming and its application to trading model induction
Parallel Computing
(1997)
A.D. Parkins et al.
Genetic programming techniques for hand written digit recognition
Signal Processing
(2004)
J.C.M. Pires et al.
Management of air quality monitoring using principal component and cluster analysis–Part II: CO, NO2 and O3
Atmospheric Environment
(2008)

There are more references available in the full text version of this article.

Cited by (12)

A novel dual-scale ensemble learning paradigm with error correction for predicting daily ozone concentration based on multi-decomposition process and intelligent algorithm optimization, and its application in heavily polluted regions of China
2022, Atmospheric Pollution Research
Citation Excerpt :
High concentrations of secondary pollutants are generated near the ground due to photochemical reactions of these precursors under specific meteorological conditions, of which ozone is a typical case (Guerra et al., 2004). Ozone is a widely distributed air pollutant, and excessive ambient ozone levels are a primary hallmark of photochemical pollution (Pires et al., 2011). While the Chinese government has made significant progress in controlling fine particle pollution, another daunting anti-pollution task lies ahead of the authorities as the problem of ozone-based photochemical smog has become increasingly severe in recent years.
Accurate prediction of daily ozone concentration is imperative to prevent and control photochemical pollution in China. A novel ensemble learning paradigm is proposed in this paper to perform high-precision ozone concentrations prediction, including model preprocess, optimized dual-scale prediction, and error correction. Firstly, the original ozone series is decomposed into two detailed sequences and two approximate sequences by wavelet packet decomposition (WPD). The detailed sequences are reconstructed into several new subsequences by complementary ensemble empirical mode decomposition (CEEMD) and fuzzy entropy (FE). Secondly, the extreme learning machine (ELM) and the support vector machine (SVM) are both optimized by the bald eagle search (BES) algorithm to carry out dual-scale parallel prediction. Finally, the error series is decomposed and predicted by the variational mode decomposition (VMD) and the optimized ELM to correct the previously predicted ozone and obtain the final ozone prediction. The actual ozone data from two typical cities in China are collected as the inputs of the model for empirical analysis under two scenarios, and one-day ahead prediction results show that: the RMSE values of the proposed model are 2.5319 and 3.2069 at Taiyuan and Shanghai sites when predicting low levels of ozone, respectively, while the RMSE values of the proposed model are 2.8451 and 3.8702 at two sites when predicting high levels of ozone, respectively; the proposed model outperforms the comparison models and four existing models under each scenario, demonstrating its robustness and serviceability. The results of this study can provide a valuable reference for analyzing the tendency of pollution.
Performance and emission characteristics of a CI engine using nano particles additives in biodiesel-diesel blends and modeling with GP approach
2017, Fuel
Citation Excerpt :
GP has been applied to a wide range of problems in artificial intelligence, engineering and science, chemical and biological processes and mechanical issues [18–22]. Pires, et al. [23] used GP method to predict the next day hourly average tropospheric ozone (O3) concentrations. The results showed very good agreement between predicted and measured data.
The performance and the exhaust emissions of a diesel engine operating on nano-diesel-biodiesel blended fuels has been investigated. Multi wall carbon nano tubes (CNT) (40, 80 and 120 ppm) and nano silver particles (40, 80 and 120 ppm) were produced and added as additive to the biodiesel-diesel blended fuel. Six cylinders, four-stroke diesel engine was fuelled with these new blended fuels and operated at different engine speeds. Experimental test results indicated the fact that adding nano particles to diesel and biodiesel fuels, increased diesel engine performance variables including engine power and torque output up to 2% and brake specific fuel consumption (bsfc) was decreased 7.08% compared to the net diesel fuel. CO₂ emission increased maximum 17.03% and CO emission in a biodiesel-diesel fuel with nano-particles was lower significantly (25.17%) compared to pure diesel fuel. UHC emission with silver nano-diesel-biodiesel blended fuel decreased (28.56%) while with fuels that contains CNT nano particles increased maximum 14.21%. With adding nano particles to the blended fuels, NOx increased 25.32% compared to the net diesel fuel. This study also presents genetic programming (GP) based model to predict the performance and emission parameters of a CI engine in terms of nano-fuels and engine speed. Experimental studies were completed to obtain training and testing data. The optimum models were selected according to statistical criteria of root mean square error (RMSE) and coefficient of determination (R²). It was observed that the GP model can predict engine performance and emission parameters with correlation coefficient (R²) in the range of 0.93–1 and RMSE was found to be near zero. The simulation results demonstrated that GP model is a good tool to predict the CI engine performance and emission parameters.
Prediction of 8h-average ozone concentration using a supervised hidden Markov model combined with generalized linear models
2013, Atmospheric Environment
An ozone prediction model based on a supervised hidden Markov model (HMM) and generalized linear models (GLMs) has been developed and tested on data from Livermore Valley, CA. Hidden states in the supervised HMM are assigned to represent different ozone concentration ranges which make the parameters of the supervised HMM easy to be explained. Using the Viterbi algorithm (VA), not only the most likely state of 8 h-average ozone concentrations but also the relative probabilities of different concentration ranges can be obtained. Then, GLMs corresponding to different ozone concentration ranges are used to quantitatively predict surface ozone levels. Using the relative probabilities and ozone levels predicted by GLMs, an ozone concentration value in the most likely concentration range can be finally determined. In this paper, data from 8 ozone seasons spanning 2000 to 2007 are used to build the prediction model and data from 2008 to 2009 are used for validation. The results show that this model can be used to predict all ozone exceedance days correctly. Compared to the generalized linear mixed effects model (GLMM), which is also used to model grouped data, the true prediction rate (TPR) of the proposed model is higher by 27%. Compared to the prediction results using the supervised HMM alone, the mean absolute error (MAE) of ozone exceedance days predicted by the proposed model is reduced by 72%.
Developing a predictive tropospheric ozone model for Tabriz
2013, Atmospheric Environment
Citation Excerpt :
In a study of Kaohsiung in Taiwan, a genetic algorithm-based model was developed by Tseng and Chang (2001) for assessing the relocation strategy of the urban air quality monitoring network with respect to the multi-objective and multi-pollutant design criteria. Pires et al. (2011) used genetic programming to predict the next day hourly average ozone concentrations in Oporto, Portugal, using hourly average concentrations of environmental variables (CO, NO, NO2) and meteorological variables (e.g. temperature, solar radiation) measured 24 h earlier. GEP was developed by Ferreira (2001a,b) but the authors are not aware of its past applications to modeling tropospheric ozone time series.
Predictive ozone models are becoming indispensable tools by providing a capability for pollution alerts to serve people who are vulnerable to the risks. We have developed a tropospheric ozone prediction capability for Tabriz, Iran, by using the following five modeling strategies: three regression-type methods: Multiple Linear Regression (MLR), Artificial Neural Networks (ANNs), and Gene Expression Programming (GEP); and two auto-regression-type models: Nonlinear Local Prediction (NLP) to implement chaos theory and Auto-Regressive Integrated Moving Average (ARIMA) models. The regression-type modeling strategies explain the data in terms of: temperature, solar radiation, dew point temperature, and wind speed, by regressing present ozone values to their past values. The ozone time series are available at various time intervals, including hourly intervals, from August 2010 to March 2011. The results for MLR, ANN and GEP models are not overly good but those produced by NLP and ARIMA are promising for the establishing a forecasting capability.
Correction methods for statistical models in tropospheric ozone forecasting
2011, Atmospheric Environment
Citation Excerpt :
The formation of O3 is a complex, non-linear, time and space varying process. Accordingly, several studies presented different statistical approaches to predict O3 concentrations (Yi and Prybutok, 1996; Spellman, 1999; Abdul-Wahab and Al-Alawi, 2002; Ballester et al., 2002; Wang et al., 2003; Baur et al., 2004; Corani, 2005; Gómez-Sanchis et al., 2006; Schlink et al., 2006; Wang and Lu, 2006; Al-Alawi et al., 2008; Coman et al., 2008; Omidvari et al., 2008; Pires et al., 2008b, 2010, 2011; Ortiz-García et al., 2010), including linear and non-linear models. The applied linear models found in the literature were: (i) multiple linear regression (MLR); (ii) principal component regression (PCR); (iii) independent component regression; (iv) quantile regression; (v) partial least squares regression; and (vi) time series.
This study proposes two methods to enhance the performance of statistical models for prediction tropospheric ozone concentrations. The first method corrects the statistical model based on the average daily profile of the model errors in training set. The second method estimates the model error by making the analogy with three basic modes of feedback control: proportional, integral and derivative. These correction methods were tested with multiple linear regression (MLR) and artificial neural networks (ANN) for prediction of hourly average tropospheric ozone (O₃) concentrations.
The inputs of the models were the hourly average concentrations of sulphur dioxide (SO₂), carbon monoxide (CO), nitrogen oxide (NO), nitrogen dioxide (NO₂) and O₃, and some meteorological variables (temperature – T; relative humidity – RH; and wind speed – WS) measured 24 h before. The analysed period was from May to June 2003 divided in training and test periods.
ANN presented slightly better performance than MLR model for prediction of O₃ concentrations. Both models presented improvements with the proposed correction methods. The first method achieved the highest improvements with ANN model. However, the second method was the one that obtained the best predictions of hourly average O₃ concentrations with the correction of MLR model.
Ozone concentration forecasting using statistical learning approaches
2017, Journal of Materials and Environmental Science

View all citing articles on Scopus

View full text

Prediction of tropospheric ozone concentrations: Application of a methodology based on the Darwin’s Theory of Evolution

Abstract

Introduction

Section snippets

Genetic programming

Data

Results and discussion

Conclusions

Acknowledgements

Environmental Modelling and Software

Expert Systems with Applications

Environmental Modelling and Software

Science of the Total Environment

Computers and Chemical Engineering

Chemosphere

Physica A: Statistical Mechanics and its Applications

Parallel Computing

Signal Processing

Atmospheric Environment