[go: up one dir, main page]

Next Article in Journal
The Influence of Aid for Trade on Human Development in South Asia
Next Article in Special Issue
An Air Quality Modeling and Disability-Adjusted Life Years (DALY) Risk Assessment Case Study: Comparing Statistical and Machine Learning Approaches for PM2.5 Forecasting
Previous Article in Journal
Growing Stock Volume Estimation for Daiyun Mountain Reserve Based on Multiple Linear Regression and Machine Learning
Previous Article in Special Issue
Chemical Characterization, Source Identification, and Health Risk Assessment of Atmospheric Fine Particulate Matter in Winter in Hangzhou Bay
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

AQI Prediction Based on CEEMDAN-ARMA-LSTM

1
Institute of Quantitative and Technological Economics, Chinese Academy of Social Sciences, Beijing 100732, China
2
College of Quantitative and Technological Economics, University of Chinese Academy of Social Sciences, Beijing 102488, China
*
Author to whom correspondence should be addressed.
Sustainability 2022, 14(19), 12182; https://doi.org/10.3390/su141912182
Submission received: 18 August 2022 / Revised: 19 September 2022 / Accepted: 21 September 2022 / Published: 26 September 2022
(This article belongs to the Special Issue Aerosols and Air Pollution)

Abstract

:
In the context of carbon neutrality and air pollution prevention, it is of great research significance to achieve high-accuracy prediction of the air quality index. In this paper, Beijing is used as the study area; data from January 2014 to December 2019 are used as the training set, and data from January 2020 to December 2021 are used as the test set. The CEEMDAN-ARMA-LSTM model constructed in this paper is used for prediction and analysis. The CEEMDAN model is used to decompose the data to improve the data information utilization. The smooth non-white noise components are fed into the ARMA model, and the remaining components and residuals are fed into the LSTM model. The results show that the MAE, MAPE, MSE, and RMSE of this model are the smallest. Compared with the CEEMDAN-LSTM, LSTM, and ARMA-GARCH models, MAE improved by 22.5%, 53.4%, and 21.5%, MAPE improved by 21.4%, 55.3%, and 26.1%, MSE improved by 39.9%, 76.9%, and 28.5%, and RMSE improved by 22.5%, 52.0%, and 15.4%. The accuracy improvement is significant and has good application prospects.
Keywords:
CEEMDAN; ARMA-GARCH; LSTM; AQI

1. Introduction

In 2006, the World Health Organization (WHO) conducted air quality tests in many cities around the world, assessing the concentrations of three pollutants—NO 2 , SO 2 and PM 2.5 —in urban air. The results showed that among the cities tested in China, Beijing, Changsha, Shijiazhuang, Linfen and other cities have high pollution levels, which are likely to cause harm to humans and hinder economic development [1,2]. In recent years, in order to improve the air pollution situation, many initiatives have been taken at home and abroad to actively carry out a number of air pollution prevention and control efforts [3].
At present, in the context of pollution prevention and control and carbon neutrality, how to achieve high-precision prediction of air quality index (AQI) is an important research topic, which is of positive significance to urban development as well as national health. In recent years, air pollution, with PM 2.5 as the main source of pollution, has become increasingly aggravated, and hazy weather has appeared in most areas of China. Air quality monitoring in several cities across the country has issued severe pollution warnings, and air pollution has become a key environmental issue of social concern. The Beijing-Tianjin-Hebei region is a key concern for national regionalization development, and accurate prediction of the air quality index (AQI) in this region is of great research significance for the green and sustainable development of Beijing-Tianjin-Hebei. The AQI reflects the dynamic trend of air pollution and provides data support for the implementation of specific measures to mitigate air pollution. However, because AQI is stochastic and non-stationary, it often leads to low prediction accuracy and poor stability, and because the atmosphere is a very complex dynamic system and its trend is easily affected by the concentration of pollutants in the air, a variety of meteorological factors and other factors, it is difficult to model it [4,5]. Therefore, the accurate prediction of AQI is a challenging and important task.
Throughout the research, the main prediction models of time series are as follows.
(1) Traditional models: OLS, GM(1,1) [6], MM5-CAMx [7], ARMA and ARIMA [8]. Among them, ARMA is an important method to study time series, which consists of an autoregressive model (AR model for short) and moving average model (MA model for short) as the basis of “hybrid” composition, which only needs endogenous variables without the help of other exogenous variables. Tan used the air quality monitoring data from 51 monitoring stations and meteorological data in Hubei Province throughout 2016 to model the PM 2.5 concentration data of each city in Hubei Province using the ARMA method and the stepwise regression method [9]. Li et al. used the data of 2340 hazardous material accidents that occurred during road transportation in China from 2013 to 2019 to develop an AR model. The ARMA prediction model was developed by Li et al. using the data of 2340 hazardous materials accidents that occurred during road transportation in China from 2013 to 2019 [10]. Zhou used the ARIMA model to predict the grain yield in China with high accuracy [11]. The study showed that the established ARMA model has a good fit and can predict the time series more accurately.
(2) Machine learning models: SVM, BP; among them, the long short-term memory neural network, LSTM, was proposed by Hochreiter et al. in 1997. LSTM has a special recurrent structure that can avoid the gradient problem to learn data sequences with long time span. The LSTM neural network is particularly suitable for air quality prediction research because the current AQI values are often correlated with historical AQI values due to the nature of condensation and accumulation of air pollutants such as PM 2.5 in the atmospheric environment. Zeng et al. studied air quality data in Beijing from 2018 to 2020. Based on the pollutant concentration correlation analysis, a recurrent neural network model based on the LSTM algorithm was developed to achieve the prediction of Beijing AQI, and the recurrent neural network prediction model had a high prediction accuracy [12]. Yan et al. compared CNN, LSTM, and CNN-LSTM for multi-hour and multi-site AQI forecasting in Beijing. The results of the study indicate that LSTM is the best model for multi-hour forecasting [13]. However, its accuracy requires a large amount of data support, and the interpretation is a “black box” [14].
Combining traditional econometric models such as ARMA and machine learning models such as LSTM, SVR and BP neural network can compensate for each other’s defects and further improve the prediction accuracy [15,16,17]. Therefore, this paper tries to combine ARMA and LSTM models to improve AQI prediction accuracy.
Second, the AQI time series of each city fluctuates more seriously, showing disorder, chaos and non-stationary states. If the unprocessed AQI data are used directly for forecasting, it will interfere with the prediction results of the subsequent forecasting models because of such data fluctuations. If such fluctuations are removed, an amount of data information will be lost. Severe weather such as rain and snow can cause large fluctuations in the time series data, but this particular variation plays a key role in the subsequent data analysis. Using decomposition methods such as EMD, EEMD, and CEEMDAN can effectively separate the time series into high-frequency to low-frequency data, which in turn preserves such data fluctuations, improves data utilization, and ultimately improves prediction model performance [18,19]. Therefore, in this paper, we choose the CEEMDAN decomposition method to decompose time series data such as AQI into multiple components (intrinsic mode function (IMF)) with different frequencies with periodic trends and volatility trends of random factors [20,21], input them into the forecasting model for forecasting, and finally carry out the integrated averaging process.
The subsequent chapters of this paper are laid out as follows: Section 2 constructs and introduces the CEEMDAN-ARMA-LSTM in the study method and describes the model performance testing scheme and data sources. Section 3 analyzes the performance of the CEEMDAN-ARMA-LSTM model and analyzes the application implications of each part of the CEEMDAN-ARMA-LSTM model by comparing it with the LSTM, CEEMDAN-LSTM, and ARMA-GARCH models. Section 4 is the conclusion.

2. Materials and Methods

2.1. Methods

The study uses the CEEMDAN-based algorithm to decompose the time series data and extract the information. The multiple time series obtained from the decomposition are then fed into multiple ARMA-LSTM models constructed separately and predicted. Finally, the final prediction results are obtained by summation. The data science tool used in this paper is Jupyiter 6.4.6. For more detailed practical research ideas, the methods used in this paper are described as follows.

2.1.1. CEEMDAN

Complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) is an adaptive white noise decomposition method based on EMD decomposition and EEMD decomposition proposed by Torres et al. [22]. In order to alleviate the influence of modal confusion on the data decomposition results and effectively solve the problem that the sum of the EEMD decomposition results is not equal to the original sequence, the CEEMDAN algorithm adds adaptive white noise sequences at each stage of data decomposition, which effectively alleviates the modal confusion phenomenon and eliminates the influence of artificially added white noise on the completeness of the original sequence data, improving the completeness of data decomposition and reducing the data reconstruction cost. The data reconstruction error is reduced. The decomposition of the original data using the CEEMDAN method decomposes the disordered and chaotic data into multiple components (IMF) of different frequencies with relative regularity and a residual component (Res) to achieve the relative smoothing of non-stationary sequences and regular information screening and clustering, which can effectively improve the performance of subsequent model prediction.
The steps of CEEMDAN decomposition for any time series data A Q I t are shown below.
Step 1: Determine the number of times J, ω i t of Gaussian white noise in this cycle ε .
Step 2: Superimpose different Gaussian white noise sequences on the original data to obtain J time series.
A Q I _ G a u s s i t = A Q I t + ε ω i t
where ω i t is the Gaussian white noise sequence, and A Q I _ G a u s s i t is the generated sequence after superposition.
Step 3: Calculate the IMF components for the obtained J time series.
h i t = A Q I _ G a u s s i t n i t
n i t = m a x i t + m i n i t 2
where m a x i t and m i n i t are the time series of local maximal and local minimal values of A Q I _ G a u s s i t , respectively, and form the upper and lower envelopes. n i t is the mean envelope. h i t is the intermediate signal.
Step 4: Determine whether the several obtained intermediate signals h i t satisfy the two constraints of the inner modal component.
(i) The number of extreme value points and the number of crossing zero points must be equal or must not differ by more than one at most throughout the data segment:
(ii) At any moment, the average value of the upper envelope formed by the local extreme value point and the lower envelope formed by the local minimal value point is zero, i.e., the upper and lower envelopes are locally symmetric with respect to the time axis.
If the constraint is satisfied, the signal h i t is an IMF1 component, noted as IMF1 i (and when there is a cycle then for IMF2 i , and so on, as in steps 5 and 6).
If the constraint is not satisfied, the signal h i t is noted as the residual component.
Step 5: Determine the number of residual component pole values.
For each of the constructed J time series, repeat C j times, and if the number of residual component extreme value points decreases to a certain number (no more than 2), the decomposition ends. Otherwise, assign A Q I _ G a u s s i = h i t and repeat steps 3 to 5.
At the end of the loop in step 5, several residual components and P I M F 1 j t components are obtained, and the final I M F 1 t expression is as follows.
IMF1 ( t ) = p = 1 P c = 1 C p I M F 1 p , c ( t )
where P J .
Step 6: Repeat steps 1 to 5 with the residual component r 1 t = A Q I t I M F 1 t as the original signal until the number of r 1 t extreme points is reduced to a certain number; then, CEEMDAN is completely finished.
CEEMDAN is completed after several large cycles from steps 1 to 5. Finally, the original signal is decomposed into Q IMF components and 1 final residual component ( Q J ).
A Q I t = q = 1 Q I M F q t + R e s t
where Q denotes the total number of cycle iterations of steps 1 to 5. R e s t is the final residual component.

2.1.2. ARMA Model

In this paper, the obtained smooth non-white noise IMF components are modeled and predicted one by one using ARMA model, and the obtained non-smooth or smooth white noise IMF components, as well as the residual component, Res, are modeled and predicted one by one using an LSTM neural network.
The ARMA model (auto-regressive and moving average model) is an important method for studying time series and consists of a “mixture” of an autoregressive model (AR model) and a sliding average model (MA model). The ARMA model can be determined by determining the order (p,q) from the autocorrelation (ACF) and partial correlation (PACF) plots, or by first building an ARMA model with different parameters and selecting the best-performing model by using the AIC, BIC and other criteria. The best-performing model can be selected through the criteria of AIC and BIC. In order to improve the information utilization rate of the residuals of the ARMA model, this paper further improves the forecasting method by combining the GARCH (generalized autoregressive conditional heteroskedasticity) model, which is called the generalized ARCH model and is an extension of the ARCH model. GARCH further models the variance of errors and is particularly suitable for volatility analysis and forecasting.
For example, an ARMA(p, q)-GARCH(h, b) model with multiple smooth IMFs as inputs takes the following form.
I M F i t = α i ( 0 ) + m = 1 q α i ( m ) I M F i t m + n = 1 p β i ( n ) e i ( t n )
e i ( t ) = σ i ( t ) ε i ( t )
σ i ( t ) 2 = α 0 + m = 1 h α m σ i ( t m ) 2 + n = 1 b β n e i ( t n ) 2
where i is the ordinal number of the IMF component. ( p , q ) and ( h , b ) are the orders of the ARMA and GARCH processes, respectively. α and β are the coefficients. ε i ( t ) satisfies the independent identical distribution. σ i ( t ) is the conditional variance, and e i ( t ) is the residual of the conditional mean equation.

2.1.3. LSTM

The human mind is persistent, but traditional neural networks are unable to do this, which seems to be a major drawback. For example, suppose you want to classify the types of events that are occurring at each point in a movie. It is not clear how traditional neural networks use inference about previous events to inform later events. Recurrent neural networks solve this problem. They are networks with loops that allow information to persist, and RNNs built on this basis have achieved great success in research areas such as speech recognition, language modeling, and translation. lSTMs are a special type of RNN that can learn long-term dependencies [23]. They are excellent in a wide variety of problems and are now widely used.
The key to the LSTM is the cell state, the horizontal line at the top of Figure 1, through which the original information is transmitted unchanged. The gates are a transfer structure that selectively adds information, and they consist of sigmoid neural network layers and point-by-point multiplication operations, where the sigmoid layer outputs values between 0 and 1, which determine the proportion of information passed.
The structure of the LSTM neural network is shown in Figure 1, and the component departments and roles are shown below [24].
(1) Forgetting gate, f t : determines the part of the current cell state C t that was passed from the previous moment in the cell state C t 1 and determines the part of the information to be lost by f t and C t 1 together.
f t = σ W f · h t 1 , X t + b f
(2) Input gate: Determines the part of the input value X t that can be retained in the current state C t at the current moment and updates the memory cell state.
i t = σ W i · h t 1 , X t + b i
C t = f t C t 1 + i t · tanh W c · h t 1 , X t + b c
(3) Output gate: Determines the part of the current cell state C t that can be used as an output value and generates an output using the new control parameter C.
o t = σ W o · h t 1 , X t + b o
(4) Final output results:
h t = o t · tanh C t
In (1 to 4), X t and h t represent the input and output values at moment t, respectively. i t is the new information retained, and C t is the control parameter C formed by the new data. σ is the sigmoid function, tanh is the activation function, W o , i , f is the corresponding weight matrix, and b o , i , f is the corresponding bias term.

2.1.4. CEEMDAN-ARMA-LSTM

Step 1: CEEMDAN decomposition is performed on AQI time series data to obtain multiple smooth IMF components and one non-smooth residual component, Res.
Step 2: ADF-Test and a white noise test are performed on the components.
Step 3: The smooth non-white noise components are input into ARMA; the remaining components are input into LSTM.
Step 4: All the predicted results are summed up, which is the final result.
The flow chart of the above steps is shown in Figure 2.
Specifically, in step 1, the CEEMDAN model decomposes the raw AQI data into multiple-signal data (IMF data) and a Res. Then, in step 2, IMF data and Res are input into the ADF-Test and LB-Test models. Here is a conditional judgment. The qualified IMF data will be input into the ARMA model later. The rest are input to LSTM model. Nest, in step 3, the ARMA or LSTM model will output the predict of AQI decomposition in terms of IMF and Res data given by steps 1 and 2. Last, in step 4, all predicted data will be summed up. The result is the final prediction.

2.2. Data Source

The air quality index (AQI) is an important indicator that describes the cleanliness or pollution level of air and its health effects. AQI presents these six pollutants in a unified evaluation standard. The AQI data used in this article comes from the real-time national urban air quality release platform of the China National Environmental Monitoring Centre.
The air quality index, AQI, is the maximum value of the air quality sub index, IAQI.
I A Q I = I high I low C high C low C C low + I low
C is the pollutant concentration and is an input value.
C l o w is a concentration limit less than or equal to C and is a constant.
C h i g h is a concentration limit greater than or equal to C and is a constant.
I l o w is an index limit corresponding to C l o w and is a constant.
I h i g h is an index limit corresponding to C h i g h and is a constant.
Please see Table 1 for the AQI concentration limit of pollutant items.
The study area is Beijing, China. It is the capital of the People’s Republic of China, the political center, cultural center, international communication center, and science and technology innovation center of China as determined by the State Council’s approval [25]. As of 2020, the city has 16 districts under its jurisdiction, with a total area of 16,410.54 square kilometers. According to the seventh census data, as of 1 November 2020, the resident population of Beijing is 2,189,095 people [26]. The topography of Beijing is high in the northwest and low in the southeast. It is surrounded by mountains in the west, north and northeast, and a plain that slopes gently toward the Bohai Sea in the southeast. Beijing is ranked as one of the top cities in the world by GaWC, a world city research institute, and the United Nations reports that Beijing ranks second in China in terms of the human development index [27]. In 2020, Beijing will achieve an annual gross regional product of CNY 3610.26 billion, an increase of 1.2% over the previous year in comparable prices [28]. Because Beijing is a political, economic and cultural center with a high degree of industrialization and urbanization, air quality has always been a concern.
Monthly AQI data are averaged based on hourly data calculated for the day. Since the data for China have been released since December 2013, monthly AQI data from 2014 to 2021 were chosen for continuous data consistency, as shown in Figure 3. The vertical axis indicates the AQI value. The horizontal axis indicates the months.

2.3. Model Performance Testing Criteria

To test the model performance, the sample was divided into pairs of data. Monthly AQI data from January 2014 to December 2019 were used as the training sample, and data from January 2020 to December 2021 were used as the test sample. Root-mean-square error (RMSE), mean square error (MSE), mean absolute percentage error (MAPE), and mean absolute error (MAE) were used as model performance evaluation criteria.
R M S E = 1 n i = 1 n y i y ^ i 2 1 2
M S E = 1 n i = 1 n y i y ^ i 2
M A P E = 1 n i = 1 n y i y ^ i y i
M A E = 1 n i = 1 n y i y ^ i
where y i j is the i-th sample true value of the j-th experimental validation set. y ^ i j is the i-th sample predicted value of the j-th experimental validation set, and n is the sample size of the validation set. To compare the models more intuitively, ARMA-GARCH, LSTM neural network, and CEEMDAN-LSTM neural network models were chosen to compare with the models employed in this paper. These three models were chosen to analyze the necessity of each part of the CEEMDAN-ARMA-LSTM model, i.e., whether each part can serve to provide prediction accuracy.

3. Results

3.1. AQI Data Decomposition Based on CEEMDAN

After CEEMDAN decomposition, we are left with 6 IMF components and 1 Res residual, as shown in Figure 4. Among them, IMF1 to 5 characterize the cyclical pattern of AQI, and IMF6 characterizes the overall decreasing trend of AQI. After CEEMDAN model processing, the original AQI data is divided into six regular IMF. Apart from the six IMF, the parts of the AQI that cannot be decomposed into regular sequences are recorded as residuals, Res.It represents residues, which contain more subtle and unordered information, so their values are usually very small. The vertical axis indicates the decomposed AQI value. The horizontal axis indicates the year.

3.2. ARMA-LSTM Result

3.2.1. Applicable Model Screening

The modeling of ARMA can be performed only when the series is smooth and has non-white noise. Therefore, the smoothness test is performed for the IMF component and the Res component. In this section, ADF-Test is used as a test to determine the lag order based on the AIC criterion. The test results are shown in Table 2.
As can be seen from Table 1, the p-values of IMF1 to 3 are all much less than 0.05 in the ADF test, which means that IMF1 to 3 are smooth series. Additionally, IMF1 to 3 are non-white-noise series under the setting of lag order of 30. Therefore, IMF1 to 3 can be directly applied to ARMA modeling.
Secondly, the p-values of IMF4 to 6 in the ADF test are all much larger than 0.05, which means that IMF1 to 3 are smooth series and cannot be directly used in ARMA series and need to be differenced. Here, Res passes the ADF test and is a smooth series, but according to the results of LB test, Res is a white noise series, so it is not applicable to ARMA model.
Finally, it should be noted that discarding the Res data or differencing the IMF data would lose the information content of the time series components extracted by CEEMDAN. Meanwhile, this section has attempted to differentialize IMFs 4 to 6, but problems such as high differential orders and non-convergence of operations still occur during data processing and modeling, as shown in Table 3.
In summary, the ARMA model was used for IMFs 1 to 3. The LSTM model was used for IMFs 4–6 and the Res component.

3.2.2. ARMA Construction

This section uses PACF and PAC to select the optimal ARMA process based on the AIC and BIC criteria. The ACF and PACF diagrams for IMF1 to 3 are shown in Figure 5, Figure 6 and Figure 7. The vertical axis indicates PACF or ACF. The horizontal axis indicates the model lag order.
Combining the numerical calculation results of ACF and PACF, the ARMA models and fitting results that best fit IMF 1 to 3 based on the AIC and BIC criteria are shown in Table 4 and Table 5.
As can be seen from Table 4, the AR(2) model built according to IMF1 has a low goodness of fit, but the models built according to IMF2 and 3 are more excellent with a goodness of fit of 0.88 or more. As can be seen from Table 5, the main explanatory variables in the models constructed according to IMF1 to 3 are almost all significant. Only MA(1) in the ARMA(4,2) model constructed from IMF3 is insignificant.
Overall, the three models are acceptable and can be used for subsequent predictions.

3.2.3. LSTM Neural Network Settings

The LSTM used in this section was modeled with a time window of 2, i.e., the first 2 data were used as input variables, and the third data were used as the predicted variable.
The neural network was structured using a 2-layer LSTM neural network. The initial number of neurons was set to 4, and the output vector of the neurons in the previous layer was used as the input vector in the next layer. Finally, a fully connected layer was added to make the input vector return to the desired output vector.
The model was trained by applying error back propagation with 10,000 iterations, the optimizer algorithm was Adam, and the learning rate was 0.01. Each iteration updates all parameters and sets the gradient to 0. The normalization method used the maximum-minimum normalization method.

3.3. Analysis of Prediction Results

The ARMA-LSTM model constructed in Section 3.2 was used to forecast AQI in Beijing from January 2020 to December 2021.
The prediction results of AQI and its components with basic statistics are shown in Table 6.
According to Table 6, the predictions of CEEMDAN-ARMA-LSTM have smaller maxima and larger minima than the real AQI values, and the overall distribution interval is narrowed. The same situation is observed for the components IMF1, IMF2, IMF3, and IMF5 and the residual Res. However, IMF4 and IMF6 show smaller minimum values and larger maximum values, and the overall distribution interval is expanded. Overall, the components IMF1, IMF2, IMF3, and IMF5 better reflect the overall change of AQI in Beijing from 2014 to 2021. Second, it also indicates that the model used in this paper weakens the “shock” information of the time series data to a certain extent, although it is not obvious, and this is something that needs to be further improved in the subsequent study.
To further analyze the model performance, the accuracy comparison results are shown in Table 7 with reference to the benchmark model listed in Section 2.3. Figure 8 shows real AQI values be compared with predicted AQI values obtained from the different models. The vertical axis indicates the AQI value. The horizontal axis indicates the year.
According to Table 7, it can be obtained that the CEEMDAN-ARMA-LSTM used in this paper has the most excellent performance. The model has the smallest MAE, MAPE, MSE, and RMSE in the simulation experiment of predicting Beijing AQI from January 2020 to December 2021. Compared with the three models CEEMDAN-LSTM, LSTM and ARMA-GARCH, MAE improved by 22.5%, 53.4% and 21.5%, MAPE improved by 21.4%, 55.3% and 26.1%, MSE improved by 39.9%, 76.9% and 28.5%, and RMSE improved by 22.5%, 52.0% and 15.4%. Accuracy improvements were significant.
Specifically, the CEEMDAN-LSTM has smaller errors in all categories compared to the LSTM. This result demonstrates that CEEMDAN improves data utilization by decomposing the AQI time series. Decomposition methods such as CEEMDAN continuously extract the various scale components that make up the original signal from high to low frequencies to strengthen and separate each frequency feature, which in turn improves the efficiency of model simulation, training, and prediction sessions to capture the patterns.
Second, the LSTM neural network has a smaller ARMA-GARCH class error compared with the ARMA-GARCH model. This result indicates that the ARMA model is more applicable for time series with strong periodicity and short length. Since this paper uses monthly AQI data with a sample size of 96, it may not meet the “big data” characteristics required for machine learning. This is also a possible reason for the low accuracy of the LSTM model.
Finally, the CEEMDAN-ARMA-LSTM model has higher accuracy compared with the CEEMDAN-LSTM model. Since ARMA cannot handle all CEEMDAN components, the CEEMDAN-ARMA-LSTM model ensures the availability of all components by feeding the non-smooth, white noise components into the LSTM. A proper combination of traditional models and machine learning and neural networks can effectively improve the prediction accuracy. Additionally, this approach has the advantage that the explanatory nature of the model can be preserved in the application of multivariate contexts.

4. Conclusions

In this paper, the CEEMDAN-ARMA-LSTM model is constructed by integrating the CEEMDAN, ARMA and LSTM neural networks with the objective of improving the prediction accuracy of monthly AQI data in Beijing. Through the accuracy test, CEEMDAN effectively captures and separates the potential amount of information contained in the data. The accuracy of the CEEMDAN-ARMA-LSTM model is relatively high and stable and has good application prospects. The study of accuracy improvement can continue to be enhanced in the subsequent research.
The empirical study in this paper is only one of the application areas of the CEEMDAN-ARMA-LSTM model. The method is estimated to be further applied to the fields of medicine, public health, economics and sociology. For example, we can estimate the historical long-term PM 2.5 concentration time series data of the raster where a subject lives and combine the health effect indicators such as the morbidity, mortality and outpatient rate of the specific diseases to be studied to conduct chronic health effect studies and add evidence for the causal relationship between long-term PM 2.5 exposure and disease. We could also estimate the historical provincial GDP, EPU and carbon sink data and combine them with statistical yearbook data, establishing econometric models for analysis of urbanization promotion, urban cluster development, and the effectiveness of transforming the economic development mode.

Author Contributions

Y.S. and J.L. set up the problem, computed the details and polished the draft. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Lu, J.; Cao, X. PM2. 5 pollution in major cities in china: Pollution status, emission sources and control measures. Fresenius Environ. Bull. 2015, 24, e1349. [Google Scholar]
  2. Kampa, M.; Castanas, E. Human health effects of air pollution. Environ. Pollut. 2008, 151, 362–367. [Google Scholar] [CrossRef] [PubMed]
  3. Akimoto, H. Global Air Quality and Pollution. Science 2004, 302, 1716–1719. [Google Scholar] [CrossRef] [PubMed]
  4. Yang, S. Real-time air quality forecasting, part I: History, techniques, and current status. Atmos. Environ. 2012, 60, 632–655. [Google Scholar]
  5. Yang, Z.; Bocquet, M.; Mallet, V.; Seigneur, C.; Baklanov, A. Real-time air quality forecasting, part II: State of the science, current research needs, and future prospects. Atmos. Environ. 2012, 60, 656–676. [Google Scholar]
  6. Liang, J.; Wu, L.; Wu, Y.; Chen, L. AQI Prediction of Nanjing City Based on GM(1,1) Seasonal Index Model. In Proceedings of the 2019 International Conference on Applied Mathematics, Model, Simulation and Optimization, Guilin, China, 21–22 April 2019. [Google Scholar]
  7. Grell, G.A. A Description of the Fifth-Generation Penn State/NCAR Mesoscale Model (MM5); No. NCAR/TN-398+STR; University Corporation for Atmospheric Research: Boulder, CO, USA, 1995. [Google Scholar]
  8. Sun, M.; Xu, M.; Xie, P.; Cao, L. The Research of Air Quality on Harbin Based on ARMA Model. Nat. Sci. J. Harbin Norm. Univ. 2018, 34, 21–25. [Google Scholar]
  9. Tan, X. Analysis of PM2.5 Fine Particulate Matter Based on Time Series Method and Stepwise Regression Method. Ph.D. Thesis, Huazhong Agricultural University, Wuhan, China, 2019. [Google Scholar]
  10. Xiao, L.; Yong, L.; Lf, A.; Ss, A.; Tao, Z.; Mq, A. Research on the prediction of dangerous goods accidents during highway transportation based on the ARMA model. J. Loss Prev. Process. Ind. 2021, 72, 104583. [Google Scholar]
  11. Liwen, Z. Application of ARIMA model on prediction of China’s corn market. J. Phys. Conf. Ser. 2021, 1941, 12064. [Google Scholar]
  12. Zeng, G.; Jin, R. Predicting Beijing Air Quality Data Based on LSTM Method. Int. J. Trend Sci. Res. Dev. 2021, 5, 774–777. [Google Scholar]
  13. Liao, Y.; Yang, J. Multi-hour and multi-site air quality index forecasting in Beijing using CNN, LSTM, CNN-LSTM, and spatiotemporal clustering. Expert Syst. Appl. 2021, 169, 114513. [Google Scholar]
  14. Chen, G.; Ren, M.; Wei, Q. Data-Intelligence Empowerment: A New Leap of Information Systems Research. J. Manag. World 2022, 38, 180–196. (In Chinese) [Google Scholar]
  15. Zhao, Y. Air Ouality Index Prediction Based on ARIMA and SVRCombined Model-Taking Jinan as an Example. Master’s Thesis, Tianjin University of Commerce, Tianjin, China, 2019. [Google Scholar]
  16. Wang, W. Research on Urban Air Quality Forecast Basedon ARMA-BP Neural Network. Master’s Thesis, Northwestern Polytechnical University, Xi’an, China, 2021. [Google Scholar]
  17. Zheng, X.; Zhu, G. ARMA-ABCSVR-GABP Network Traffic Prediction Based On HP Filter. Comput. Appl. Softw. 2022, 39, 94–99. [Google Scholar]
  18. Wang, T.; Zhang, M.; Yu, Q.; Zhang, H. Comparing the applications of EMD and EEMD on time–frequency analysis of seismic signal. J. Appl. Geophys. 2012, 83, 29–34. [Google Scholar] [CrossRef]
  19. Chen, R.X.; Tang, B.P.; Ma, J.H. Adaptive de-noising method based on ensemble empirical mode decomposition for vibration signal. J. Vib. Shock 2012, 31, 82–86. [Google Scholar]
  20. Das, A.B.; Bhuiyan, M. Discrimination of focal and non-focal EEG signals using entropy-based features in EEMD and CEEMDAN domains. In Proceedings of the 2016 9th International Conference on Electrical and Computer Engineering (ICECE), Dhaka, Bangladesh, 20–22 December 2016. [Google Scholar]
  21. Chen, J.; Cheng, S.; Yang, Y. Modified EEMD algorithm and its applications. J. Vib. Shock 2013, 32, 7. [Google Scholar]
  22. Torres , M.E.; Colominas, M.A.; Schlotthauer, G. A complete ensemble empirical mode decomposition with adaptive noise. In Proceedings of the IEEE International Conference on Acoustics, Prague, Czech Republic, 22–27 May 2011. [Google Scholar]
  23. Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
  24. Olah, C. Understanding LSTM Networks. Available online: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ (accessed on 27 August 2015).
  25. Agency, X.N. The Central Committee of the Communist Party of China the State Council on the Approval of the Beijing Urban Master Plan (2016–2035). Available online: http://www.gov.cn/zhengce/2017-09/27/content_5227992.htm (accessed on 27 September 2017).
  26. National Bureau of Statistics of China, Seventh National Population Census Bulletin. Available online: http://www.stats.gov.cn/tjsj/zxfb/202105/t20210510_1817179.html (accessed on 11 May 2021).
  27. People.cn. Six Chinese Cities Are among the World’s “First Tier” Cities. Available online: http://house.people.com.cn/n1/2018/1115/c164220-30402942.html (accessed on 11 November 2018).
  28. Statistics, B.M.B. Beijing’s Economy Will Recover Steadily in 2020. Available online: http://tjj.beijing.gov.cn/zxfbu/202101/t20210120_2227698.html (accessed on 20 January 2021).
Figure 1. LSTM Structure Schematic. Figure from https://colah.github.io/posts/2015-08-Understanding-LSTMs/.
Figure 1. LSTM Structure Schematic. Figure from https://colah.github.io/posts/2015-08-Understanding-LSTMs/.
Sustainability 14 12182 g001
Figure 2. Flowchart of CEEMDAN-ARMA-LSTM algorithm.
Figure 2. Flowchart of CEEMDAN-ARMA-LSTM algorithm.
Sustainability 14 12182 g002
Figure 3. AQI data from 2014 to 2021 in Beijing.
Figure 3. AQI data from 2014 to 2021 in Beijing.
Sustainability 14 12182 g003
Figure 4. CEEMDAN Results.
Figure 4. CEEMDAN Results.
Sustainability 14 12182 g004
Figure 5. Distribution of ACF and PACF of IMF1.
Figure 5. Distribution of ACF and PACF of IMF1.
Sustainability 14 12182 g005
Figure 6. Distribution of ACF and PACF of IMF2.
Figure 6. Distribution of ACF and PACF of IMF2.
Sustainability 14 12182 g006
Figure 7. Distribution of ACF and PACF of IMF3.
Figure 7. Distribution of ACF and PACF of IMF3.
Sustainability 14 12182 g007
Figure 8. Line Graph of Prediction Results of Each Model.
Figure 8. Line Graph of Prediction Results of Each Model.
Sustainability 14 12182 g008
Table 1. AQI: concentration limit of pollutant items.
Table 1. AQI: concentration limit of pollutant items.
Average Moving Value24-h8-h
IAQISO2NO2PM2.5PM10COO3
Unit μ g/m3 μ g/m3 μ g/m3 μ g/m3mg/m3 μ g/m3
0000000
50504035502100
10015080751504160
15047518011525014215
20080028015035024265
300160056525042036800
400210075035050048Note 3
500262094050060060Note 3
Average Moving Value1-h (Note 1)
IAQICOO3SO2NO2
Unitmg/m3 μ g/m3 μ g/m3 μ g/m3
00000
505160150100
10010200500200
15035300650700
200604008001200
30090800Note 22340
4001201000Note 23090
5001501200Note 23840
One hour data is only used for real-time reporting, and 24 h data is used in daily news. If it exceeds 800, it will not be calculated. Calculated as 24-h moving average. If it exceeds 800, it will not be calculated. Calculated as 1-h moving average.
Table 2. Results of ADF-Test and LB-Test.
Table 2. Results of ADF-Test and LB-Test.
ADF-TestLB-Test
tPLags Used (AIC)Result (Lags Used 30)
IMF1−7.0460.0003Non white noise sequence
IMF2−3.5040.0075Non white noise sequence
IMF3−4.2370.0015Non white noise sequence
IMF4−2.0280.2745Non white noise sequence
IMF5−2.1220.2365Non white noise sequence
IMF60.3090.30912Non white noise sequence
Res−7.9130.0000White noise sequence
Table 3. Problems Arising from IMF4 to 6 Follow-up Operations.
Table 3. Problems Arising from IMF4 to 6 Follow-up Operations.
DiffProblem
IMF416Order too high
IMF52Singular matrix, SVD function, non convergence
IMF6NoneIt cannot be stationary by difference
Table 4. The Selected Model and its R 2 .
Table 4. The Selected Model and its R 2 .
ModelR 2
IMF1MA(2)0.278
IMF2ARMA(2,2)0.881
IMF3ARMA(4,2)0.917
Table 5. Fitting Results of the Selected Model.
Table 5. Fitting Results of the Selected Model.
IMF1: MA(2)R 2 Log LikelihoodS.D. of InnovationsAICBICHQIC
0.278−297.33714.638602.673611.780606.299
coefstd errzP[0.0250.975]
const *−1.2110.136−8.9150.000−1.478−0.945
ma.L1.IMF1 *−0.3030.103−2.9560.003−0.504−0.102
ma.L2.IMF1 *−0.6970.097−7.1800.000−0.887−0.507
IMF2: ARMA(2,2)R 2 Log LikelihoodS.D. of InnovationsAICBICHQIC
0.881−166.4552.338344.910358.570350.348
coefstd errzP[0.0250.975]
const−0.1901.191−0.1590.874−2.5242.145
ar.L1.IMF2 *1.0450.1069.8290.0000.8371.253
ar.L2.IMF2 *−0.6910.102−6.7710.000−0.891−0.491
ma.L1.IMF2 *1.2600.1369.2570.0000.9931.527
ma.L2.IMF2 *0.5400.1493.6230.0000.2480.832
IMF3: ARMA(4,2)R 2 Log LikelihoodS.D. of InnovationsAICBICHQIC
0.91742.2260.120−68.452−50.239−61.201
coefstd errzP[0.0250.975]
const−0.3990.473-0.8420.400−1.3260.529
ar.L1.IMF3 *3.4680.05958.6350.0003.3523.584
ar.L2.IMF3 *−4.8060.157−30.6140.000−5.114−4.498
ar.L3.IMF3 *3.1300.15520.1610.0002.8253.434
ar.L4.IMF3 *−0.8140.057−14.3260.000−0.926−0.703
ma.L1.IMF3−0.0210.131−0.1640.870−0.2770.235
ma.L2.IMF3 *−0.2480.120−2.0590.040−0.484−0.012
* Labeled as significant at the 5% level.
Table 6. Basic Statistics of AQI Prediction Results in Beijing Based on CEEMDAN-ARMA-LSTM.
Table 6. Basic Statistics of AQI Prediction Results in Beijing Based on CEEMDAN-ARMA-LSTM.
AQIIMF1IMF2IMF3
TruePredictTruePredictTruePredictTruePredict
Mean78.2585.83−2.70−1.55−0.570.10−1.87−0.02
Median73.5084.71−2.18−1.21−0.20−0.12−0.860.27
Max149.00103.2334.450.4318.304.799.255.39
Min50.0072.97−28.87−11.08−27.92−2.99−14.96−6.32
std21.237.4515.002.0111.211.667.173.47
IMF4IMF5IMF6Res
TruePredictTruePredictTruePredictTruePredict
Mean0.000.041.170.7582.2286.51−5.9 × 10 16 −1.6 × 10 15
Median−0.14−0.291.651.1681.8383.040−2.27 × 10 15
Max5.205.382.531.7985.55115.271.4 × 10 14 2.3 × 10 14
Min−5.54−6.66−2.07−3.0380.2380.66−1.4 × 10 14 −2.3 × 10 14
std3.694.081.371.251.658.524.5 × 10 15 1.1 × 10 14
Table 7. Model Performance Comparison Analysis.
Table 7. Model Performance Comparison Analysis.
CEEMDAN-ARMA-LSTM
AQIIMF1IMF2IMF3IMF4IMF5IMF6Res
MAR17.0912.378.576.970.820.654.299.3 × 10 15
MAPE23.2%109.9%152.9%301.0%51.9%60.4%5.1%
MSE478.13237.31135.3175.191.850.5669.545.4 × 10 35
RMSE21.8715.4011.638.671.360.758.347.4 × 10 18
CEEMDAN-LSTM
AQIIMF1IMF2IMF3IMF4IMF5IMF6Res
MAR22.0520.225.931.370.820.654.299.3 × 10 15
MAPE29.5%679.3%723.9%82.4%51.9%60.4%5.1%
MSE795.67791.1451.974.621.850.5669.545.4 × 10 35
RMSE28.2128.137.212.151.360.758.347.4 × 10 18
LSTMARMA-GARCH
AQI AQI
MAR36.64 21.76
MAPE51.9% 31.4%
MSE2072.50 668.53
RMSE45.52 25.86
The red values are the minimum values of errors under model comparison. The ARMA-GARCH model in this paper is actually an ARMA. It is shown that there is no ARCH effect.
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Sun, Y.; Liu, J. AQI Prediction Based on CEEMDAN-ARMA-LSTM. Sustainability 2022, 14, 12182. https://doi.org/10.3390/su141912182

AMA Style

Sun Y, Liu J. AQI Prediction Based on CEEMDAN-ARMA-LSTM. Sustainability. 2022; 14(19):12182. https://doi.org/10.3390/su141912182

Chicago/Turabian Style

Sun, Yong, and Jiwei Liu. 2022. "AQI Prediction Based on CEEMDAN-ARMA-LSTM" Sustainability 14, no. 19: 12182. https://doi.org/10.3390/su141912182

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop