Open AccessArticle

AQI Prediction Based on CEEMDAN-ARMA-LSTM

Yong Sun

¹ and

Jiwei Liu

^2,*

Institute of Quantitative and Technological Economics, Chinese Academy of Social Sciences, Beijing 100732, China

College of Quantitative and Technological Economics, University of Chinese Academy of Social Sciences, Beijing 102488, China

Author to whom correspondence should be addressed.

Sustainability 2022, 14(19), 12182; https://doi.org/10.3390/su141912182

Submission received: 18 August 2022 / Revised: 19 September 2022 / Accepted: 21 September 2022 / Published: 26 September 2022

(This article belongs to the Special Issue Aerosols and Air Pollution)

Download

Browse Figures

Versions Notes

Abstract

In the context of carbon neutrality and air pollution prevention, it is of great research significance to achieve high-accuracy prediction of the air quality index. In this paper, Beijing is used as the study area; data from January 2014 to December 2019 are used as the training set, and data from January 2020 to December 2021 are used as the test set. The CEEMDAN-ARMA-LSTM model constructed in this paper is used for prediction and analysis. The CEEMDAN model is used to decompose the data to improve the data information utilization. The smooth non-white noise components are fed into the ARMA model, and the remaining components and residuals are fed into the LSTM model. The results show that the MAE, MAPE, MSE, and RMSE of this model are the smallest. Compared with the CEEMDAN-LSTM, LSTM, and ARMA-GARCH models, MAE improved by 22.5%, 53.4%, and 21.5%, MAPE improved by 21.4%, 55.3%, and 26.1%, MSE improved by 39.9%, 76.9%, and 28.5%, and RMSE improved by 22.5%, 52.0%, and 15.4%. The accuracy improvement is significant and has good application prospects.

Keywords:

CEEMDAN; ARMA-GARCH; LSTM; AQI

1. Introduction

In 2006, the World Health Organization (WHO) conducted air quality tests in many cities around the world, assessing the concentrations of three pollutants—NO

_{2}

, SO

_{2}

and PM

_{2.5}

—in urban air. The results showed that among the cities tested in China, Beijing, Changsha, Shijiazhuang, Linfen and other cities have high pollution levels, which are likely to cause harm to humans and hinder economic development [1,2]. In recent years, in order to improve the air pollution situation, many initiatives have been taken at home and abroad to actively carry out a number of air pollution prevention and control efforts [3].

At present, in the context of pollution prevention and control and carbon neutrality, how to achieve high-precision prediction of air quality index (AQI) is an important research topic, which is of positive significance to urban development as well as national health. In recent years, air pollution, with PM

_{2.5}

as the main source of pollution, has become increasingly aggravated, and hazy weather has appeared in most areas of China. Air quality monitoring in several cities across the country has issued severe pollution warnings, and air pollution has become a key environmental issue of social concern. The Beijing-Tianjin-Hebei region is a key concern for national regionalization development, and accurate prediction of the air quality index (AQI) in this region is of great research significance for the green and sustainable development of Beijing-Tianjin-Hebei. The AQI reflects the dynamic trend of air pollution and provides data support for the implementation of specific measures to mitigate air pollution. However, because AQI is stochastic and non-stationary, it often leads to low prediction accuracy and poor stability, and because the atmosphere is a very complex dynamic system and its trend is easily affected by the concentration of pollutants in the air, a variety of meteorological factors and other factors, it is difficult to model it [4,5]. Therefore, the accurate prediction of AQI is a challenging and important task.

Throughout the research, the main prediction models of time series are as follows.

(1) Traditional models: OLS, GM(1,1) [6], MM5-CAMx [7], ARMA and ARIMA [8]. Among them, ARMA is an important method to study time series, which consists of an autoregressive model (AR model for short) and moving average model (MA model for short) as the basis of “hybrid” composition, which only needs endogenous variables without the help of other exogenous variables. Tan used the air quality monitoring data from 51 monitoring stations and meteorological data in Hubei Province throughout 2016 to model the PM

_{2.5}

concentration data of each city in Hubei Province using the ARMA method and the stepwise regression method [9]. Li et al. used the data of 2340 hazardous material accidents that occurred during road transportation in China from 2013 to 2019 to develop an AR model. The ARMA prediction model was developed by Li et al. using the data of 2340 hazardous materials accidents that occurred during road transportation in China from 2013 to 2019 [10]. Zhou used the ARIMA model to predict the grain yield in China with high accuracy [11]. The study showed that the established ARMA model has a good fit and can predict the time series more accurately.

(2) Machine learning models: SVM, BP; among them, the long short-term memory neural network, LSTM, was proposed by Hochreiter et al. in 1997. LSTM has a special recurrent structure that can avoid the gradient problem to learn data sequences with long time span. The LSTM neural network is particularly suitable for air quality prediction research because the current AQI values are often correlated with historical AQI values due to the nature of condensation and accumulation of air pollutants such as PM

_{2.5}

in the atmospheric environment. Zeng et al. studied air quality data in Beijing from 2018 to 2020. Based on the pollutant concentration correlation analysis, a recurrent neural network model based on the LSTM algorithm was developed to achieve the prediction of Beijing AQI, and the recurrent neural network prediction model had a high prediction accuracy [12]. Yan et al. compared CNN, LSTM, and CNN-LSTM for multi-hour and multi-site AQI forecasting in Beijing. The results of the study indicate that LSTM is the best model for multi-hour forecasting [13]. However, its accuracy requires a large amount of data support, and the interpretation is a “black box” [14].

Combining traditional econometric models such as ARMA and machine learning models such as LSTM, SVR and BP neural network can compensate for each other’s defects and further improve the prediction accuracy [15,16,17]. Therefore, this paper tries to combine ARMA and LSTM models to improve AQI prediction accuracy.

Second, the AQI time series of each city fluctuates more seriously, showing disorder, chaos and non-stationary states. If the unprocessed AQI data are used directly for forecasting, it will interfere with the prediction results of the subsequent forecasting models because of such data fluctuations. If such fluctuations are removed, an amount of data information will be lost. Severe weather such as rain and snow can cause large fluctuations in the time series data, but this particular variation plays a key role in the subsequent data analysis. Using decomposition methods such as EMD, EEMD, and CEEMDAN can effectively separate the time series into high-frequency to low-frequency data, which in turn preserves such data fluctuations, improves data utilization, and ultimately improves prediction model performance [18,19]. Therefore, in this paper, we choose the CEEMDAN decomposition method to decompose time series data such as AQI into multiple components (intrinsic mode function (IMF)) with different frequencies with periodic trends and volatility trends of random factors [20,21], input them into the forecasting model for forecasting, and finally carry out the integrated averaging process.

The subsequent chapters of this paper are laid out as follows: Section 2 constructs and introduces the CEEMDAN-ARMA-LSTM in the study method and describes the model performance testing scheme and data sources. Section 3 analyzes the performance of the CEEMDAN-ARMA-LSTM model and analyzes the application implications of each part of the CEEMDAN-ARMA-LSTM model by comparing it with the LSTM, CEEMDAN-LSTM, and ARMA-GARCH models. Section 4 is the conclusion.

2. Materials and Methods

2.1. Methods

The study uses the CEEMDAN-based algorithm to decompose the time series data and extract the information. The multiple time series obtained from the decomposition are then fed into multiple ARMA-LSTM models constructed separately and predicted. Finally, the final prediction results are obtained by summation. The data science tool used in this paper is Jupyiter 6.4.6. For more detailed practical research ideas, the methods used in this paper are described as follows.

2.1.1. CEEMDAN

Complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) is an adaptive white noise decomposition method based on EMD decomposition and EEMD decomposition proposed by Torres et al. [22]. In order to alleviate the influence of modal confusion on the data decomposition results and effectively solve the problem that the sum of the EEMD decomposition results is not equal to the original sequence, the CEEMDAN algorithm adds adaptive white noise sequences at each stage of data decomposition, which effectively alleviates the modal confusion phenomenon and eliminates the influence of artificially added white noise on the completeness of the original sequence data, improving the completeness of data decomposition and reducing the data reconstruction cost. The data reconstruction error is reduced. The decomposition of the original data using the CEEMDAN method decomposes the disordered and chaotic data into multiple components (IMF) of different frequencies with relative regularity and a residual component (Res) to achieve the relative smoothing of non-stationary sequences and regular information screening and clustering, which can effectively improve the performance of subsequent model prediction.

The steps of CEEMDAN decomposition for any time series data

A Q I (t)

are shown below.

Step 1: Determine the number of times J,

ω_{i} (t)

of Gaussian white noise in this cycle

ε

Step 2: Superimpose different Gaussian white noise sequences on the original data to obtain J time series.

{A Q I_G a u s s}_{i} (t) = A Q I (t) + ε ω_{i} (t)

(1)

where

ω_{i} (t)

is the Gaussian white noise sequence, and

{A Q I_G a u s s}_{i} (t)

is the generated sequence after superposition.

Step 3: Calculate the IMF components for the obtained J time series.

h_{i} (t) = {A Q I_G a u s s}_{i} (t) - n_{i} (t)

(2)

n_{i} (t) = \frac{{m a x}_{i} (t) + {m i n}_{i} (t)}{2}

(3)

where

{m a x}_{i} (t)

and

{m i n}_{i} (t)

are the time series of local maximal and local minimal values of

{A Q I_G a u s s}_{i} (t)

, respectively, and form the upper and lower envelopes.

n_{i} (t)

is the mean envelope.

h_{i} (t)

is the intermediate signal.

Step 4: Determine whether the several obtained intermediate signals

h_{i} (t)

satisfy the two constraints of the inner modal component.

(i) The number of extreme value points and the number of crossing zero points must be equal or must not differ by more than one at most throughout the data segment:

(ii) At any moment, the average value of the upper envelope formed by the local extreme value point and the lower envelope formed by the local minimal value point is zero, i.e., the upper and lower envelopes are locally symmetric with respect to the time axis.

If the constraint is satisfied, the signal

h_{i} (t)

is an IMF1 component, noted as IMF1

_{i}

(and when there is a cycle then for IMF2

_{i}

, and so on, as in steps 5 and 6).

If the constraint is not satisfied, the signal

h_{i} (t)

is noted as the residual component.

Step 5: Determine the number of residual component pole values.

For each of the constructed J time series, repeat

C_{j}

times, and if the number of residual component extreme value points decreases to a certain number (no more than 2), the decomposition ends. Otherwise, assign

{A Q I_G a u s s}_{i} = h_{i} (t)

and repeat steps 3 to 5.

At the end of the loop in step 5, several residual components and P

{I M F 1}_{j} (t)

components are obtained, and the final

I M F 1 (t)

expression is as follows.

IMF1 (t) = \sum_{p = 1}^{P} \sum_{c = 1}^{C_{p}} I M F 1_{p, c} (t)

(4)

where

P \leq J

Step 6: Repeat steps 1 to 5 with the residual component

r 1 (t) = A Q I (t) - I M F 1 (t)

as the original signal until the number of

r 1 (t)

extreme points is reduced to a certain number; then, CEEMDAN is completely finished.

CEEMDAN is completed after several large cycles from steps 1 to 5. Finally, the original signal is decomposed into Q IMF components and 1 final residual component (

Q \leq J

A Q I (t) = \sum_{q = 1}^{Q} I M F q (t) + R e s (t)

(5)

where Q denotes the total number of cycle iterations of steps 1 to 5.

R e s (t)

is the final residual component.

2.1.2. ARMA Model

In this paper, the obtained smooth non-white noise IMF components are modeled and predicted one by one using ARMA model, and the obtained non-smooth or smooth white noise IMF components, as well as the residual component, Res, are modeled and predicted one by one using an LSTM neural network.

The ARMA model (auto-regressive and moving average model) is an important method for studying time series and consists of a “mixture” of an autoregressive model (AR model) and a sliding average model (MA model). The ARMA model can be determined by determining the order (p,q) from the autocorrelation (ACF) and partial correlation (PACF) plots, or by first building an ARMA model with different parameters and selecting the best-performing model by using the AIC, BIC and other criteria. The best-performing model can be selected through the criteria of AIC and BIC. In order to improve the information utilization rate of the residuals of the ARMA model, this paper further improves the forecasting method by combining the GARCH (generalized autoregressive conditional heteroskedasticity) model, which is called the generalized ARCH model and is an extension of the ARCH model. GARCH further models the variance of errors and is particularly suitable for volatility analysis and forecasting.

For example, an ARMA(p, q)-GARCH(h, b) model with multiple smooth IMFs as inputs takes the following form.

I M F i (t) = α_{i} (0) + \sum_{m = 1}^{q} α_{i} (m) I M F i (t - m) + \sum_{n = 1}^{p} β_{i} (n) e_{i} (t - n)

(6)

e_{i} (t) = σ_{i} (t) ε_{i} (t)

(7)

{σ_{i} (t)}^{2} = α_{0} + \sum_{m = 1}^{h} α_{m} {σ_{i} (t - m)}^{2} + \sum_{n = 1}^{b} β_{n} {e_{i} (t - n)}^{2}

(8)

where i is the ordinal number of the IMF component. (

p, q

) and (

h, b

) are the orders of the ARMA and GARCH processes, respectively.

α

and

β

are the coefficients.

ε_{i} (t)

satisfies the independent identical distribution.

σ_{i} (t)

is the conditional variance, and

e_{i} (t)

is the residual of the conditional mean equation.

2.1.3. LSTM

The human mind is persistent, but traditional neural networks are unable to do this, which seems to be a major drawback. For example, suppose you want to classify the types of events that are occurring at each point in a movie. It is not clear how traditional neural networks use inference about previous events to inform later events. Recurrent neural networks solve this problem. They are networks with loops that allow information to persist, and RNNs built on this basis have achieved great success in research areas such as speech recognition, language modeling, and translation. lSTMs are a special type of RNN that can learn long-term dependencies [23]. They are excellent in a wide variety of problems and are now widely used.

The key to the LSTM is the cell state, the horizontal line at the top of Figure 1, through which the original information is transmitted unchanged. The gates are a transfer structure that selectively adds information, and they consist of sigmoid neural network layers and point-by-point multiplication operations, where the sigmoid layer outputs values between 0 and 1, which determine the proportion of information passed.

The structure of the LSTM neural network is shown in Figure 1, and the component departments and roles are shown below [24].

(1) Forgetting gate,

f_{t}

: determines the part of the current cell state

C_{t}

that was passed from the previous moment in the cell state

C_{t - 1}

and determines the part of the information to be lost by

f_{t}

and

C_{t - 1}

together.

f_{t} = σ (W_{f} \cdot [h_{t - 1}, X_{t}] + b_{f})

(9)

(2) Input gate: Determines the part of the input value

X_{t}

that can be retained in the current state

C_{t}

at the current moment and updates the memory cell state.

i_{t} = σ (W_{i} \cdot (h_{t - 1}, X_{t}) + b_{i})

(10)

C_{t} = f_{t} C_{t - 1} + i_{t} \cdot \tanh (W_{c} \cdot (h_{t - 1}, X_{t}) + b_{c})

(11)

(3) Output gate: Determines the part of the current cell state

C_{t}

that can be used as an output value and generates an output using the new control parameter C.

o_{t} = σ (W_{o} \cdot (h_{t - 1}, X_{t}) + b_{o})

(12)

(4) Final output results:

h_{t} = o_{t} \cdot tanh (C_{t})

(13)

In (1 to 4),

X_{t}

and

h_{t}

represent the input and output values at moment t, respectively.

i_{t}

is the new information retained, and

C_{t}

is the control parameter C formed by the new data.

σ

is the sigmoid function, tanh is the activation function,

W_{o, i, f}

is the corresponding weight matrix, and

b_{o, i, f}

is the corresponding bias term.

2.1.4. CEEMDAN-ARMA-LSTM

Step 1: CEEMDAN decomposition is performed on AQI time series data to obtain multiple smooth IMF components and one non-smooth residual component, Res.

Step 2: ADF-Test and a white noise test are performed on the components.

Step 3: The smooth non-white noise components are input into ARMA; the remaining components are input into LSTM.

Step 4: All the predicted results are summed up, which is the final result.

The flow chart of the above steps is shown in Figure 2.

Specifically, in step 1, the CEEMDAN model decomposes the raw AQI data into multiple-signal data (IMF data) and a Res. Then, in step 2, IMF data and Res are input into the ADF-Test and LB-Test models. Here is a conditional judgment. The qualified IMF data will be input into the ARMA model later. The rest are input to LSTM model. Nest, in step 3, the ARMA or LSTM model will output the predict of AQI decomposition in terms of IMF and Res data given by steps 1 and 2. Last, in step 4, all predicted data will be summed up. The result is the final prediction.

2.2. Data Source

The air quality index (AQI) is an important indicator that describes the cleanliness or pollution level of air and its health effects. AQI presents these six pollutants in a unified evaluation standard. The AQI data used in this article comes from the real-time national urban air quality release platform of the China National Environmental Monitoring Centre.

The air quality index, AQI, is the maximum value of the air quality sub index, IAQI.

I A Q I = \frac{I_{high} - I_{low}}{C_{high} - C_{low}} (C - C_{low}) + I_{low}

(14)

C is the pollutant concentration and is an input value.

_{l o w}

is a concentration limit less than or equal to C and is a constant.

_{h i g h}

is a concentration limit greater than or equal to C and is a constant.

_{l o w}

is an index limit corresponding to C

_{l o w}

and is a constant.

_{h i g h}

is an index limit corresponding to C

_{h i g h}

and is a constant.

Please see Table 1 for the AQI concentration limit of pollutant items.

The study area is Beijing, China. It is the capital of the People’s Republic of China, the political center, cultural center, international communication center, and science and technology innovation center of China as determined by the State Council’s approval [25]. As of 2020, the city has 16 districts under its jurisdiction, with a total area of 16,410.54 square kilometers. According to the seventh census data, as of 1 November 2020, the resident population of Beijing is 2,189,095 people [26]. The topography of Beijing is high in the northwest and low in the southeast. It is surrounded by mountains in the west, north and northeast, and a plain that slopes gently toward the Bohai Sea in the southeast. Beijing is ranked as one of the top cities in the world by GaWC, a world city research institute, and the United Nations reports that Beijing ranks second in China in terms of the human development index [27]. In 2020, Beijing will achieve an annual gross regional product of CNY 3610.26 billion, an increase of 1.2% over the previous year in comparable prices [28]. Because Beijing is a political, economic and cultural center with a high degree of industrialization and urbanization, air quality has always been a concern.

Monthly AQI data are averaged based on hourly data calculated for the day. Since the data for China have been released since December 2013, monthly AQI data from 2014 to 2021 were chosen for continuous data consistency, as shown in Figure 3. The vertical axis indicates the AQI value. The horizontal axis indicates the months.

2.3. Model Performance Testing Criteria

To test the model performance, the sample was divided into pairs of data. Monthly AQI data from January 2014 to December 2019 were used as the training sample, and data from January 2020 to December 2021 were used as the test sample. Root-mean-square error (RMSE), mean square error (MSE), mean absolute percentage error (MAPE), and mean absolute error (MAE) were used as model performance evaluation criteria.

R M S E = {(\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2})}^{\frac{1}{2}}

(15)

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

(16)

M A P E = \frac{1}{n} \sum_{i = 1}^{n} \frac{|y_{i} - {\hat{y}}_{i}|}{y_{i}}

(17)

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}|

(18)

where

y_{i j}

is the i-th sample true value of the j-th experimental validation set.

{\hat{y}}_{i j}

is the i-th sample predicted value of the j-th experimental validation set, and n is the sample size of the validation set. To compare the models more intuitively, ARMA-GARCH, LSTM neural network, and CEEMDAN-LSTM neural network models were chosen to compare with the models employed in this paper. These three models were chosen to analyze the necessity of each part of the CEEMDAN-ARMA-LSTM model, i.e., whether each part can serve to provide prediction accuracy.

3. Results

3.1. AQI Data Decomposition Based on CEEMDAN

After CEEMDAN decomposition, we are left with 6 IMF components and 1 Res residual, as shown in Figure 4. Among them, IMF1 to 5 characterize the cyclical pattern of AQI, and IMF6 characterizes the overall decreasing trend of AQI. After CEEMDAN model processing, the original AQI data is divided into six regular IMF. Apart from the six IMF, the parts of the AQI that cannot be decomposed into regular sequences are recorded as residuals, Res.It represents residues, which contain more subtle and unordered information, so their values are usually very small. The vertical axis indicates the decomposed AQI value. The horizontal axis indicates the year.

3.2. ARMA-LSTM Result

3.2.1. Applicable Model Screening

The modeling of ARMA can be performed only when the series is smooth and has non-white noise. Therefore, the smoothness test is performed for the IMF component and the Res component. In this section, ADF-Test is used as a test to determine the lag order based on the AIC criterion. The test results are shown in Table 2.

As can be seen from Table 1, the p-values of IMF1 to 3 are all much less than 0.05 in the ADF test, which means that IMF1 to 3 are smooth series. Additionally, IMF1 to 3 are non-white-noise series under the setting of lag order of 30. Therefore, IMF1 to 3 can be directly applied to ARMA modeling.

Secondly, the p-values of IMF4 to 6 in the ADF test are all much larger than 0.05, which means that IMF1 to 3 are smooth series and cannot be directly used in ARMA series and need to be differenced. Here, Res passes the ADF test and is a smooth series, but according to the results of LB test, Res is a white noise series, so it is not applicable to ARMA model.

Finally, it should be noted that discarding the Res data or differencing the IMF data would lose the information content of the time series components extracted by CEEMDAN. Meanwhile, this section has attempted to differentialize IMFs 4 to 6, but problems such as high differential orders and non-convergence of operations still occur during data processing and modeling, as shown in Table 3.

In summary, the ARMA model was used for IMFs 1 to 3. The LSTM model was used for IMFs 4–6 and the Res component.

3.2.2. ARMA Construction

This section uses PACF and PAC to select the optimal ARMA process based on the AIC and BIC criteria. The ACF and PACF diagrams for IMF1 to 3 are shown in Figure 5, Figure 6 and Figure 7. The vertical axis indicates PACF or ACF. The horizontal axis indicates the model lag order.

Combining the numerical calculation results of ACF and PACF, the ARMA models and fitting results that best fit IMF 1 to 3 based on the AIC and BIC criteria are shown in Table 4 and Table 5.

As can be seen from Table 4, the AR(2) model built according to IMF1 has a low goodness of fit, but the models built according to IMF2 and 3 are more excellent with a goodness of fit of 0.88 or more. As can be seen from Table 5, the main explanatory variables in the models constructed according to IMF1 to 3 are almost all significant. Only MA(1) in the ARMA(4,2) model constructed from IMF3 is insignificant.

Overall, the three models are acceptable and can be used for subsequent predictions.

3.2.3. LSTM Neural Network Settings

The LSTM used in this section was modeled with a time window of 2, i.e., the first 2 data were used as input variables, and the third data were used as the predicted variable.

The neural network was structured using a 2-layer LSTM neural network. The initial number of neurons was set to 4, and the output vector of the neurons in the previous layer was used as the input vector in the next layer. Finally, a fully connected layer was added to make the input vector return to the desired output vector.

The model was trained by applying error back propagation with 10,000 iterations, the optimizer algorithm was Adam, and the learning rate was 0.01. Each iteration updates all parameters and sets the gradient to 0. The normalization method used the maximum-minimum normalization method.

3.3. Analysis of Prediction Results

The ARMA-LSTM model constructed in Section 3.2 was used to forecast AQI in Beijing from January 2020 to December 2021.

The prediction results of AQI and its components with basic statistics are shown in Table 6.

According to Table 6, the predictions of CEEMDAN-ARMA-LSTM have smaller maxima and larger minima than the real AQI values, and the overall distribution interval is narrowed. The same situation is observed for the components IMF1, IMF2, IMF3, and IMF5 and the residual Res. However, IMF4 and IMF6 show smaller minimum values and larger maximum values, and the overall distribution interval is expanded. Overall, the components IMF1, IMF2, IMF3, and IMF5 better reflect the overall change of AQI in Beijing from 2014 to 2021. Second, it also indicates that the model used in this paper weakens the “shock” information of the time series data to a certain extent, although it is not obvious, and this is something that needs to be further improved in the subsequent study.

To further analyze the model performance, the accuracy comparison results are shown in Table 7 with reference to the benchmark model listed in Section 2.3. Figure 8 shows real AQI values be compared with predicted AQI values obtained from the different models. The vertical axis indicates the AQI value. The horizontal axis indicates the year.

According to Table 7, it can be obtained that the CEEMDAN-ARMA-LSTM used in this paper has the most excellent performance. The model has the smallest MAE, MAPE, MSE, and RMSE in the simulation experiment of predicting Beijing AQI from January 2020 to December 2021. Compared with the three models CEEMDAN-LSTM, LSTM and ARMA-GARCH, MAE improved by 22.5%, 53.4% and 21.5%, MAPE improved by 21.4%, 55.3% and 26.1%, MSE improved by 39.9%, 76.9% and 28.5%, and RMSE improved by 22.5%, 52.0% and 15.4%. Accuracy improvements were significant.

Specifically, the CEEMDAN-LSTM has smaller errors in all categories compared to the LSTM. This result demonstrates that CEEMDAN improves data utilization by decomposing the AQI time series. Decomposition methods such as CEEMDAN continuously extract the various scale components that make up the original signal from high to low frequencies to strengthen and separate each frequency feature, which in turn improves the efficiency of model simulation, training, and prediction sessions to capture the patterns.

Second, the LSTM neural network has a smaller ARMA-GARCH class error compared with the ARMA-GARCH model. This result indicates that the ARMA model is more applicable for time series with strong periodicity and short length. Since this paper uses monthly AQI data with a sample size of 96, it may not meet the “big data” characteristics required for machine learning. This is also a possible reason for the low accuracy of the LSTM model.

Finally, the CEEMDAN-ARMA-LSTM model has higher accuracy compared with the CEEMDAN-LSTM model. Since ARMA cannot handle all CEEMDAN components, the CEEMDAN-ARMA-LSTM model ensures the availability of all components by feeding the non-smooth, white noise components into the LSTM. A proper combination of traditional models and machine learning and neural networks can effectively improve the prediction accuracy. Additionally, this approach has the advantage that the explanatory nature of the model can be preserved in the application of multivariate contexts.

4. Conclusions

In this paper, the CEEMDAN-ARMA-LSTM model is constructed by integrating the CEEMDAN, ARMA and LSTM neural networks with the objective of improving the prediction accuracy of monthly AQI data in Beijing. Through the accuracy test, CEEMDAN effectively captures and separates the potential amount of information contained in the data. The accuracy of the CEEMDAN-ARMA-LSTM model is relatively high and stable and has good application prospects. The study of accuracy improvement can continue to be enhanced in the subsequent research.

The empirical study in this paper is only one of the application areas of the CEEMDAN-ARMA-LSTM model. The method is estimated to be further applied to the fields of medicine, public health, economics and sociology. For example, we can estimate the historical long-term PM

_{2.5}

concentration time series data of the raster where a subject lives and combine the health effect indicators such as the morbidity, mortality and outpatient rate of the specific diseases to be studied to conduct chronic health effect studies and add evidence for the causal relationship between long-term PM

_{2.5}

exposure and disease. We could also estimate the historical provincial GDP, EPU and carbon sink data and combine them with statistical yearbook data, establishing econometric models for analysis of urbanization promotion, urban cluster development, and the effectiveness of transforming the economic development mode.

Author Contributions

Y.S. and J.L. set up the problem, computed the details and polished the draft. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lu, J.; Cao, X. PM2. 5 pollution in major cities in china: Pollution status, emission sources and control measures. Fresenius Environ. Bull. 2015, 24, e1349. [Google Scholar]
Kampa, M.; Castanas, E. Human health effects of air pollution. Environ. Pollut. 2008, 151, 362–367. [Google Scholar] [CrossRef] [PubMed]
Akimoto, H. Global Air Quality and Pollution. Science 2004, 302, 1716–1719. [Google Scholar] [CrossRef] [PubMed]
Yang, S. Real-time air quality forecasting, part I: History, techniques, and current status. Atmos. Environ. 2012, 60, 632–655. [Google Scholar]
Yang, Z.; Bocquet, M.; Mallet, V.; Seigneur, C.; Baklanov, A. Real-time air quality forecasting, part II: State of the science, current research needs, and future prospects. Atmos. Environ. 2012, 60, 656–676. [Google Scholar]
Liang, J.; Wu, L.; Wu, Y.; Chen, L. AQI Prediction of Nanjing City Based on GM(1,1) Seasonal Index Model. In Proceedings of the 2019 International Conference on Applied Mathematics, Model, Simulation and Optimization, Guilin, China, 21–22 April 2019. [Google Scholar]
Grell, G.A. A Description of the Fifth-Generation Penn State/NCAR Mesoscale Model (MM5); No. NCAR/TN-398+STR; University Corporation for Atmospheric Research: Boulder, CO, USA, 1995. [Google Scholar]
Sun, M.; Xu, M.; Xie, P.; Cao, L. The Research of Air Quality on Harbin Based on ARMA Model. Nat. Sci. J. Harbin Norm. Univ. 2018, 34, 21–25. [Google Scholar]
Tan, X. Analysis of PM2.5 Fine Particulate Matter Based on Time Series Method and Stepwise Regression Method. Ph.D. Thesis, Huazhong Agricultural University, Wuhan, China, 2019. [Google Scholar]
Xiao, L.; Yong, L.; Lf, A.; Ss, A.; Tao, Z.; Mq, A. Research on the prediction of dangerous goods accidents during highway transportation based on the ARMA model. J. Loss Prev. Process. Ind. 2021, 72, 104583. [Google Scholar]
Liwen, Z. Application of ARIMA model on prediction of China’s corn market. J. Phys. Conf. Ser. 2021, 1941, 12064. [Google Scholar]
Zeng, G.; Jin, R. Predicting Beijing Air Quality Data Based on LSTM Method. Int. J. Trend Sci. Res. Dev. 2021, 5, 774–777. [Google Scholar]
Liao, Y.; Yang, J. Multi-hour and multi-site air quality index forecasting in Beijing using CNN, LSTM, CNN-LSTM, and spatiotemporal clustering. Expert Syst. Appl. 2021, 169, 114513. [Google Scholar]
Chen, G.; Ren, M.; Wei, Q. Data-Intelligence Empowerment: A New Leap of Information Systems Research. J. Manag. World 2022, 38, 180–196. (In Chinese) [Google Scholar]
Zhao, Y. Air Ouality Index Prediction Based on ARIMA and SVRCombined Model-Taking Jinan as an Example. Master’s Thesis, Tianjin University of Commerce, Tianjin, China, 2019. [Google Scholar]
Wang, W. Research on Urban Air Quality Forecast Basedon ARMA-BP Neural Network. Master’s Thesis, Northwestern Polytechnical University, Xi’an, China, 2021. [Google Scholar]
Zheng, X.; Zhu, G. ARMA-ABCSVR-GABP Network Traffic Prediction Based On HP Filter. Comput. Appl. Softw. 2022, 39, 94–99. [Google Scholar]
Wang, T.; Zhang, M.; Yu, Q.; Zhang, H. Comparing the applications of EMD and EEMD on time–frequency analysis of seismic signal. J. Appl. Geophys. 2012, 83, 29–34. [Google Scholar] [CrossRef]
Chen, R.X.; Tang, B.P.; Ma, J.H. Adaptive de-noising method based on ensemble empirical mode decomposition for vibration signal. J. Vib. Shock 2012, 31, 82–86. [Google Scholar]
Das, A.B.; Bhuiyan, M. Discrimination of focal and non-focal EEG signals using entropy-based features in EEMD and CEEMDAN domains. In Proceedings of the 2016 9th International Conference on Electrical and Computer Engineering (ICECE), Dhaka, Bangladesh, 20–22 December 2016. [Google Scholar]
Chen, J.; Cheng, S.; Yang, Y. Modified EEMD algorithm and its applications. J. Vib. Shock 2013, 32, 7. [Google Scholar]
Torres , M.E.; Colominas, M.A.; Schlotthauer, G. A complete ensemble empirical mode decomposition with adaptive noise. In Proceedings of the IEEE International Conference on Acoustics, Prague, Czech Republic, 22–27 May 2011. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long Short-Term Memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Olah, C. Understanding LSTM Networks. Available online: http://colah.github.io/posts/2015-08-Understanding-LSTMs/ (accessed on 27 August 2015).
Agency, X.N. The Central Committee of the Communist Party of China the State Council on the Approval of the Beijing Urban Master Plan (2016–2035). Available online: http://www.gov.cn/zhengce/2017-09/27/content_5227992.htm (accessed on 27 September 2017).
National Bureau of Statistics of China, Seventh National Population Census Bulletin. Available online: http://www.stats.gov.cn/tjsj/zxfb/202105/t20210510_1817179.html (accessed on 11 May 2021).
People.cn. Six Chinese Cities Are among the World’s “First Tier” Cities. Available online: http://house.people.com.cn/n1/2018/1115/c164220-30402942.html (accessed on 11 November 2018).
Statistics, B.M.B. Beijing’s Economy Will Recover Steadily in 2020. Available online: http://tjj.beijing.gov.cn/zxfbu/202101/t20210120_2227698.html (accessed on 20 January 2021).

Figure 1. LSTM Structure Schematic. Figure from https://colah.github.io/posts/2015-08-Understanding-LSTMs/.

Figure 2. Flowchart of CEEMDAN-ARMA-LSTM algorithm.

Figure 3. AQI data from 2014 to 2021 in Beijing.

Figure 4. CEEMDAN Results.

Figure 5. Distribution of ACF and PACF of IMF1.

Figure 6. Distribution of ACF and PACF of IMF2.

Figure 7. Distribution of ACF and PACF of IMF3.

Figure 8. Line Graph of Prediction Results of Each Model.

Table 1. AQI: concentration limit of pollutant items.

Average Moving Value	24-h					8-h
IAQI	SO₂	NO₂	PM_2.5	PM₁₀	CO	O₃
Unit	$μ$ g/m³	$μ$ g/m³	$μ$ g/m³	$μ$ g/m³	mg/m³	$μ$ g/m³
0	0	0	0	0	0	0
50	50	40	35	50	2	100
100	150	80	75	150	4	160
150	475	180	115	250	14	215
200	800	280	150	350	24	265
300	1600	565	250	420	36	800
400	2100	750	350	500	48	Note 3
500	2620	940	500	600	60	Note 3
Average Moving Value	1-h (Note 1)
IAQI	CO	O₃	SO₂	NO₂
Unit	mg/m³	$μ$ g/m³	$μ$ g/m³	$μ$ g/m³
0	0	0	0	0
50	5	160	150	100
100	10	200	500	200
150	35	300	650	700
200	60	400	800	1200
300	90	800	Note 2	2340
400	120	1000	Note 2	3090
500	150	1200	Note 2	3840

One hour data is only used for real-time reporting, and 24 h data is used in daily news. If it exceeds 800, it will not be calculated. Calculated as 24-h moving average. If it exceeds 800, it will not be calculated. Calculated as 1-h moving average.

Table 2. Results of ADF-Test and LB-Test.

	ADF-Test			LB-Test
	t	P	Lags Used (AIC)	Result (Lags Used 30)
IMF1	−7.046	0.000	3	Non white noise sequence
IMF2	−3.504	0.007	5	Non white noise sequence
IMF3	−4.237	0.001	5	Non white noise sequence
IMF4	−2.028	0.274	5	Non white noise sequence
IMF5	−2.122	0.236	5	Non white noise sequence
IMF6	0.309	0.309	12	Non white noise sequence
Res	−7.913	0.000	0	White noise sequence

Table 3. Problems Arising from IMF4 to 6 Follow-up Operations.

	Diff	Problem
IMF4	16	Order too high
IMF5	2	Singular matrix, SVD function, non convergence
IMF6	None	It cannot be stationary by difference

Table 4. The Selected Model and its R

^{2}

Table 4. The Selected Model and its R

^{2}

	Model	R $^{2}$
IMF1	MA(2)	0.278
IMF2	ARMA(2,2)	0.881
IMF3	ARMA(4,2)	0.917

Table 5. Fitting Results of the Selected Model.

IMF1: MA(2)	R $^{2}$	Log Likelihood	S.D. of Innovations	AIC	BIC	HQIC
	0.278	−297.337	14.638	602.673	611.780	606.299
	coef	std err	z	P	[0.025	0.975]
const *	−1.211	0.136	−8.915	0.000	−1.478	−0.945
ma.L1.IMF1 *	−0.303	0.103	−2.956	0.003	−0.504	−0.102
ma.L2.IMF1 *	−0.697	0.097	−7.180	0.000	−0.887	−0.507
IMF2: ARMA(2,2)	R $^{2}$	Log Likelihood	S.D. of Innovations	AIC	BIC	HQIC
	0.881	−166.455	2.338	344.910	358.570	350.348
	coef	std err	z	P	[0.025	0.975]
const	−0.190	1.191	−0.159	0.874	−2.524	2.145
ar.L1.IMF2 *	1.045	0.106	9.829	0.000	0.837	1.253
ar.L2.IMF2 *	−0.691	0.102	−6.771	0.000	−0.891	−0.491
ma.L1.IMF2 *	1.260	0.136	9.257	0.000	0.993	1.527
ma.L2.IMF2 *	0.540	0.149	3.623	0.000	0.248	0.832
IMF3: ARMA(4,2)	R $^{2}$	Log Likelihood	S.D. of Innovations	AIC	BIC	HQIC
	0.917	42.226	0.120	−68.452	−50.239	−61.201
	coef	std err	z	P	[0.025	0.975]
const	−0.399	0.473	-0.842	0.400	−1.326	0.529
ar.L1.IMF3 *	3.468	0.059	58.635	0.000	3.352	3.584
ar.L2.IMF3 *	−4.806	0.157	−30.614	0.000	−5.114	−4.498
ar.L3.IMF3 *	3.130	0.155	20.161	0.000	2.825	3.434
ar.L4.IMF3 *	−0.814	0.057	−14.326	0.000	−0.926	−0.703
ma.L1.IMF3	−0.021	0.131	−0.164	0.870	−0.277	0.235
ma.L2.IMF3 *	−0.248	0.120	−2.059	0.040	−0.484	−0.012

* Labeled as significant at the 5% level.

Table 6. Basic Statistics of AQI Prediction Results in Beijing Based on CEEMDAN-ARMA-LSTM.

	AQI		IMF1		IMF2		IMF3
	True	Predict	True	Predict	True	Predict	True	Predict
Mean	78.25	85.83	−2.70	−1.55	−0.57	0.10	−1.87	−0.02
Median	73.50	84.71	−2.18	−1.21	−0.20	−0.12	−0.86	0.27
Max	149.00	103.23	34.45	0.43	18.30	4.79	9.25	5.39
Min	50.00	72.97	−28.87	−11.08	−27.92	−2.99	−14.96	−6.32
std	21.23	7.45	15.00	2.01	11.21	1.66	7.17	3.47
	IMF4		IMF5		IMF6		Res
	True	Predict	True	Predict	True	Predict	True	Predict
Mean	0.00	0.04	1.17	0.75	82.22	86.51	−5.9 × $10^{- 16}$	−1.6 × $10^{- 15}$
Median	−0.14	−0.29	1.65	1.16	81.83	83.04	0	−2.27 × $10^{- 15}$
Max	5.20	5.38	2.53	1.79	85.55	115.27	1.4 × $10^{- 14}$	2.3 × $10^{- 14}$
Min	−5.54	−6.66	−2.07	−3.03	80.23	80.66	−1.4 × $10^{- 14}$	−2.3 × $10^{- 14}$
std	3.69	4.08	1.37	1.25	1.65	8.52	4.5 × $10^{- 15}$	1.1 × $10^{- 14}$

Table 7. Model Performance Comparison Analysis.

	CEEMDAN-ARMA-LSTM
	AQI	IMF1	IMF2	IMF3	IMF4	IMF5	IMF6	Res
MAR	17.09	12.37	8.57	6.97	0.82	0.65	4.29	9.3 × $10^{- 15}$
MAPE	23.2%	109.9%	152.9%	301.0%	51.9%	60.4%	5.1%	∖
MSE	478.13	237.31	135.31	75.19	1.85	0.56	69.54	5.4 × $10^{- 35}$
RMSE	21.87	15.40	11.63	8.67	1.36	0.75	8.34	7.4 × $10^{- 18}$
	CEEMDAN-LSTM
	AQI	IMF1	IMF2	IMF3	IMF4	IMF5	IMF6	Res
MAR	22.05	20.22	5.93	1.37	0.82	0.65	4.29	9.3 $\times 10^{- 15}$
MAPE	29.5%	679.3%	723.9%	82.4%	51.9%	60.4%	5.1%	∖
MSE	795.67	791.14	51.97	4.62	1.85	0.56	69.54	5.4 × $10^{- 35}$
RMSE	28.21	28.13	7.21	2.15	1.36	0.75	8.34	7.4 × $10^{- 18}$
	LSTM				ARMA-GARCH
	AQI				AQI
MAR	36.64				21.76
MAPE	51.9%				31.4%
MSE	2072.50				668.53
RMSE	45.52				25.86

The red values are the minimum values of errors under model comparison. The ARMA-GARCH model in this paper is actually an ARMA. It is shown that there is no ARCH effect.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Sun, Y.; Liu, J. AQI Prediction Based on CEEMDAN-ARMA-LSTM. Sustainability 2022, 14, 12182. https://doi.org/10.3390/su141912182

AMA Style

Sun Y, Liu J. AQI Prediction Based on CEEMDAN-ARMA-LSTM. Sustainability. 2022; 14(19):12182. https://doi.org/10.3390/su141912182

Chicago/Turabian Style

Sun, Yong, and Jiwei Liu. 2022. "AQI Prediction Based on CEEMDAN-ARMA-LSTM" Sustainability 14, no. 19: 12182. https://doi.org/10.3390/su141912182

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AQI Prediction Based on CEEMDAN-ARMA-LSTM

Abstract

1. Introduction

2. Materials and Methods

2.1. Methods

2.1.1. CEEMDAN

2.1.2. ARMA Model

2.1.3. LSTM

2.1.4. CEEMDAN-ARMA-LSTM

2.2. Data Source

2.3. Model Performance Testing Criteria

3. Results

3.1. AQI Data Decomposition Based on CEEMDAN

3.2. ARMA-LSTM Result

3.2.1. Applicable Model Screening

3.2.2. ARMA Construction

3.2.3. LSTM Neural Network Settings

3.3. Analysis of Prediction Results

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI