Open AccessArticle

Ensemble Prediction Method Based on Decomposition–Reconstitution–Integration for COVID-19 Outbreak Prediction

Wenhui Ke

and

Yimin Lu

Key Laboratory of Spatial Data Mining & Information Sharing of Ministry of Education, National Engineering Research Centre of Geospatial Information Technology, Academy of Digital China (Fujian), Fuzhou University, Fuzhou 350116, China

Author to whom correspondence should be addressed.

Mathematics 2024, 12(3), 493; https://doi.org/10.3390/math12030493

Submission received: 24 December 2023 / Revised: 29 January 2024 / Accepted: 2 February 2024 / Published: 4 February 2024

Download

Browse Figures

Figure 1
Structure of CNN-LSTM-ATT network hybrid model. CNN layer: convolutional neural network layer. LSTM layer: long short-term memory layer. "> Figure 2
Flowchart of the decomposition–reconstitution–integration-based prediction method for COVID-19 new cases. EEMD: ensemble empirical mode decomposition. "> Figure 3
Original time series of daily new cases in the United States, France, and Russia. Data range from 1 February 2020 to 30 September 2022. "> Figure 4
Intrinsic mode function (IMF) subseries via decomposing the original daily new cases time series. "> Figure 5
Single-step prediction results for COVID-19 daily new cases of different prediction methods. "> Figure 6
Influence curve of input sequence length on model performance: (a) R2 performance curve; (b) MAE performance curve; (c) RMSE performance curve; (d) MAPE performance curve. "> Figure 7
Influence curve of output sequence length on model performance: (a) R2 performance curve; (b) MAE performance curve; (c) RMSE performance curve; (d) MAPE performance curve. "> Figure 8
Single-step prediction results of time series of daily new cases of COVID-19 in different countries: (a) France; (b) Russia. ">

Versions Notes

Abstract

Due to the non-linear and non-stationary nature of daily new 2019 coronavirus disease (COVID-19) case time series, existing prediction methods struggle to accurately forecast the number of daily new cases. To address this problem, a hybrid prediction framework is proposed in this study, which combines ensemble empirical mode decomposition (EEMD), fuzzy entropy (FE) reconstruction, and a CNN-LSTM-ATT hybrid network model. This new framework, named EEMD-FE-CNN-LSTM-ATT, is applied to predict the number of daily new COVID-19 cases. This study focuses on the daily new case dataset from the United States as the research subject to validate the feasibility of the proposed prediction framework. The results show that EEMD-FE-CNN-LSTM-ATT outperforms other baseline models in all evaluation metrics, demonstrating its efficacy in handling the non-linear and non-stationary epidemic time series. Furthermore, the generalizability of the proposed hybrid framework is validated on datasets from France and Russia. The proposed hybrid framework offers a new approach for predicting the COVID-19 pandemic, providing important technical support for future infectious disease forecasting.

Keywords:

COVID-19; ensemble prediction; ensemble empirical mode decomposition; fuzzy entropy; LSTM network

MSC:

92D30

1. Introduction

Since the outbreak of COVID-19 in December 2019, the highly infectious novel coronavirus rapidly spread across the globe within a short period of time. Despite the implementation of pharmaceutical interventions (such as vaccination) and non-pharmaceutical interventions (such as travel bans, flight cancellations, restrictions on gatherings, school closures, and public transportation shutdowns), the number of confirmed cases has continued to surge at an alarming rate [1,2]. The massive increase in infected patients has overwhelmed healthcare infrastructure, while the persistent spread of the pandemic has presented significant challenges for countries to adopt non-pharmaceutical interventions. Therefore, research on the trend of epidemic transmission has been a focal point for academia, governments, and the public. In the early stages of the pandemic, a timely and accurate understanding of the future trend and peak situation of the epidemic can provide scientific and technological support for curbing its spread and proposing targeted prevention and control policies and measures [3]. Although various countries have lifted their epidemic prevention measures at present, understanding the future trajectory of the pandemic can enable governments to timely grasp the local infection situation, remind the public to avoid infection peaks, and provide a theoretical basis for safeguarding people’s lives and health [4]. Therefore, the development of a model that is capable of accurately predicting the number and trends of COVID-19 infection cases is of paramount importance.

The dissemination of the novel coronavirus involves complex non-linear relationships, as the virus can spread through multiple transmission pathways, including droplet transmission, airborne transmission, and contact transmission [5]. Furthermore, the complexity of the pandemic’s transmission is exacerbated by the varying population densities, social behaviors, and hygiene measures implemented across different states and municipalities in the United States. During the early stages of the pandemic, the United States faced challenges in implementing effective prevention and control measures, leading to a rapid spread of the virus. Over time, however, the government gradually implemented stricter measures to curb the transmission, including implementing lockdown measures, promoting mask wearing, and enforcing social distancing protocols. These adjustments in strategies have had a significant impact on the dynamics of the outbreak. The complexity of the viral transmission is further manifested in the non-linearity and non-stationarity characteristics observed in the infection sequences [6]. Hence, the non-linearity and non-stationarity present a significant challenge in accurately predicting the number of infection cases and their trends. Therefore, the objective of this study is to develop a data-driven predictive model that addresses the challenges posed by the non-linearity and non-stationarity encountered in existing research, aiming to achieve accurate forecasting of daily new confirmed cases, in order to assist governmental authorities and the public in making more scientifically informed and rational decisions.

The remaining sections of this study are outlined as follows: Section 2 provides a comprehensive review of the relevant literature on forecasting methods for the COVID-19 pandemic, highlighting the unique contributions of this research. In Section 3, the methodology employed in this study is described, encompassing EEMD, fuzzy entropy reconstitution, and a hybrid model combining CNN-LSTM with an attention mechanism. The empirical data analysis process is presented in Section 4. Section 5 introduces ablation experiments, time window analysis, and model application. Finally, Section 6 offers a comprehensive summary of the entire study, presenting the key findings and suggesting directions for future research.

2. Literature Review

Since the onset of the pandemic, researchers worldwide have been developing and implementing predictive models for COVID-19 to understand the future trajectory of outbreaks. Some researchers have utilized epidemiological models, such as the susceptible–exposed–infectious–removed (SEIR) model and its variants, to express and model the transmission process among individuals in different infection states. These models are used to forecast the development trends of the epidemic and assess the effects of various intervention measures [7,8,9]. However, these models rely on a considerable number of assumed input parameters, including the probabilities of transitioning between the S, E, I, and R compartments of the population. Due to the strong sensitivity of the SEIR model to variations in these input parameters, the accuracy of predictions may be significantly compromised. Furthermore, the SEIR model is based on oversimplified assumptions, one of which assumes that transition rates are uniform within the population and remain constant over time. On the contrary, the transition rates of COVID-19 among the S, E, I, and R compartments vary over time and are highly sensitive to social demographics and mitigation policies.

With the increasing volume of pandemic data, data-driven predictive methods (such as the autoregressive integrated moving average model, random forests, support vector machines, and neural networks) have been utilized to uncover patterns in the historical time series of the COVID-19 outbreak and extrapolate future trends [10,11,12,13]. These models, which rely on a single source of error, play a crucial role in the short-term forecasting of the COVID-19 epidemic. However, due to unstable incubation periods, asymptomatic patients, and epidemic prevention policies, there are complex time-series relationships in the epidemic time-series data, and the non-linearity and non-smoothness of the epidemic data become a great challenge for the data-driven class of epidemic prediction [14,15]. The present-day non-linear modeling techniques that are widely utilized in this domain include the likes of artificial neural networks (ANNs) [16,17], long short-term memory (LSTM) [18,19,20], and gate recurrent units (GRUs) [21]. The methods outlined above merely incorporate rudimentary non-linear assumptions during the modeling process, failing to fully account for the inherent laws that underlie the COVID-19 epidemic spread. As a result, these models struggle to accurately identify key inflection points, exhibit significant lags, and face challenges when it comes to generalization. To enhance the dependability of their predictions, Kumar et al. [22] proposed the use of a spline function to segment the non-linear epidemic time series into different growth stages and predict it at different stages of spread of the infection with a linear modeling approach, which reduces the difficulty of prediction. Some researchers synergistically amalgamated the strengths of several models to devise hybrid models [23,24,25,26,27]. The results found that combining convolutional neural networks with temporal recurrent neural networks (e.g., CNN-LSTM, CNN-GRU) to capture the local spatial correlation and long-term dependence of historical daily new cases has significantly better predictive performance than a single model. The CNN-LSTM hybrid model, with low requirements for sequence stationarity, has been widely applied in various fields such as gold price forecasting [28], air quality prediction [29], and water quality forecasting [30], exhibiting excellent predictive performance. However, as the sequence length increases, the robustness of the model decreases [31]. To address this issue, Ran et al. [32] proposed an LSTM travel time prediction method based on attention mechanisms, while Yang et al. [33] utilized an attention-based CNN-LSTM model for water quality prediction, and Zhang et al. [34] developed a probabilistic CNN-BiLSTM algorithm based on multi-head attention for day-ahead wind speed forecasting. According to their research, the attention mechanism can capture long-term dependencies in time-series data and improve model robustness. Predictive models based on attention mechanisms have higher prediction accuracy than LSTM and other benchmark models.

However, due to the interaction of virus transmission characteristics and epidemic prevention policies, mining deeper feature patterns is a feasible means to further improve the prediction accuracy. To capture the long-term trends, periodic variations, and random fluctuations in the time series of epidemic data, various decomposition methods such as discrete wavelet transform (DWT), ensemble empirical mode decomposition (EEMD), and other techniques have been introduced into the field of epidemic forecasting [35,36,37,38,39]. These methods aim to enhance the interpretability and regularity of the decomposed time-series data, allowing the models to better learn and represent the features of the decomposed data. Among these methods, the EEMD proposed by Wu and Huang [40] is an improvement of empirical mode decomposition (EMD), which decomposes the sequence into a series of relatively smooth subsequences based on the characteristics of the sequence itself, which contains several intrinsic mode functions (IMFs) and a residual term, and the IMFs contain the local characteristics of the original sequence at different time scales. In addition, EEMD shows better robustness in dealing with noise disturbances and outliers, which can resist the influence of noise on the decomposition results and provide more robust decomposition results. EEMD has been widely used in COVID-19 epidemic prediction due to its adaptivity and stability, and has achieved certain results. For instance, Liu et al. [41] combined EEMD and an autoregressive moving average model (ARMA) to correct the prediction results of the Global Prediction System for COVID-19 Pandemic (GPCP) developed by Lanzhou University, making the corrected trend closer to the real situation. Hasan [42] used an ANN to predict all components of EEMD separately to significantly improve the prediction accuracy. Although the above EEMD-based prediction methods can improve prediction accuracy, the use of simple linear or non-linear models in the choice of prediction methods still leads to the omission of the characteristic laws implied by the component series. For this, Wang et al. [43] used a non-linear autoregressive artificial neural network (NARANN) to model each IMF term of COVID-19 prevalence and mortality decomposed by EEMD, and an autoregressive integrated moving average model (ARIMA) to model the residual term to capture the non-linear and linear features of the component series, respectively, which can better fit the dynamic dependence of the epidemic time series. Thus, motivated by the EEMD-based decomposition–integration idea, capturing the features implied by the component sequences with an appropriate hybrid model is an available way to improve the model’s performance. In addition, based on the idea of decomposition–integration, some scholars introduced sample entropy, fuzzy entropy (FE), and the Hurst exponent to measure the complexity of the decomposition sequence after decomposition and then reconstituted the decomposition sequences, thus reducing the accumulation of forecast errors. Among these time-series analysis methods, fuzzy entropy stands out as an improved approach based on sample entropy. Its computation involves the use of an exponential function to fuzzify the similarity measurement formula. Notably, fuzzy entropy exhibits the remarkable capability of obtaining continuously smooth entropy values, even with relatively short data sequences. This distinctive feature enables it to effectively capture the non-linear characteristics of time series and evaluate their complexity with greater precision [44]. The decomposition–reconstitution–integration mode has shown advantages in carbon price forecasting [45], electric charge forecasting [46], and stock price forecasting [47]. Therefore, the introduction of the EEMD-based decomposition–reconstitution–integration idea in COVID-19 epidemic prediction is an effective means to fully extract the implied features and reduce the prediction error.

Previous studies in the literature aim to establish a model that accurately predicts the number of COVID-19 infection cases and their trends. However, due to the varying speed of virus transmission, the influence of social interventions, and changes in population behavior, COVID-19 epidemic data exhibit non-linear and non-stationary characteristics. Existing data-driven prediction methods struggle to provide precise forecasts under such circumstances. Based on their studies, a hybrid CNN-LSTM model can extract local correlations and temporal dependencies from the virus infection sequences. Meanwhile, the EEMD technique can capture long-term trends, periodic variations, and random fluctuations in the infection data. Models constructed from a feature extraction perspective can enhance prediction accuracy to a certain extent. Therefore, this study integrates the strengths of EEMD and CNN-LSTM, fully exploring the features within the COVID-19 time-series data. Furthermore, an attention mechanism is introduced in the CNN-LSTM model to compensate for the decreasing ability of LSTM to capture time dependencies as the time series grows. Moreover, following the “decomposition–reconstitution–integration” idea, this study introduces FE to reconstruct the decomposed subsequences, thereby reducing the accumulation of prediction errors. Consequently, this research develops a novel hybrid forecasting framework (EEMD-FE-CNN-LSTM-ATT) that better fits the non-linear and non-stationary features in COVID-19 data. The proposed method improves the accuracy and reliability of epidemic forecasting, offering a novel approach for COVID-19 prediction, and providing crucial technical support for future predictions of other infectious diseases.

The contributions of this research are as follows: (1) from a feature extraction perspective, this study proposes a novel hybrid prediction framework (EEMD-FE-CNN-LSTM-ATT) that integrates EEMD, FE reconstruction, and CNN-LSTM with an attention mechanism, based on the “decomposition–reconstitution–integration” idea; (2) empirical data analysis validates the effectiveness and generalizability of the proposed method in handling non-linear, non-stationary COVID-19 time-series data. The proposed method can provide decision-makers with more accurate predictions to support effective public health interventions and resource allocation decisions.

3. Proposed Method

In the field of epidemic prediction, from the perspective of feature extraction, based on the idea of “decomposition–reconstitution–integration”, this study proposes, for the first time, a novel hybrid framework that integrates EEMD, FE reconstitution, and a hybrid model of a CNN-LSTM network based on the attention mechanism to predict the number of new cases per day. Firstly, EEMD is utilized to extract features such as long-term trends, cyclical changes, and random fluctuations in the time series of daily new cases. Second, considering that predicting each subsequence separately after decomposition will inevitably result in the accumulation of prediction errors, fuzzy entropy analysis is introduced after EEMD to reconstitute the subsequence according to the complexity of the subsequence. Then, the constructed CNN-LSTM based on the attention mechanism (CNN-LSTM-ATT) is used to predict each reconstituted subsequence, using a CNN to capture the local spatial correlation of the reconstituted subsequence, LSTM to capture its temporal dependence, and the attention mechanism to make up for the decreasing ability of LSTM to capture temporal dependence with the growth of the time series. Finally, the resulting predictions are combined to obtain the complete predicted values. Next, the EEMD method, the FE reconstitution method, the construction of the CNN-LSTM-ATT network hybrid model, and the analysis process of the proposed method are described in detail.

3.1. EEMD

Considering the contagiousness of COVID-19, the number of infections on the most recent day always shows some correlation with the number of infections on the previous days, and thus, the outbreak time series may show a long-term trend. In addition, the activity of new coronavirus is higher in winter and more contagious, and the epidemic time series exhibits certain cyclic variation. Considering the combined effect of policy, medical care, and other factors, the epidemic time series also shows some random fluctuation. Therefore, EEMD is introduced to extract the features of the long-term trends, cyclic variations, and random fluctuations of daily new case time series; to achieve individual predictions of local features; and to solve the latency of prediction of traditional time-series models. The decomposition steps of EEMD are as follows [40]:

Perform M iterations of the empirical mode decomposition (EMD) on the original time series,

\{s (t) : 1 ⩽ t ⩽ N\},

of length N. During each decomposition process, add different white Gaussian noise sequences,

k n_{m} (t),

with an equal root mean square. In the

m

th decomposition, we obtain

n

IMF components,

c_{i, m} (t), i = 1,2, \dots, n,

and one residual component,

r_{n \cdot m} (t)

. Calculate the mean values of the IMF components and residual component of group

M

as the decomposition results of EEMD:

{\bar{c}}_{i} (t) = \frac{\sum_{m = 1}^{M} c_{i . m} (t)}{M}, i = 1,2, \dots, n

(1)

{\bar{r}}_{n} (t) = \frac{\sum_{m = 1}^{M} r_{n . m} (t)}{M}

(2)

where, according to [40] and the experimental data in this paper,

M

is taken as 100;

k

is taken as 0.05, indicating the amplitude of the added white noise. The time series of daily new cases of COVID-19 can be expressed as a linear combination of the final IMF components, as well as the residual component, as follows:

s (t) = \sum_{i = 1}^{n} {\bar{c}}_{i} (t) + {\bar{r}}_{n} (t)

(3)

3.2. Fuzzy Entropy Reconstitution

To reduce the accumulation of prediction errors, after the original time series of daily new cases of COVID-19 are decomposed into several IMF components (including the residual component) by EEMD, the autocorrelation of each component sequence is measured using fuzzy entropy, and then the components are reconstituted according to the autocorrelation of the components. The larger the entropy value, the more complex the component series, the less similarity between the series; the greater the influence of the change information contained on the prediction results of daily new cases, the more should be retained when reconstructing. The reconstruction steps after fuzzy entropy valuation are as follows:

First, each sequence of daily new case data components of length

N

is reconstituted separately to generate a

w

—dimensional vector,

\begin{matrix} U_{j}^{w} = u (j), u (j + 1), \dots, u (j + w - 1) - u_{0} (j) \\ (j = 1, \dots, N - w + 1) \end{matrix}

(4)

and a

w + 1

—dimensional vector,

\begin{matrix} U_{j}^{w + 1} = u (j), u (j + 1), \dots, u (j + w) - u_{0} (j) \\ (j = 1, \dots, N - w) \end{matrix}

(5)

where

u (j), u (j + 1), \dots, u (j + w - 1)

and

u (j), u (j + 1), \dots, u (j + w)

represent the vectors consisting of

w

and

w + 1

consecutive data of

{\bar{c}}_{i} (t),

starting from the

j

th data, respectively, and

u_{0} (j)

is its mean value.

Second, the definition of fuzzy entropy is as follows:

F E (w, n, r) = \underset{N \to \infty}{l i m} [\ln ϕ^{w} (n, r) - \ln ϕ^{w + 1} (n, r)]

(6)

Then, the fuzzy entropy value of each component sequence of length

N

is calculated.

FE (w, n, r, N) = \ln ϕ^{w} (n, r) - \ln ϕ^{w + 1} (n, r)

(7)

where

w

represents the embedding dimension, and depending on the desired data length (

10^{w}

∼

30^{w}

w

is taken as 2;

n

determines the similar tolerance boundary gradient, and to capture more detailed information,

n

is taken as 2;

r

represents the width of the fuzzy function boundary, and to capture more information as well as reduce the sensitivity to the resultant noise,

r

is taken as

0.15 s d (u)

(

s d (u)

is the standard deviation of the sequence);

ϕ^{w} (n, r)

and

ϕ^{w + 1} (n, r)

are the similarities of vectors

U_{j}^{w}

and

U_{j}^{w + 1}

, respectively.

Finally, the sequences with similar fuzzy entropy values are reconstituted into

s

sequences,

X = \{x_{1}, x_{2}, \dots, x_{N}\},

of length

N

3.3. CNN-LSTM-ATT Network Hybrid Model

Due to the correlation between the number of daily new cases within a certain period, we propose a network hybrid model, the CNN-LSTM-ATT, as shown in Figure 1. The hybrid model uses a CNN to capture the local spatial correlation of the reconstructed subsequence and LSTM to capture its temporal dependency, and uses the attention mechanism to compensate for the decline in the ability of LSTM to capture temporal dependency as the time series grows, which improves the accuracy of the prediction of the short-term COVID-19 daily new cases. Each layer in the model is described as follows:

Input layer: the input layer takes the decomposed recombinant sequence of COVID-19 daily new case data as the input of the hybrid prediction model, and the recombinant sequence of length $N$ can be expressed as $X = [x_{1} \dots x_{t - 1}, x_{t} \dots x_{N}]^{T}$ .
CNN layer: The CNN layer consists of a convolutional layer, a pooling layer, and node expansion. According to the features of the input sequence, a one-dimensional convolutional layer is designed and ReLU is selected as the activation function to complete the feature extraction of the input sequence. Maximum pooling is selected for down sampling, while retaining more information about data fluctuations. After convolutional processing and pooling, the input sequence is mapped to the hidden layer feature space, its nodes are expanded (dimensionality reduction), and the expanded feature vector contains the local connections between different feature values of the input sequence, and then the extracted feature vector is input to the LSTM layer for further processing. Noting the output of the CNN layer as $H_{C}$ , the training process can be expressed as

$C = f (X \otimes W_{1} + b_{1}) = R e L U (X \otimes W_{1} + b_{1})$

(8)

$P = m a x (C) + b_{2}$

(9)

$H_{C} = P \times W_{2}$

(10)

where $C$ is the output of the convolution layer; $P$ is the output of the pooling layer; $W_{1}$ and $W_{2}$ are the weight matrices; $b_{1}$ and $b_{2}$ are the deviations; $\otimes$ and $m a x ()$ are the convolution operation and the maximum function; and the output length of the CNN layer is $i$ , denoted as $H_{C} = [h_{c 1} \dots h_{c t - 1} \dots h_{c t} \dots h_{c i}]$ .
LSTM layer: The LSTM layer learns the feature vectors extracted by the CNN layer to achieve the prediction of the input sequence. A single-layer LSTM structure is built with neurons of 10 to prevent model overfitting. The output of the LSTM layer is denoted as $H$ . The output at step $t$ is denoted as

$h_{t} = L S T M (H_{C, t - 1}, H_{C, t}), t \in [1, i]$

(11)
Attention layer: The attention mechanism layer implements the weight assignment to the output vector $(H$ ) of the LSTM layer, constantly updating the weight parameter matrix to focus on important information and ignore irrelevant information, thus obtaining higher scalability and robustness. The formula for calculating the attention assignment weights can be expressed as

$e_{t} = u t a n h (w h_{t} + b)$

(12)

$α_{t} = \frac{e x p (e_{t})}{\sum_{j = 1}^{t} e_{j}}$

(13)

$s_{t} = \sum_{t = 1}^{i} α_{t} h_{t}$

(14)

where the probability vector composed of $α_{t}$ represents the attention distribution value of the output vector $(h_{t})$ of the LSTM network layer at moment $t$ ; $e_{t}$ represents the attention scoring function; $u$ and $w$ are the weight coefficients; $b$ is the bias coefficient; and $s_{t}$ represents the weighted average of the input information as the output of the attention layer at moment $t$ .
Output layer: The output of the attention mechanism layer is used as the input of the output layer. The output layer outputs the prediction result $Y = [y_{1}, y_{2} \dots y_{m}]^{T}$ with prediction time step $m$ through the fully connected layer. The prediction formula can be expressed as

$y_{t} = s_{t} \times w_{o}$

(15)

where $y_{t}$ denotes the output value of the hybrid prediction model at moment $t$ ; $w_{o}$ is the weight matrix.

The Adam optimizer is used to obtain the optimal parameters. Adam is a first-order optimization algorithm proposed based on a traditional stochastic gradient descent, which continuously adjusts the weights of the network according to the changes to the training samples to minimize the loss function of the neural network. The mean squared error function is used as the loss function of the hybrid prediction model with the following equation:

P E = \frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}

(16)

where

y_{i}

is the actual value of each recombination sequence;

{\hat{y}}_{i}

is the output value of the hybrid prediction model; and

n

is the length of the recombination sequence.

3.4. COVID-19 Outbreak Time-Series Hybrid Forecasting Framework

The flow chart of the COVID-19 daily new case prediction method, constructed based on EEMD, FE reconstruction, and a CNN-LSTM-ATT network hybrid model, is shown in Figure 2. First, the EEMD method is used to achieve the smoothing of the historical time series of new cases for COVID-19 to obtain a finite number of IMF components and a residual component to fully exploit the feature information on the time scale. Second, the complexity of each IMF component and the residual component is calculated by fuzzy entropy to measure the autocorrelation of sequences and the similarity between sequences, and the sequences with approximate entropy values are reconstituted into new sequences to reduce error accumulation while making full use of sequence detail information. Then, each reconstituted sequence is preprocessed and normalized to [−1,1], and a CNN-LSTM-ATT network hybrid model is built to predict each of the normalized reconstituted sequences. Finally, the prediction results of each reconstituted sequence are inverse-normalized and superimposed to obtain the final prediction value.

4. Data Analysis

Three datasets of daily new cases in the United States, France, and Russia from 1 February 2020 to 30 September 2022 were selected from the COVID-19 epidemic dataset released by the World Health Organization (WHO), with a total of 2919 samples as research objects. Each dataset contained a time series of daily new confirmed cases in one country. The original sequences of daily new cases in the United States, France, and Russia are plotted separately, as shown in Figure 3. From Figure 3, the three original time series have no obvious linear trend and have large volatility in the short term. In addition, combined with the KPSS test results, the p-values of the three sequences are all lower than 0.05 and the statistical values are all greater than the critical value; that is, the null hypothesis is rejected and the sequence is non-stationary. Therefore, these three sequences are non-linear and non-stationary time series. To validate the effectiveness of the EEMD-FE-CNN-LSTM-ATT ensemble prediction method, experiments are conducted on these three datasets. Since there are no vacancies in the data or 0 data, no interpolation is required, so anomaly detection becomes the main consideration for pre-processing. Therefore, the isolated forest algorithm [48] is used for anomaly detection, where outliers are detected and rejected.

To evaluate the prediction accuracy of different methods intuitively, the R² (coefficient of determination), MAE (mean absolute error), MAPE (mean absolute percentage error), and RMSE (root mean square error) are selected as evaluation indexes. These four evaluation methods are commonly used criteria in regression forecasting and are defined as follows:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}{\sum_{i = 1}^{n} (y_{i} - \bar{y})^{2}}

(17)

M A E = \frac{1}{n} \sum_{i = 1}^{n} ∣ y_{i} - {\hat{y}}_{i} ∣

(18)

M A P E = \frac{1}{n} \sum_{i = 1}^{n} \frac{|y_{i} - {\hat{y}}_{i}|}{y_{i}} \times 100 %

(19)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})^{2}}

(20)

where

n

denotes the total number of predicted days;

y_{i}

denotes the actual daily number of new cases on day

i

;

{\hat{y}}_{i}

denotes the predicted daily number of new cases on day

i

; and

\bar{y}

denotes the average of the actual daily number of new cases on day

n

. The RMSE and MAE describe the actual magnitude of the difference between the predicted and true values. The MAPE describes the percentage of prediction error relative to the true values. R² describes the goodness of fit of the predicted values to the true values. The smaller the values of the MAPE, RMSE, and MAE, and the closer the value of R² to one in the forecast, the more accurate the forecast results.

To verify the prediction performance of EEMD-FE-CNN-LSTM-ATT, we set the CNN-LSTM-ATT network hybrid model with an input sequence length (

T

of seven and an output sequence length (

t

) of one to analyze the whole process of decomposition, reconstruction, and prediction with the US dataset; we compared the performance difference between the proposed model and other baseline models under the same experimental environment with the same input and output sequence length. In this test, Tensorflow and Keras frameworks were used to build the ensemble prediction model. In the prediction experiments, the learning rate was set to 0.01; the training and testing sets were divided in a 9:1 ratio to optimize the parameter settings of the network hybrid model.

The results of the EEMD of the time series of daily new cases are shown in Figure 4. As can be seen from Figure 4, the original time series of daily new cases is decomposed into eight IMF components and one residual component (Res). The original series of daily new cases has significant non-smooth and non-linear characteristics. IMF1~IMF5 and the residual component vary significantly on the time scale, with obvious oscillation and aggregation. IMF6~IMF7 are cyclical in time scale, showing interannual variability in the series of daily new cases. IMF8 showed a clear upward trend on the time scale, exhibiting a long-term trend in the time series of daily new cases.

The fuzzy entropy of each IMF component is calculated, and the results are shown in Table 1. Δ is the difference between the fuzzy entropy of the current component and the adjacent components. As can be seen from Table 1, the fuzzy entropy value of the IMF component shows a decreasing trend, indicating that the fluctuation frequency and complexity of the sequence are decreasing; the residual component has a slightly higher entropy value and the complexity is comparable to that of the IMF3 component. The IMF components are divided into three groups according to the fuzzy entropy values: IMF1~IMF2 has the largest entropy values and is the high-frequency group; IMF3~IMF5 and Res have relatively large entropy values and are the medium-frequency group; and IMF6~IMF8 have the smallest entropy values and are the low-frequency group. According to the fuzzy entropy difference, among the high-frequency group, the fuzzy entropy difference between IMF1 and IMF2 is larger, second only to the fuzzy entropy difference between IMF2 in the high-frequency group and IMF3 in the medium-frequency group, indicating that IMF1~IMF2 have important information affecting the temporal change in the daily new cases of COVID-19. In the medium-frequency group, the fuzzy entropy differences between IMF3 and IMF4, IMF4 and IMF5, and IMF5 and IMF6 are relatively large, suggesting that IMF3~IMF5 and Res retain detailed information on the chronological changes in the daily new cases of COVID-19. In the low-frequency group, the fuzzy entropy differences between IMF6 and IMF7 and IMF7 and IMF8 are relatively low, indicating that the information retained on the time scale of IMF6~IMF8 has some similarity. To retain more detailed information on the time scale and reduce the accumulation of prediction errors, all components of the high-frequency and medium-frequency groups are combined as new recombination sequences; the low-frequency groups IMF6~IMF8 are combined as a new reconstitution sequence.

The seven reconstituted sequences are input into the CNN-LSTM-Attention network hybrid model to obtain the prediction results of each reconstituted feature sequence, and finally, the prediction results of these seven sequences are summed to obtain the final predicted value of daily new cases in the United States.

To evaluate the effectiveness of the proposed method (EEMD-FE-CNN-LSTM-ATT), it is compared with ARIMA [10], LSTM [18], CNN-LSTM [25], EEMD-NARANN-ARIMA [43], and the prediction models using discrete wavelet transform (DWT-CNN-LSTM-ATT). The results, as illustrated in Figure 5, indicate that both the ARIMA and LSTM models exhibit inaccuracies in capturing extreme points and demonstrate noticeable lag. While the CNN-LSTM and DWT-CNN-LSTM-ATT models generally capture the trend of the epidemic accurately, CNN-LSTM still exhibits some lag. In comparison to the prediction model utilizing discrete wavelet transform (DWT-CNN-LSTM-ATT), the model employing EEMD demonstrates a more accurate capture of turning points and extreme values. This validates the effectiveness of using EEMD to achieve data stabilization for daily new COVID-19 cases and predicting local features at different time scales, leading to a significant improvement in prediction accuracy. Furthermore, when compared to EEMD-NARANN-ARIMA, the proposed method (EEMD-FE-CNN-LSTM-ATT) produces a prediction curve that better fits the actual curve and achieves higher accuracy.

Table 2 presents the accuracy evaluation results of the proposed method and the comparative methods on the US dataset. It can be observed that the proposed method achieves the lowest MAE, MAPE, and RMSE values, as well as the highest R² value, indicating the highest prediction accuracy. The ARIMA model exhibits an MAE of 34,609.37, an MAPE of 64.24%, an RMSE of 46,260.28, and an R² of 0.2450. Its prediction accuracy is lower than other models, which may be attributed to the high requirement of sequence stationarity by ARIMA, while the daily new cases in the US exhibit non-stationarity, leading to poorer prediction performance. The LSTM model yields an MAE of 44,167.21, an MAPE of 74.03%, an RMSE of 35,246.13, and an R2 of 0.3171. Compared to ARIMA, LSTM shows improved prediction accuracy; however, as observed in Figure 5, LSTM performs poorly in capturing extreme points accurately. In contrast, CNN-LSTM reduces the MAE by 18,717.64, the MAPE by 30.09%, and the RMSE by 2152.94, and increases the R2 by 0.2995. The significant improvement in prediction accuracy indicates the effectiveness of integrating CNNs and LSTM in extracting relevant features from the daily new case time series.

The proposed DWT-CNN-LSTM-ATT model achieves an MAE of 23,058.23, an MAPE of 45.35%, an RMSE of 29,313.31, and an R² of 0.6992, indicating higher prediction accuracy than the models without decomposition techniques. Compared to DWT-CNN-LSTM-ATT, EEMD-NARANN-ARIMA reduces the MAE by 11,462.57, the MAPE by 24.99%, and the RMSE by 14,151.52, and increases the R² by 0.2203. Similarly, EEMD-FE-CNN-LSTM-ATT reduces the MAE by 13,402.99, the MAPE by 26.56%, and the RMSE by 17,084.67, and increases the R² by 0.2484. This is likely due to the dependence of discrete wavelet transform on the choice of wavelet basis functions, while EEMD exhibits adaptability and stability, providing advantages in extracting long-term trends, periodic changes, and random fluctuations from the daily new cases time series. Compared to EEMD-NARANN-ARIMA, EEMD-FE-CNN-LSTM-ATT reduces the MAE by 1940.42, the MAPE by 1.57%, and the RMSE by 2933.15, and increases the R² by 0.0281, proving that appropriate hybrid prediction models for subsequence prediction after decomposition can provide better prediction performance. Therefore, the effectiveness of the proposed method in predicting the daily new COVID-19 cases has been demonstrated.

5. Discussion

5.1. Ablation Experiments

To explore the contributions of various components in the proposed model (EEMD-FE-CNN-LSTM-ATT) towards its overall performance, we compared the performance differences among six models: CNN-LSTM, CNN-LSTM-ATT, EEMD-CNN-LSTM, EEMD-CNN-LSTM-ATT, EEMD-FE-CNN-LSTM, and EEMD-FE-CNN-LSTM-ATT. We evaluated their performance in predicting the COVID-19 epidemic in the United States, and the results are presented in Table 3.

Compared with CNN-LSTM, the R² of CNN-LSTM-ATT increased by 0.0324, the MAE decreased by 1951.37, the MAPE decreased by 2.06%, and the RMSE decreased by 1429.55. It is proved that adding the attention module to capture the time dependence of virus infection sequences based on CNN-LSTM can improve the prediction accuracy. Compared with CNN-LSTM, the R² of EEMD-CNN-LSTM increased by 0.2908, the MAE decreased by 12,578.65, the MAPE decreased by 23.37%, and the RMSE decreased by 16,826.24. The prediction accuracy is greatly improved, which proves that for non-linear and non-stationary epidemic time series, it is necessary to use EEMD technology to stabilize the time series and predict the extracted local features separately. Compared to CNN-LSTM-ATT and EEMD-CNN-LSTM, EEMD-CNN-LSTM-ATT demonstrated improved predictive accuracy. This indicates that the simultaneous introduction of the attention module and the EEMD method can further enhance the performance of CNN-LSTM in COVID-19 epidemic prediction. Compared with EEMD-CNN-LSTM, the R² of EEMD-FE-CNN-LSTM increased by 0.0199, the MAE decreased by 1382.33, the MAPE increased by 0.81%, and the RMSE decreased by 1857.11. Compared with EEMD-CNN-LSTM-ATT, the R² of the proposed model (EEMD-FE-CNN-LSTM-ATT) increased by 0.0206, the MAE decreased by 1642.44, the MAPE decreased by 2.42%, and the RMSE decreased by 2206.8. In general, the FE reconstitution of EEMD decomposed subsequences effectively reduces the accumulation of prediction errors. Compared with EEMD-FE-CNN-LSTM, the R² of the proposed model (EEMD-FE-CNN-LSTM-ATT) increased by 0.0203, the MAE decreased by 1833.35, the MAPE decreased by 2.59%, and the RMSE decreased by 2181.2. These results confirm that the simultaneous integration of EEMD, FE reconstitution, and an attention module on top of CNN-LSTM can further enhance the predictive performance of the model. In summary, the contribution of each component of the proposed model to the overall performance of the model and the reliability of the proposed model for outbreak prediction are verified.

5.2. Time Window Analysis

For the short-term forecasting of epidemic time series, the input time window refers to taking the evolution sequence of the epidemic in the past period as the input to the model, fully exploiting the implied trend, cycle, and other characteristic laws of the historical sequence, and outputting the short-term fluctuations of epidemic time series within the length of a specific output sequence. Too short an input time window cannot capture the important feature laws of the historical series, while with too long an input time window, the model is unable to extract many features, leading to a decline in model performance. Additionally, FE reconstitution adds complexity to the decomposed subsequence, and the CNN network and attention mechanism play crucial roles in extracting latent features within the input time window. Therefore, the selection of the input time window influences the performance of each component of the model, ultimately affecting the training effectiveness and predictive accuracy of the proposed model.

The optimal input sequence length for the proposed model is explored by selecting the input time window from

T

= {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14}, considering the incubation period of COVID-19, which spans from 2 to 14 days. Furthermore, experiments on three variants of the proposed model with FE reconstitution removed (EEMD-CNN-LSTM-ATT), the CNN network removed (EEMD-FE-LSTM-ATT), and the attention mechanism removed (EEMD-FE-CNN-LSTM) are conducted under different input sequence lengths to explore how the change in the input sequence length affects the performance of each component of the model, which in turn affects the prediction accuracy of the proposed model.

The output sequence length is selected from

t

= {1, 2, 3, 4} to further explore the changes in the prediction accuracy of the three variants and the proposed model as the output sequence length increases, and to verify the superiority of the proposed model with different output sequence lengths.

After fixing the output sequence length as

t

= 1 and adjusting the input sequence length (

T

), the result is shown in Figure 6. The prediction error of the proposed model shows a trend of increasing and then decreasing to a relatively smooth level as the length of the input sequence increases. This phenomenon suggests that longer input sequences contain more information about the historical COVID-19 epidemic changes, leading to improved prediction performance. However, once the input sequence surpasses a certain length threshold, it becomes challenging for the model to extract the hidden features embedded in the input sequence, causing a decline in the model’s performance. Remarkably, the best predictive outcomes are achieved when the input sequence has a length of seven. Interestingly, previous studies have corroborated these findings, demonstrating that time-series models with an input step size of seven yield superior predictions in COVID-19 outbreak forecasting [49,50].

In addition, compared with EEMD-CNN-LSTM-ATT, the improvement in the prediction accuracy of the proposed model in terms of the R², MAE, and RMSE shows the process of increasing and then decreasing with the growth of the input sequence length, which is because, as the sequence contains more information, the FE-reconstituted component sequences reduce the accumulation of prediction errors and effectively improve the model prediction accuracy. However, as the input sequence grows further, the FE-reconstituted component sequences make the complexity of the components increase greatly, leading to the loss of information in the model prediction and making it difficult to improve the prediction performance. Compared with EEMD-FE-LSTM-ATT, the improvement of prediction accuracy of the proposed model in terms of the R², MAE, RMSE, and MAPE is very significant when the length of the input sequence is around two to nine, which proves the effectiveness of selecting the CNN module to extract local features based on LSTM, and the robustness of model prediction is obviously enhanced as the length of the input sequence grows again. Compared with EEMD-FE-CNN-LSTM, the performance degradation of the proposed model slows down when the input sequence length is greater than three, demonstrating the ability of the attention mechanism to capture historical dependencies. In the comprehensive analysis, the strengths of the components of the proposed model can be fully utilized when the length of the input sequence is seven, and the highest prediction accuracy of the proposed model can be achieved.

After fixing the input sequence length as

T

= 7 and adjusting the output sequence length (

t

), the result is shown in Figure 7. The proposed model exhibits a similar pattern to the three variants of the model in terms of predicting daily new cases in the United States dataset. Notably, there is a gradual rise in the prediction error as the output series expands in size. By considering the four evaluation metrics, it is evident that the proposed model outperforms the other three variants when applied to varying output sequence lengths. These results further confirm the superiority of the proposed model in predicting daily new cases of COVID-19.

5.3. Model Applications

To provide additional evidence for the efficacy and applicability of our proposed model in predicting daily new COVID-19 cases across diverse regions, we conducted further analyses by forecasting the daily new cases in France and Russia during the same time frame. As displayed in Figure 8, the predicted values fit well with the actual values and adeptly capture the overarching trajectory of the epidemic. However, slight inaccuracies were detected in the forecasting of mutation data (e.g., the number of new cases in France on 12 July 2022). Based on the evaluation indexes, it can be observed that the French MAE is 5944.34, the MAPE is 26.37%, the RMSE is 8454.89, and the R² is 0.9642. On the other hand, the Russian MAE is 1167.37, the MAPE is 6.54%, the RMSE is 1496.26, and the R² is 0.9939. These results suggest that the proposed model exhibits a superior goodness of fit on datasets with relatively low non-stationarity, as evidenced by the better performance in predicting the Russian data compared to the French data.

6. Conclusions

To tackle the challenges stemming from the non-smooth and non-linear nature of the COVID-19 daily new case time series, which can result in both low prediction accuracy and notable forecasting lags associated with single-model approaches, we proposed a novel hybrid framework grounded in EEMD, fuzzy entropy, and CNN-LSTM-Attention techniques. By using the United States, France, and Russia COVID-19 daily new case time series as validation examples, we arrived at the following conclusions:

-: The EEMD method enables the extraction of temporal features such as long-term trends, cyclical changes, and random fluctuations in daily new case data, which helps to decipher the intrinsic mechanism of daily new case time series over time. The introduction of the FE reconstruction component on this basis preserves the detailed change information on the time scale and minimizes the accumulation of prediction errors.
-: A CNN-LSTM-ATT hybrid model was formulated to facilitate the training and forecasting of daily new case time–frequency features. The hybrid network integrates an attention module into the CNN-LSTM network, thereby offering greater adherence to the COVID-19 propagation pattern and elevating the accuracy of daily new case prediction.
-: The novel EEMD-FE-CNN-LSTM-ATT ensemble prediction model exhibits substantially lower MAE, MAPE, and RMSE values, alongside significantly higher R² scores when compared to traditional time-series models in single-step prediction outcomes, which can provide a new method for the data-driven prediction of daily new cases of COVID-19.

This research, while offering valuable insights, does not come without limitations. Firstly, the EEMD-FE-CNN-LSTM-ATT model employed in this study belongs to the data-driven category of models, thus underscoring the importance of having reliable and accurate daily new case data. However, due to limited nucleic acid detection capacity and potential reporting delays, there may exist a gap between the reported figures and the actual number of new COVID-19 cases. Secondly, this study solely focuses on mining characteristic patterns from the time series itself, whereas virus infection rates are influenced by various external factors such as population movements, weather changes, and government policies aimed at prevention and control. Incorporating these external factors into the prediction model could potentially lead to enhanced predictive accuracy. Future research endeavors should therefore aim to leverage the characteristics inherent to epidemic time series, while accounting for the effects of external factors on the pandemic trajectory, thereby improving the reliability and accuracy of our predictive course of the COVID-19 outbreak.

Author Contributions

Conceptualization, Y.L.; methodology, Y.L.; software, W.K.; validation, W.K. and Y.L.; formal analysis, W.K.; investigation, W.K.; resources, W.K.; data curation, W.K.; writing—original draft preparation, W.K.; writing—review and editing, W.K.; visualization, W.K.; supervision, Y.L.; project administration, Y.L.; funding acquisition, Y.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Strategic Priority Research Program of the Chinese Academy of Sciences (grant number XDA23100504), the Special Projects of the Central Government Guiding Local Science and Technology Development (grant number 2020L3005), and the National Key Research and Development Program of China (grant number 2017YFB0503500).

Data Availability Statement

The data that support the findings of this study are openly available at https://COVID19.who.int/data (accessed on 1 July 2023).

Acknowledgments

The authors would like to thank the editors and reviewers for their detailed comments and efforts toward improving our study.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sartorius, B.; Lawson, A.B.; Pullan, R.L. Modelling and predicting the spatio-temporal spread of COVID-19, associated deaths and impact of key risk factors in England. Sci. Rep. 2021, 11, 5378. [Google Scholar] [CrossRef]
Gamio, L.; Symonds, A. Global Virus Cases Reach New Peak, Driven by India and South America. 2021. Available online: https://nyti.ms/3xYVO94 (accessed on 5 May 2021).
Wu, J.T.; Leung, K.; Leung, G.M. Nowcasting and forecasting the potential domestic and international spread of the 2019-nCoV outbreak originating in Wuhan, China: A modelling study. Lancet 2020, 395, 689–697. [Google Scholar] [CrossRef] [PubMed]
Prakash, S.; Jalal, A.S.; Pathak, P. Forecasting COVID-19 Pandemic using Prophet, LSTM, hybrid GRU-LSTM, CNN-LSTM, Bi-LSTM and Stacked-LSTM for India. In Proceedings of the 2023 6th International Conference on Information Systems and Computer Networks (ISCON), Mathura, India, 4 May 2023. [Google Scholar] [CrossRef]
Salian, V.S.; Wright, J.A.; Vedell, P.T.; Nair, S.; Li, C.; Kandimalla, M.; Tang, X.; Carmona Porquera, E.M.; Kalari, K.R.; Kandimalla, K.K. COVID-19 Transmission, Current Treatment, and Future Therapeutic Strategies. Mol. Pharm. 2021, 18, 754–771. [Google Scholar] [CrossRef] [PubMed]
Dickson, M.M.; Espa, G.; Giuliani, D.; Santi, F.; Savadori, L. Assessing the effect of containment measures on the spatio-temporal dynamic of COVID-19 in Italy. Nonlinear Dyn. 2020, 101, 1833–1846. [Google Scholar] [CrossRef] [PubMed]
Fanelli, D.; Piazza, F. Analysis and forecast of COVID-19 spreading in China, Italy and France. Chaos Solitons Fractals 2020, 134, 109761. [Google Scholar] [CrossRef] [PubMed]
Choi, S.; Ki, M. Estimating the reproductive number and the outbreak size of COVID-19 in Korea. Epidemiol. Health 2020, 42, e2020011. [Google Scholar] [CrossRef] [PubMed]
Ray, D.; Salvatore, M.; Bhattacharyya, R.; Wang, L.; Du, J.; Mohammed, S.; Purkayastha, S.; Halder, A.; Rix, A.; Barker, D. Predictions, role of interventions and effects of a historic national lockdown in India’s response to the COVID-19 pandemic: Data science call to arms. Harv. Data Sci. Rev. 2020, 176, 139–148. [Google Scholar]
Alabdulrazzaq, H.; Alenezi, M.N.; Rawajfih, Y.; Alghannam, B.A.; Al-Hassan, A.A.; Al-Anzi, F.S. On the accuracy of ARIMA based prediction of COVID-19 spread. Results Phys. 2021, 27, 104509. [Google Scholar] [CrossRef]
Gupta, V.K.; Gupta, A.; Kumar, D.; Sardana, A. Prediction of COVID-19 confirmed, death, and cured cases in India using random forest model. Big Data Min. Anal. 2021, 4, 116–123. [Google Scholar] [CrossRef]
Singh, V.; Poonia, R.C.; Kumar, S.; Dass, P.; Agarwal, P.; Bhatnagar, V.; Raja, L. Prediction of COVID-19 corona virus pandemic based on time series data using support vector machine. J. Discret. Math. Sci. Cryptogr. 2020, 23, 1583–1597. [Google Scholar] [CrossRef]
Wieczorek, M.; Siłka, J.; Woźniak, M. Neural network powered COVID-19 spread forecasting model. Chaos Solitons Fractals 2020, 140, 110203. [Google Scholar] [CrossRef]
Peng, Y.; Nagata, M.H. An empirical overview of nonlinearity and overfitting in machine learning using COVID-19 data. Chaos Solitons Fractals 2020, 139, 110055. [Google Scholar] [CrossRef]
Hayhoe, M.; Barreras, F.; Preciado, V.M. Multitask learning and nonlinear optimal control of the COVID-19 outbreak: A geometric programming approach. Annu. Rev. Control 2021, 52, 495–507. [Google Scholar] [CrossRef]
Shafiq, A.; Batur Colak, A.; Naz Sindhu, T.; Ahmad Lone, S.; Alsubie, A.; Jarad, F. Comparative study of artificial neural network versus parametric method in COVID-19 data analysis. Results Phys. 2022, 38, 105613. [Google Scholar] [CrossRef]
Conde-Gutierrez, R.A.; Colorado, D.; Hernandez-Bautista, S.L. Comparison of an artificial neural network and Gompertz model for predicting the dynamics of deaths from COVID-19 in Mexico. Nonlinear Dyn. 2021, 104, 4655–4669. [Google Scholar] [CrossRef] [PubMed]
Kirbas, I.; Sozen, A.; Tuncer, A.D.; Kazancioglu, F.S. Comparative analysis and forecasting of COVID-19 cases in various European countries with ARIMA, NARNN and LSTM approaches. Chaos Solitons Fractals 2020, 138, 110015. [Google Scholar] [CrossRef]
Shahid, F.; Zameer, A.; Muneeb, M. Predictions for COVID-19 with deep learning models of LSTM, GRU and Bi-LSTM. Chaos Solitons Fractals 2020, 140, 110212. [Google Scholar] [CrossRef]
Devaraj, J.; Madurai Elavarasan, R.; Pugazhendhi, R.; Shafiullah, G.M.; Ganesan, S.; Jeysree, A.K.; Khan, I.A.; Hossain, E. Forecasting of COVID-19 cases using deep learning models: Is it reliable and practically significant? Results Phys. 2021, 21, 103817. [Google Scholar] [CrossRef]
Ayoobi, N.; Sharifrazi, D.; Alizadehsani, R.; Shoeibi, A.; Gorriz, J.M.; Moosaei, H.; Khosravi, A.; Nahavandi, S.; Gholamzadeh Chofreh, A.; Goni, F.A.; et al. Time series forecasting of new cases and new deaths rate for COVID-19 using deep learning methods. Results Phys. 2021, 27, 104495. [Google Scholar] [CrossRef]
Kumar, J.; Agiwal, V.; Yau, C.Y. Study of the trend pattern of COVID-19 using spline-based time series model: A Bayesian paradigm. Jpn. J. Stat. Data Sci. 2021, 5, 363–377. [Google Scholar] [CrossRef]
Dairi, A.; Harrou, F.; Zeroual, A.; Hittawe, M.M.; Sun, Y. Comparative study of machine learning methods for COVID-19 transmission forecasting. J. Biomed. Inf. 2021, 118, 103791. [Google Scholar] [CrossRef]
Jin, Y.; Wang, R.; Zhuang, X.; Wang, K.; Wang, H.; Wang, C.; Wang, X. Prediction of COVID-19 Data Using an ARIMA-LSTM Hybrid Forecast Model. Mathematics 2022, 10, 4001. [Google Scholar] [CrossRef]
Verma, H.; Mandal, S.; Gupta, A. Temporal deep learning architecture for prediction of COVID-19 cases in India. Expert Syst. Appl. 2022, 195, 116611. [Google Scholar] [CrossRef]
Pandianchery, M.S.; Sowmya, V.; Gopalakrishnan, E.; Ravi, V.; Soman, K. Centralized CNN–GRU Model by Federated Learning for COVID-19 Prediction in India. IEEE Trans. Comput. Soc. Syst. 2023, 11, 1362–1371. [Google Scholar] [CrossRef]
Silk, D.S.; Bowman, V.E.; Semochkina, D.; Dalrymple, U.; Woods, D.C. Uncertainty quantification for epidemiological forecasts of COVID-19 through combinations of model predictions. Stat. Methods Med. Res. 2022, 31, 1778–1789. [Google Scholar] [CrossRef]
Livieris, I.E.; Pintelas, E.; Pintelas, P. A CNN–LSTM model for gold price time-series forecasting. Neural Comput. Appl. 2020, 32, 17351–17360. [Google Scholar] [CrossRef]
Li, S.; Xie, G.; Ren, J.; Guo, L.; Yang, Y.; Xu, X. Urban PM2.5 Concentration Prediction via Attention-Based CNN–LSTM. Appl. Sci. 2020, 10, 1953. [Google Scholar] [CrossRef]
Barzegar, R.; Aalami, M.T.; Adamowski, J. Short-term water quality variable prediction using a hybrid CNN–LSTM deep learning model. Stoch. Environ. Res. Risk Assess. 2020, 34, 415–433. [Google Scholar] [CrossRef]
Cinar, Y.G.; Mirisaee, H.; Goswami, P.; Gaussier, E.; Aït-Bachir, A. Period-aware content attention RNNs for time series forecasting with missing values. Neurocomputing 2018, 312, 177–186. [Google Scholar] [CrossRef]
Ran, X.; Shan, Z.; Fang, Y.; Lin, C. An LSTM-Based Method with Attention Mechanism for Travel Time Prediction. Sensors 2019, 19, 861. [Google Scholar] [CrossRef]
Yang, Y.; Xiong, Q.; Wu, C.; Zou, Q.; Yu, Y.; Yi, H.; Gao, M. A study on water quality prediction by a hybrid CNN-LSTM model with attention mechanism. Environ. Sci. Pollut. Res. 2021, 28, 55129–55139. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.-M.; Wang, H. Multi-head attention-based probabilistic CNN-BiLSTM for day-ahead wind speed forecasting. Energy 2023, 278, 127865. [Google Scholar] [CrossRef]
Liu, X.; Huang, J.; Li, C.; Zhao, Y.; Wang, D.; Huang, Z.; Yang, K. The role of seasonality in the spread of COVID-19 pandemic. Environ. Res. 2021, 195, 110874. [Google Scholar] [CrossRef]
Singh, S.; Parmar, K.S.; Kumar, J.; Makkhan, S.J.S. Development of new hybrid model of discrete wavelet decomposition and autoregressive integrated moving average (ARIMA) models in application to one month forecast the casualties cases of COVID-19. Chaos Solitons Fractals 2020, 135, 109866. [Google Scholar] [CrossRef] [PubMed]
Sharma, R.R.; Kumar, M.; Maheshwari, S.; Ray, K.P. EVDHM-ARIMA-based time series forecasting model and its application for COVID-19 cases. IEEE Trans. Instrum. Meas. 2020, 70, 3041833. [Google Scholar] [CrossRef] [PubMed]
Ijaz, M.F.; Sperandio Nascimento, E.G.; Ortiz, J.; Furtado, A.N.; Frias, D. Using discrete wavelet transform for optimizing COVID-19 new cases and deaths prediction worldwide with deep neural networks. PLoS ONE 2023, 18, e0282621. [Google Scholar] [CrossRef]
Bhattacharyya, A.; Chakraborty, T.; Rai, S.N. Stochastic forecasting of COVID-19 daily new cases across countries with a novel hybrid time series model. Nonlinear Dyn. 2022, 107, 3025–3040. [Google Scholar] [CrossRef] [PubMed]
Wu, Z.; Huang, N.E. Ensemble empirical mode decomposition: A noise-assisted data analysis method. Adv. Adapt. Data Anal. 2009, 1, 1–41. [Google Scholar] [CrossRef]
Liu, C.; Huang, J.; Ji, F.; Zhang, L.; Liu, X.; Wei, Y.; Lian, X. Improvement of the global prediction system of the COVID-19 pandemic based on the ensemble empirical mode decomposition (EEMD) and autoregressive moving average (ARMA) model in a hybrid approach. Atmos. Ocean. Sci. Lett. 2021, 14, 100019. [Google Scholar] [CrossRef]
Hasan, N. A Methodological Approach for Predicting COVID-19 Epidemic Using EEMD-ANN Hybrid Model. Internet Things 2020, 11, 100228. [Google Scholar] [CrossRef]
Wang, Y.; Xu, C.; Yao, S.; Wang, L.; Zhao, Y.; Ren, J.; Li, Y. Estimating the COVID-19 prevalence and mortality using a novel data-driven hybrid model based on ensemble empirical mode decomposition. Sci. Rep. 2021, 11, 21413. [Google Scholar] [CrossRef]
Chen, W.; Wang, Z.; Xie, H.; Yu, W. Characterization of surface EMG signal based on fuzzy entropy. IEEE Trans. Neural Syst. Rehabil. Eng. 2007, 15, 266–272. [Google Scholar] [CrossRef] [PubMed]
Huang, Y.; Dai, X.; Wang, Q.; Zhou, D. A hybrid model for carbon price forecasting using GARCH and long short-term memory network. Appl. Energy 2021, 285, 116485. [Google Scholar] [CrossRef]
Li, K.; Huang, W.; Hu, G.; Li, J. Ultra-short term power load forecasting based on CEEMDAN-SE and LSTM neural network. Energy Build. 2023, 279, 112666. [Google Scholar] [CrossRef]
Chen, Y.; Zhao, P.; Zhang, Z.; Bai, J.; Guo, Y. A Stock Price Forecasting Model Integrating Complementary Ensemble Empirical Mode Decomposition and Independent Component Analysis. Int. J. Comput. Intell. Syst. 2022, 15, 75. [Google Scholar] [CrossRef]
Liu, F.T.; Ting, K.M.; Zhou, Z.-H. Isolation forest. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining, Pisa, Italy, 15–19 December 2008. [Google Scholar] [CrossRef]
Zandavi, S.M.; Rashidi, T.H.; Vafaee, F. Dynamic Hybrid Model to Forecast the Spread of COVID-19 Using LSTM and Behavioral Models Under Uncertainty. IEEE Trans. Cybern. 2022, 52, 11977–11989. [Google Scholar] [CrossRef]
Stewart, R.; Erwin, S.; Piburn, J.; Nagle, N.; Kaufman, J.; Peluso, A.; Christian, J.B.; Grant, J.; Sorokine, A.; Bhaduri, B. Near real time monitoring and forecasting for COVID-19 situational awareness. Appl. Geogr. 2022, 146, 102759. [Google Scholar] [CrossRef]

Figure 1. Structure of CNN-LSTM-ATT network hybrid model. CNN layer: convolutional neural network layer. LSTM layer: long short-term memory layer.

Figure 2. Flowchart of the decomposition–reconstitution–integration-based prediction method for COVID-19 new cases. EEMD: ensemble empirical mode decomposition.

Figure 3. Original time series of daily new cases in the United States, France, and Russia. Data range from 1 February 2020 to 30 September 2022.

Figure 4. Intrinsic mode function (IMF) subseries via decomposing the original daily new cases time series.

Figure 5. Single-step prediction results for COVID-19 daily new cases of different prediction methods.

Figure 6. Influence curve of input sequence length on model performance: (a) R² performance curve; (b) MAE performance curve; (c) RMSE performance curve; (d) MAPE performance curve.

Figure 7. Influence curve of output sequence length on model performance: (a) R² performance curve; (b) MAE performance curve; (c) RMSE performance curve; (d) MAPE performance curve.

Figure 8. Single-step prediction results of time series of daily new cases of COVID-19 in different countries: (a) France; (b) Russia.

Table 1. Fuzzy entropy of each IMF component.

Component	Fuzzy Entropy	Δ
IMF1	0.199936659	0.0205864370
IMF2	0.179350222	0.1636201220
IMF3	0.015730100	0.0114509900
IMF4	0.004279110	0.0027034932
IMF5	0.004279110	0.0013026940
IMF6	0.000272923	0.0002533040
IMF7	0.000019619	0.0000177400
IMF8	0.000001879	−0.015548988
Res	0.015550867	-

Δ displays the difference between the fuzzy entropy of the current component and the next component.

Table 2. Evaluation of the accuracy of different prediction methods.

Prediction Method	MAE	MAPE/%	RMSE	R²
ARIMA	34,609.37	64.24	46,260.28	0.2450
LSTM	44,167.21	74.03	35,246.13	0.3171
CNN-LSTM	25,449.57	43.9395	33,093.19	0.6166
DWT-CNN-LSTM-ATT	23,058.23	45.35	29,313.31	0.6992
EEMD-NARANN-ARIMA	11,595.66	20.36	15,161.79	0.9195
EEMD-FE-CNN-LSTM-ATT	9655.24	18.79	12,228.64	0.9476

Table 3. Evaluation table for accuracy of ablation experiments.

Method	MAE	MAPE/%	RMSE	R²
CNN-LSTM	25,449.57	43.94	33,093.19	0.6166
CNN-LSTM-ATT	23,498.20	41.88	31,663.64	0.6490
EEMD-CNN-LSTM	12,870.92	20.57	16,266.95	0.9074
EEMD-CNN-LSTM-ATT	11,297.68	21.21	14,435.44	0.9270
EEMD-FE-CNN-LSTM	11,488.59	21.38	14,409.84	0.9273
EEMD-FE-CNN-LSTM-ATT	9655.24	18.79	12,228.64	0.9476

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ke, W.; Lu, Y. Ensemble Prediction Method Based on Decomposition–Reconstitution–Integration for COVID-19 Outbreak Prediction. Mathematics 2024, 12, 493. https://doi.org/10.3390/math12030493

AMA Style

Ke W, Lu Y. Ensemble Prediction Method Based on Decomposition–Reconstitution–Integration for COVID-19 Outbreak Prediction. Mathematics. 2024; 12(3):493. https://doi.org/10.3390/math12030493

Chicago/Turabian Style

Ke, Wenhui, and Yimin Lu. 2024. "Ensemble Prediction Method Based on Decomposition–Reconstitution–Integration for COVID-19 Outbreak Prediction" Mathematics 12, no. 3: 493. https://doi.org/10.3390/math12030493

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Ensemble Prediction Method Based on Decomposition–Reconstitution–Integration for COVID-19 Outbreak Prediction

Abstract

1. Introduction

2. Literature Review

3. Proposed Method

3.1. EEMD

3.2. Fuzzy Entropy Reconstitution

3.3. CNN-LSTM-ATT Network Hybrid Model

3.4. COVID-19 Outbreak Time-Series Hybrid Forecasting Framework

4. Data Analysis

5. Discussion

5.1. Ablation Experiments

5.2. Time Window Analysis

5.3. Model Applications

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI