Artificial Neural Networks for COVID-19 Time Series Forecasting ()
1. Introduction
The World Health Organization has declared the 2019 Coronavirus (COVID-19) a universal epidemic that has been recognized as a global threat. During the last 3 years, it has been having a significant worldwide negative impact on all fields. Thus, predicting future COVID-19 infections can be extremely useful, as it may enhance public health decision-making, including intervention decisions in the spread of the pandemic. Using appropriate models and consistently making accurate projections can help countries to better allocate their resources and prepare for the future. Discovering possible future values of the pandemic, in terms of number of infection cases, evolving of the spread of the virus or deaths can help countries have a more prepared health care system, whether they are among the most affected by the pandemic or have recently been struggling with its spread.
Many models for forecasting the global and local spread of infection cases have been developed since the beginning of the pandemic. In this article, we provide forecasts for the confirmed Italian new COVID-19 cases using four different time-series forecasting models and compare their performance to analyze the advancement of the cases based on the daily reported data. We aim to forecast total confirmed COVID-19 cases through a comparison of the performance of these models and provide an analysis of the errors of the forecasts, with the objective to have a clear expectation of future cases, in order to obtain more preparedness in health care systems.
The purpose of our work is to determine the best forecasting model for the spread of Coronavirus infection data in a certain region for a given period of time.
Several studies try to predict the evolution of the COVID-19 pandemic using a variety of models. Khan and Gupta [1] applied an ARIMA (1, 1, 0) and a nonlinear autoregressive (NAR) model to Indian COVID-19 infected cases for a daily prediction of new cases 50 days ahead, preferring the linear ARIMA model over the NAR model, due to the fact that the most recent Indian COVID-19 new cases followed a linear trend. Batista [2] used the logistic model to predict the number of cases in China, South Korea and the rest of the world during the first semester of 2020 before the second wave occurred. Abotaleb and Makarovskikh [3] predicted future COVID-19 cases in Russia through a hybrid system, considering linear models (ARIMA and Exponential Smoothing) and nonlinear models (BATS, TBATS) for data collected until March 2021. Safi and Sanusi [4] applied an ARIMA model to predict COVID-19 cases for data collected during the first and the second pandemic wave, dividing the time series into two parts. Gecili et al. [5] applied ARIMA, Smoothing Spline and TBATS models to COVID-19 pandemic data for USA and Italy, preferring the first two linear models to the third, for the period February-April 2020. Salaheldin and Abotaleb [6] chose the exponential growth model over ARIMA for making predictions on daily COVID-19 cases in China, Italy and USA, not considering the nonlinear models as possible forecasting models, due to the fact that in these countries COVID-19 new cases had a nonlinear trend. Tian et al. [7] applied a hidden Markov chain hierarchical Bayes and long-short term memory (LSTM) model to predict future cases for six countries during the first four months of the spread of the pandemic, preferring the LSTM model over the others, since it had the lowest root mean square error.
In this paper, we aim to choose the best model among the most well-known and widespread models in literature for time series forecasting. Unlike previous works, since the COVID-19 new cases curve follows a nonlinear trend, this work emphasizes the importance of using nonlinear methods for modeling these time series, as classical linear models would not be able to distinguish the traits of nonlinear time series and, subsequently, would give unreliable predicted value. We take into consideration a time series containing data from the beginning of the spread of the pandemic (22 February 2020) to 10 January 2022, months in which it was thought, according to previous works predictions, would correspond to quite quiet months from the point of the spread of the pandemic, considering new daily cases.
2. Methodology
We considered data published online from Superior Health Institute on Epidemiology for public health related to COVID-19 infections and death cases in Italy for the period from 22 February 2020 to 10 January 2022 considering:
- New daily national infections from 22 February 2020 to 10 January 2022 (Figure 1);
- The last 8 days for testing daily cases (2 January 2022-10 January 2022);
- The last 50 days for testing the forecasting of the third wave.
The forecasting was conducted through the R package forecasting, which provides methods and tools for forecasting univariate time series. We implemented an ARIMA model, a NNAR model, as well as a TBATS and Holt’s linear model and chose the best model considering the Mean Average Percentage Error (MAPE) for each of them as follows:
(1)
where, n is the total number of observations, At is the actual value and Ft is the forecast value.
Figure 1. Daily COVID-19 infection cases in Italy.
2.1. ARIMA Model
The first model is ARIMA (Auto-Regressive Integrated Moving Average), which is the most common model for time series forecasting. It represents a time series as a function of its past values, its own lags and the lagged errors, to forecast future values. An ARIMA model is compound by 3 terms: p, d, q:
(2)
where, p is the order of the Auto-Regressive (AR) term and refers to the number of y lags which should be used as predictors, q is the order of the Moving Average (MA) term and it refers to the number of lagged errors used as predictors, while d is the number of differentiating required to make the time series stationary. More than one differentiation may be required, depending on the complexity of the series. The most common approach to making a series stationary is to subtract the previous value from the current value. So, d is the minimum number of differentiation to make the series stationary and if the time series is already stationary, then d = 0.
The principal objective of the ARIMA model is to forecast future values by recognizing the stochastic mechanism of the time series. Although ARIMA is widely used for time series analysis, it is not easy choosing appropriate orders for its components, so we proceeded to determine the orders automatically, using the auto.arima function from the forecast package in R, which returns the best ARIMA model. This includes identifying the most suitable lags for the AR and MA components and deciding whether the variable needs differentiation to induce stationary. The model that better fitted our time series data was ARIMA (2,1,2). This time series model has been used in the study to forecast the number of new COVID-19 cases in Italy. The steps of the ARIMA model building methodology are presented in Figure 2:
Identification of the model: The Auto Correlation function (ACF) and the Partial Auto Correlation function (PACF) were used to determine the best model, by defining the AR and MA model components.
Model estimation: This step involves using statistical techniques to derive the coefficients that better fit the chosen ARIMA model. The most popular approach is the Maximum Likelihood (ML) method or the nonlinear least square approach.
Model testing: This step includes the test for autocorrelation. In particular, the ACF and PACF plots (Figure 3) are helpful in detecting dependencies between the lags. If this test fails, the process goes back to phase one to create a better model. The estimated model will be compared with the other ARIMA model in order to select the best. To choose between models, the most popular model selection criteria are Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC):
(3)
where, L indicates the likelihood and k is the number of parameters.
Figure 2. Application of ARIMA model to forecasting data.
Forecasting: Once the model was identified and the parameters have been estimated, it can be used for forecasting purposes. It is checked using statistical tests and residual plots that can be used to analyze the suitability of various models to historical data.
2.2. Holt’s Linear Trend
The linear exponential smoothing model uses double exponential smoothing parameters to forecast future values: the first parameter is used for the overall smoothing, while the other for the trend smoothing equation. So this approach includes a prediction equation and two smoothing equations. We obtained the current value considering the adjusted last smoothed value for the last period’s trend and updating the trend over time, expressing it as the difference between the last two smoothed values.
Holt’s forecast equation:
(4)
where
(5)
indicates the first equation (level equation), while
(6)
indicates the trend equation, where:
α indicates the smoothing parameter, 0 ≤ α ≤ 1, is the smoothing parameter for the trend, 0 ≤ β ≤ 1, lt indicates the time series value at time t, bt is the time series trend at time t.
2.3. Nonlinear Autoregressive Neural Networks (NARNN)
Artificial Neural Networks are forecasting models inspired by biological neural networks. They identify and model nonlinear relationships between the response variable and its predictors. A collection of neurons, grouped in input, hidden and output layers to form the artificial network, can perform a large number of complex tasks, quite efficiently [8]. This makes ANNs a powerful tool, able to learn from previous examples and improve its performance. That gives them the ability to analyze new data based on previous results. Artificial Neural Networks are nonlinear models that map a set of input into a set of output variables, through hidden layers of neurons. An ANN is composed of several layers:
- The first layer, known as the input layer, is the one that takes the data in input. The last layer, called the output layer, gives the results of the analysis or the solution to the problem. The data flow from the input layer to the output through one or more intermediate layers called hidden layers. This is where the data is analyzed and the requested outputs are taken. The nodes of the hidden layers detect the features in the pattern of the data and the nonlinear relationships between them. Then, the requested output is sent from the hidden layer to the output layer. In designing a neural network, we must determine the following variables.
- The number of input nodes: corresponds to the number of variables of the input layer used to predict future values. In a time series forecasting problem, the number of input nodes corresponds to the number of lagged observations taken into consideration for the forecasting. It is preferred to use a small number of input nodes to unveil the features of the data, as too few or too many input nodes can affect the learning or prediction capability of the network [9].
- The number of hidden layers and hidden nodes: usually, one hidden layer is enough for most forecasting problems. Two or more hidden layers are preferred over one hidden layer, especially when one hidden layer network has too many nodes, which can lead to unsatisfactory results or overfitting problems.
- The number of output nodes: depends directly on the considered problem. In a time series forecasting problem, the number of output nodes corresponds to the forecasting horizon, which can be one-step-ahead (using one output node) or multi-step-ahead forecasting. There are two ways of making multi-step-ahead forecasts: the iterative method, in which the forecasted values are iteratively used as inputs for the next periods’ forecasts, where only one output node is necessary and the second one, called the direct method, which requires several output nodes to directly forecast each step into the future [9].
In our study, the NAR network was developed using the nnetar function of R software “caret” package that fits a neural network model to a time series [8] developed by Hyndman, O’Hara, and Wang. A NNAR (p, k), where p indicates the number of non-seasonal lags used as inputs and k the number of nodes in the hidden layer, can be described as an AR process with nonlinear functions.
We chose a (28-5-1) network, with 28 lags as input nodes and 5 hidden layer nodes (Figure 4). It has the form of a feedforward three-layer ANN, where neurons have a one-way connection with the neurons of the next layers [10]. The data set was divided into training set (70%), testing set (15%), while the last 8 days’ data were used for the validation.
2.4. TBATS Model
The third was the TBATS (Trigonometric Exponential smoothing state-space model with Box-Cox transformation, ARMA errors, Trend and Seasonal component) model, which uses a combination of Fourier terms with an exponential smoothing state-space model and a Box-Cox transformation, in a completely automated manner. The unit of time used in modeling was day. The forecasting performance of all these models was evaluated using the mean absolute percentage error (MAPE), while the model fits were evaluated using AIC (Akaike Information Criterion), reported in Table 1.
3. Results
Selection and accuracy measures for the forecasting models are reported in Table 1. RMSE, MAE, MPE, MAPE, ME and MASE accuracy indicators were used to measure the performance of the models built for the COVID-19 new cases time series, considering the time series training data. In addition to the graph, where it can be clearly seen, the above values of the table show that the NARNN model has given more accurate forecasting values than the ARIMA model and the other linear forecasting models too. NARNN has improved the forecasting accuracy by 75.7% compared with ARIMA, according to MAPE and by 38.5% according to RMSE.
Table 1. Accuracy of training data.
The NARNN model gives better results in almost all the considered indicators with a considerable difference from the indicators of the other models. It has improved the forecasting performance, according to the ME indicator, by 98.6% compared with Holt’s model and by 89.5% compared with TBATS. According to the MAE indicator, the NARNN model has improved by 42.6% compared with the ARIMA model, 54% compared with Holt’s model and 38.4% compared with TBATS.
We chose the best forecasting model according to the MAPE value (Mean Absolute Percentage Error), as it is recommended as an accuracy comparing unit when using different methods on a time series, considering the most accurate model the one with the lowest MAPE value, given the considered period. NNAR model has the minimal MAPE for the considered period (14.178%).
In Table 2, we represent the MAPE for the last 8 days (testing data) for cumulative data for COVID-19. We can observe that again NNAR model is the best one for forecasting COVID-19 new cases in Italy. This fact confirms once again our assumption about choosing the best model for our time series.
We performed the forecasting for confirmed COVID-19 cases in Italy using the above models. We conducted 20 days ahead forecast (until 30 January 2022) and compared the forecasting data with the testing data for 8 days (02 January 2022-10 January 2022). We applied the forecasting models to the confirmed cases for Italy for the last 8 days and compared the results with the actual COVID-19 data. We calculated the MAPE values as the difference between actual data and forecast values. The MAPE values for each forecasting model are represented in Table 3. Based on our analysis, we concluded that the prediction performance of the models was similar to the real data. In particular, NNAR model gave more accurate predictions, as its MAPE values were lower compared to the other models. We observed decreasing MAPE values, in particular for the last 6 days’ testing values, as its values decreased from about 13% to 1%. While for the other predictive models, we observed higher MAPE values. ARIMA had a worse predicting performance for the first 4 days and the last 2 days, while TBATS was the worst forecasting model when comparing the 8 days’ training data MAPE values.
A visual representation of the forecasting is shown in the above figures.
Figure 5 presents our time series and the ARIMA forecasting model. For the fitting of the ARIMA model, auto.arima function was used in addition to an iterative function constructed in R. It resulted in an ARIMA (2, 1, 2) model as the best forecasting model for our time series. We can observe that ARIMA model shows a steady trend for the next 20 days, with daily new cases values between 11.000 and 14.000. According to ARIMA, there is a steadily decreasing rate of new cases during the last two weeks of January 2022.
As can be seen from the graph, the predicted values follow the trend and the seasonality of our time series testing data. The confidence interval indicates that accurate forecasts can vary within that interval (marked in blue in Figure 5). If we compare the values of eight days used as a test set, we notice that there are significant differences between the values predicted by ARIMA model and the values observed from the collected data. This is emphasized by the value of MAPE for the eight days test, which for the ARIMA model reaches 18.058%. While Figure 6 shows a graphical representation of the ARIMA (2, 1, 2) model error tests. From the error curve, it is noticed that the ARIMA model was selected through the auto.arima function shows normal errors with a relatively low autocorrelation between them.
Table 2. MAPE (%) for daily COVID-19 infection cases in Italy for testing last 8 days.
Table 3. MAPE (%) for 8 days’ accuracy of forecasting models in Italy.
Figure 5. Daily COVID-19 infections prediction with ARIMA model.
Figure 6. Residual test for ARIMA (2, 1, 2).
For the construction of the NARNN model, the data were divided into two sets; training set and testing set. The training set was used to create the model, while the test set was used for the evaluation of the created model [11]. The network structure was chosen based on the results of Zhang et al. [9], who showed, through simulation, that the best network structure corresponds to one hidden layer with a maximum of two neurons. Since the network with 5 hidden neurons performed better than the ones with 1, 2, 3 and 4 hidden neurons in terms of difference between actual and forecast data, we chose 5 hidden components for our model because it had a low RMSE in comparison to other models. The input layer neurons were chosen by the function nnetar and accuracy. They gave as output a neural network with 28 input neurons. nnetar function selected the NARNN (28-5-1) model as the best model. Figure 7 presents the forecasting results of NARNN model for the following 20 days for COVID-19 new confirmed
Figure 7. Daily COVID-19 infections prediction with NARNN model.
cases in Italy. The NARNN model values follow very well the time series’ trend, thanks to the training and learning process, which enable the model to better understand the time series’ features. The NARNN hybrid model fits quite well in our time series. All components are well-presented and the difference between the predicted values and the time series observed values tends to zero, as the nonlinear component of the time series is taken into consideration by the nonlinear model. In the NARNN model:
- Once trained, neural networks continued to perform quite well when they had to work with data they were not previously familiar;
- The network itself decides on the importance of the variables;
- The network keeps learning continuously, with no need to retrain it once we want to introduce new time series data.
Through the graphic representation and the performance indicators (MAPE, RMSE, MAE) we observed that NARNN model performance was better in comparison with the ARIMA model for predicting COVID-19 new cases in Italy. Its ability to learn, work with multiple parallel inputs, as well as nonlinearity, plasticity, tolerance to fuzzy data are some of the characteristics that make neural networks efficient in finding the most suitable model for time series forecasting [12].
Figure 7 shows the trend of the number of new cases predicted by the NARNN model, constructed considering as input 28 lags values of the time series and 5 nodes in the artificial neural network’s hidden layer. From the results obtained by the predictions of NARNN model, we can say that this model’s predictions of the new COVID-19 confirmed cases are closer to the observed time series values. This is also emphasized by the value of MAPE for the test set, equal to 4%, much lower than other forecasting models’ MAPE values. According to the NARNN (28-5-1) model, there will be an exponential increase in the number of new COVID-19 infections by the end of January, compared to that of the ARIMA model.
Figure 8 shows the COVID-19 predictions obtained through Holt’s and TBATS model [13]. Both of them had poor performance compared to the previous two
Figure 8. Daily COVID-19 infections prediction with Holt’s and TBATS model.
models (ARIMA and NARNN) in the analysis for the 20 days forecasting periods, as they have higher MAPE, RMSE, ME and MAE in comparison with the other two models. They show a relatively linear trend for the future 20 days’ values, with a light increasing tendency for Holt’s linear model and a light decreasing tendency for the TBATS model accompanied by relatively wide confidence intervals that correspond to a higher degree of uncertainty for the forecasts.
4. Conclusions and Discussions
In this article, we have evaluated four different time series forecasting models for predicting daily Italian COVID-19 confirmed new cases. Our findings evidenced the differences between each model’s accuracy when forecasting and their performance. Using multiple models lets us test and compare their forecasting accuracy and make an optimal selection. For our time series, the NARNN model was preferred over the other linear forecasting models. It was chosen based on MAPE value, as it had the lowest value among all the forecasting models. In addition, NARNN has improved the forecasting accuracy by 75.7% compared with ARIMA, according to MAPE and by 38.5% according to RMSE. The NARNN model gives better results in almost all the considered indicators with a considerable difference from the indicators of the other models. It has improved the forecasting performance, according to the ME indicator, by 98.6% compared with Holt’s model and by 89.5% compared with TBATS. According to the MAE indicator, the NARNN model has improved by 42.6% compared with the ARIMA model, 54% compared with Holt’s model and 38.4% compared with TBATS. We chose NARNN as the best forecasting model according to the MAPE value, considering the most accurate model the one with the lowest MAPE value, given the considered period. NNAR model had the minimal MAPE for the considered period (14.178%). The NARNN (28-5-1) model predicted an exponential increase in the number of new COVID-19 infections by the end of January. The results are valid for a short period of time because in the long run they can be influenced by other factors such as vaccination, immunization of the population, measures taken by government authorities to limit the spread of the infection, etc.
Similarly, the above-considered models can be implemented on new data as they become available, for possible future COVID-19 new confirmed cases forecasting, in order to improve forecasting accuracy. It would also be interesting to consider future COVID-19 new confirmed cases taking into consideration other patients’ parameters as possible inputs for the NARNN model since additional data would improve forecasting performance. It would be very helpful considering a time series of death and recovery cases too, in addition to the new confirmed COVID-19 cases for Italy. Predictions about possible future new cases would be very helpful for the allocation of medical resources, handling the spread of the pandemic and getting more prepared in terms of health care systems. People that deal with decision-making could find it very helpful for future projections regarding intervention for reducing and controlling the spread of the infection.