Time Series Analysis and Prediction of COVID-19 Pandemic Using Dynamic Harmonic Regression Models

Abstract

Rapidly spreading COVID-19 virus and its variants, especially in metropolitan areas around the world, became a major health public concern. The tendency of COVID-19 pandemic and statistical modelling represents an urgent challenge in the United States for which there are few solutions. In this paper, we demonstrate combining Fourier terms for capturing seasonality with ARIMA errors and other dynamics in the data. Therefore, we have analyzed 156 weeks COVID-19 dataset on national level using Dynamic Harmonic Regression model, including simulation analysis and accuracy improvement from 2020 to 2023. Most importantly, we provide new advanced pathways which may serve as targets for developing new solutions and approaches.

Share and Cite:

Wang, L. (2023) Time Series Analysis and Prediction of COVID-19 Pandemic Using Dynamic Harmonic Regression Models. Open Journal of Statistics, 13, 222-232. doi: 10.4236/ojs.2023.132012.

1. Introduction

The COVID-19 pandemic has had a tremendous impact on the world for 3 years and spread to more than 200 countries worldwide, leading to more than 36 million confirmed cases as of October 10, 2020. Some well-respected organizations such as Johns Hopkins University, the Centers for Disease Control and Prevention, the World Health Organization and the United States Census Bureau are involved in the study and tracking of the COVID-19 pandemic [1] .

To respond this urgent public health concern, we used 156 weekly time series datasets to evaluate the seasonal patterns of COVID-19 cases and mortality in the United States with the objective to determine the tendency of COVID-19 pandemic. Besides, the implantation of R and simulation analysis can improve the forecasting accuracy.

Given my prospective research interest in Data Science, smart data analytics is giving professionals and public more insight into the factors impacting than ever before. From assessing risks to analyzing evolving trends, we are now able to anticipate the success of a property more accurately thanks to the abundance of information available to academics and professionals. Our analysis can help in understanding the trends of the disease outbreak and provide suggestions and instructions of adopted countries.

In epidemiology, ARIMA models can be used to forecast future trends in disease incidence or prevalence, as well as to identify patterns in the data that may be related to seasonal or other cyclical factors. For example, an epidemiologist might use ARIMA models to forecast the number of new cases of a particular disease over the next several months or years, based on historical data on the disease incidence.

Based on complex nature of virus transformation, traditional epidemic models such as Regression and ARIMA methods have been applied for prediction of its spread. Particularly, Dynamic Harmonic Regression (DHR) approaches were used to predict the spreading trends of COVID-19, such as new cases and deaths. We reviewed studies that implemented these strategies [2] .

Dynamic Harmonic Regression (DHR) is a nonstationary time-series analysis approach used to identify trends, seasonal, cyclical and irregular components within a state space framework. Many researchers studied about this forecasting method. Dr. Kumar and Dr. Suan (2020) use ARIMA model and day level information of COVID-19 spread for cumulative cases from whole world and 10 mostly affected countries to forecast the impact of the virus in the affected countries and worldwide [3] . Also, Dr. Fuad Ahmed Chyon Md, Dr. Nazmul Hasan Suman employed ARIMA model to analyze the temporal dynamics of the worldwide spread of COVID-19 in the time window from January 22, 2020 to April 7, 2020 [1] . Dr. Tandan, Dr. Acharya, Dr. Pokharel, Dr. Timilsina aimed to discover symptom patterns and overall symptom rules, including rules disaggregated by age, sex, chronic condition, and mortality status, among COVID-19 patients [4] .

However, Dynamic harmonic regression is a statistical modeling technique used for time series analysis, which includes periodic patterns in the data. While there has been some research on this topic, there are still some gaps in our understanding of dynamic harmonic regression, such as Model Selection, Outlier Detection, Estimation Techniques, and Uncertainty Quantification. Therefore, more research is needed to address these gaps and further advance our understanding of this technique.

2. Methods

2.1. A Short Review of COVID-19 Situations

● In early December 2019, an outbreak of coronavirus disease 2019 (COVID-19) caused by a novel severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), occurred in Wuhan City, Hubei Province, China.

● On January 30, 2020 the World Health Organization declared the outbreak as a Public Health Emergency of International Concern (PHEIC).

● As of February 14, 2020, 49,053 laboratory-confirmed and 1381 deaths have been reported globally.

● On March 2020, the Journal of the American Medical Association Ophthalmology reported that COVID-19 can be transmitted through the eye. One of the first warnings of the emergence of the SARS-CoV-2 virus came late in 2019 from a Chinese ophthalmologist, Li Wenliang, MD, who treated patients in Wuhan and later died at age 34 from COVID-19.

● On December 18, 2020, after demonstrating 94 percent efficacy, the NIH-Moderna vaccine was authorized by the U.S. Food and Drug Administration (FDA) for emergency use. Just days earlier, the similar Pfizer/BioNTech vaccine had become the first COVID-19 vaccine to be authorized for use in the United States [5] .

● In the late summer and fall of 2021, the delta variant was the dominate strain of COVID-19 in the U.S.

● On 26 November 2021, WHO designated the variant B.1.1.529 a variant of concern, named Omicron.

● Director of the National Institute of Allergy and Infectious Diseases Anthony Fauci gave an update on the Omicron COVID-19 variant during the daily press briefing at the White House on December 1, 2021 in Washington, DC. He said that we will likely learn to live with COVID-19 like we do with the common cold and flu [2] .

● Globally, as of 6:32 pm CET, 27 January 2023, there have been 752,517,552 confirmed cases of COVID-19, including 6,804,491 deaths, reported to WHO. As of 24 January 2023, a total of 13,156,047,747 vaccine doses have been administered.

2.2. Data Collection

The data for the ongoing COVID-19 outbreak in the United States is collected from the Centers for Disease Control and Prevention. The columns of this dataset include the total number of weekly cases, Weekly Death and Weekly tests volume of COVID-19 patients accumulating all the states, on a weekly basis from 29th Jan 2020 to 18th Jan 2023. The total cases per 100,000, allow for comparisons between areas with different population sizes.

Weekly data is difficult to work with because the seasonal period (the number of weeks in a year) is both large and non-integer, like stock prices, employment numbers, or other economic indicators. The average number of weeks in a year is 52.18. Most of the methods we have considered require the seasonal period to be an integer. Even if we approximate it by 52, most of the methods will not handle such a large seasonal period efficiently.

So far, many publications and researchers have considered relatively simple seasonal patterns, such as quarterly and monthly data. However, higher frequency time series often exhibit more complicated seasonal patterns. For example, daily data may have a weekly pattern as well as an annual pattern. Hourly data usually has three types of seasonality: a daily pattern, a weekly pattern, and an annual pattern. Even weekly data can be challenging to forecast as it typically has an annual pattern with seasonal period of 365.25/7 ≈ 52.179 on average.

Exponential smoothing model didn’t seem applicable, and ARIMA modelling is poor working with high integer seasonal periods (e.g. days/weeks rather than months/quarters), and also struggles with a non-integer seasonal period (i.e. 52 weeks some years, 53 weeks other years).

3. Advanced Forecasting Model: Dynamic Harmonic Regression (DHR)

There are several methods for incorporating seasonality into a forecasting model. One common approach is to use time-series models such as SARIMA (Seasonal Autoregressive Integrated Moving Average) or Seasonal Exponential Smoothing. These models can capture the seasonal patterns in the data and adjust the forecast accordingly.

The time series processes are usually all stationary processes, but many applied time series, particularly those arising from economic and business areas are non-stationary. With respect to the class of covariance stationary processes, non-stationary time series can occur in many different ways. They could have non-constant means µt, time-varying second moments, such as non-constant variance σ2, or both of these properties [6] .

When applied to COVID-19 data, taking the natural logarithm of the number of cases or deaths can help stabilize the variance of the data and make the trend more apparent, especially in the early stages of the pandemic when the growth was exponential. This can also help identify if there are any underlying patterns or seasonality in the data. After applying the log transformation, the resulting data will have a more linear trend and a constant variance, which makes it easier to model using standard statistical techniques such as linear regression or ARIMA models [7] .

Many models used in practice are of the simple ARIMA type, which has a long history and was formalized in Box and Jenkins [8] . ARIMA stands for Autoregressive Integrated Moving Average and an ARIMA(p; d; q) model for an observed series, and “I” stands for integration; where p is order of autoregression, d is order of differencing, q is order of moving average [9] .

Since we are also taking into account the seasonal pattern even if it is weak, we should also examine the seasonal ARIMA process. This model is built by adding seasonal terms in the non-seasonal ARIMA model we mentioned before. One shorthand notation for the model is

ARIMA ( p , d , q ) ( P , D , Q ) m (3.1)

● {(p, d, q)}: non-seasonal part.

● {(P, D, Q)m}: seasonal part.

P = seasonal AR order, D = seasonal differencing, Q = seasonal MA order.

m: the number of observations before the next year starts; seasonal period [4] .

The seasonal parts have term non-seasonal components with backshifts of the seasonal period. For instance, we take ARIMA(p, d, q)(P, D, Q)m model for weekly data (m = 52). Without differencing operations, this process can be formally written as:

Φ ( B m ) ϕ ( B ) ( x t μ ) = Θ B m θ ( B ) ( w t ) (3.2)

A seasonal ARIMA model incorporates both non-seasonal and seasonal factors in a multiplicative fashion.

The time series models in ARIMA model and Exponential Smoothing model allow for the inclusion of information from past observations of a series, but not for the inclusion of other information that may also be relevant. For example, the effects of holidays, competitor activity, changes in the law, the wider economy, or other external variables may explain some of the historical variation and may lead to more accurate forecasts. On the other hand, the regression models allow for the inclusion of a lot of relevant information from predictor variables but do not allow for the subtle time series dynamics that can be handled with ARIMA models.

An alternative approach uses a dynamic harmonic regression model. Next, we tried to extend ARIMA models in order to allow other information to be included in the models. Firstly, we considered regression model

y t = T t + C t + S t + ϵ t (3.3)

The system composed by four components: trend (T), sustained cyclical (C) with period different to the seasonality, seasonal (S) and white noise ( ϵ t ) [6] .

The measured values of y are the output (observations) series of a system of stochastic state space equations, which can then be broken down to allow for estimation of the four components.

So for such time series, we prefer a harmonic regression approach where the seasonal pattern is modelled using Fourier terms with short-term time series dynamics handled by an ARIMA error.

In the following example, the number of Fourier terms was selected by minimising the AICc. The order of the ARIMA model is also selected by minimising the AICc although that is done within the auto.arima() function in R.

Dynamic harmonic regression is based on the principal that a combination of sine and cosine functions can approximate any periodic function.

y t = b t + j = 1 K [ α j sin ( 2 π j t m ) + β j cos ( 2 π j t m ) ] + η t (3.4)

where m is the seasonal period, αj and βj are regression coefficients, and ηt is modeled as a non-seasonal ARIMA process.

The fitted model has 18 pairs of Fourier terms and can be written as

y t = b t + j = 1 18 [ α j sin ( 2 π j t 52.18 ) + β j cos ( 2 π j t 52.18 ) ] + η t (3.5)

where ηt is an ARIMA(4, 1, 1) process. Because nt is non-stationary, the model is actually estimated on the differences of the variables on both sides of this equation. There are 36 parameters to capture the seasonality which is rather a lot but apparently required according to the AICc selection. The total number of degrees of freedom is 42 (the other six coming from the 4 AR parameters, 1 MA parameter, and the drift parameter) [10] .

The advantages of this approach are:

● Flexibility: DHR model can be used to model data with various levels of complexity, including data with multiple seasonal patterns, irregular patterns, and non-stationary patterns. It allows any length seasonality; the short-term dynamics are easily handled with a simple ARIMA error. Especially, for data with more than one seasonal period, Fourier terms of different frequencies can be included;

● The smoothness of the seasonal pattern can be controlled by K, the number of Fourier sin and cos pairs—the seasonal pattern is smoother for smaller values of K;

The only real disadvantage (compared to a seasonal ARIMA model) is that the seasonality is assumed to be fixed—the seasonal pattern is not allowed to change over time. But in practice, seasonality is usually remarkably constant so this is not a big disadvantage except for long time series.

4. Main Results

4.1. Forecasting Accuracy

Time series analysis and forecasting are an active research area over the last five decades. Thus, various kinds of forecasting models have been developed and researchers have relied on statistical techniques to predict time series data. The accuracy of time series forecasting is fundamental to many decisions processes, and hence the research for improving the performance of forecasting models has never been stopped. However, the time series datasets are often nonlinear and irregular [11] . An interdisciplinary approach afforded in the study of Data Science critically analyzes the relevant disciplinary insights and attempts to produce a more comprehensive understanding or purpose of a holistic solution.

The author measured forecasting performance by the mean absolute error (MAE), root mean square error (RMSE), root relative squared error (RSE), and mean absolute percentage error (MAPE). The MAE criterion is most appropriate when the cost of a forecast error rises proportionally with respect to the absolute size of the error. With RMSE, the cost of the error rises as the square of the error, and so large errors can be weighted far more than proportionally. Whether MAE or RMSE is most appropriate surely varies according to circumstances and individual institutions, and in any case we will find that the several measures pick the same model in all but several instances [12] .

These measures were calculated by using the following Equations. Pt is the predicted value at time t, Zt is the observed value at time t and N is the number of predictions.

ME = t = 1 N ( P t Z t ) N (4.1)

MAE = 1 N t = 1 N | P t Z t | (4.2)

MAPE = 1 N t = 1 N | P t Z t Z t | (4.3)

MPE = 1 N t = 1 N ( P t Z t Z t ) × 100 % (4.4)

RMSE = MSE = t = 1 N ( P t Z t ) 2 N (4.5)

AIC = 2 ln ( L ) + 2 k (4.6)

AICc = AIC 2 k ( k + 1 ) n k 1 (4.7)

where k is the number of parameters and n the number of samples.

It is important to note that these information criteria tend not to be good guides to selecting the appropriate order of differencing (d) of a model, but only for selecting the values of p and q. This is because the differencing changes the data on which the likelihood is computed, making the AIC values between models with different orders of differencing not comparable [10] .

4.2. Conclusion

In this section, the focus is on statistical methodology and forecasting results on time series datasets regarding COVID-19 pandemic. The comparison Table 1 below shows all the potential forecasting models. A given forecasting model may

Table 1. Comparison table for forecasting model.

have a systematic positive or negative bias and do a poor job of tracking the actual mean of value changes, and measures such as RMSE and MAE could well miss this defect. Obviously, the Log Transformation DHR performs best among other models. Because we evaluated the different models with different criterion. The Log Transformation DHR minimizes the RMSE, MAE and shows relatively better forecasting accuracy. In Figure 1, the forecasting results show the tendency of weekly cases and weekly deaths for the following months from our selected models.

Collectively, these models are capable of identification of learning parameters that affect dissimilarities in COVID-19 spread across various regions or populations, combining numerous intervention methods and implementing what-if scenarios by integrating data from diseases having analogous trends with COVID-19 pandemic [9] .

As it was the case with the forecast in Table 2 and Table 3, the number of weekly cases and weekly deaths are projected to continue increase in the following weeks.

Table 2. Forecasting results for weekly cases from regression with ARIMA(3, 1, 1) errors.

Figure 1. Forecasting results.

Table 3. Forecasting results for weekly deaths with regression with ARIMA(4, 0, 1) errors.

It shows the noticeable increase in the future. However, weekly cases will decrease at the end of May 2023. However, the weekly deaths forecasting results shows the uncertainty and fluctuations until the end of 2023. The DHR shows the smallest RMSE. Because it is a better model than ARIMA(p, d, q)(P, D, Q)m and dynamic harmonic regression with ARIMA error. We can easily confirm from the above results that the transformation improves the accuracy if the time series have an unstabilized variance. It also shows that when there are long seasonal periods, a dynamic regression with Fourier terms is often better than other models we have considered from the raw datasets.

The trend analysis shows unstable situation in the infected cases and weekly deaths and prediction study shows increase in the expected active and death cases nationally. However, the time series datasets are often nonlinear and irregular. This data has been used by researchers, policymakers, and others to better understand and respond to the effects of the pandemic.

The objective in providing crucial statistical techniques is to enable government and public to make informed decisions regarding COVID-19. Most importantly, we obtain how to add value to public health and apply skills in a real world environment. These models are essential for informing public health decision-making and resource allocation, as well as for predicting future trends in the spread of the disease.

Acknowledgements

The author would like to thank some comments and constructive suggestions from Dr. Olusegun Michael Otunuga from the college of Science and Math and Dr. Hinton Romana from Writing Center in Augusta University. Several stimulating discussions and comments allowed me to develop original ideas and improve my paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

[1] Saud, S., Jaini, G., Aishita, J., Sunny, A., Sagar, J. and Mani, R.E. (2021) Analysis and Prediction of COVID-19 Using Regression Models and Time Series Forecasting. 11th International Conference on Cloud Computing, Data Science & Engineering, Noida, 28-29 January 2021.
[2] Iman, R., Fang, C. and Amir, H.G. (2021) A Review on COVID-19 Forecasting Models. Neural Computing and Applications, 1-11.
[3] Naresh, K. and Seba, S. (2020) COVID-19 Pandemic Prediction Using Time Series Forecasting Models. The 11th ICCCNT 2020 Conference.
[4] Chyon, F.A., et al. (2022) Time Series Analysis and Predicting COVID-19 Affected Patients by ARIMA Model Using Machine Learning. Journal of Virological Methods, 301, 114433.
https://doi.org/10.1016/j.jviromet.2021.114433
[5] BBC (2020) Coronavirus: Sharp Increase in Deaths and Cases in Hubei.
https://www.bbc.co.uk/news/world-asia-china-51482994
[6] David, A.M. and Wlodzimierz, T. (2019) Dynamic Harmonic Regression and Irregular Sampling; Avoiding Pre-Processing and Minimising Modelling Assumptions. Environmental Modelling & Software, 121, 104503.
https://doi.org/10.1016/j.envsoft.2019.104503
[7] Faraway, J.J. (2014) Linear Models with R. CRC Press, Taylor and Francis Group, Boca Raton.
[8] Box, G. and Jenkins, G. (1970) Time Series Analysis: Forecasting and Control. Holden-Day, San Francisco.
[9] Ratnadip, A. (2013) An Introductory Study on Time Series Modeling and Forecasting. LAP Lambert Academic Publishing, Germany.
[10] Hyndman, R.J. and Athanasopoulos, G. (2014) Forecasting: Principles and Practice. 2nd Edition, George Athanasopoulos Monash University, Australia.
[11] Fotios, P. and Spyros, M. (2020) Forecasting the Novel Coronavirus COVID-19. PLoS ONE, 15, e0231236.
https://doi.org/10.1371/journal.pone.0231236
[12] Brockwell, P.J. and Davis, R.A. (2002) Introduction to Time Series and Forecasting. 2nd Edition, Springer, New York.
https://doi.org/10.1007/b97391

Copyright © 2024 by authors and Scientific Research Publishing Inc.

Creative Commons License

This work and the related PDF file are licensed under a Creative Commons Attribution 4.0 International License.