Modelling and predicting low count child asthma hospital readmissions using General Additive Models ()
1. INTRODUCTION
Asthma is the most common long-term medical condition in children [1]. Children have higher rates of asthma hospitalization than adults. In Australia, children aged 0 - 14 years account for 58% of total asthma admissions [2]. Asthma hospital readmissions attract disproportionate amounts of resources from healthcare systems worldwide [3,4] , including in Australia [5]. Asthma readmissions are an indicator of asthma severity, psychological comorbidity, the effectiveness of hospital and community health services and also developments in hospital admission criteria, diagnostic practice and/or emergency department management [6]. Seasonality is an important marker of total environmental load or triggers such as high pollen exposure and respiratory virus infections which are associated with asthma hospital admissions [7]. Being aware of the days/times of the year when childhood asthma readmissions are expected to increase may assist in the achievement of better and more efficient health service planning.
The utility of predictive models for management of paediatric hospital readmissions in general and their benefit for health services research have been demonstrated [8]. Predictive models for paediatric asthma in particular have also been developed. Some models have used Cox regressions [9] or logistic regressions [10-14] with demographic or clinical attributes as predictors of readmissions and are therefore important for use in case management. These models predict population at risk but do not have the capacity to predict readmission frequencies or their timing, and are therefore of little use for hospital management resource planning. Models based on mathematical neural networking techniques have also been used to predict paediatric asthma admissions frequencies [15,16] . However, neural networks have limited capacities to demonstrate associations with seasonality and long term trends [16] .
The analysis of seasonality and time trends arise naturally within a time series framework. Many studies of childhood asthma hospital admissions have developed time series models for examining the associations between aeroallergens, pollution, weather and viral infections [17-22] . However, none of these models have been used to make predictions. Furthermore, only one study of childhood asthma readmissions has used a time series model, however, the model’s capability to predict future readmission counts was not tested [23]. In this study, we aimed to developed a time series model of childhood asthma readmissions, using the Generalized Additive Model (GAM) framework [24] to predict daily childhood asthma readmissions overall, as well as for males and females separately.
2. METHODS
2.1. Data
We used daily childhood asthma hospital admissions in Victoria between 1997 and 2009 from the Department of Human Services (DHS). The data collection methodology has been described elsewhere [25]. Children between the ages of 2 and 18 were included in the study. ICD-9 codes (493) up to 1998 and ICD-10 codes (J45 or J46) for the remainder of the data to 2009 were used to determine admissions with a principal diagnosis of asthma. The daily count of readmissions was defined as the daily number of readmissions that occurred within 28 days of the separation date of their respective index admission [26,27].
2.2. Statistical Methods
We assumed that the mean varied smoothly over time, and that a Poisson process underlay the association between the mean and observed counts. We modeled the mean using a GAM [24]. As covariates we included a time variable that was used to capture any trend in the data, while a month variable modeled any seasonality. The day of the week was also included, to cover possible differences in hospital service usage on weekends compared to the rest of the week [28]. Our final model was: Mean (daily readmission count) = f1 (time) + f2 (month) + f3 (day of week) + error term, with the mean and covariates on a log scale. f1 was a smooth non-parametric function, obtained as a combination of smoothing spline basis functions. f2 and f3 were linear functions, built using model coefficient matrices for parametric components of the model. In other words, we employed a semiparametric model.
We assumed a conditional Poisson distribution and, although the results are not shown here, χ2 tests gave us no evidence to reject this assumption for either male or female readmissions separately and, therefore, for all readmissions. We also tried fitting models assuming an underlying negative binomial distribution for the outcome, but this had a poor fit to the data. To cope with data over-dispersion, we used a self-adjusting over-dispersion parameter. The GAM method we used guarded against over-fitting by incorporating a penalized regression spline approach, with increasing penalty with increasing curve “wiggleness”, that is, its second derivative. The choice of a low spline basis dimension also guards against over-fitting by reducing smoothing parameters and hence reducing model inflexibility and increasing the effective degrees of freedom [29]. By experimenting, we found that comparatively low spline basis dimensions of between 5 and 15 coped with this sort of sparse data adequately. We chose counts as the outcome rather than readmission rates, usually defined as the number of readmissions/number of admissions, because a preliminary exploration indicated that counts were a more accurate marker of population rates. Also as part of our initial exploration, we compared our final model to a fully parametric model. That is, not only time, but also day of the week and day of the year (replacing month) were modeled with GAM fitted smooth penalised regression splines instead of assuming an a priori distribution and estimating their parametric components. Using day of the year (1 - 365 or 366) did not impose any assumption of monthly seasonality and so can be regarded as a test of whether seasonality is actually substantiated by the data.
We validated our model predictions by training the model on the first 11 years of data and then using the model to predict the final two years. We used χ2 goodness-of-fit tests to compare these predictions to the actual data. Using the full 13 years of data, we forecasted readmission counts one year beyond the end of the study. We also included 95% error limits around the predictions. We constructed our graphs using either the Lattice package [30] in the open source language R, version 2.13 [5] or just using the usual plot functions in R, and Stata 10.1 (Statacorp, USA). All of the GAMs were fitted using the MGCV package in R [29]. Statistical analyses were done in R or Stata using a two-sided significance level of 0.05.
3. RESULTS
There were 2401 readmissions covering fiscal years 1997-2009, a total of 4748 days, resulting in a daily mean of 0.5057. Boys accounted for 1358 (57%) and girls 1043 (43%) readmissions. Overall, readmissions peaked in winter (30.5%) then autumn (28.6%), spring (24.6%) followed by a trough in summer (16.2%). In contrast, χ2 test showed that all other asthma admissions had a differing overall seasonal distribution peaking in autumn (29.4%) then winter (25.6%), spring (23.8%) with a shallower trough in summer (20.9%) (p < 0.0005).
In Figure 1, we display the time series graphs of the total daily readmission counts for each of the study fiscal years (1st July to 30th June). These are typical of low count time series, that is, infrequently occurring and low magnitude counts. The distribution of daily readmission counts had a very low range, between 0 and 5, with the majority of counts being 0, including the median (Table 1). The sparseness of the data can be appreciated by observing that the 75th percentile was 1 and even the 99th was only 3. For the boys’ and girls’ distributions separately, their respective 75th percentiles were even lower, both being equal to 0. Even on a per-month aggregate basis, the counts were as low as 2. Runs of consecutive zero readmission days were as long as 54 days for boys’ and 48 days for girls’ readmissions with respective inter quartile ranges and median (IQR) of (1, 3, 5) and (2, 4, 7). Consecutive zero readmission days for total readmissions had IQR (1, 2, 3) with a maximum length of 27 days. On a monthly basis, there were as many as 12 distinct blocks (interspersed by at least 1 readmission) of these consecutive zero runs for total, male and female readmissions. The IQR for blocks of consecutive zero runs per month for total and boys’ was (4, 7, 10) and for
Figure 1. Time series graphs of all daily readmissions for each study year.
girls’ readmissions (4, 7, 9).
In Figures 2(a) to (c), we display readmissions, stratified by age and gender, as yearly counts, rates per 100,000 population and readmission rates, respectively. A close correspondence between counts and population rates for all age groups can be seen, however, similarity between population and readmission rates is present only for 2 - 5 year olds while the older age groups show a stark contrast. Although not shown here, regression analyses indicated a linear correspondence between counts and population rates but not between readmission rates and population rates.
3.1. Semi Parametric Models
We fitted our model to total readmissions and to male and female readmissions separately. The model showed that both month and day of the week had significant associations with mean daily readmission counts. Higher counts were expected in March, May, June and November, and the lowest counts in January. This pattern held for both males and females with males experiencing higher peaks. The day of the week had more of an impact for girls than boys. Overall more readmissions expected earlier in the week, from Sunday onwards, and the fewest on Saturday.
The fitted model produced a dispersion scaling parameter which was very close to unity, further justifying the assumption of a Poisson process and explaining why the negative binomial was such a bad fit. The smooth function of time was also significant (p < 0.0005), and indicated that after an initial trend of declining readmission counts until about 2001-2002, they subsequently began to increase returning to roughly initial levels (Figure 3).
Exploration with auto correlation plots and DurbinWatson tests indicated a weak but statistically significant auto correlation at lag 1 of 0.1, 0.08 and 0.06 for total, male and female readmissions respectively and so violated the assumption of independence of the daily readmission counts. In order to adjust for non independence, lag one values were included as part of the GAM. They made no substantial difference to the results and were finally not included due to the predictive role of our model.
3.2. Non Parametric Models
The non parametric model also demonstrated time, day of the week as well as day of year to be significantly associated with readmission counts. The day of year variable was in complete agreement with the parametric monthly variable of our final model for both timing and amplitude of peaks and troughs throughout the year.
Table 1. Distribution of counts of readmissions within 28 days for fiscal years 1997-2009.
(a)(b)(c)
Figure 2. (a) Annual count of readmission within 28 days; (b) Annual rate of readmission within 28 days per 100,000 population; (c) Annual rate of readmission within 28 days per total number of admissions.
3.3. Model Validation
We validated the model predictions by training the model with the first 11 years of data then using the model to predict the final two years of data. The predictions were subsequently aggregated by month and compared to the actual data. χ2 goodness-of-fit tests indicated that there was no statistical difference between the predictions and the data (p = 0.65). In comparison, when using the mean of the training data as a predictor of the final 2 years of data, χ2 tests indicated that the monthly aggregated mean predictions were not a good fit to the data (p = 0.03). We also used this method to test the parametric model’s fit to the data and found very similar results to the semi parametric model’s but with slightly higher χ2 values and so the parametric model did not have as good a fit to the data as the semi parametric
(a) Boys (b) Girls
Figure 3. Fitted semi parametric GAM and time trend to boys’ (a) and girls’ (b) daily readmission counts. Dotted lines represent 95% error limits. The horizontal lines are the overall means for the respective data subsets.
model. As the semi parametric model outperformed both the mean and the non parametric model in predicting the final 2 years of readmission data this gave us some confidence in using it to predict future readmissions.
Semi parametric models were also fitted to data subsets consisting of the 6 combinations of age group and sex displayed in Figure 2. Of these subsets, the largest number of total readmissions was for 2 to 5 year old boys, N = 829, and the smallest was for 13 to 18 year old girls, N = 253. In the presence of these extremely small numbers, the GAM was still sensitive to seasonality, day of the week and the time trend, with differences in the values of these covariates for the various subsets. The results of the χ2 goodness-of-fit tests validated the fit and prediction, but were only marginally better than the mean. Due to their extremely small numbers and the marginal improvement on the mean, we could not consider the models reliable for these 6 subsets.
Table 2 shows the actual data for 2009, in fortnightly aggregates, alongside the semi parametric GAM predictions for the same period. The data predictions were obtained with the same method used for model validation. The model was trained on the first 11 years of data and then used to predict the final two years of data. The predictions largely follow the previous year’s seasonal patterns, but with smoothed troughs and peaks. There are major differences in fortnights 18 & 19 and 25 & 26 representing March and June respectively. From Figure 1, we see that the March 2009 counts were unusually high and the June 2009 counts unusually low for those times of the year and so these differences may be due to normal variability or to the smoothing effect of the model. Table 2 shows that the confidence intervals for predictions after fortnight 17 tended to be the widest which indicated a decreasing reliability with an increasing predictive length. These predictions were for two years and so predictions of not more than one year are likely to be more reliable. However, overall, the predictions are in keeping with the seasonal character of the data.