Many people are wondering why New York is taking such a heavy hit while in states like Florida or California COVID-19 spreads much slower. After all California has as much tourism and immigration and in Florida the population is older and the quarantine was announced later.
What is going on? What really affects the spread of the virus? To take a stab at this question I looked at how much various factors like population density and weather are influencing the growth rate of COVID-19 in US states using multiple linear regression. Those who want to know the details of the analysis can check out the Method section below. Here, while I still have your attention, I’ll summarize the interesting bits.
I looked at many different factors that could potentially be at play:
- Delay (in days) between the first case and the stay-in-place (SIP) orders for the whole state (source)
- Delay (in days) between the first death and SIP orders for the whole state (source)
- Average temperature in the winter in the state (source)
- Poverty – percent of the state population living in poverty (source)
- Population age – percent of the state population that are 65 and over (source)
- Population density in the state (source)
- Population density in the largest city (multiple sources)
- Compliance with SIP – percent reduction in mobility to places of recreation (source)
- Tourism – State visitors as a percent of total US visitors (source)
There are definitely other factors that can play a role, such as the number of simultaneous starting hot spots for the virus, but the above factors were readily available and include some of the more obvious causes.
The dependant variable, or the thing that these factors are explaining was the growth rate of COVID-19 cases. For example, in NY at the time of the analysis (April 14th) it was 24% per day. I put all of these factors into a regression model and looked for the model that explained the growth rate in the best possible way. In the model that had the best fit the following factors were statistically significant: Delay between first case and SIP, delay between first death and SIP, population density, compliance and biggest city density. Below you can see how these factors relate to growth rate of COVID-19 cases with their correlations shown in black lines.
Let’s look at these factors in more detail. The greatest correlation was between cases growth rate and 1) delay between first case and SIP (r = 0.46) 2) State population density (r = 0.46) and 3) Population density in the biggest city (r = 0.27).
The delay between the first case and SIP is inversely related to the COVID-9 cases growth. That means that shorter delays are associated with greater COVID-19 growth. What is going on here? Shouldn’t it be the other way around? What I think we are seeing here is a chicken and egg problem, or an illustration of how correlation does not imply causation. It is a lot more likely that similarly to the previous analysis of mobility, the effect goes the other way. States that saw rapid initial growth in cases closed down sooner. I found the same thing when correlating deaths growth and mobility reduction.
Population density on the other hand has an expected relationship with cases growth: COVID-19 cases growth is faster for states with higher population density. The same is true for the density of the largest city.
This analysis suggests that population density is the most prominent factor affecting the spread of COVID-19 out of all of the more obvious factors. Not surprising but good to get a confirmation with data.
Growth rates were computed using the data from Johns Hopkins. Data for each state was fitted with an exponential function and the growth rate was estimated from the slope of the function in log coordinates.
Analysis was done with R’s lm() built in function for linear models. To check if any of the variables were severely co-linear, I did a VIF analysis and found that VIF was between 1-2 for all factors and so it did not indicate high multicolinearity.
To determine the best model, I used the stepAIC() function to compare nested models by the Akaike information criterion. What it does is starting from a full model with all variables in it, the function steps down by removing variables one-by-one and evaluating which of the “sub-models” explains the data better (smaller AIC).
Outliers were identified using Cook’s distance and one severe outlier was removed (Michigan).
The final model looked like this:
Step: AIC=64.79 lm(formula = Cases.growth ~ SIP.delay.deaths + SIP.delay.cases + Temperature + Population.Density + Compliance + Biggest.city.density, data = data) Residuals: Min 1Q Median 3Q Max -3.0258 -1.0464 -0.3621 1.3237 3.7465 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.495e+01 2.644e+00 9.435 6.19e-12 *** SIP.delay.deaths 1.768e-01 3.846e-02 4.598 3.89e-05 *** SIP.delay.cases -1.679e-01 2.336e-02 -7.186 7.90e-09 *** Temperature 7.652e-02 4.077e-02 1.877 0.06752 . Population.Density 4.085e-03 1.238e-03 3.301 0.00197 ** Compliance -1.373e-01 6.123e-02 -2.243 0.03022 * Biggest.city.density 2.924e-04 8.931e-05 3.274 0.00213 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.814 on 42 degrees of freedom Multiple R-squared: 0.6634, Adjusted R-squared: 0.6153 F-statistic: 13.8 on 6 and 42 DF, p-value: 1.366e-08
Limitations: There were multiple limitations to this analysis. First some of the data for the factors was from several years ago (e.g. 2014 for tourism) and might have been outdated. Second, reporting of cases data is itself limited by testing availability and delays. Some of the data might not be the best representation of the specific factor (e.g. in compliance I looked at only one aspect of mobility).