To continue our analysis we will build a linear model that assesses the relationship between pollutant concentration and incidence of lung cancer. We will merge the data set with incidence rate by state with the pollution data set that has been aggregated by state over the 17 year period. In essence we would like to answer whether states that had higher mean pollutant concentrations over this 15 year period were significantly associated with higher lung cancer incidence. Thus, from building the model we can assess whether concentrations of the various pollutants are significantly associated with increased incidence rates of lung cancer. We will evaluate the predictive accuracy of the model, and cross validate the fitted model to other potential models.
inc_state =
read_excel("data/IncRate.xlsx", sheet = "State",
skip = 6) %>%
janitor::clean_names() %>%
separate(
col = breast_both_sexes_combined,
into = c("breast_total", "female_breast_only"),
sep = "-"
) %>%
mutate(
breast_male = if_else(breast_male == "n/a", "0", breast_male),
cervix_male = if_else(cervix_male == "n/a", "0", cervix_male),
colon_excluding_rectum_both_sexes_combined = if_else(colon_excluding_rectum_both_sexes_combined == "n/a", "0", colon_excluding_rectum_both_sexes_combined),
colon_excluding_rectum_female = if_else(colon_excluding_rectum_female == "n/a", "0", colon_excluding_rectum_female),
colon_excluding_rectum_male = if_else(colon_excluding_rectum_male == "n/a", "0", colon_excluding_rectum_male),
) %>%
filter(state != "Puerto Rico") %>%
select(-c("female_breast_only", starts_with("colon"), starts_with("rectum")))
pollution_incidence = read_csv("data/uspollution_us_2000_2016.csv") %>%
janitor::clean_names() %>%
select(state, date_local, no2_mean, o3_mean,
so2_mean, co_mean) %>%
separate(date_local, into = c("year", "month", "day"), sep = "\\-") %>%
select(-c("month", "day")) %>%
group_by(year, state) %>%
summarize(across(everything(), mean)) %>%
mutate_if(is.numeric, ~round(., 3)) %>%
filter(state != "Country Of Mexico") %>%
ungroup() %>%
select(state:co_mean) %>%
group_by(state) %>%
summarize(
no2 = mean(no2_mean),
o3 = mean(o3_mean),
so2 = mean(so2_mean),
co = mean(co_mean)
) %>%
merge(inc_state, by = "state") %>%
filter(
state != "Nevada"
) %>%
mutate(
lung_and_bronchus_both_sexes_combined = as.numeric(lung_and_bronchus_both_sexes_combined),
lung_and_bronchus_female = as.numeric(lung_and_bronchus_female),
lung_and_bronchus_male = as.numeric(lung_and_bronchus_male)
)
pollution_incidence %>%
select(state,"lung/bronchus combined" = lung_and_bronchus_both_sexes_combined,
"lung/bronchus male" = lung_and_bronchus_male,
"lung/bronchus female" = lung_and_bronchus_female,
no2,co,so2,o3) %>%
knitr::kable(digits = 3)
state | lung/bronchus combined | lung/bronchus male | lung/bronchus female | no2 | co | so2 | o3 |
---|---|---|---|---|---|---|---|
Alabama | 64.8 | 84.1 | 50.0 | 10.020 | 0.212 | 0.945 | 0.022 |
Alaska | 55.8 | 64.8 | 47.6 | 11.388 | 0.430 | 6.056 | 0.012 |
Arizona | 46.7 | 51.2 | 43.0 | 19.032 | 0.493 | 1.313 | 0.025 |
Arkansas | 77.2 | 95.8 | 62.7 | 9.540 | 0.418 | 1.483 | 0.026 |
California | 41.5 | 46.4 | 37.8 | 13.328 | 0.441 | 1.112 | 0.026 |
Colorado | 41.4 | 44.1 | 39.6 | 19.886 | 0.467 | 1.627 | 0.022 |
Connecticut | 59.6 | 65.0 | 55.8 | 9.971 | 0.279 | 1.159 | 0.028 |
Delaware | 66.9 | 75.3 | 60.8 | 11.643 | 0.265 | 0.969 | 0.026 |
Florida | 57.7 | 66.5 | 50.5 | 7.454 | 0.450 | 0.479 | 0.027 |
Georgia | 62.8 | 79.0 | 50.6 | 11.849 | 0.329 | 0.530 | 0.020 |
Hawaii | 45.7 | 57.3 | 36.3 | 3.182 | 0.369 | 1.020 | 0.025 |
Idaho | 49.5 | 54.3 | 45.7 | 8.553 | 0.198 | 0.420 | 0.024 |
Illinois | 63.7 | 73.8 | 56.3 | 15.460 | 0.403 | 2.753 | 0.023 |
Indiana | 72.2 | 86.4 | 61.4 | 12.393 | 0.382 | 3.289 | 0.030 |
Iowa | 63.3 | 74.7 | 54.5 | 7.060 | 0.223 | 0.408 | 0.027 |
Kansas | 56.2 | 64.9 | 49.7 | 11.564 | 0.393 | 2.578 | 0.025 |
Kentucky | 91.0 | 109.0 | 77.5 | 11.511 | 0.204 | 2.686 | 0.026 |
Louisiana | 66.2 | 82.6 | 53.6 | 13.697 | 0.404 | 2.257 | 0.023 |
Maine | 72.1 | 80.3 | 65.8 | 4.961 | 0.225 | 0.882 | 0.025 |
Maryland | 56.4 | 62.9 | 51.7 | 10.717 | 0.309 | 1.942 | 0.027 |
Massachusetts | 61.6 | 65.5 | 59.2 | 18.458 | 0.312 | 2.470 | 0.021 |
Michigan | 63.3 | 71.8 | 56.9 | 16.505 | 0.333 | 3.314 | 0.026 |
Minnesota | 56.1 | 61.5 | 52.2 | 6.702 | 0.225 | 0.601 | 0.028 |
Missouri | 72.0 | 83.6 | 63.2 | 14.005 | 0.423 | 3.031 | 0.027 |
New Hampshire | 63.7 | 67.2 | 61.8 | 7.366 | 0.337 | 1.409 | 0.028 |
New Jersey | 55.3 | 60.8 | 51.7 | 18.907 | 0.414 | 3.576 | 0.023 |
New Mexico | 38.5 | 43.9 | 34.3 | 12.343 | 0.211 | 0.613 | 0.032 |
New York | 58.7 | 66.2 | 53.4 | 18.750 | 0.357 | 4.382 | 0.024 |
North Carolina | 67.7 | 82.8 | 56.4 | 10.857 | 0.378 | 1.669 | 0.030 |
North Dakota | 57.7 | 65.4 | 52.3 | 4.773 | 0.169 | 0.221 | 0.026 |
Ohio | 67.8 | 80.0 | 58.7 | 10.629 | 0.283 | 2.551 | 0.026 |
Oklahoma | 67.4 | 80.5 | 57.1 | 6.835 | 0.139 | 0.814 | 0.031 |
Oregon | 54.1 | 58.7 | 50.6 | 9.718 | 0.306 | 0.982 | 0.019 |
Pennsylvania | 63.5 | 73.4 | 56.4 | 11.809 | 0.231 | 3.658 | 0.026 |
Rhode Island | 69.4 | 75.4 | 65.6 | 6.986 | 0.222 | 0.470 | 0.030 |
South Carolina | 64.4 | 80.1 | 52.3 | 2.909 | 0.139 | 0.721 | 0.032 |
South Dakota | 59.0 | 66.7 | 54.1 | 5.253 | 0.186 | 0.482 | 0.030 |
Tennessee | 74.5 | 91.6 | 61.4 | 1.800 | 0.414 | 0.815 | 0.037 |
Texas | 50.6 | 61.3 | 42.2 | 12.242 | 0.270 | 1.117 | 0.026 |
Utah | 26.0 | 30.2 | 22.5 | 17.066 | 0.376 | 0.475 | 0.028 |
Virginia | 56.4 | 65.9 | 49.2 | 9.888 | 0.331 | 2.262 | 0.028 |
Washington | 54.0 | 58.8 | 50.3 | 11.220 | 0.216 | 0.590 | 0.020 |
Wisconsin | 59.1 | 66.6 | 53.5 | 16.009 | 0.334 | 2.560 | 0.019 |
Wyoming | 42.2 | 45.2 | 40.0 | 3.209 | 0.098 | 0.331 | 0.038 |
Let’s begin by exploring each pollutant to see if there exists linear relationship between the pollutant concentration and incidence of lung cancer
continuous_variables =
pollution_incidence %>%
select(where(is.numeric)) %>%
select(-lung_and_bronchus_both_sexes_combined,-lung_and_bronchus_female, -lung_and_bronchus_male) %>%
colnames() %>%
as.vector()
Creating a for loop to create scatter plots that evaluate each pollutant concentration vs lung cancer incidence
continuous_variables =
continuous_variables %>%
as.list()
for (i in continuous_variables) {
plot =
ggplot(pollution_incidence, aes_string(i, "lung_and_bronchus_both_sexes_combined")) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
labs(title = "Pollutant Concentration vs. Lung Cancer Incidence", y = "Lung Cancer Incidence") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
print(plot)
}
Well, this is not the best start. From reviewing the scatter plots, none of the pollutants show a particularly strong linear relationship with incidence of lung cancer. Each plot has a linear model fitted, but as we can see none of the linear associations are strong, however we will proceed forward with a model that includes all the pollutants.The lack of strong linear association will be noted in the final discussion of the prediction accuracy of the fitted model.
fitted_model = lm(lung_and_bronchus_both_sexes_combined ~ no2 + o3 + so2 + co,
data = pollution_incidence)
fitted_model %>%
broom::tidy() %>%
select(term, estimate,p.value) %>%
knitr::kable(digits = 3)
term | estimate | p.value |
---|---|---|
(Intercept) | 62.537 | 0.000 |
no2 | -1.034 | 0.034 |
o3 | 126.200 | 0.768 |
so2 | 4.561 | 0.006 |
co | -9.008 | 0.651 |
fitted_model %>%
broom::glance() %>%
knitr::kable(digits = 3)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.221 | 0.141 | 10.705 | 2.766 | 0.041 | 4 | -164.092 | 340.185 | 350.89 | 4469.614 | 39 | 44 |
From looking at the table of values for the fitted model, we see that beta coefficients for so2 and o3 indicate a positive relationship with lung cancer incidence, whereas the beta coefficients for no2 and co indicate a negative relationship with lung cancer incidence. But more importantly, the only significant beta coefficients are for no2 and so2 using the threshold of alpha-level = 0.05. If look further at the analysis for the model, we see the R^2 value is quite low, This indicates the linear model does not not explain variation in lung cancer incidence well, thus prediction accuracy of the model is low. However, given the p-value for the overall model(p-value=0.041) is significant. Thus we will proceed to look at the residual plot for the fitted model.
pollution_incidence %>%
add_residuals(fitted_model) %>%
add_predictions(fitted_model) %>%
ggplot(aes(x = pred, y = resid)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(
title = "Residuals vs. Predictions plot",
x = "Predictions",
y = "Residuals"
) +
scale_color_viridis(discrete = TRUE) +
theme_bw()
The plot indicates a random scatter of residuals. Therefore, the assumption of homoscedasticity is not violated.
We will cross validate the fitted model to models that are a combination of the different predictors. O3 and so2 were grouped together because of the scatter plots before that indicate slight positive relationships. Similarly no2 and co were grouped together because of the scatter plots that indicated slightly negative relationships.
comparison_model = lm(lung_and_bronchus_both_sexes_combined ~ o3 + so2, data = pollution_incidence)
comparison_model2 = lm(lung_and_bronchus_both_sexes_combined ~ no2 + co, data = pollution_incidence)
cv_df = crossv_mc(pollution_incidence,100)
cv_df =
cv_df %>%
mutate(
train = map(train,as_tibble),
test = map(test,as_tibble)
)
cv_df =
cv_df %>%
mutate(
fitted_model = map(train, ~lm(lung_and_bronchus_both_sexes_combined ~ no2 + o3 + so2 + co,
data = pollution_incidence)),
comparison_model = map(train, ~lm(lung_and_bronchus_both_sexes_combined ~ o3 + so2,
data = pollution_incidence)),
comparison_model2 = map(train, ~lm(lung_and_bronchus_both_sexes_combined ~ no2 + co,
data = pollution_incidence))) %>%
mutate(
rmse_fitted_model = map2_dbl(fitted_model, test,~rmse(model=.x, data=.y)),
rmse_comparison_model = map2_dbl(comparison_model, test, ~rmse(model=.x, data=.y)),
rmse_comparison_model2 = map2_dbl(comparison_model2, test, ~rmse(model=.x, data=.y))
)
cv_df %>%
select(starts_with("rmse")) %>%
pivot_longer(
everything(),
names_to = "model",
values_to = "rmse",
names_prefix = "rmse_") %>%
mutate(model = fct_inorder(model)) %>%
ggplot(aes(x = model, y = rmse)) + geom_boxplot()
From the boxplot, we see that the fitted model has a lower RMSE compared to the comparison models. Despite the fitted model being better,the mean RMSE value is approximately equal to 10, which is not good. Thus the overall predictive accuracy of the fitted model is quite low. We will now evaluate whether stratifying lung cancer incidence by gender affects the predictive accuracy of the model.
female_model = lm(lung_and_bronchus_female ~ no2 + o3 + so2 + co, data = pollution_incidence)
female_model %>%
broom::tidy() %>%
knitr::kable(digits = 3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 54.464 | 12.441 | 4.378 | 0.000 |
no2 | -0.640 | 0.402 | -1.594 | 0.119 |
o3 | 133.212 | 362.366 | 0.368 | 0.715 |
so2 | 3.745 | 1.341 | 2.794 | 0.008 |
co | -15.525 | 16.843 | -0.922 | 0.362 |
female_model %>%
broom::glance() %>%
knitr::kable(digits = 3)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.195 | 0.112 | 9.129 | 2.357 | 0.07 | 4 | -157.085 | 326.169 | 336.874 | 3250.356 | 39 | 44 |
male_model = lm(lung_and_bronchus_male ~ no2 + o3 + so2 + co, data = pollution_incidence)
male_model %>%
broom::tidy() %>%
knitr::kable(digits = 3)
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
(Intercept) | 72.236 | 19.060 | 3.790 | 0.001 |
no2 | -1.496 | 0.615 | -2.430 | 0.020 |
o3 | 156.136 | 555.152 | 0.281 | 0.780 |
so2 | 5.611 | 2.054 | 2.732 | 0.009 |
co | -2.349 | 25.804 | -0.091 | 0.928 |
male_model %>%
broom::glance() %>%
knitr::kable(digits = 3)
r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
---|---|---|---|---|---|---|---|---|---|---|---|
0.218 | 0.138 | 13.986 | 2.72 | 0.043 | 4 | -175.854 | 363.709 | 374.414 | 7628.856 | 39 | 44 |
From looking at the beta coefficients and model analyses for both models, predictive accuracy did not improve.