To understand the pollution-related factors that act upon cancer incidence and death, we obtained, cleaned, and analyzed pollution data that spanned from 2000-2016. This data contained concentrations from the known pollutants NO2, O3, SO2, and CO.
In the first table below, you can see the average concentrations of each pollutant in question in the United States for each year over the 17-year span.
modified_pollution =
read_csv("data/uspollution_us_2000_2016.csv") %>%
janitor::clean_names() %>%
select(state, date_local, no2_mean, o3_mean,
so2_mean, co_mean) %>%
rename(
no2 = no2_mean,
co = co_mean,
o3 = o3_mean,
so2 = so2_mean
) %>%
separate(date_local, into = c("year", "month", "day"), sep = "\\-") %>%
select(-c("month", "day")) %>%
group_by(year, state) %>%
summarize(across(everything(), mean)) %>%
mutate_if(is.numeric, ~round(., 3)) %>%
filter(state != "Country Of Mexico") %>%
group_by(year) %>%
summarize(across(everything(), mean)) %>%
select(-c(state)) %>%
mutate(
year = as.numeric(year)
)
modified_pollution %>%
mutate_if(is.numeric, ~round(., 3)) %>%
rename("no2 (ppb)" = "no2", "o3 (ppm)" = "o3", "so2 (ppb)" = "so2", "co (ppm)" = "co") %>%
knitr::kable(caption = "United States Pollutant Concentration from 2000-2016")
year | no2 (ppb) | o3 (ppm) | so2 (ppb) | co (ppm) |
---|---|---|---|---|
2000 | 17.323 | 0.025 | 4.396 | 0.582 |
2001 | 17.670 | 0.026 | 3.598 | 0.476 |
2002 | 15.537 | 0.028 | 2.899 | 0.477 |
2003 | 14.555 | 0.027 | 3.144 | 0.467 |
2004 | 14.945 | 0.024 | 3.219 | 0.437 |
2005 | 15.297 | 0.026 | 3.129 | 0.437 |
2006 | 13.284 | 0.026 | 2.877 | 0.396 |
2007 | 12.208 | 0.026 | 2.407 | 0.372 |
2008 | 10.844 | 0.026 | 1.967 | 0.323 |
2009 | 9.917 | 0.026 | 1.674 | 0.300 |
2010 | 10.668 | 0.027 | 1.424 | 0.310 |
2011 | 10.972 | 0.027 | 1.229 | 0.288 |
2012 | 10.446 | 0.028 | 0.931 | 0.292 |
2013 | 10.188 | 0.026 | 0.917 | 0.261 |
2014 | 9.591 | 0.026 | 1.011 | 0.260 |
2015 | 9.253 | 0.026 | 0.851 | 0.265 |
2016 | 10.988 | 0.026 | 0.646 | 0.280 |
Below, we have included another data frame with the means across the entire 17-year span.
modified_pollution %>%
pivot_longer(
no2:co,
names_to = "pollutant_type",
values_to = "concentration"
) %>%
mutate(measurement = case_when(
pollutant_type %in% c("no2", "so2") ~ "ppb",
pollutant_type %in% c("co", "o3") ~ "ppm")) %>%
group_by(pollutant_type, measurement) %>%
summarize(
mean_concentration = mean(concentration),
median_concentration = median(concentration),
standard_deviation = sd(concentration)) %>%
arrange(measurement) %>%
knitr::kable(digits = 3)
pollutant_type | measurement | mean_concentration | median_concentration | standard_deviation |
---|---|---|---|---|
no2 | ppb | 12.570 | 10.988 | 2.782 |
so2 | ppb | 2.136 | 1.967 | 1.153 |
co | ppm | 0.366 | 0.323 | 0.098 |
o3 | ppm | 0.026 | 0.026 | 0.001 |
To compare the distributions of the pollutants to each other, we created a series of boxplots. As you can see, NO2 has the highest average concentration of any of the pollutants in ppm, and also has a skewed-right shape. CO and O3 have much narrower distributions (in addition to much lower mean concentrations), making their shape harder to discern. SO2, however, appears to be approximately normally distributed.
ppb_plot =
modified_pollution %>%
pivot_longer(
no2:co,
names_to = "pollutant_type",
values_to = "concentration"
) %>%
group_by(pollutant_type) %>%
filter(pollutant_type %in% c("no2", "so2")) %>%
ggplot(aes(x = pollutant_type, y = concentration)) +
geom_boxplot() +
labs(
x = "Pollutant Type",
y = "Concentration (ppb)",) + theme_bw()
ppm_plot =
modified_pollution %>%
pivot_longer(
no2:co,
names_to = "pollutant_type",
values_to = "concentration"
) %>%
group_by(pollutant_type) %>%
filter(pollutant_type %in% c("o3", "co")) %>%
ggplot(aes(x = pollutant_type, y = concentration)) +
geom_boxplot() +
labs(
x = "Pollutant Type",
y = "Concentration (ppm)",) + theme_bw()
ppb_plot + ppm_plot + plot_annotation(title =
"Distribution of Concentration Values for each Pollutant Type")
The next plot we created shows the pollution concentration with respect to time. This plot reveals that while NO2 and SO2 have been decreasing in concentration in recent years, that CO and O3 have remained relatively stable.
ppm_point =
modified_pollution %>%
pivot_longer(
no2:co,
names_to = "pollutant_type",
values_to = "concentration"
) %>%
filter(pollutant_type %in% c("o3", "co")) %>%
ggplot(aes(x = year,y = concentration, group = pollutant_type, color = pollutant_type)) +
geom_point() +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_bw() +
labs(
title = "",
x = "Year",
y = "Pollutant Concentration(ppm)"
)
ppb_point =
modified_pollution %>%
pivot_longer(
no2:co,
names_to = "pollutant_type",
values_to = "concentration"
) %>%
filter(pollutant_type %in% c("no2", "so2")) %>%
ggplot(aes(x = year,y = concentration, group = pollutant_type, color = pollutant_type)) +
geom_point() +
geom_line() +
scale_color_viridis(discrete = TRUE) +
theme_bw() +
labs(
title = "",
x = "Year",
y = "Pollutant Concentration(ppb)"
)
ppb_point / ppm_point + plot_annotation(title =
"Trends in Pollutant Concentration in the US from 2000-2016")
Here, an interactive map of the US that shows incidence rate of lung/bronchus cancer by state is at the bottom of this page, and gives a visualization of regional incidence trends. From this, it can be seen that southern states and northern Midwest states have the highest incidence rates of lung and bronchus cancers. Kentucky had the highest incidence of lung and bronchus cancer in 2013-2017 and Utah had the lowest incidence rate. We will further explore the relationship between pollution concentration and lung cancer incidence in our regression analysis.
g = list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('white')
)
inc_state =
read_excel("data/IncRate.xlsx", sheet = "State",
skip = 6) %>%
janitor::clean_names() %>%
separate(
col = breast_both_sexes_combined,
into = c("breast_total", "female_breast_only"),
sep = "-"
) %>%
mutate(
breast_male = if_else(breast_male == "n/a", "0", breast_male),
cervix_male = if_else(cervix_male == "n/a", "0", cervix_male),
colon_excluding_rectum_both_sexes_combined =
if_else(colon_excluding_rectum_both_sexes_combined == "n/a", "0",
colon_excluding_rectum_both_sexes_combined),
colon_excluding_rectum_female = if_else(colon_excluding_rectum_female == "n/a", "0",
colon_excluding_rectum_female),
colon_excluding_rectum_male = if_else(colon_excluding_rectum_male == "n/a", "0",
colon_excluding_rectum_male)) %>%
filter(state != "Puerto Rico")
inc_state %>%
select(state, lung_and_bronchus_both_sexes_combined) %>%
mutate(code = state.abb[match(state, state.name)],
lung_and_bronchus_both_sexes_combined = as.numeric(lung_and_bronchus_both_sexes_combined)) %>%
filter(code != is.na(code)) %>%
relocate(state, code) %>%
plot_geo(., locationmode = 'USA-states') %>%
add_trace(
z = ~lung_and_bronchus_both_sexes_combined, text = ~state, locations = ~code,
color = ~lung_and_bronchus_both_sexes_combined, colors = 'inferno'
) %>%
colorbar(title = "Incidence Rate per 100,000") %>%
layout(
title = "Distribution of Lung/Bronchus Cancer Incidence Rate by State in 2013-2017",
geo = g
)
Note. The raw data for incidence rate of cancer by cancer type did not include data point for Nevada in the combined national rates. Original data recommends referring to publications and sources page on American Cancer Society website
Similarly, the interactive map of the mortality rate from lung and bronchus cancer in 2013-2017 shows comparable distribution across the states as the incidence rate.
g = list(
scope = 'usa',
projection = list(type = 'albers usa'),
showlakes = TRUE,
lakecolor = toRGB('white')
)
death_state =
read_excel("data/DeathRate.xlsx", sheet = "State",
skip = 6) %>%
janitor::clean_names() %>%
separate(
col = breast_both_sexes_combined,
into = c("breast_total", "female_breast_only"),
sep = "-"
) %>%
select(-female_breast_only) %>%
mutate(
breast_male = if_else(breast_male == "n/a", "0", breast_male),
cervix_male = if_else(cervix_male == "n/a", "0", cervix_male),
) %>%
mutate_at(vars(-("state")), as.numeric) %>%
filter(state != "Puerto Rico")
death_state %>%
select(state, lung_and_bronchus_both_sexes_combined) %>%
mutate(code = state.abb[match(state, state.name)],
lung_and_bronchus_both_sexes_combined = as.numeric(lung_and_bronchus_both_sexes_combined)) %>%
filter(code != is.na(code)) %>%
relocate(state, code) %>%
plot_geo(., locationmode = 'USA-states') %>%
add_trace(
z = ~lung_and_bronchus_both_sexes_combined, text = ~state, locations = ~code,
color = ~lung_and_bronchus_both_sexes_combined, colors = 'inferno'
) %>%
colorbar(title = "Mortality Rate per 100,000") %>%
layout(
title = "Distribution of Lung/Bronchus Cancer Mortality Rate by State in 2013-2017",
geo = g
)
read_death_time =
read_excel("data/DeathTrend.xlsx",
skip = 6) %>%
janitor::clean_names()
x = c("colorectum_female", "colorectum_male", "liver_and_intrahepatic_bile_duct_female",
"liver_and_intrahepatic_bile_duct_male", "lung_and_bronchus_female",
"lung_and_bronchus_male", "ovary_female", "uterus_cervix_and_corpus_combined_female")
remove_note = function(column_name) {
read_death_time = read_death_time %>%
separate(column_name, into = c(column_name, "note"), sep = "\\-") %>%
select(-note)}
for (i in x) {
read_death_time = remove_note(i)}
death_time =
read_death_time %>%
filter(year %in% 2000:2016) %>%
select(-c("breast_male", "ovary_male", "prostate_female",
"uterus_cervix_and_corpus_combined_male")) %>%
mutate_at(vars(-("year")), as.numeric) %>%
mutate(
year = as.numeric(year)
)
Following plot shows the death trend in lung cancer over the same period of time as the previous plot of pollutants. As can be seen, lung cancer has sharply decreased from 2000-2016, which is similar to the trend we see in NO2 and SO2.
death_time %>%
rename(
male = lung_and_bronchus_male,
female = lung_and_bronchus_female
) %>%
pivot_longer(
female:male,
names_to = "gender",
values_to = "death_rate"
) %>%
ggplot(aes(x = year,y = death_rate, group = gender, color = gender)) +
geom_point() +
geom_line() +
labs(
title = "Trend in US Death rate from Lung Cancer, 2000-2016",
x = "Year",
y = "Lung Cancer Death Rate(per 100,000)"
)
After analyzing pollutants and lung cancer mortality/incidence on their own, we combined the pollution data with our data on cancer death and incidence in order to draw conclusions on the correlative effects pollutants have on cancer.
The next series of plots compares the pollutant concentration to the lung cancer death rate, stratifying by gender. We see moderately strong, positive linear correlations between Carbon Monoxide(co), Sulfur Dioxide(so2), and nitrous dioxide(no2) and lung cancer death rate among males. Whereas for females, we only see a moderate linear correlation between so2 and carbon monoxide concentrations and lung cancer death rate.
merged_death_pollution =
merge(death_time, modified_pollution, by = "year")
pollutant_variables =
merged_death_pollution %>%
select(no2,o3,so2,co) %>%
colnames() %>%
as.vector()
Create a for loop to plot Pollutant concentration vs. Lung Cancer Death Rate
pollutant_variables =
pollutant_variables %>%
as.list()
for (i in pollutant_variables) {
plot =
ggplot(merged_death_pollution, aes_string(i, "lung_and_bronchus_male")) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
labs(title = "Pollutant Concentration vs. Lung Cancer Death Rate(Male)",
y = "Lung Cancer Death Rate(per 100,000)") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
print(plot)
}
for (i in pollutant_variables) {
plot =
ggplot(merged_death_pollution, aes_string(i, "lung_and_bronchus_female")) +
geom_point() +
geom_smooth(method = lm, se = FALSE) +
labs(title = "Pollutant Concentration vs. Lung Cancer Death Rate(Female)",
y = "Lung Cancer Death Rate(per 100,000)") +
theme(plot.title = element_text(hjust = 0.5, face = "bold"))
print(plot)
}