To understand the pollution-related factors that act upon cancer incidence and death, we obtained, cleaned, and analyzed pollution data that spanned from 2000-2016. This data contained concentrations from the known pollutants NO2, O3, SO2, and CO.


Exploring Pollution Data Set

 

In the first table below, you can see the average concentrations of each pollutant in question in the United States for each year over the 17-year span.

modified_pollution =
  read_csv("data/uspollution_us_2000_2016.csv") %>% 
  janitor::clean_names() %>%
  select(state, date_local, no2_mean, o3_mean, 
         so2_mean, co_mean) %>%
  rename(
    no2 = no2_mean,
    co = co_mean,
    o3 = o3_mean,
    so2 = so2_mean
  ) %>% 
  separate(date_local, into = c("year", "month", "day"), sep = "\\-") %>%
  select(-c("month", "day")) %>%
  group_by(year, state) %>%
  summarize(across(everything(), mean)) %>%
  mutate_if(is.numeric, ~round(., 3)) %>%
  filter(state != "Country Of Mexico") %>%
  group_by(year) %>%
  summarize(across(everything(), mean)) %>%
  select(-c(state)) %>% 
  mutate(
    year = as.numeric(year)
  )

modified_pollution %>% 
  mutate_if(is.numeric, ~round(., 3)) %>%
  rename("no2 (ppb)" = "no2", "o3 (ppm)" = "o3", "so2 (ppb)" = "so2", "co (ppm)" = "co") %>%
  knitr::kable(caption = "United States Pollutant Concentration from 2000-2016")
United States Pollutant Concentration from 2000-2016
year no2 (ppb) o3 (ppm) so2 (ppb) co (ppm)
2000 17.323 0.025 4.396 0.582
2001 17.670 0.026 3.598 0.476
2002 15.537 0.028 2.899 0.477
2003 14.555 0.027 3.144 0.467
2004 14.945 0.024 3.219 0.437
2005 15.297 0.026 3.129 0.437
2006 13.284 0.026 2.877 0.396
2007 12.208 0.026 2.407 0.372
2008 10.844 0.026 1.967 0.323
2009 9.917 0.026 1.674 0.300
2010 10.668 0.027 1.424 0.310
2011 10.972 0.027 1.229 0.288
2012 10.446 0.028 0.931 0.292
2013 10.188 0.026 0.917 0.261
2014 9.591 0.026 1.011 0.260
2015 9.253 0.026 0.851 0.265
2016 10.988 0.026 0.646 0.280

Below, we have included another data frame with the means across the entire 17-year span.

modified_pollution %>% 
   pivot_longer(
    no2:co,
    names_to = "pollutant_type",
    values_to = "concentration"
  )  %>% 
  mutate(measurement = case_when(
    pollutant_type %in% c("no2", "so2") ~ "ppb",
    pollutant_type %in% c("co", "o3") ~ "ppm")) %>%
   group_by(pollutant_type, measurement) %>% 
   summarize(
    mean_concentration = mean(concentration),
    median_concentration = median(concentration),
    standard_deviation = sd(concentration)) %>% 
  arrange(measurement) %>%
  knitr::kable(digits = 3)
pollutant_type measurement mean_concentration median_concentration standard_deviation
no2 ppb 12.570 10.988 2.782
so2 ppb 2.136 1.967 1.153
co ppm 0.366 0.323 0.098
o3 ppm 0.026 0.026 0.001

 


Comparing Distributions of Pollutant Concentrations

 

To compare the distributions of the pollutants to each other, we created a series of boxplots. As you can see, NO2 has the highest average concentration of any of the pollutants in ppm, and also has a skewed-right shape. CO and O3 have much narrower distributions (in addition to much lower mean concentrations), making their shape harder to discern. SO2, however, appears to be approximately normally distributed.

ppb_plot =
  modified_pollution %>% 
  pivot_longer(
    no2:co,
    names_to = "pollutant_type",
    values_to = "concentration"
  ) %>% 
  group_by(pollutant_type) %>% 
  filter(pollutant_type %in% c("no2", "so2")) %>%
  ggplot(aes(x = pollutant_type, y = concentration)) + 
  geom_boxplot() +
  labs(
    x = "Pollutant Type",
    y = "Concentration (ppb)",) + theme_bw() 

ppm_plot =
   modified_pollution %>% 
  pivot_longer(
    no2:co,
    names_to = "pollutant_type",
    values_to = "concentration"
  ) %>% 
  group_by(pollutant_type) %>% 
  filter(pollutant_type %in% c("o3", "co")) %>%
  ggplot(aes(x = pollutant_type, y = concentration)) + 
  geom_boxplot() +
  labs(
    x = "Pollutant Type",
    y = "Concentration (ppm)",) + theme_bw()

ppb_plot + ppm_plot + plot_annotation(title = 
                                        "Distribution of Concentration Values for each Pollutant Type")

The next plot we created shows the pollution concentration with respect to time. This plot reveals that while NO2 and SO2 have been decreasing in concentration in recent years, that CO and O3 have remained relatively stable.

ppm_point =
  modified_pollution %>% 
  pivot_longer(
    no2:co,
    names_to = "pollutant_type",
    values_to = "concentration"
  ) %>% 
  filter(pollutant_type %in% c("o3", "co")) %>%
  ggplot(aes(x = year,y = concentration, group = pollutant_type, color = pollutant_type)) +
  geom_point() +
  geom_line() +
  scale_color_viridis(discrete = TRUE) +
  theme_bw() +
  labs(
    title = "",
    x = "Year",
    y = "Pollutant Concentration(ppm)"
  ) 

ppb_point =
  modified_pollution %>% 
  pivot_longer(
    no2:co,
    names_to = "pollutant_type",
    values_to = "concentration"
  ) %>% 
  filter(pollutant_type %in% c("no2", "so2")) %>%
  ggplot(aes(x = year,y = concentration, group = pollutant_type, color = pollutant_type)) +
  geom_point() +
  geom_line() +
  scale_color_viridis(discrete = TRUE) +
  theme_bw() +
  labs(
    title = "",
    x = "Year",
    y = "Pollutant Concentration(ppb)"
  ) 

ppb_point / ppm_point + plot_annotation(title = 
                                        "Trends in Pollutant Concentration in the US from 2000-2016")


Exploring Lung Cancer Data

 

Exploring Incidence of Lung Cancer

 

Here, an interactive map of the US that shows incidence rate of lung/bronchus cancer by state is at the bottom of this page, and gives a visualization of regional incidence trends. From this, it can be seen that southern states and northern Midwest states have the highest incidence rates of lung and bronchus cancers. Kentucky had the highest incidence of lung and bronchus cancer in 2013-2017 and Utah had the lowest incidence rate. We will further explore the relationship between pollution concentration and lung cancer incidence in our regression analysis.

g = list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)

inc_state =
  read_excel("data/IncRate.xlsx", sheet = "State", 
             skip = 6) %>%
  janitor::clean_names() %>% 
  separate(
    col = breast_both_sexes_combined,
    into = c("breast_total", "female_breast_only"),
    sep = "-"
  ) %>% 
  mutate(
    breast_male = if_else(breast_male == "n/a", "0", breast_male),
    cervix_male = if_else(cervix_male == "n/a", "0", cervix_male),
    colon_excluding_rectum_both_sexes_combined = 
      if_else(colon_excluding_rectum_both_sexes_combined == "n/a", "0",
              colon_excluding_rectum_both_sexes_combined),
    colon_excluding_rectum_female = if_else(colon_excluding_rectum_female == "n/a", "0",
                                            colon_excluding_rectum_female),
    colon_excluding_rectum_male = if_else(colon_excluding_rectum_male == "n/a", "0",
                                          colon_excluding_rectum_male)) %>%
  filter(state != "Puerto Rico")

inc_state %>%
  select(state, lung_and_bronchus_both_sexes_combined) %>%
  mutate(code = state.abb[match(state, state.name)],
         lung_and_bronchus_both_sexes_combined = as.numeric(lung_and_bronchus_both_sexes_combined)) %>%
  filter(code != is.na(code)) %>%
  relocate(state, code) %>%
  plot_geo(., locationmode = 'USA-states') %>% 
  add_trace(
    z = ~lung_and_bronchus_both_sexes_combined, text = ~state, locations = ~code,
    color = ~lung_and_bronchus_both_sexes_combined, colors = 'inferno'
  ) %>% 
  colorbar(title = "Incidence Rate per 100,000") %>% 
  layout(
    title = "Distribution of Lung/Bronchus Cancer Incidence Rate by State in 2013-2017",
    geo = g
  )

Note. The raw data for incidence rate of cancer by cancer type did not include data point for Nevada in the combined national rates. Original data recommends referring to publications and sources page on American Cancer Society website

 

Exploring Mortality of Lung Cancer

 

Similarly, the interactive map of the mortality rate from lung and bronchus cancer in 2013-2017 shows comparable distribution across the states as the incidence rate.

g = list(
  scope = 'usa',
  projection = list(type = 'albers usa'),
  showlakes = TRUE,
  lakecolor = toRGB('white')
)

death_state =
  read_excel("data/DeathRate.xlsx", sheet = "State", 
             skip = 6) %>%
  janitor::clean_names() %>% 
  separate(
    col = breast_both_sexes_combined,
    into = c("breast_total", "female_breast_only"),
    sep = "-"
  ) %>% 
  select(-female_breast_only) %>%
  mutate(
    breast_male = if_else(breast_male == "n/a", "0", breast_male),
    cervix_male = if_else(cervix_male == "n/a", "0", cervix_male),
  ) %>%
  mutate_at(vars(-("state")), as.numeric) %>%
  filter(state != "Puerto Rico")


death_state %>%
  select(state, lung_and_bronchus_both_sexes_combined) %>%
  mutate(code = state.abb[match(state, state.name)],
         lung_and_bronchus_both_sexes_combined = as.numeric(lung_and_bronchus_both_sexes_combined)) %>%
  filter(code != is.na(code)) %>%
  relocate(state, code) %>%
  plot_geo(., locationmode = 'USA-states') %>% 
  add_trace(
    z = ~lung_and_bronchus_both_sexes_combined, text = ~state, locations = ~code,
    color = ~lung_and_bronchus_both_sexes_combined, colors = 'inferno'
  ) %>% 
  colorbar(title = "Mortality Rate per 100,000") %>% 
  layout(
    title = "Distribution of Lung/Bronchus Cancer Mortality Rate by State in 2013-2017",
    geo = g
  )

Exploring relationship between trend in death rate from Lung Cancer to Pollution Concentration

 

After analyzing pollutants and lung cancer mortality/incidence on their own, we combined the pollution data with our data on cancer death and incidence in order to draw conclusions on the correlative effects pollutants have on cancer.

The next series of plots compares the pollutant concentration to the lung cancer death rate, stratifying by gender. We see moderately strong, positive linear correlations between Carbon Monoxide(co), Sulfur Dioxide(so2), and nitrous dioxide(no2) and lung cancer death rate among males. Whereas for females, we only see a moderate linear correlation between so2 and carbon monoxide concentrations and lung cancer death rate.

merged_death_pollution =
merge(death_time, modified_pollution, by = "year") 


pollutant_variables =
  merged_death_pollution %>% 
  select(no2,o3,so2,co) %>% 
  colnames() %>% 
  as.vector()

Create a for loop to plot Pollutant concentration vs. Lung Cancer Death Rate

Male

pollutant_variables =
  pollutant_variables %>% 
  as.list()


for (i in pollutant_variables) {
  plot =
  ggplot(merged_death_pollution, aes_string(i,  "lung_and_bronchus_male")) + 
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  labs(title = "Pollutant Concentration vs. Lung Cancer Death Rate(Male)", 
       y = "Lung Cancer Death Rate(per 100,000)") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))
  
  print(plot)
  
}

Female

for (i in pollutant_variables) {
  plot =
  ggplot(merged_death_pollution, aes_string(i,  "lung_and_bronchus_female")) + 
  geom_point() +
  geom_smooth(method = lm, se = FALSE) +
  labs(title = "Pollutant Concentration vs. Lung Cancer Death Rate(Female)", 
       y = "Lung Cancer Death Rate(per 100,000)") +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))
  
  print(plot)
  
}