14 Solutions || Medical Data

suppressPackageStartupMessages(library(ggplot2)) 
suppressPackageStartupMessages(library(medicaldata))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidyr))

Use the covid_testing data set and everything you’ve learned so far to answer the following questions:

14.1 Clinics included

How many clinics participated in the study, and how many valid tests were performed on each one? Did the testing trend vary over time?

covid<-covid_testing

14.1.1 Number of clinics

clinics<- covid %>%
          dplyr::select(subject_id,clinic_name,result,pan_day) %>%
          dplyr::distinct()
length(unique(clinics$clinic_name))
## [1] 88

14.1.2 Number of valid tests

clinics %>%
  dplyr::filter(result!="invalid") %>%
  dplyr::group_by(clinic_name) %>%
  dplyr::summarize(n_test = length(clinic_name)) %>%
  dplyr::arrange(desc(n_test))
## # A tibble: 88 × 2
##    clinic_name        n_test
##    <chr>               <int>
##  1 clinical lab         7397
##  2 emergency dept       3353
##  3 oncology day hosp     509
##  4 nicu                  285
##  5 laboratory            262
##  6 picu                  258
##  7 care ntwk             231
##  8 line clinical lab-    218
##  9 inpatient ward a      207
## 10 radiation oncology    161
## # ℹ 78 more rows

14.1.3 Testing trend over time

covid %>%
  filter(pan_day<=100) %>%
  group_by(pan_day) %>%
  summarize(n=length(result)) %>%
  ggplot(.,aes(x=pan_day,y=n))+
    geom_point()+
    geom_line()+
    ylab("Number of tests per day")+
    xlab("Pandemic day")

14.2 Number of positive tests

How many patients tested positive vs negative in the first 100 days of the pandemic? Do you notice any difference with the age of the patients? Hint: You can make two age groups and calculate the percentage each age group in positive vs negative tests.

14.2.1 Number of positive tests in the first 100 days

covid %>%
  filter(result!="invalid" & pan_day<=100) %>%
  group_by(result) %>%
  summarize(n=length(subject_id))
## # A tibble: 2 × 2
##   result       n
##   <chr>    <int>
## 1 negative 13225
## 2 positive   780

14.2.2 Tests by age group

tsts_age<-covid %>%
            filter(result!="invalid" & pan_day<=100) %>%
            mutate(age_group=ifelse(age<=21,"children","adults")) %>%
            group_by(age_group,result) %>%
            summarize(n=length(subject_id)) %>%
            mutate(percent_total=n/sum(n)*100)
## `summarise()` has grouped output by 'age_group'. You can override using the `.groups` argument.
tsts_age
## # A tibble: 4 × 4
## # Groups:   age_group [2]
##   age_group result       n percent_total
##   <chr>     <chr>    <int>         <dbl>
## 1 adults    negative  2627         92.1 
## 2 adults    positive   226          7.92
## 3 children  negative 10598         95.0 
## 4 children  positive   554          4.97

14.3 Processing times

Look at the specimen processing time to receipt, did the sample processing times improve over the first 100 days of the pandemic? Plot the median processing times of each day over the course of the pandemic and then compare the summary statistics of the first 50 vs the last 50 days

covid %>%
  group_by(pan_day) %>%
  dplyr::summarise(median_col_rec_tat=median(col_rec_tat)) %>%
  ggplot(.,aes(x=pan_day,y=median_col_rec_tat)) +
  geom_point()+
  geom_line()

covid %>%
  mutate(pan_day_group=ifelse(pan_day<50,"first_50","last_50")) %>%
  group_by(pan_day_group) %>%
  dplyr::summarise(mean_col_rec_tat=mean(col_rec_tat),
                   median_col_rec_tat=median(col_rec_tat),
                   min_col_rec_tat=min(col_rec_tat),
                   max_col_rec_tat=max(col_rec_tat))
## # A tibble: 2 × 5
##   pan_day_group mean_col_rec_tat median_col_rec_tat min_col_rec_tat max_col_rec_tat
##   <chr>                    <dbl>              <dbl>           <dbl>           <dbl>
## 1 first_50                  3.26                1.9               0            865.
## 2 last_50                   9.41                1.9               0          61370.

14.4 Bonus - Viral load

Higher viral loads are detected in less PCR cycles.

What can you observe about the viral load of positive vs negative samples.

Do you notice anything differences in viral load across ages in the positive samples?

Hint: Also split the data into two age groups and try using geom_boxplot()

ggplot(covid,aes(y=ct_result,x=result,color=result))+
  geom_boxplot()

covid %>%
  mutate(age_group=ifelse(age<=21,"children","adults")) %>%
  ggplot(.,aes(y=ct_result,x=result,color=age_group))+
    geom_boxplot()