Learning objectives

  1. Simulate Poisson-distributed data with a relevant covariate
  2. Fit a Poisson log-linear GLM
  3. Create and interpret diagnostic plots for a log-linear GLM
  4. Use analysis of deviance to compare two log-linear GLMs
  5. Practice recoding and creating tables and plots

Exercises

  1. Simulate count data from a Poisson distribution (for example number of hospital visits by persons over 70 in a 3-year period), where:
    1. 10,000 persons annotated with “race” as “white” or “non-white”
    2. “white” persons have an average of 3.5 hospital visits during this time period
    3. “non-white” persons have an average of 3 hospital visits
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
##  dplyr     1.1.3      readr     2.1.4
##  forcats   1.0.0      stringr   1.5.0
##  ggplot2   3.4.4      tibble    3.2.1
##  lubridate 1.9.3      tidyr     1.3.0
##  purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
##  dplyr::filter() masks stats::filter()
##  dplyr::lag()    masks stats::lag()
##  Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
set.seed(1)
N <- 10000
simdat <- data.frame(race = sample(c("white", "non-white"), N, replace = TRUE)) %>%
  mutate(race = factor(race, levels = c("white", "non-white"))) %>%
  mutate(y = rpois(N, lambda = ifelse(race == "white", 3.5, 3.0)))
  1. Fit a log-linear Poisson model of count outcomes with “race” as the predictor. Note, in this context I tend to use the terms “predictor” and “covariate” interchangeably, to mean any variable used as a predictor in the regression model.
fit <- glm(y ~ race, data = simdat, family = poisson(link = "log"))
summary(fit)
## 
## Call:
## glm(formula = y ~ race, family = poisson(link = "log"), data = simdat)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    1.254710   0.007564  165.88   <2e-16 ***
## racenon-white -0.161428   0.011137  -14.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 11041  on 9999  degrees of freedom
## Residual deviance: 10831  on 9998  degrees of freedom
## AIC: 39419
## 
## Number of Fisher Scoring iterations: 5
  1. Use a chi-square test on deviance residuals to test null hypothesis of no relationship between mean hospital visits and race.
  • The difference in total deviance between two nested models is \(\chi^2\) distributed under \(H_0\) that the more complex model is no better at explaining the response.
    • The difference in deviance residuals is (11041 - 10831) = 210, with a difference of 1 degrees of freedom.

The critical threshold for rejection at p=0.05 is:

qchisq(0.95, df=1)
## [1] 3.841459

So we reject \(H_0\)

BEWARE OF MISSING DATA: THIS IS SAFER

fit0 <- glm(y ~ 1, data = simdat, family = poisson(link = "log"))
anova(fit0, fit, test = "LRT")
## Analysis of Deviance Table
## 
## Model 1: y ~ 1
## Model 2: y ~ race
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1      9999      11042                          
## 2      9998      10831  1   210.76 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  1. Create and discuss standard fit diagnostics plots

  1. Example: Risky Drug Use Behavior
  • Download the “needle_sharing” dataset (see Vittinghoff 8.3.1)
  • Outcome is # times the drug user shared a syringe in the past month (shared_syr)
  • Predictors: sex, ethn, homeless
library(readxl)
needledat <- read_excel("needle_sharing.xlsx")
summary(needledat$shared_syr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   2.976   0.000  60.000       5
var(needledat$shared_syr, na.rm=TRUE)
## [1] 106.5978

Some recoding:

suppressPackageStartupMessages(library(dplyr))
needledat_cleaned <-
  mutate(needledat,
    homeless = factor(homeless, levels = 0:1, labels = c("No", "Yes")),
    sex = factor(sex, levels = c("M", "F"), labels = c("Male", "Female")),
    ethnicity = factor(ethn)
  ) %>%
  select(all_of(c("shared_syr", "ethnicity", "sex", "homeless")))
  1. Create a table of the risky drug use behavior dataset
## 
## Attaching package: 'table1'
## The following objects are masked from 'package:base':
## 
##     units, units<-
table1(~ ., data = needledat_cleaned)
Overall
(N=128)
shared_syr
Mean (SD) 2.98 (10.3)
Median [Min, Max] 0 [0, 60.0]
Missing 5 (3.9%)
ethnicity
AA 36 (28.1%)
Asian 1 (0.8%)
Filipino 2 (1.6%)
Hispanic 10 (7.8%)
Indian 1 (0.8%)
Indian & White 1 (0.8%)
White 76 (59.4%)
White & Hispa 1 (0.8%)
sex
Male 97 (75.8%)
Female 30 (23.4%)
Missing 1 (0.8%)
homeless
No 63 (49.2%)
Yes 61 (47.7%)
Missing 4 (3.1%)
  1. Plots of Risky Drug Use Behavior
  1. Create a histogram number of syringe uses

  2. Create a scatter plot of number of syringe uses versus rank of number of syringe uses

library(ggplot2)
ggplot(needledat, aes(shared_syr)) +
  geom_histogram() +
  labs(title = "Counts of Syringe Sharing Incidents Per Person") +
  xlab("Number of Incidents") +
  ylab("Number of people with that count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5 rows containing non-finite values (`stat_bin()`).

library(dplyr)
mutate(needledat, rnk = rank(shared_syr, ties.method = "first")) %>%
  ggplot(aes(x = rnk, y = shared_syr)) +
  geom_point() +
  labs(title = "Count vs Rank Count of Syringe Sharing Incidents") +
  xlab("rank of count") +
  ylab("count of syringe sharing")
## Warning: Removed 5 rows containing missing values (`geom_point()`).

  • There are a lot of zeros - Poisson model is not a good fit
  1. Fit a Poisson model to the risky drug use behavior dataset anyways

Even though we know it is a bad fit

fit.pois <- glm(shared_syr ~ sex + ethn + homeless,
                data = needledat,
                family = poisson(link = "log"))
summary(fit.pois)
## 
## Call:
## glm(formula = shared_syr ~ sex + ethn + homeless, family = poisson(link = "log"), 
##     data = needledat)
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          0.72332    0.14462   5.002 5.69e-07 ***
## sexM                -0.92480    0.12133  -7.622 2.50e-14 ***
## sexTrans           -15.08655  773.78384  -0.019   0.9844    
## ethnFilipino       -14.52887  510.68253  -0.028   0.9773    
## ethnHispanic         1.46454    0.16004   9.151  < 2e-16 ***
## ethnIndian         -14.10111  773.78385  -0.018   0.9855    
## ethnIndian & White -15.02591  773.78384  -0.019   0.9845    
## ethnWhite            0.06064    0.13348   0.454   0.6496    
## ethnWhite & Hispa    0.86195    0.39872   2.162   0.0306 *  
## homeless             1.28543    0.12664  10.150  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 1621.9  on 120  degrees of freedom
## Residual deviance: 1364.8  on 111  degrees of freedom
##   (7 observations deleted due to missingness)
## AIC: 1483.8
## 
## Number of Fisher Scoring iterations: 12
  1. Create and discuss Poisson model diagnostic plots
par(mfrow = c(2, 2))
plot(fit.pois)
## Warning: not plotting observations with leverage one:
##   17, 38, 72, 86