Poisson log-linear GLM
Towards a reasonable model
- A multiplicative model will allow us to make inference on
ratios of mean emergency room usage
- Modeling
of the mean emergency usage ensures positive means, and does
not suffer from
problem
- Random component of GLM, or residuals (was
for linear regression) may still not be normal, but we can choose from
other distributions
Proposed model without time
Or equivalently:
where
is the expected number of emergency room visits for patient
i.
- Important note: Modeling
is not equivalent to modeling
Accounting for follow-up time
Instead, model mean count per unit time:
Or equivalently:
-
is not a covariate, it is called an offset
The Poisson distribution
- Count data are often modeled as Poisson distributed:
- mean
is greater than 0
- variance is also
- Probability density

When the Poisson distribution works
- Individual events are low-probability (small p), but many
opportunities (large n)
- e.g. # 911 calls per day
- e.g. # emergency room visits
- Approximates the binomial distribution when n is large and p is
small
- e.g. ,
or
- When mean of residuals is approx. equal to variance
GLM with log-linear link and Poisson error model
- Model the number of counts per unit time as Poisson-distributed + so
the expected number of counts per time is
Recalling the log-linear model systematic component:
GLM with log-linear link and Poisson error model (cont’d)
Then the systematic part of the GLM is:
Or alternatively:
Interpretation of coefficients
- Suppose that
in the fitted model, where
for white and
for non-white.
- The mean rate of emergency room visits per unit time for white
relative to non-white, all else held equal, is estimated to be:
Interpretation of coefficients (cont’d)
- If
with whites as the reference group:
- after adjustment for treatment group, alcohol and drug usage, whites
tend to use the emergency room at a rate 1.65 times higher than
non-whites.
- equivalently, the average rate of usage for whites is 65% higher
than that for non-whites
- Multiplicative rules apply for other coefficients as well, because
they are exponentiated to estimate the mean rate.
Multi-collinearity
What is Multicollinearity?
-
Multicollinearity exists when two or more of the
independent variables in regression are moderately or highly
correlated.
- High correlation among continuous predictors or high concordance
among categorical predictors
- Impacts the ability to estimate regression coefficients
- larger standard errors for regression coefficients
- ie, coefficients are unstable over repeated sampling
- exact collinearity produces infinite standard errors on
coefficients
- Can also result in unstable (high variance) prediction models
Identifying multicollinearity
- Pairwise correlations of data or of model matrix (latter works with
categorical variables)
- Heat maps
- Variance Inflation Factor (VIF) of regression coefficients
Example: US Judge Ratings dataset
See ?USJudgeRatings
for dataset, ?pairs
for
plot code:
**Pairwise scatterplot of continuous variables in US Judge Ratings
dataset
Example: iris dataset
One categorical variable, so use model matrix. Make a simple heatmap.
Note: multicollinearity exists between multiple predictors, not
between predictor and outcome
Example: iris dataset
Confirm what in iris dataset using Variance Inflation Factor of a
linear regression model:
fit <- lm(Sepal.Width ~ ., data = iris)
car::vif(fit)
## GVIF Df GVIF^(1/(2*Df))
## Sepal.Length 6.124653 1 2.474804
## Petal.Length 45.132550 1 6.718076
## Petal.Width 18.373804 1 4.286468
## Species 32.701564 2 2.391344
Approaches for dealing with multicollinearity
Options:
- Select a representative variable
- Average variables
- Principal Component Analysis or other dimension reducuction
- For prediction modeling, special methods like penalized regression,
Support Vector Machines, …