vignettes/session_lecture.Rmd
session_lecture.Rmd
\[ E[y|x] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p \]
\(y_i = E[y|x] + \epsilon_i\)
\(y_i = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_p x_p + \epsilon_i\)
Assumption: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)
The model: \(y_i = E[y|x] + \epsilon_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} + \epsilon_i\)
Random component of \(y_i\) is normally distributed: \(\epsilon_i \stackrel{iid}{\sim} N(0, \sigma_\epsilon^2)\)
Systematic component (linear predictor): \(\beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi}\)
Link function here is the identity link: \(g(E(y | x)) = E(y | x)\). We are modeling the mean directly, no transformation.
Random component: \(y_i\) follows a Binomial distribution (outcome is a binary variable)
Systematic component: linear predictor \[ \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \]
Link function: logit (log of the odds that the event occurs)
\[ g(P(x)) = logit(P(x)) = log\left( \frac{P(x)}{1-P(x)} \right) \]
\[ P(x) = g^{-1}\left( \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ... + \beta_p x_{pi} \right) \]
invLogit <- function(x) 1/(1+exp(-x))
Age (years) | (N=4) | 25-29 (N=4) |
30-39 (N=4) |
40-49 (N=4) |
Overall (N=16) |
---|---|---|---|---|---|
education | |||||
high | 2 (50.0%) | 2 (50.0%) | 2 (50.0%) | 2 (50.0%) | 8 (50.0%) |
low | 2 (50.0%) | 2 (50.0%) | 2 (50.0%) | 2 (50.0%) | 8 (50.0%) |
wantsMore | |||||
no | 2 (50.0%) | 2 (50.0%) | 2 (50.0%) | 2 (50.0%) | 8 (50.0%) |
yes | 2 (50.0%) | 2 (50.0%) | 2 (50.0%) | 2 (50.0%) | 8 (50.0%) |
percentusing | |||||
Mean (SD) | 18.8 (7.64) | 27.1 (6.53) | 38.8 (15.6) | 46.9 (23.8) | 32.9 (17.5) |
Median [Min, Max] | 18.2 [10.2, 28.6] | 27.6 [18.9, 34.5] | 39.5 [22.8, 53.4] | 50.5 [14.6, 72.1] | 28.3 [10.2, 72.1] |
Source: http://data.princeton.edu/wws509/datasets/#cuse. Note, this table represents rows of the source data, not number of participants. See the lab to make a table that summarizes the participants.
fit1 <- glm(cbind(using, notUsing) ~ age + education + wantsMore,
data=cuse, family=binomial("logit"))
summary(fit1)
##
## Call:
## glm(formula = cbind(using, notUsing) ~ age + education + wantsMore,
## family = binomial("logit"), data = cuse)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8082 0.1590 -5.083 3.71e-07 ***
## age25-29 0.3894 0.1759 2.214 0.02681 *
## age30-39 0.9086 0.1646 5.519 3.40e-08 ***
## age40-49 1.1892 0.2144 5.546 2.92e-08 ***
## educationlow -0.3250 0.1240 -2.620 0.00879 **
## wantsMoreyes -0.8330 0.1175 -7.091 1.33e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 165.772 on 15 degrees of freedom
## Residual deviance: 29.917 on 10 degrees of freedom
## AIC: 113.43
##
## Number of Fisher Scoring iterations: 4
\[ r_i = \frac{y_i - \hat y_i}{ \sqrt{ Var(\hat y_i) }} \]
Summing the squared Pearson residuals produces the Pearson Chi-squared statistic:
\[ \chi ^2 = \sum_i r_i^2 \]
\[ d_i = s_i \sqrt{ -2 ( y_i \log \hat y_i + (1-y_i) \log (1 - \hat y_i) ) } \]
Where \(s_i = 1\) if \(y_i = 1\) and \(s_i = -1\) if \(y_i = 0\).
\(\Delta (\textrm{D}) = -2 * \Delta (\textrm{log likelihood})\)
fit0 <- glm(cbind(using, notUsing) ~ -1, data=cuse,
family=binomial("logit"))
anova(fit0, fit1, test="LRT")
## Analysis of Deviance Table
##
## Model 1: cbind(using, notUsing) ~ -1
## Model 2: cbind(using, notUsing) ~ age + education + wantsMore
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 16 389.85
## 2 10 29.92 6 359.94 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: Wald test confidence intervals on coefficients can provide poor coverage in some cases, even with relatively large samples