Study of nesting horseshoe crabs (J. Brockman, Ethology, 1996). Each female crab in the study had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males, called satellites, residing nearby her. Explanatory variables thought possibly to affect this included the female crab’s color, spine condition, weight, and carapace width. The response outcome for each female crab is her number of satellites.
First, read in the data set and convert color and spine to factors:
crab <- read.table("http://www.math.montana.edu/shancock/data/horseshoe.txt",header=T)
crab$colorF <- factor(crab$color, levels=c(1,2,3,4), labels=c("LM","M","DM","D"))
crab$spineF <- factor(crab$spine, levels=c(1,2,3), labels=c("Good","Soso","Poor"))
Our goal is to determine, of the four predictors available, what best predicts the number of satellites.
anova
),
forward/backward selection
(drop1
,add1
,step
), or best
subsets (bestglm
).Recall: Overdispersion was a problem with these count data. We can use Negative Binomial regression to address this issue.
mod1 <- glm.nb(satell ~ width, data = crab)
summary(mod1)
##
## Call:
## glm.nb(formula = satell ~ width, data = crab, init.theta = 0.90456808,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7798 -1.4110 -0.2502 0.4770 2.0177
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.05251 1.17143 -3.459 0.000541 ***
## width 0.19207 0.04406 4.360 1.3e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.9046) family taken to be 1)
##
## Null deviance: 213.05 on 172 degrees of freedom
## Residual deviance: 195.81 on 171 degrees of freedom
## AIC: 757.29
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.905
## Std. Err.: 0.161
##
## 2 x log-likelihood: -751.291
Or quasi-Poisson:
mod1.qpois <- glm(satell ~ width, family = quasipoisson, data = crab)
summary(mod1.qpois)
##
## Call:
## glm(formula = satell ~ width, family = quasipoisson, data = crab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8526 -1.9884 -0.4933 1.0970 4.9221
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.30476 0.96729 -3.417 0.000793 ***
## width 0.16405 0.03562 4.606 7.99e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 3.182205)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
Compare to Poisson regression:
mod1.pois <- glm(satell ~ width, family = poisson, data = crab)
summary(mod1.pois)
##
## Call:
## glm(formula = satell ~ width, family = poisson, data = crab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8526 -1.9884 -0.4933 1.0970 4.9221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***
## width 0.16405 0.01997 8.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: 927.18
##
## Number of Fisher Scoring iterations: 6
Prior to model fitting -> exploratory data analysis.
One approach - start with all four in the model, and try taking one out:
mod.full <- glm.nb(satell ~ colorF + spineF + width + weight, data = crab)
summary(mod.full)
##
## Call:
## glm.nb(formula = satell ~ colorF + spineF + width + weight, data = crab,
## init.theta = 0.9521901429, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8535 -1.3633 -0.3033 0.4685 2.2238
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58662 1.83022 -0.867 0.386
## colorFM -0.31434 0.37457 -0.839 0.401
## colorFDM -0.57764 0.41921 -1.378 0.168
## colorFD -0.53423 0.46877 -1.140 0.254
## spineFSoso -0.21439 0.39927 -0.537 0.591
## spineFPoor 0.02570 0.24903 0.103 0.918
## width 0.07734 0.09053 0.854 0.393
## weight 0.37950 0.31432 1.207 0.227
##
## (Dispersion parameter for Negative Binomial(0.9522) family taken to be 1)
##
## Null deviance: 219.08 on 172 degrees of freedom
## Residual deviance: 196.73 on 165 degrees of freedom
## AIC: 764.9
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.952
## Std. Err.: 0.173
##
## 2 x log-likelihood: -746.905
Why all large p-values?
Model with width alone –> p-value for width = tiny
Do I need width compared to nothing? –> Yes!
Model with everything –> p-value for width = 0.393
Do I need width in addition to color, spine, weight? –> No
Try dropping each:
drop1(mod.full)
## Single term deletions
##
## Model:
## satell ~ colorF + spineF + width + weight
## Df Deviance AIC
## <none> 196.73 762.90
## colorF 3 199.22 759.40
## spineF 2 197.15 759.33
## width 1 197.43 761.61
## weight 1 198.18 762.35
Take out spineF:
mod2 <- update(mod.full, . ~ . - spineF)
drop1(mod2)
## Single term deletions
##
## Model:
## satell ~ colorF + width + weight
## Df Deviance AIC
## <none> 196.74 759.33
## colorF 3 199.26 755.84
## width 1 197.76 758.34
## weight 1 198.02 758.61
Take out colorF:
mod3 <- update(mod2, . ~ . - colorF)
drop1(mod3)
## Single term deletions
##
## Model:
## satell ~ width + weight
## Df Deviance AIC
## <none> 196.34 755.82
## width 1 197.51 754.99
## weight 1 197.82 755.30
Step-wise regression:
step(mod.full, direction = "backward")
## Start: AIC=762.9
## satell ~ colorF + spineF + width + weight
##
## Df Deviance AIC
## - spineF 2 197.15 759.33
## - colorF 3 199.22 759.40
## - width 1 197.43 761.61
## - weight 1 198.18 762.35
## <none> 196.73 762.90
##
## Step: AIC=759.33
## satell ~ colorF + width + weight
##
## Df Deviance AIC
## - colorF 3 199.26 755.84
## - width 1 197.76 758.34
## - weight 1 198.02 758.61
## <none> 196.74 759.33
##
## Step: AIC=755.82
## satell ~ width + weight
##
## Df Deviance AIC
## - width 1 197.51 754.99
## - weight 1 197.82 755.30
## <none> 196.34 755.82
##
## Step: AIC=754.99
## satell ~ weight
##
## Df Deviance AIC
## <none> 196.74 754.99
## - weight 1 214.42 770.68
##
## Call: glm.nb(formula = satell ~ weight, data = crab, init.theta = 0.9152623199,
## link = log)
##
## Coefficients:
## (Intercept) weight
## -0.6142 0.6637
##
## Degrees of Freedom: 172 Total (i.e. Null); 171 Residual
## Null Deviance: 214.4
## Residual Deviance: 196.7 AIC: 757
Step-wise regression going forward:
mod.null <- glm.nb(satell ~ 1, data = crab)
step(mod.null, satell ~ colorF + spineF + width + weight, direction = "forward")
## Start: AIC=769.41
## satell ~ 1
##
## Df Deviance AIC
## + weight 1 177.64 756.12
## + width 1 177.82 756.30
## <none> 192.93 769.41
## + colorF 3 187.93 770.41
## + spineF 2 190.50 770.98
##
## Step: AIC=754.99
## satell ~ weight
##
## Df Deviance AIC
## <none> 196.74 754.99
## + width 1 195.56 755.82
## + spineF 2 196.08 758.34
## + colorF 3 194.11 758.37
##
## Call: glm.nb(formula = satell ~ weight, data = crab, init.theta = 0.9152622973,
## link = log)
##
## Coefficients:
## (Intercept) weight
## -0.6142 0.6637
##
## Degrees of Freedom: 172 Total (i.e. Null); 171 Residual
## Null Deviance: 214.4
## Residual Deviance: 196.7 AIC: 757
Step-wise regression in both directions:
step(mod.full, direction = "both")
## Start: AIC=762.9
## satell ~ colorF + spineF + width + weight
##
## Df Deviance AIC
## - spineF 2 197.15 759.33
## - colorF 3 199.22 759.40
## - width 1 197.43 761.61
## - weight 1 198.18 762.35
## <none> 196.73 762.90
##
## Step: AIC=759.33
## satell ~ colorF + width + weight
##
## Df Deviance AIC
## - colorF 3 199.26 755.84
## - width 1 197.76 758.34
## - weight 1 198.02 758.61
## <none> 196.74 759.33
## + spineF 2 196.32 762.91
##
## Step: AIC=755.82
## satell ~ width + weight
##
## Df Deviance AIC
## - width 1 197.51 754.99
## - weight 1 197.82 755.30
## <none> 196.34 755.82
## + colorF 3 193.87 759.35
## + spineF 2 195.89 759.37
##
## Step: AIC=754.99
## satell ~ weight
##
## Df Deviance AIC
## <none> 196.74 754.99
## + width 1 195.56 755.82
## + spineF 2 196.08 758.34
## + colorF 3 194.11 758.37
## - weight 1 214.42 770.68
##
## Call: glm.nb(formula = satell ~ weight, data = crab, init.theta = 0.9152623199,
## link = log)
##
## Coefficients:
## (Intercept) weight
## -0.6142 0.6637
##
## Degrees of Freedom: 172 Total (i.e. Null); 171 Residual
## Null Deviance: 214.4
## Residual Deviance: 196.7 AIC: 757
According to AIC, the “best” model includes just weight. “Best” model:
mod.best <- glm.nb(satell ~ weight, data = crab)
summary(mod.best)
##
## Call:
## glm.nb(formula = satell ~ weight, data = crab, init.theta = 0.9152622972,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8063 -1.4363 -0.3072 0.5161 2.0708
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.6142 0.3957 -1.552 0.121
## weight 0.6637 0.1549 4.286 1.82e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(0.9153) family taken to be 1)
##
## Null deviance: 214.42 on 172 degrees of freedom
## Residual deviance: 196.74 on 171 degrees of freedom
## AIC: 756.99
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 0.915
## Std. Err.: 0.165
##
## 2 x log-likelihood: -750.993
exp(confint(mod.best))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.2361634 1.217564
## weight 1.4170155 2.696467
plot(satell ~ weight, data = crab,
xlab = "Weight (kg)", ylab = "Number of Satellites",
main = "Number of Satellites vs Weight")
curve(exp(-0.6142 + 0.6637*x), col = "red", add=T)
Fitted model: (\(x\) = weight) \[ \log(\hat{\mu}) = -0.6142 + 0.6637x \] \[ \hat{\mu} = \exp(-0.6142 + 0.6637x) \] We are 95% confident that an increase in one kg of weight is associated with an increase of between 41.7% and 169.6% in mean number of satellites in this population.
dev <- mod.best$deviance
dev
## [1] 196.7368
df <- mod.best$df.residual
df
## [1] 171
# p-value
pchisq(dev, df, lower.tail = FALSE)
## [1] 0.08643382
Let’s convert our response variable to an indicator variable of whether the female crab had satellites or not.
crab$satell_ind <- as.numeric(crab$satell > 0)
Note: Overdispersion is not a problem with binary data – why?
We are restricted to the values {0,1}, whereas counts can go to
infinity.
The bestglm
function will fit every combination of
predictors (all subsets), and return the one with the minimum value of
the information criteria you select.
See the help file for the bestglm
function – it needs
the design matrix (the actual values of the variables that show up in
the model equation - not categories), and the response variable in the
last column of the data set.
mod.full <- glm(satell_ind ~ colorF + spineF + width + weight, family = binomial, data = crab)
X <- model.matrix(mod.full)
head(X)
## (Intercept) colorFM colorFDM colorFD spineFSoso spineFPoor width weight
## 1 1 1 0 0 0 1 28.3 3.05
## 2 1 0 1 0 0 1 22.5 1.55
## 3 1 0 0 0 0 0 26.0 2.30
## 4 1 0 1 0 0 1 24.8 2.10
## 5 1 0 1 0 0 1 26.0 2.60
## 6 1 1 0 0 0 1 23.8 2.10
# Remove column of 1's (not a variable):
X <- X[,-1]
Xy <- as.data.frame(cbind(X,y = crab$satell_ind))
head(Xy)
## colorFM colorFDM colorFD spineFSoso spineFPoor width weight y
## 1 1 0 0 0 1 28.3 3.05 1
## 2 0 1 0 0 1 22.5 1.55 0
## 3 0 0 0 0 0 26.0 2.30 1
## 4 0 1 0 0 1 24.8 2.10 0
## 5 0 1 0 0 1 26.0 2.60 1
## 6 1 0 0 0 1 23.8 2.10 0
best.AIC <- bestglm(Xy, family = binomial, IC = "AIC")
## Morgan-Tatar search since family is non-gaussian.
best.AIC
## AIC
## BICq equivalent for q in (0.338329717201477, 0.904725300676076)
## Best Model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.6790256 2.6925051 -4.337606 1.440432e-05
## colorFD -1.3005121 0.5258610 -2.473110 1.339430e-02
## width 0.4782223 0.1041468 4.591812 4.394143e-06
best.BIC <- bestglm(Xy, family = binomial, IC = "BIC")
## Morgan-Tatar search since family is non-gaussian.
best.BIC
## BIC
## BICq equivalent for q in (0.338329717201477, 0.904725300676076)
## Best Model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -11.6790256 2.6925051 -4.337606 1.440432e-05
## colorFD -1.3005121 0.5258610 -2.473110 1.339430e-02
## width 0.4782223 0.1041468 4.591812 4.394143e-06
WARNING: Best subsets does not follow the rule that if you include
one indicator variable for a factor, you must include the rest. Thus, if
using the models given by best.AIC
or
best.BIC
, make sure you include all indicators for
colorF
!
mod.wgt <- glm(satell_ind ~ weight, family = binomial, data = crab)
obs <- crab$satell_ind
expected <- fitted(mod.wgt)
hoslem.test(obs, expected, g = 10)
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: obs, expected
## X-squared = 7.3643, df = 8, p-value = 0.4979