Study of nesting horseshoe crabs (J. Brockman, Ethology, 1996). Each female crab in the study had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males, called satellites, residing nearby her. Explanatory variables thought possibly to affect this included the female crab’s color, spine condition, weight, and carapace width. The response outcome for each female crab is her number of satellites.
Variable descriptions:
color
: 1 - light medium, 2 - medium, 3 - dark medium, 4
- darkspine
: 1 - both good, 2 - one worn or broken, 3 - both
worn or brokenwidth
: carapace width in cmsatell
: number of satellitesweight
: weight in kgRead in the data and recode factors.
crab <- read.table("http://www.math.montana.edu/shancock/data/horseshoe.txt",header=T)
crab$colorF <- factor(crab$color, levels=c(1,2,3,4), labels=c("LM","M","DM","D"))
crab$spineF <- factor(crab$spine, levels=c(1,2,3), labels=c("Good","Soso","Poor"))
str(crab)
## 'data.frame': 173 obs. of 7 variables:
## $ color : int 2 3 1 3 3 2 1 3 2 3 ...
## $ spine : int 3 3 1 3 3 3 1 2 1 3 ...
## $ width : num 28.3 22.5 26 24.8 26 23.8 26.5 24.7 23.7 25.6 ...
## $ satell: int 8 0 9 0 4 0 0 0 0 0 ...
## $ weight: num 3.05 1.55 2.3 2.1 2.6 2.1 2.35 1.9 1.95 2.15 ...
## $ colorF: Factor w/ 4 levels "LM","M","DM",..: 2 3 1 3 3 2 1 3 2 3 ...
## $ spineF: Factor w/ 3 levels "Good","Soso",..: 3 3 1 3 3 3 1 2 1 3 ...
Summary statistics for each variable:
summary(crab)
## color spine width satell
## Min. :1.000 Min. :1.000 Min. :21.0 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:24.9 1st Qu.: 0.000
## Median :2.000 Median :3.000 Median :26.1 Median : 2.000
## Mean :2.439 Mean :2.486 Mean :26.3 Mean : 2.919
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.:27.7 3rd Qu.: 5.000
## Max. :4.000 Max. :3.000 Max. :33.5 Max. :15.000
## weight colorF spineF
## Min. :0.200 LM:12 Good: 37
## 1st Qu.:2.000 M :95 Soso: 15
## Median :2.350 DM:44 Poor:121
## Mean :2.427 D :22
## 3rd Qu.:2.850
## Max. :5.200
Plot all variables against one another:
plot(crab)
Boxplots for categorical predictors:
boxplot(satell ~ colorF, data=crab, xlab="Color", ylab="No. of Satellites")
boxplot(satell ~ spineF, data=crab, xlab="Spine Condition", ylab="No. of Satellites")
Based on the plots alone, what variables seem to be the best predictors of number of satellites?
cor(crab[,1:5])
## color spine width satell weight
## color 1.0000000 0.37850163 -0.2643863 -0.19078455 -0.2748001
## spine 0.3785016 1.00000000 -0.1218946 -0.08993242 -0.1707781
## width -0.2643863 -0.12189458 1.0000000 0.33989033 0.8560429
## satell -0.1907846 -0.08993242 0.3398903 1.00000000 0.3542486
## weight -0.2748001 -0.17077808 0.8560429 0.35424861 1.0000000
Note that weight and width are highly correlated. What happens if we fit them both in the model?
summary(glm(satell ~ width, family = poisson, data = crab))
##
## Call:
## glm(formula = satell ~ width, family = poisson, data = crab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8526 -1.9884 -0.4933 1.0970 4.9221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***
## width 0.16405 0.01997 8.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: 927.18
##
## Number of Fisher Scoring iterations: 6
summary(glm(satell ~ weight, family = poisson, data = crab))
##
## Call:
## glm(formula = satell ~ weight, family = poisson, data = crab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9182 -2.0169 -0.5926 1.0290 4.9755
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.36997 0.17811 -2.077 0.0378 *
## weight 0.56837 0.06496 8.750 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 564.35 on 171 degrees of freedom
## AIC: 923.65
##
## Number of Fisher Scoring iterations: 5
summary(glm(satell ~ width + weight, family = poisson, data = crab))
##
## Call:
## glm(formula = satell ~ width + weight, family = poisson, data = crab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9194 -1.9390 -0.5447 1.0694 4.9683
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.69254 0.86372 -1.960 0.0500 .
## width 0.06972 0.04435 1.572 0.1159
## weight 0.35659 0.14906 2.392 0.0167 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 561.93 on 170 degrees of freedom
## AIC: 923.23
##
## Number of Fisher Scoring iterations: 6
width
crab.glm <- glm(satell ~ width, family = poisson(link = "log"), data = crab)
summary(crab.glm)
##
## Call:
## glm(formula = satell ~ width, family = poisson(link = "log"),
## data = crab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8526 -1.9884 -0.4933 1.0970 4.9221
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***
## width 0.16405 0.01997 8.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: 927.18
##
## Number of Fisher Scoring iterations: 6
How would you interpret each coefficient?
Fitted model: \(\mu\) = mean number of satellites among the population of female crabs represented by this sample \[ \log(\hat{\mu}) = -3.3048 + 0.16405x, \] where \(x\) is the width of the crab in cm.
OR \[ \hat{\mu} = \exp(-3.3048 + 0.16405x), \]
To interpret, exponentiate to get to original scale.
exp(crab.glm$coefficients)
## (Intercept) width
## 0.03670812 1.17826744
A one cm increase in crab width is associated with a 17.8% increase in [predicted number]/[estimated mean number] of satellites.
Note that the log link is the default for poisson, so we could have
used family = poisson
rather than
family = poisson(link = "log")
.
How can we check if the model fits well?
Add fitted model to plot.
co <- crab.glm$coefficients
plot(satell ~ width, xlab="Width (cm)", ylab="No. of Satellites", data=crab)
curve(exp(co[1]+co[2]*x),add=T)
What is “overdisperson”?
Create the eight width categories used in book:
One option to create width categories is the cut
function. Note that the resulting factor is treated as nominal even
though it is ordinal.
crab$width.grp <- cut(crab$width,
breaks = c(0, seq(23.25, 29.25, by=1), 35),
labels=c(1:8))
is.factor(crab$width.grp)
## [1] TRUE
head(cbind(crab$width, crab$width.grp))
## [,1] [,2]
## [1,] 28.3 7
## [2,] 22.5 1
## [3,] 26.0 4
## [4,] 24.8 3
## [5,] 26.0 4
## [6,] 23.8 2
Explore association between number of satellites and width group.
xtabs(~ satell+width.grp, data=crab)
## width.grp
## satell 1 2 3 4 5 6 7 8
## 0 9 10 11 18 7 4 3 0
## 1 2 0 4 0 2 4 4 0
## 2 0 2 1 3 1 0 0 2
## 3 0 0 1 6 3 3 4 2
## 4 3 0 3 3 2 2 2 4
## 5 0 0 4 3 3 2 1 2
## 6 0 1 3 1 2 6 0 0
## 7 0 0 0 1 1 1 0 1
## 8 0 0 1 1 1 1 1 1
## 9 0 0 0 1 0 0 1 1
## 10 0 1 0 1 0 0 1 0
## 11 0 0 0 0 0 1 0 0
## 12 0 0 0 0 0 0 0 1
## 14 0 0 0 1 0 0 0 0
## 15 0 0 0 0 0 0 1 0
Since width.grp
is a factor, R defaults to side-by-side
boxplots when we use plot
.
plot(satell ~ width.grp, data=crab,
xlab = "Width Group", ylab = "Number of Satellites")
Examine our fitted model over the means for each group.
means <- tapply(crab$satell, crab$width.grp, mean)
plot(seq(22.75,29.75,by=1), means,
xlab="Width", ylab="Mean No. of Satellites")
curve(exp(co[1]+co[2]*x), add=T)
Check for overdispersion by comparing the means within each group to the sample variances.
vars <- tapply(crab$satell, crab$width.grp, var)
kable(data.frame(width.grp = 1:8, means = round(means, 3), variances = round(vars,3)))
width.grp | means | variances |
---|---|---|
1 | 1.000 | 2.769 |
2 | 1.429 | 8.879 |
3 | 2.393 | 6.544 |
4 | 2.692 | 11.377 |
5 | 2.864 | 6.885 |
6 | 3.875 | 8.810 |
7 | 3.944 | 16.879 |
8 | 5.143 | 8.286 |
The variances in each group are much larger than the means for each group, indicating overdispersion is present. This may indicate need to include other predictors to reduce unexplained variability.
colorF
crab.glmF <- glm(satell ~ colorF, family = poisson(link = "log"), data = crab)
summary(crab.glmF)
##
## Call:
## glm(formula = satell ~ colorF, family = poisson(link = "log"),
## data = crab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8577 -2.1106 -0.1649 0.8721 4.7491
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.4069 0.1429 9.848 < 2e-16 ***
## colorFM -0.2146 0.1536 -1.397 0.162487
## colorFDM -0.6061 0.1750 -3.464 0.000532 ***
## colorFD -0.6913 0.2065 -3.348 0.000814 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 609.14 on 169 degrees of freedom
## AIC: 972.44
##
## Number of Fisher Scoring iterations: 6
exp(crab.glmF$coefficients)
## (Intercept) colorFM colorFDM colorFD
## 4.0833333 0.8068743 0.5454545 0.5009276
How would you interpret each coefficient?
If we treat color as quantitative?
crab.color <- glm(satell ~ color, family = poisson, data = crab)
summary(crab.color)
##
## Call:
## glm(formula = satell ~ color, family = poisson, data = crab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9072 -2.2128 -0.2959 0.9189 4.9423
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.71420 0.14212 12.062 < 2e-16 ***
## color -0.27295 0.05932 -4.601 4.2e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 610.66 on 171 degrees of freedom
## AIC: 969.96
##
## Number of Fisher Scoring iterations: 6
exp(crab.color$coefficients[2]*2)-1
## color
## -0.4206765
The estimated mean number of satellites for dark medium crabs are about 42.0676479% less than for light medium crabs.
width
and
colorF
Model with interaction:
crab.glm2 <- glm(satell ~ width*colorF, data=crab, family=poisson)
summary(crab.glm2)
##
## Call:
## glm(formula = satell ~ width * colorF, family = poisson, data = crab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.9830 -1.9293 -0.4705 0.9635 4.7216
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.57144 2.46299 1.450 0.14705
## width -0.08057 0.09187 -0.877 0.38046
## colorFM -6.13346 2.55488 -2.401 0.01636 *
## colorFDM -8.51524 2.83682 -3.002 0.00268 **
## colorFD -10.54353 3.31180 -3.184 0.00145 **
## width:colorFM 0.21942 0.09513 2.306 0.02108 *
## width:colorFDM 0.30017 0.10598 2.832 0.00462 **
## width:colorFD 0.37883 0.12457 3.041 0.00236 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 547.57 on 165 degrees of freedom
## AIC: 918.86
##
## Number of Fisher Scoring iterations: 6
How to interpret coefficient estimates?
Plot number of satellites vs width by color and add fitted model.
co <- crab.glm2$coef
plot(satell ~ width, data=crab, xlab="Width (cm)", ylab="No. of Satellites",
pch=color, col=(color+1))
legend("topleft", c("LM","M","DM","D"), pch=c(1,2,3,4), col=c(2,3,4,5))
# Add fitted models by color
# LM
curve(exp(co[1]+co[2]*x), col=2, add=TRUE)
# M
curve(exp(co[1]+co[3] + (co[2]+co[6])*x), col=3, add=TRUE)
# DM
curve(exp(co[1]+co[4] + (co[2]+co[7])*x), col=4, add=TRUE)
# D
curve(exp(co[1]+co[5] + (co[2]+co[8])*x), col=5, add=TRUE)
Or with the effects package:
library(effects)
## Loading required package: carData
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
plot(allEffects(crab.glm2), type="response", multiline=T, ci.style="bands")
plot(allEffects(crab.glm2), type="link", multiline=T, ci.style="bands")
plot(allEffects(crab.glm2, residuals=T), type="link")
Additive model (no interaction):
crab.glm2a <- glm(satell ~ width + colorF, data=crab, family=poisson)
co <- crab.glm2a$coef
plot(satell ~ width, data=crab, xlab="Width (cm)", ylab="No. of Satellites", pch=color, col=(color+1))
legend("topright", c("LM","M","DM","D"), pch=c(1,2,3,4), col=c(2,3,4,5))
# LM
curve(exp(co[1]+co[2]*x), col=2, add=TRUE, lty=2)
# M
curve(exp(co[1]+co[3] + (co[2])*x), col=3, add=TRUE, lty=2)
# DM
curve(exp(co[1]+co[4] + (co[2])*x), col=4, add=TRUE, lty=2)
# D
curve(exp(co[1]+co[5] + (co[2])*x), col=5, add=TRUE, lty=2)
What if we treated color as quantitative? (Is this valid?)
crab.glm3 <- glm(satell ~ width*color, data=crab, family=poisson)
summary(crab.glm3)
##
## Call:
## glm(formula = satell ~ width * color, family = poisson, data = crab)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8651 -1.9044 -0.5242 0.9894 4.6655
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.30392 2.09204 1.579 0.11427
## width -0.06880 0.07779 -0.884 0.37651
## color -2.76463 0.89760 -3.080 0.00207 **
## width:color 0.09777 0.03360 2.909 0.00362 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 551.41 on 169 degrees of freedom
## AIC: 914.71
##
## Number of Fisher Scoring iterations: 6
# Note significant interaction term.
plot(satell ~ width, data=crab, xlab="Width (cm)", ylab="No. of Satellites", pch=color, col=(color+1))
legend("topleft", c("LM","M","DM","D"), pch=c(1,2,3,4), col=c(2,3,4,5))
## Add fitted model where color is quantitative to plot:
co <- crab.glm3$coef
# LM
curve(exp(co[1] + co[2]*x + co[3]*1 + co[4]*x*1), col=2, add=TRUE, lty=3, lwd=2)
# M
curve(exp(co[1] + co[2]*x + co[3]*2 + co[4]*x*2), col=3, add=TRUE, lty=3, lwd=2)
# DM
curve(exp(co[1] + co[2]*x + co[3]*3 + co[4]*x*3), col=4, add=TRUE, lty=3, lwd=2)
# D
curve(exp(co[1] + co[2]*x + co[3]*4 + co[4]*x*4), col=5, add=TRUE, lty=3, lwd=2)
# Effect of a one cm increase in width for each color:
for(i in 1:4){
print(exp(crab.glm3$coef[2]+crab.glm3$coef[4]*i))
}
## width
## 1.029395
## width
## 1.135121
## width
## 1.251705
## width
## 1.380263