Poisson Regression: Horseshoe crab example

Data set from Agresti (2007)

Study of nesting horseshoe crabs (J. Brockman, Ethology, 1996). Each female crab in the study had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males, called satellites, residing nearby her. Explanatory variables thought possibly to affect this included the female crab’s color, spine condition, weight, and carapace width. The response outcome for each female crab is her number of satellites.

Variable descriptions:

  • color: 1 - light medium, 2 - medium, 3 - dark medium, 4 - dark
  • spine: 1 - both good, 2 - one worn or broken, 3 - both worn or broken
  • width: carapace width in cm
  • satell: number of satellites
  • weight: weight in kg

Read in the data and recode factors.

crab <- read.table("http://www.math.montana.edu/shancock/data/horseshoe.txt",header=T)
crab$colorF <- factor(crab$color, levels=c(1,2,3,4), labels=c("LM","M","DM","D"))
crab$spineF <- factor(crab$spine, levels=c(1,2,3), labels=c("Good","Soso","Poor"))
str(crab)
## 'data.frame':    173 obs. of  7 variables:
##  $ color : int  2 3 1 3 3 2 1 3 2 3 ...
##  $ spine : int  3 3 1 3 3 3 1 2 1 3 ...
##  $ width : num  28.3 22.5 26 24.8 26 23.8 26.5 24.7 23.7 25.6 ...
##  $ satell: int  8 0 9 0 4 0 0 0 0 0 ...
##  $ weight: num  3.05 1.55 2.3 2.1 2.6 2.1 2.35 1.9 1.95 2.15 ...
##  $ colorF: Factor w/ 4 levels "LM","M","DM",..: 2 3 1 3 3 2 1 3 2 3 ...
##  $ spineF: Factor w/ 3 levels "Good","Soso",..: 3 3 1 3 3 3 1 2 1 3 ...

Exploratory data analysis

Summary statistics for each variable:

summary(crab)
##      color           spine           width          satell      
##  Min.   :1.000   Min.   :1.000   Min.   :21.0   Min.   : 0.000  
##  1st Qu.:2.000   1st Qu.:2.000   1st Qu.:24.9   1st Qu.: 0.000  
##  Median :2.000   Median :3.000   Median :26.1   Median : 2.000  
##  Mean   :2.439   Mean   :2.486   Mean   :26.3   Mean   : 2.919  
##  3rd Qu.:3.000   3rd Qu.:3.000   3rd Qu.:27.7   3rd Qu.: 5.000  
##  Max.   :4.000   Max.   :3.000   Max.   :33.5   Max.   :15.000  
##      weight      colorF   spineF   
##  Min.   :0.200   LM:12   Good: 37  
##  1st Qu.:2.000   M :95   Soso: 15  
##  Median :2.350   DM:44   Poor:121  
##  Mean   :2.427   D :22             
##  3rd Qu.:2.850                     
##  Max.   :5.200

Plot all variables against one another:

plot(crab)

Boxplots for categorical predictors:

boxplot(satell ~ colorF, data=crab, xlab="Color", ylab="No. of Satellites")

boxplot(satell ~ spineF, data=crab, xlab="Spine Condition", ylab="No. of Satellites")

Based on the plots alone, what variables seem to be the best predictors of number of satellites?

cor(crab[,1:5])
##             color       spine      width      satell     weight
## color   1.0000000  0.37850163 -0.2643863 -0.19078455 -0.2748001
## spine   0.3785016  1.00000000 -0.1218946 -0.08993242 -0.1707781
## width  -0.2643863 -0.12189458  1.0000000  0.33989033  0.8560429
## satell -0.1907846 -0.08993242  0.3398903  1.00000000  0.3542486
## weight -0.2748001 -0.17077808  0.8560429  0.35424861  1.0000000

Note that weight and width are highly correlated. What happens if we fit them both in the model?

summary(glm(satell ~ width, family = poisson, data = crab))
## 
## Call:
## glm(formula = satell ~ width, family = poisson, data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8526  -1.9884  -0.4933   1.0970   4.9221  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.30476    0.54224  -6.095  1.1e-09 ***
## width        0.16405    0.01997   8.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: 927.18
## 
## Number of Fisher Scoring iterations: 6
summary(glm(satell ~ weight, family = poisson, data = crab))
## 
## Call:
## glm(formula = satell ~ weight, family = poisson, data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9182  -2.0169  -0.5926   1.0290   4.9755  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.36997    0.17811  -2.077   0.0378 *  
## weight       0.56837    0.06496   8.750   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 564.35  on 171  degrees of freedom
## AIC: 923.65
## 
## Number of Fisher Scoring iterations: 5
summary(glm(satell ~ width + weight, family = poisson, data = crab))
## 
## Call:
## glm(formula = satell ~ width + weight, family = poisson, data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9194  -1.9390  -0.5447   1.0694   4.9683  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -1.69254    0.86372  -1.960   0.0500 .
## width        0.06972    0.04435   1.572   0.1159  
## weight       0.35659    0.14906   2.392   0.0167 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 561.93  on 170  degrees of freedom
## AIC: 923.23
## 
## Number of Fisher Scoring iterations: 6

Poisson regression models

One quantitative predictor: width

crab.glm <- glm(satell ~ width, family = poisson(link = "log"), data = crab)
summary(crab.glm)
## 
## Call:
## glm(formula = satell ~ width, family = poisson(link = "log"), 
##     data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8526  -1.9884  -0.4933   1.0970   4.9221  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.30476    0.54224  -6.095  1.1e-09 ***
## width        0.16405    0.01997   8.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: 927.18
## 
## Number of Fisher Scoring iterations: 6

How would you interpret each coefficient?

Fitted model: \(\mu\) = mean number of satellites among the population of female crabs represented by this sample \[ \log(\hat{\mu}) = -3.3048 + 0.16405x, \] where \(x\) is the width of the crab in cm.

OR \[ \hat{\mu} = \exp(-3.3048 + 0.16405x), \]

To interpret, exponentiate to get to original scale.

exp(crab.glm$coefficients)
## (Intercept)       width 
##  0.03670812  1.17826744

A one cm increase in crab width is associated with a 17.8% increase in [predicted number]/[estimated mean number] of satellites.

Note that the log link is the default for poisson, so we could have used family = poisson rather than family = poisson(link = "log").

Checking model fit

How can we check if the model fits well?

Add fitted model to plot.

co <- crab.glm$coefficients
plot(satell ~ width, xlab="Width (cm)", ylab="No. of Satellites", data=crab)
curve(exp(co[1]+co[2]*x),add=T)

Check for overdispersion (start here Thur)

What is “overdisperson”?

Create the eight width categories used in book:

  • 1 = <23.25
  • 2 = 23.25-24.25
  • 3 = 24.25-25.25
  • 4 = 25.25-26.25
  • 5 = 26.25-27.25
  • 6 = 27.25-28.25
  • 7 = 28.25-29.25
  • 8 = >29.25 - Note: Book error = 30.25

One option to create width categories is the cut function. Note that the resulting factor is treated as nominal even though it is ordinal.

crab$width.grp <- cut(crab$width, 
                      breaks = c(0, seq(23.25, 29.25, by=1), 35), 
                      labels=c(1:8))
is.factor(crab$width.grp)
## [1] TRUE
head(cbind(crab$width, crab$width.grp))
##      [,1] [,2]
## [1,] 28.3    7
## [2,] 22.5    1
## [3,] 26.0    4
## [4,] 24.8    3
## [5,] 26.0    4
## [6,] 23.8    2

Explore association between number of satellites and width group.

xtabs(~ satell+width.grp, data=crab)
##       width.grp
## satell  1  2  3  4  5  6  7  8
##     0   9 10 11 18  7  4  3  0
##     1   2  0  4  0  2  4  4  0
##     2   0  2  1  3  1  0  0  2
##     3   0  0  1  6  3  3  4  2
##     4   3  0  3  3  2  2  2  4
##     5   0  0  4  3  3  2  1  2
##     6   0  1  3  1  2  6  0  0
##     7   0  0  0  1  1  1  0  1
##     8   0  0  1  1  1  1  1  1
##     9   0  0  0  1  0  0  1  1
##     10  0  1  0  1  0  0  1  0
##     11  0  0  0  0  0  1  0  0
##     12  0  0  0  0  0  0  0  1
##     14  0  0  0  1  0  0  0  0
##     15  0  0  0  0  0  0  1  0

Since width.grp is a factor, R defaults to side-by-side boxplots when we use plot.

plot(satell ~ width.grp, data=crab,
     xlab = "Width Group", ylab = "Number of Satellites")

Examine our fitted model over the means for each group.

means <- tapply(crab$satell, crab$width.grp, mean)
plot(seq(22.75,29.75,by=1), means, 
     xlab="Width", ylab="Mean No. of Satellites")
curve(exp(co[1]+co[2]*x), add=T)

Check for overdispersion by comparing the means within each group to the sample variances.

vars <- tapply(crab$satell, crab$width.grp, var)
kable(data.frame(width.grp = 1:8, means = round(means, 3), variances = round(vars,3)))
width.grp means variances
1 1.000 2.769
2 1.429 8.879
3 2.393 6.544
4 2.692 11.377
5 2.864 6.885
6 3.875 8.810
7 3.944 16.879
8 5.143 8.286

The variances in each group are much larger than the means for each group, indicating overdispersion is present. This may indicate need to include other predictors to reduce unexplained variability.

One categorical predictor: colorF

crab.glmF <- glm(satell ~ colorF, family = poisson(link = "log"), data = crab)
summary(crab.glmF)
## 
## Call:
## glm(formula = satell ~ colorF, family = poisson(link = "log"), 
##     data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8577  -2.1106  -0.1649   0.8721   4.7491  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   1.4069     0.1429   9.848  < 2e-16 ***
## colorFM      -0.2146     0.1536  -1.397 0.162487    
## colorFDM     -0.6061     0.1750  -3.464 0.000532 ***
## colorFD      -0.6913     0.2065  -3.348 0.000814 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 609.14  on 169  degrees of freedom
## AIC: 972.44
## 
## Number of Fisher Scoring iterations: 6
exp(crab.glmF$coefficients)
## (Intercept)     colorFM    colorFDM     colorFD 
##   4.0833333   0.8068743   0.5454545   0.5009276

How would you interpret each coefficient?

If we treat color as quantitative?

crab.color <- glm(satell ~ color, family = poisson, data = crab)
summary(crab.color)
## 
## Call:
## glm(formula = satell ~ color, family = poisson, data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9072  -2.2128  -0.2959   0.9189   4.9423  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.71420    0.14212  12.062  < 2e-16 ***
## color       -0.27295    0.05932  -4.601  4.2e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 610.66  on 171  degrees of freedom
## AIC: 969.96
## 
## Number of Fisher Scoring iterations: 6
exp(crab.color$coefficients[2]*2)-1
##      color 
## -0.4206765

The estimated mean number of satellites for dark medium crabs are about 42.0676479% less than for light medium crabs.

Poisson regression with two predictors: width and colorF

Model with interaction:

crab.glm2 <- glm(satell ~ width*colorF, data=crab, family=poisson)
summary(crab.glm2)
## 
## Call:
## glm(formula = satell ~ width * colorF, family = poisson, data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.9830  -1.9293  -0.4705   0.9635   4.7216  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)   
## (Intercept)      3.57144    2.46299   1.450  0.14705   
## width           -0.08057    0.09187  -0.877  0.38046   
## colorFM         -6.13346    2.55488  -2.401  0.01636 * 
## colorFDM        -8.51524    2.83682  -3.002  0.00268 **
## colorFD        -10.54353    3.31180  -3.184  0.00145 **
## width:colorFM    0.21942    0.09513   2.306  0.02108 * 
## width:colorFDM   0.30017    0.10598   2.832  0.00462 **
## width:colorFD    0.37883    0.12457   3.041  0.00236 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 547.57  on 165  degrees of freedom
## AIC: 918.86
## 
## Number of Fisher Scoring iterations: 6

How to interpret coefficient estimates?

Plot number of satellites vs width by color and add fitted model.

co <- crab.glm2$coef
plot(satell ~ width, data=crab, xlab="Width (cm)", ylab="No. of Satellites", 
     pch=color, col=(color+1))
legend("topleft", c("LM","M","DM","D"), pch=c(1,2,3,4), col=c(2,3,4,5))
# Add fitted models by color
# LM
curve(exp(co[1]+co[2]*x), col=2, add=TRUE)
# M
curve(exp(co[1]+co[3] + (co[2]+co[6])*x), col=3, add=TRUE)
# DM
curve(exp(co[1]+co[4] + (co[2]+co[7])*x), col=4, add=TRUE)
# D
curve(exp(co[1]+co[5] + (co[2]+co[8])*x), col=5, add=TRUE)

Or with the effects package:

library(effects)
## Loading required package: carData
## lattice theme set by effectsTheme()
## See ?effectsTheme for details.
plot(allEffects(crab.glm2), type="response", multiline=T, ci.style="bands")

plot(allEffects(crab.glm2), type="link", multiline=T, ci.style="bands")

plot(allEffects(crab.glm2, residuals=T), type="link")

Additive model (no interaction):

crab.glm2a <- glm(satell ~ width + colorF, data=crab, family=poisson)
co <- crab.glm2a$coef
plot(satell ~ width, data=crab, xlab="Width (cm)", ylab="No. of Satellites", pch=color, col=(color+1))
legend("topright", c("LM","M","DM","D"), pch=c(1,2,3,4), col=c(2,3,4,5))
# LM
curve(exp(co[1]+co[2]*x), col=2, add=TRUE, lty=2)
# M
curve(exp(co[1]+co[3] + (co[2])*x), col=3, add=TRUE, lty=2)
# DM
curve(exp(co[1]+co[4] + (co[2])*x), col=4, add=TRUE, lty=2)
# D
curve(exp(co[1]+co[5] + (co[2])*x), col=5, add=TRUE, lty=2)

What if we treated color as quantitative? (Is this valid?)

crab.glm3 <- glm(satell ~ width*color, data=crab, family=poisson)
summary(crab.glm3)
## 
## Call:
## glm(formula = satell ~ width * color, family = poisson, data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8651  -1.9044  -0.5242   0.9894   4.6655  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)  3.30392    2.09204   1.579  0.11427   
## width       -0.06880    0.07779  -0.884  0.37651   
## color       -2.76463    0.89760  -3.080  0.00207 **
## width:color  0.09777    0.03360   2.909  0.00362 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 551.41  on 169  degrees of freedom
## AIC: 914.71
## 
## Number of Fisher Scoring iterations: 6
# Note significant interaction term.
plot(satell ~ width, data=crab, xlab="Width (cm)", ylab="No. of Satellites", pch=color, col=(color+1))
legend("topleft", c("LM","M","DM","D"), pch=c(1,2,3,4), col=c(2,3,4,5))
## Add fitted model where color is quantitative to plot:
co <- crab.glm3$coef
# LM
curve(exp(co[1] + co[2]*x + co[3]*1 + co[4]*x*1), col=2, add=TRUE, lty=3, lwd=2)
# M
curve(exp(co[1] + co[2]*x + co[3]*2 + co[4]*x*2), col=3, add=TRUE, lty=3, lwd=2)
# DM
curve(exp(co[1] + co[2]*x + co[3]*3 + co[4]*x*3), col=4, add=TRUE, lty=3, lwd=2)
# D
curve(exp(co[1] + co[2]*x + co[3]*4 + co[4]*x*4), col=5, add=TRUE, lty=3, lwd=2)

# Effect of a one cm increase in width for each color:
for(i in 1:4){
    print(exp(crab.glm3$coef[2]+crab.glm3$coef[4]*i))
}
##    width 
## 1.029395 
##    width 
## 1.135121 
##    width 
## 1.251705 
##    width 
## 1.380263