Data set from Agresti, 2007

Study of nesting horseshoe crabs (J. Brockman, Ethology, 1996). Each female crab in the study had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males, called satellites, residing nearby her. Explanatory variables thought possibly to affect this included the female crab’s color, spine condition, weight, and carapace width. The response outcome for each female crab is her number of satellites.

Variable descriptions

  • color: 1 - light medium, 2 - medium, 3 - dark medium, 4 - dark
  • spine: 1 - both good, 2 - one worn or broken, 3 - both worn or broken
  • width: carapace width in cm
  • satell: number of satellites
  • weight: weight in kg

Data import

First, read in the data set and convert color and spine to factors:

crab <- read.table("http://www.math.montana.edu/shancock/data/horseshoe.txt",header=T)
crab$colorF <- factor(crab$color, levels=c(1,2,3,4), labels=c("LM","M","DM","D"))
crab$spineF <- factor(crab$spine, levels=c(1,2,3), labels=c("Good","Soso","Poor"))

Research Goal

Our goal is to determine, of the four predictors available, what best predicts the number of satellites.

Model Selection Plan

  1. Exploratory data analysis to examine relationships between number of satellites and each predictor individually.
  2. Model selection via analysis of deviance (anova), forward/backward selection (drop1,add1,step), or best subsets (bestglm).
  3. Check assumptions, diagnostics, and goodness-of-fit.
  4. Write paragraph summarizing what the model tells us about the relationships between predictors and response.

Part I: Model Selection with Poisson Response

Recall: Overdispersion was a problem with these count data. We can use Negative Binomial regression to address this issue.

mod1 <- glm.nb(satell ~ width, data = crab)
summary(mod1)
## 
## Call:
## glm.nb(formula = satell ~ width, data = crab, init.theta = 0.90456808, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7798  -1.4110  -0.2502   0.4770   2.0177  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.05251    1.17143  -3.459 0.000541 ***
## width        0.19207    0.04406   4.360  1.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.9046) family taken to be 1)
## 
##     Null deviance: 213.05  on 172  degrees of freedom
## Residual deviance: 195.81  on 171  degrees of freedom
## AIC: 757.29
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.905 
##           Std. Err.:  0.161 
## 
##  2 x log-likelihood:  -751.291

Or quasi-Poisson:

mod1.qpois <- glm(satell ~ width, family = quasipoisson, data = crab)
summary(mod1.qpois)
## 
## Call:
## glm(formula = satell ~ width, family = quasipoisson, data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8526  -1.9884  -0.4933   1.0970   4.9221  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -3.30476    0.96729  -3.417 0.000793 ***
## width        0.16405    0.03562   4.606 7.99e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 3.182205)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

Compare to Poisson regression:

mod1.pois <- glm(satell ~ width, family = poisson, data = crab)
summary(mod1.pois)
## 
## Call:
## glm(formula = satell ~ width, family = poisson, data = crab)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8526  -1.9884  -0.4933   1.0970   4.9221  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.30476    0.54224  -6.095  1.1e-09 ***
## width        0.16405    0.01997   8.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: 927.18
## 
## Number of Fisher Scoring iterations: 6

Model Selection

Prior to model fitting -> exploratory data analysis.

One approach - start with all four in the model, and try taking one out:

mod.full <- glm.nb(satell ~ colorF + spineF + width + weight, data = crab)
summary(mod.full)
## 
## Call:
## glm.nb(formula = satell ~ colorF + spineF + width + weight, data = crab, 
##     init.theta = 0.9521901429, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8535  -1.3633  -0.3033   0.4685   2.2238  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.58662    1.83022  -0.867    0.386
## colorFM     -0.31434    0.37457  -0.839    0.401
## colorFDM    -0.57764    0.41921  -1.378    0.168
## colorFD     -0.53423    0.46877  -1.140    0.254
## spineFSoso  -0.21439    0.39927  -0.537    0.591
## spineFPoor   0.02570    0.24903   0.103    0.918
## width        0.07734    0.09053   0.854    0.393
## weight       0.37950    0.31432   1.207    0.227
## 
## (Dispersion parameter for Negative Binomial(0.9522) family taken to be 1)
## 
##     Null deviance: 219.08  on 172  degrees of freedom
## Residual deviance: 196.73  on 165  degrees of freedom
## AIC: 764.9
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.952 
##           Std. Err.:  0.173 
## 
##  2 x log-likelihood:  -746.905

Why all large p-values?

Model with width alone –> p-value for width = tiny
Do I need width compared to nothing? –> Yes!

Model with everything –> p-value for width = 0.393
Do I need width in addition to color, spine, weight? –> No

Try dropping each:

drop1(mod.full)
## Single term deletions
## 
## Model:
## satell ~ colorF + spineF + width + weight
##        Df Deviance    AIC
## <none>      196.73 762.90
## colorF  3   199.22 759.40
## spineF  2   197.15 759.33
## width   1   197.43 761.61
## weight  1   198.18 762.35

Take out spineF:

mod2 <- update(mod.full, . ~ . - spineF)
drop1(mod2)
## Single term deletions
## 
## Model:
## satell ~ colorF + width + weight
##        Df Deviance    AIC
## <none>      196.74 759.33
## colorF  3   199.26 755.84
## width   1   197.76 758.34
## weight  1   198.02 758.61

Take out colorF:

mod3 <- update(mod2, . ~ . - colorF)
drop1(mod3)
## Single term deletions
## 
## Model:
## satell ~ width + weight
##        Df Deviance    AIC
## <none>      196.34 755.82
## width   1   197.51 754.99
## weight  1   197.82 755.30

Step-wise regression:

step(mod.full, direction = "backward")
## Start:  AIC=762.9
## satell ~ colorF + spineF + width + weight
## 
##          Df Deviance    AIC
## - spineF  2   197.15 759.33
## - colorF  3   199.22 759.40
## - width   1   197.43 761.61
## - weight  1   198.18 762.35
## <none>        196.73 762.90
## 
## Step:  AIC=759.33
## satell ~ colorF + width + weight
## 
##          Df Deviance    AIC
## - colorF  3   199.26 755.84
## - width   1   197.76 758.34
## - weight  1   198.02 758.61
## <none>        196.74 759.33
## 
## Step:  AIC=755.82
## satell ~ width + weight
## 
##          Df Deviance    AIC
## - width   1   197.51 754.99
## - weight  1   197.82 755.30
## <none>        196.34 755.82
## 
## Step:  AIC=754.99
## satell ~ weight
## 
##          Df Deviance    AIC
## <none>        196.74 754.99
## - weight  1   214.42 770.68
## 
## Call:  glm.nb(formula = satell ~ weight, data = crab, init.theta = 0.9152623199, 
##     link = log)
## 
## Coefficients:
## (Intercept)       weight  
##     -0.6142       0.6637  
## 
## Degrees of Freedom: 172 Total (i.e. Null);  171 Residual
## Null Deviance:       214.4 
## Residual Deviance: 196.7     AIC: 757

Step-wise regression going forward:

mod.null <- glm.nb(satell ~ 1, data = crab)
step(mod.null, satell ~ colorF + spineF + width + weight, direction = "forward")
## Start:  AIC=769.41
## satell ~ 1
## 
##          Df Deviance    AIC
## + weight  1   177.64 756.12
## + width   1   177.82 756.30
## <none>        192.93 769.41
## + colorF  3   187.93 770.41
## + spineF  2   190.50 770.98
## 
## Step:  AIC=754.99
## satell ~ weight
## 
##          Df Deviance    AIC
## <none>        196.74 754.99
## + width   1   195.56 755.82
## + spineF  2   196.08 758.34
## + colorF  3   194.11 758.37
## 
## Call:  glm.nb(formula = satell ~ weight, data = crab, init.theta = 0.9152622973, 
##     link = log)
## 
## Coefficients:
## (Intercept)       weight  
##     -0.6142       0.6637  
## 
## Degrees of Freedom: 172 Total (i.e. Null);  171 Residual
## Null Deviance:       214.4 
## Residual Deviance: 196.7     AIC: 757

Step-wise regression in both directions:

step(mod.full, direction = "both")
## Start:  AIC=762.9
## satell ~ colorF + spineF + width + weight
## 
##          Df Deviance    AIC
## - spineF  2   197.15 759.33
## - colorF  3   199.22 759.40
## - width   1   197.43 761.61
## - weight  1   198.18 762.35
## <none>        196.73 762.90
## 
## Step:  AIC=759.33
## satell ~ colorF + width + weight
## 
##          Df Deviance    AIC
## - colorF  3   199.26 755.84
## - width   1   197.76 758.34
## - weight  1   198.02 758.61
## <none>        196.74 759.33
## + spineF  2   196.32 762.91
## 
## Step:  AIC=755.82
## satell ~ width + weight
## 
##          Df Deviance    AIC
## - width   1   197.51 754.99
## - weight  1   197.82 755.30
## <none>        196.34 755.82
## + colorF  3   193.87 759.35
## + spineF  2   195.89 759.37
## 
## Step:  AIC=754.99
## satell ~ weight
## 
##          Df Deviance    AIC
## <none>        196.74 754.99
## + width   1   195.56 755.82
## + spineF  2   196.08 758.34
## + colorF  3   194.11 758.37
## - weight  1   214.42 770.68
## 
## Call:  glm.nb(formula = satell ~ weight, data = crab, init.theta = 0.9152623199, 
##     link = log)
## 
## Coefficients:
## (Intercept)       weight  
##     -0.6142       0.6637  
## 
## Degrees of Freedom: 172 Total (i.e. Null);  171 Residual
## Null Deviance:       214.4 
## Residual Deviance: 196.7     AIC: 757

Summary

According to AIC, the “best” model includes just weight. “Best” model:

mod.best <- glm.nb(satell ~ weight, data = crab)
summary(mod.best)
## 
## Call:
## glm.nb(formula = satell ~ weight, data = crab, init.theta = 0.9152622972, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8063  -1.4363  -0.3072   0.5161   2.0708  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.6142     0.3957  -1.552    0.121    
## weight        0.6637     0.1549   4.286 1.82e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.9153) family taken to be 1)
## 
##     Null deviance: 214.42  on 172  degrees of freedom
## Residual deviance: 196.74  on 171  degrees of freedom
## AIC: 756.99
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.915 
##           Std. Err.:  0.165 
## 
##  2 x log-likelihood:  -750.993
exp(confint(mod.best))
## Waiting for profiling to be done...
##                 2.5 %   97.5 %
## (Intercept) 0.2361634 1.217564
## weight      1.4170155 2.696467
plot(satell ~ weight, data = crab,
     xlab = "Weight (kg)", ylab = "Number of Satellites",
     main = "Number of Satellites vs Weight")
curve(exp(-0.6142 + 0.6637*x), col = "red", add=T)

Fitted model: (\(x\) = weight) \[ \log(\hat{\mu}) = -0.6142 + 0.6637x \] \[ \hat{\mu} = \exp(-0.6142 + 0.6637x) \] We are 95% confident that an increase in one kg of weight is associated with an increase of between 41.7% and 169.6% in mean number of satellites in this population.

Diagnostics

  • functional form of mean - transform weight or width?
  • interactions?
  • outliers or influential observations - if are outliers, fit model with and without and see how much conclusions change
  • goodness of fit test (using deviance since Poisson with reasonably large counts)
dev <- mod.best$deviance
dev
## [1] 196.7368
df <- mod.best$df.residual
df
## [1] 171
# p-value
pchisq(dev, df, lower.tail = FALSE)
## [1] 0.08643382

Part I: Model Selection with Binomial Response

Let’s convert our response variable to an indicator variable of whether the female crab had satellites or not.

crab$satell_ind <- as.numeric(crab$satell > 0)

Note: Overdispersion is not a problem with binary data – why?
We are restricted to the values {0,1}, whereas counts can go to infinity.

Model Selection

The bestglm function will fit every combination of predictors (all subsets), and return the one with the minimum value of the information criteria you select.

See the help file for the bestglm function – it needs the design matrix (the actual values of the variables that show up in the model equation - not categories), and the response variable in the last column of the data set.

mod.full <- glm(satell_ind ~ colorF + spineF + width + weight, family = binomial, data = crab)
X <- model.matrix(mod.full)
head(X)
##   (Intercept) colorFM colorFDM colorFD spineFSoso spineFPoor width weight
## 1           1       1        0       0          0          1  28.3   3.05
## 2           1       0        1       0          0          1  22.5   1.55
## 3           1       0        0       0          0          0  26.0   2.30
## 4           1       0        1       0          0          1  24.8   2.10
## 5           1       0        1       0          0          1  26.0   2.60
## 6           1       1        0       0          0          1  23.8   2.10
# Remove column of 1's (not a variable):
X <- X[,-1]
Xy <- as.data.frame(cbind(X,y = crab$satell_ind))
head(Xy)
##   colorFM colorFDM colorFD spineFSoso spineFPoor width weight y
## 1       1        0       0          0          1  28.3   3.05 1
## 2       0        1       0          0          1  22.5   1.55 0
## 3       0        0       0          0          0  26.0   2.30 1
## 4       0        1       0          0          1  24.8   2.10 0
## 5       0        1       0          0          1  26.0   2.60 1
## 6       1        0       0          0          1  23.8   2.10 0
best.AIC <- bestglm(Xy, family = binomial, IC = "AIC")
## Morgan-Tatar search since family is non-gaussian.
best.AIC
## AIC
## BICq equivalent for q in (0.338329717201477, 0.904725300676076)
## Best Model:
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -11.6790256  2.6925051 -4.337606 1.440432e-05
## colorFD      -1.3005121  0.5258610 -2.473110 1.339430e-02
## width         0.4782223  0.1041468  4.591812 4.394143e-06
best.BIC <- bestglm(Xy, family = binomial, IC = "BIC")
## Morgan-Tatar search since family is non-gaussian.
best.BIC
## BIC
## BICq equivalent for q in (0.338329717201477, 0.904725300676076)
## Best Model:
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept) -11.6790256  2.6925051 -4.337606 1.440432e-05
## colorFD      -1.3005121  0.5258610 -2.473110 1.339430e-02
## width         0.4782223  0.1041468  4.591812 4.394143e-06

WARNING: Best subsets does not follow the rule that if you include one indicator variable for a factor, you must include the rest. Thus, if using the models given by best.AIC or best.BIC, make sure you include all indicators for colorF!

Goodness of Fit for binary data: Hosmer-Lemeshow Test

mod.wgt <- glm(satell_ind ~ weight, family = binomial, data = crab)
obs <- crab$satell_ind
expected <- fitted(mod.wgt)
hoslem.test(obs, expected, g = 10)
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  obs, expected
## X-squared = 7.3643, df = 8, p-value = 0.4979