Load required libraries:
library(tidyverse)
In this case study, we will examine part of the analysis in Dalal et al. (1989) on O-ring data on shuttle launches prior to the 1986 Challenger shuttle explosion. Variables in this data set are:
Note that the Challenger launched at a temperature of 31 degrees F.
Orings <- structure(list(
Flight = structure(c(1L, 2L, 3L, 8L, 17L, 22L,
23L, 24L, 4L, 5L, 6L, 7L, 9L, 11L, 12L, 10L, 14L, 13L, 15L, 16L,
18L, 19L, 20L), .Label = c("1", "2", "3", "41-B", "41-C", "41-D",
"41-G", "5", "51-A", "51-B", "51-C", "51-D", "51-F", "51-G",
"51-I", "51-J", "6", "61-A", "61-B", "61-C", "61-I", "7", "8",
"9"), class = "factor"),
Date = structure(c(8L, 22L, 5L, 21L,
6L, 11L, 15L, 24L, 4L, 7L, 16L, 18L, 20L, 2L, 9L, 10L, 12L, 13L,
14L, 17L, 19L, 23L, 1L), .Label = c("01/12/86", "01/24/85", "01/28/86",
"02/03/84", "03/22/82", "04/04/83", "04/06/84", "04/12/81", "04/12/85",
"04/29/85", "06/16/83", "06/17/85", "07/29/85", "08/27/85", "08/30/83",
"08/30/84", "10/03/85", "10/05/84", "10/30/85", "11/08/84", "11/11/82",
"11/12/81", "11/26/85", "11/28/83"), class = "factor"),
Field = c(0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 3, 0, 0, 0, 0, 0, 0, 2, 0, 1),
Temp = c(66, 70, 69, 68, 67, 72, 73, 70, 57, 63, 70, 78,
67, 53, 67, 75, 70, 81, 76, 79, 75, 76, 58),
Pres = c(50, 50, 50, 50, 50, 50, 100, 100, 200, 200, 200,
200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200, 200)),
.Names = c("Flight", "Date", "Field", "Temp", "Pres"),
row.names = c(NA, 23L), class = "data.frame")
Let’s replicate the statistical analysis of scientists before the Challenger launch.
First, they only looked at flights that experienced some thermal distress. Let’s subset the data:
Orings.fail <- Orings %>%
filter(Field != 0)
They examined a plot of these data, and then fit a simple linear regression.
Orings.fail %>% ggplot(aes(x = Temp, y = Field)) +
geom_point() +
ylim(0,6) +
labs(x = "Temperature (F)",
y = "Number of O-ring Failures",
title = "Launches with at least one O-ring failure") +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
mod1 <- lm(Field ~ Temp, data = Orings.fail)
summary(mod1)
##
## Call:
## lm(formula = Field ~ Temp, data = Orings.fail)
##
## Residuals:
## 1 2 3 4 5 6 7
## -0.2690 -0.5991 -0.4467 -0.2690 1.2994 0.8580 -0.5737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.04649 2.66929 1.141 0.305
## Temp -0.02539 0.04160 -0.610 0.568
##
## Residual standard error: 0.8315 on 5 degrees of freedom
## Multiple R-squared: 0.06934, Adjusted R-squared: -0.1168
## F-statistic: 0.3726 on 1 and 5 DF, p-value: 0.5683
With a p-value of 0.568, it was concluded that temperature had no significant effect on the predicted number of O-ring failures…. what is wrong with this analysis?
Let’s look at an appropriate analysis, including all of the past launches (not just those that experienced O-ring failure!).
Orings %>% ggplot(aes(x = Temp, y = Field)) +
geom_point() +
ylim(0,6) +
labs(x = "Temperature (F)",
y = "Number of O-ring Failures",
title = "All past launches") +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 13 rows containing missing values (geom_smooth).
mod2 <- lm(Field ~ Temp, data = Orings)
summary(mod2)
##
## Call:
## lm(formula = Field ~ Temp, data = Orings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.6582 -0.4389 -0.1594 0.1551 1.9058
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.79365 1.40926 3.402 0.00269 **
## Temp -0.06266 0.02016 -3.108 0.00532 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6673 on 21 degrees of freedom
## Multiple R-squared: 0.3151, Adjusted R-squared: 0.2825
## F-statistic: 9.661 on 1 and 21 DF, p-value: 0.005321
Now, the p-value for the slope of a simple linear regression is 0.00532 – a very different conclusion!
But is simple linear regression appropriate? (Hint: Is the response variable normally distributed?)
For a binomial response variable, we use logistic regression.
mod3 <- glm(cbind(Field, 6) ~ Temp, family = binomial, data = Orings)
summary(mod3)
##
## Call:
## glm(formula = cbind(Field, 6) ~ Temp, family = binomial, data = Orings)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.9301 -0.7612 -0.5216 -0.1925 2.4573
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.20471 2.79211 1.864 0.06231 .
## Temp -0.11816 0.04368 -2.705 0.00683 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 24.038 on 22 degrees of freedom
## Residual deviance: 16.386 on 21 degrees of freedom
## AIC: 34.564
##
## Number of Fisher Scoring iterations: 5
Logistic regression models a transformation of the probability of failure.
Orings %>% ggplot(aes(x = Temp, y = Field/6)) +
geom_point() +
ylim(0,1) +
labs(x = "Temperature (F)",
y = "Probability of O-ring Failure",
title = "All past launches") +
stat_function(fun = function(x){
exp(mod3$coef[1] + mod3$coef[2]*x)/(1+exp(mod3$coef[1] + mod3$coef[2]*x))
})
So what was the predicted probability an O-ring fails at a temperature of 31 degrees F?
pred.prob <- predict(mod3, newdata = data.frame(Temp = 31), type = "response")
pred.prob
## 1
## 0.8237404
Thus, the probability of at least one O-ring failing is
sum(dbinom(1:6, 6, pred.prob))
## [1] 0.99997