The Framingham heart study is a cohort longitudinal study on residents of a small Massachusetts town (Framingham) that has been studying risk factors for cardiovascular disease (CVD) for over 70 years, since 1948. The data set we’ll use here contains data on 4699 subjects (men and women between the ages of 30 and 62). The subjects were given biennial exams for blood pressure, serum cholesterol, and relative weight. These data are the 30-year follow-up data. Major endpoints include the occurrence of coronary heart disease (CHD) and deaths from CHD, cerebrovasular accident (CVA or stroke), cancer, and other causes.
Identify the common factors or characteristics that contribute to cardiovascular disease (CVD) by following its development over a long period of time in a large group of participants who had not yet developed overt symptoms of CVD or suffered a heart attack or stroke.
sex : Sex (1 = male; 2 = female)
sbp : Systolic Blood Pressure
dbp : Diastolic Blood Pressure
scl : Serum Cholesterol
chdfate : Coronary Heart Disease Indicator
followup : Follow-up in Days
age : Age in Years
bmi : Body Mass Index (wt (kg) / h^2 (m)
month : Study Month of Baseline Exam
id : Subject ID
Load in Framingham data:
fram <- read.table("http://www.math.montana.edu/shancock/data/Framingham.txt")
# Check data were read in correctly:
head(fram)
## sex sbp dbp scl chdfate followup age bmi month id
## 1 1 120 80 267 1 18 55 25.0 8 2642
## 2 1 130 78 192 1 35 53 28.4 12 4627
## 3 1 144 90 207 1 109 61 25.1 8 2568
## 4 1 92 66 231 1 147 48 26.2 11 4192
## 5 1 162 98 271 1 169 39 28.4 11 3977
## 6 2 212 118 182 1 199 61 33.3 2 659
tail(fram)
## sex sbp dbp scl chdfate followup age bmi month id
## 4694 2 136 92 197 0 11688 45 23.1 5 1976
## 4695 1 130 88 213 0 11688 47 28.4 9 3195
## 4696 2 112 68 252 0 11688 40 22.0 4 1674
## 4697 1 112 76 177 0 11688 40 23.5 12 4526
## 4698 2 125 75 244 0 11688 47 30.4 12 4536
## 4699 2 142 70 255 0 11688 44 23.3 4 1786
dim(fram)
## [1] 4699 10
names(fram)
## [1] "sex" "sbp" "dbp" "scl" "chdfate" "followup"
## [7] "age" "bmi" "month" "id"
We will only consider one potential predictor, systolic blood pressure, sbp
on the response variable chdfate
. We would like to investigate the research question: Does SBP influence probability of CHD?
Univariate distributions:
fram %>% ggplot(aes(x = chdfate)) +
geom_bar()
hist(fram$sbp)
Simple scatterplot and side-by-side boxplots (note response and explanatory variables reversed axes than typically used):
plot(chdfate~sbp, xlab="Systolic BP", ylab="Coronary Heart Disease Indicator", data=fram)
boxplot(sbp ~ chdfate, data=fram)
Since it is hard to visualize a response variable of 0’s and 1’s, let’s create subgroups for SBP and look at the proportion who experienced CHD in each group.
fram$sbpgrp <- with(fram,
cut(sbp, breaks = c(min(sbp),126,146,166,max(sbp)),
include.lowest=TRUE)
)
# Count number of CHD incidences within each group:
counts <- with(fram,
table(list(chdfate,sbpgrp))
)
counts
## .2
## .1 [80,126] (126,146] (146,166] (166,270]
## 0 1663 978 395 190
## 1 534 503 255 181
# Proportions in each group:
props <- counts[2,]/colSums(counts)
props
## [80,126] (126,146] (146,166] (166,270]
## 0.2430587 0.3396354 0.3923077 0.4878706
Now plot proportions and add to the original scatterplot.
# SBP Midpoints:
mids = c(103,136,156,218)
plot(chdfate~sbp, xlab="Systolic BP", ylab="Coronary Heart Disease Indicator", data=fram)
points(props~mids, xlab="Systolic BP", ylab="Proportion with CHD", col="tomato", pch = 16)
lines(lowess(fram$chdfate~fram$sbp), col="red")
Linear probability model:
mod1 <- glm(chdfate ~ sbp, family=binomial(link="identity"), data=fram)
summary(mod1)
##
## Call:
## glm(formula = chdfate ~ sbp, family = binomial(link = "identity"),
## data = fram)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8829 -0.8729 -0.7565 1.3610 1.9201
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.1889767 0.0386344 -4.891 1e-06 ***
## sbp 0.0037744 0.0002925 12.906 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5844.1 on 4698 degrees of freedom
## Residual deviance: 5689.0 on 4697 degrees of freedom
## AIC: 5693
##
## Number of Fisher Scoring iterations: 4
Logistic regression model: The default link for binomial family is logit.
mod2 <- glm(chdfate~sbp, family=binomial, data=fram)
summary(mod2)
##
## Call:
## glm(formula = chdfate ~ sbp, family = binomial, data = fram)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8320 -0.8668 -0.7634 1.3677 1.8368
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.008809 0.189815 -15.85 <2e-16 ***
## sbp 0.016593 0.001385 11.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5844.1 on 4698 degrees of freedom
## Residual deviance: 5695.7 on 4697 degrees of freedom
## AIC: 5699.7
##
## Number of Fisher Scoring iterations: 4
Probit regression model:
mod3 <- glm(chdfate~sbp, family=binomial(link="probit"), data=fram)
summary(mod3)
##
## Call:
## glm(formula = chdfate ~ sbp, family = binomial(link = "probit"),
## data = fram)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8426 -0.8680 -0.7618 1.3660 1.8506
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.8529512 0.1144066 -16.20 <2e-16 ***
## sbp 0.0102093 0.0008407 12.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5844.1 on 4698 degrees of freedom
## Residual deviance: 5694.3 on 4697 degrees of freedom
## AIC: 5698.3
##
## Number of Fisher Scoring iterations: 4
Add the three fitted models to the scatterplot:
plot(chdfate~sbp, xlab="Systolic BP", ylab="Coronary Heart Disease Indicator", data=fram)
points(props~mids, xlab="Systolic BP", ylab="Proportion with CHD", col="tomato", pch = 16)
# Linear:
abline(mod1,lwd=2,col="darkgreen")
# Logistic:
curve(exp(mod2$coef[1]+mod2$coef[2]*x)/(1+exp(mod2$coef[1]+mod2$coef[2]*x)),
add=T,lwd=2,lty=2,from=90, to=300,col="red")
# Probit:
curve(pnorm(mod3$coef[1]+mod3$coef[2]*x),
add=T,lwd=2,lty=3,col="blue")
legend("topleft",c("Linear","Logit","Probit"),lty=c(1,2,3),
col=c("darkgreen","red","blue"))
Let’s continue using mod2
, the logistic regression model.
Interpretations are always on the odds scale in logistic regression.
exp(mod2$coef)
## (Intercept) sbp
## 0.04935042 1.01673181
(Intercept)
The estimated odds of CHD for an individual with 0 mmHg systolic blood pressure are 0.0494 to 1.
sbp
A one mmHg increase in systolic blood pressure is associated with a 1.673% increase in estimated odds of CHD.
Confidence interval for slope - interpret on odds scale:
exp(confint(mod2))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 0.03394333 0.07144715
## sbp 1.01398776 1.01951091
A one mmHg increase in systolic blood pressure is associated with an increase of between 1.40% to 1.95% in the odds of CHD.
Estimated odds ratio of CHD for SBP = 140 compared to SBP = 120:
exp(mod2$coef[2] * (140-120))
## sbp
## 1.393568
# same as
exp(mod2$coef[2] * (110-90))
## sbp
## 1.393568
To calculate a 95% CI for this odds ratio, first calculate a CI for \(\beta\):
ztable <- summary(mod2)$coefficients
ztable
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.00880898 0.189815484 -15.85123 1.378551e-56
## sbp 0.01659338 0.001385338 11.97785 4.642017e-33
CIbeta <- ztable[2,1] + c(-1,1)*qnorm(.975)*ztable[2,2]
CIbeta
## [1] 0.01387816 0.01930859
Then multiply by 20 and transform the endpoints to get CI for \(e^{20\beta}\):
exp(CIbeta * 20)
## [1] 1.319910 1.471337
Estimated median effective level:
-mod2$coef[1]/mod2$coef[2]
## (Intercept)
## 181.3259
The estimated probability of CHD is 0.5 when systolic blood pressure is 181.33 mmHg.
Your rate of change in the odds of CHD with SBP is highest around an SBP of 181.33 mmHg.
Estimated risk of CHD and rate of change in risk for a one unit increase in SBP when SBP is around 140 (hypertension):
co <- mod2$coef
p140 <- exp(co[1]+co[2]*140)/(1+exp(co[1]+co[2]*140))
# Risk
p140
## (Intercept)
## 0.3349822
# Rate of change
co[2] * p140 * (1-p140)
## sbp
## 0.003696492
The predicted probability of CHD for an individual with systolic blood pressure of 140 mmMg is 0.335. That is, there is an estimated 33.5% chance of CHD at some point in the 30 years of the study.
For each additional mmMg for that individual, the predicted probability of CHD increases (additively) by 0.0037.
More efficiently, we can write a function to calculate estimated probabilities:
prob <- function(co,x){
# co = vector of coefficients (logistic model 1 predictor)
# x = x-value
as.numeric(exp(co[1]+co[2]*x)/(1+exp(co[1]+co[2]*x)))
}
# Predicted risk of CHD when SBP = 120:
prob(mod2$coef,120)
## [1] 0.2654944
Predictions using Logistic model, by hand:
co <- mod2$coef
# Odds for sbp = 110:
exp(mod2$coef[1]+mod2$coef[2]*110)
## (Intercept)
## 0.3061936
# Probability for sbp = 110:
exp(mod2$coef[1]+mod2$coef[2]*110)/(1+exp(mod2$coef[1]+mod2$coef[2]*110))
## (Intercept)
## 0.2344167
or by using the predict function:
?predict.glm
newdata <- data.frame(sbp = c(110,150))
# Predicted probabilities:
predict(mod2,newdata,type="response")
## 1 2
## 0.2344167 0.3728984
# Predicted log-odds:
predict(mod2,newdata,type="link")
## 1 2
## -1.1835377 -0.5198026
# Predicted odds:
exp(predict(mod2,newdata,type="link"))
## 1 2
## 0.3061936 0.5946379
Check the predictive power of our logistic regression model. First, create a classification table using a cutoff of the sample proportion of those that experienced CHD.
# Predictive probabilities
probs <- fitted(mod2)
# Proportion in sample with chdfate = 1
mean(fram$chdfate) # About 0.31
## [1] 0.313471
# Use 0.31 as cutoff to calculate sensitivity and specificity:
# pred prob > 0.31 -> 1; <= 0.31 -> 0
preds <- as.numeric(probs>0.31)
table(fram$chdfate,preds)
## preds
## 0 1
## 0 2036 1190
## 1 698 775
Sensitivity = P(y-hat = 1 | y = 1):
775/(698+775)
## [1] 0.5261371
Specificity = P(y-hat = 0 | y = 0):
2036/(2036+1190)
## [1] 0.6311221
The classification table is sensitive to our choice of cutoff. How do answers change if we use a cutoff of 0.5?
preds <- as.numeric(probs>0.5)
table(fram$chdfate,preds)
## preds
## 0 1
## 0 3134 92
## 1 1381 92
library(ROCR)
# arguments = fitted probabilities; observed 0/1
pred.obj <- prediction(probs, fram$chdfate)
# Plot ROC curve
plot(performance(pred.obj,"tpr","fpr"))
# Area under the ROC curve - not sure why this isn't working; need to investigate
performance(pred.obj,"auc")
## A performance instance
## 'Area under the ROC curve'
Our model is a very poor predictor of CHD.
# Recode sex to be an indicator of female
fram$sex <- fram$sex-1
Logistic regression with two predictors - no interaction between sbp and sex:
mod4 <- glm(chdfate ~ sbp + sex, family = "binomial", data = fram)
summary(mod4)
##
## Call:
## glm(formula = chdfate ~ sbp + sex, family = "binomial", data = fram)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7431 -0.8935 -0.6838 1.2765 1.9910
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.76479 0.19401 -14.25 <2e-16 ***
## sbp 0.01785 0.00142 12.57 <2e-16 ***
## sex -0.78261 0.06535 -11.97 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5844.1 on 4698 degrees of freedom
## Residual deviance: 5549.7 on 4696 degrees of freedom
## AIC: 5555.7
##
## Number of Fisher Scoring iterations: 4
exp(coef(mod4))
## (Intercept) sbp sex
## 0.06298909 1.01801052 0.45720929
Interpretations?
In-class: * sbp
: Within one sex, a one mmHG increase in systolic blood pressure is associated with a 1.8% estimated increase in odds of a heart attack. * sex
: We estimate that the odds of a heart attack are 54% lower for females compared to males, holding systolic blood pressure constant. OR The estimated odds ratio of a heart attack for females compared to males with the same systolic blood pressure is 0.4572.
spb
:
sex
:
Logistic regression with two predictors - interaction between sbp and sex:
mod5 <- glm(chdfate ~ sbp*sex, family = "binomial", data = fram)
mod5 <- glm(chdfate ~ 1 + sbp + sex + sbp:sex, family = "binomial", data = fram)
summary(mod5)
##
## Call:
## glm(formula = chdfate ~ 1 + sbp + sex + sbp:sex, family = "binomial",
## data = fram)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9162 -0.9263 -0.6713 1.2860 2.0446
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.088799 0.310533 -6.726 1.74e-11 ***
## sbp 0.012758 0.002315 5.512 3.55e-08 ***
## sex -1.866700 0.400881 -4.656 3.22e-06 ***
## sbp:sex 0.008048 0.002934 2.744 0.00608 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 5844.1 on 4698 degrees of freedom
## Residual deviance: 5542.2 on 4695 degrees of freedom
## AIC: 5550.2
##
## Number of Fisher Scoring iterations: 4
exp(coef(mod5))
## (Intercept) sbp sex sbp:sex
## 0.1238357 1.0128399 0.1546331 1.0080807
Look at the model within each sex:
Males (sex = 0): (\(x\) = sbp) \[ \log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) = -2.0888 + 0.0128x \]
The estimated odds of CHD for males increase by 1.28% for each additional mmMg in systolic blood pressure.
OR
A one mmMG increase in systolic blood pressure is associated with an estimated 1.28% increase in the odds of CHD for males.
Females (sex = 1): (\(x\) = sbp)
# Intercept for females
coef(mod5)[1]+coef(mod5)[3]
## (Intercept)
## -3.955499
# Slope for females
coef(mod5)[2]+coef(mod5)[4]
## sbp
## 0.02080645
\[ \begin{eqnarray*} \log\left(\frac{\hat{\pi}}{1-\hat{\pi}}\right) &= (-2.0888-1.8667) + (0.0128+0.008048)x \\ &= -3.9555 + 0.0208x \end{eqnarray*} \] For every one mmMg increase in SBP, the estimated increase of odds of CHD is 2.1% for females.
What does interaction between sex and sbp mean? * The effect of sbp on CHD (relationship between spb and CHD) is going to differ between men and women. * The effect of sex on CHD (relationship between sex and CHD) is going to depend on systolic blood pressure.
Interpretations In-class: * sex
: We estimate that the odds of a heart attack are 85% lower for females compared to males, among men and women with systolic blood pressure equal to zero. (no longer meaningful) OR The estimated odds ratio of a heart attack for females compared to males with the systolic blood pressure equal to zerois 0.4572. * sbp
: For males, a one mmHG increase in systolic blood pressure is associated with a 1.28% estimated increase in odds of a heart attack. –> For women, a one mmHG increase in systolic blood pressure is associated with a 2.1024415% estimated increase in odds of a heart attack. * sex:spb
: For each additional mmHg in systolic blood pressure, the estimated [odds ratio of CHD for females compared to males] increases by 0.80%. OR For each additional mmHg in systolic blood pressure, the estimated percent decrease in odds of CHD for females compared to males increases by 0.80%. OR The estimated odds ratio of CHD for a one mmHG increase in systolic blood pressure is 0.80% higher for women.
exp(coef(mod5))
## (Intercept) sbp sex sbp:sex
## 0.1238357 1.0128399 0.1546331 1.0080807
How the effect of SBP changes with sex:
The increase in estimated odds for a one mmHg increase in SBP is 0.8% higher for females than for males.
OR
How the effect of sex changes with SBP:
The change in estimated odds from males to females is 0.8% higher for each 1 mmHG increase in SBP.
OR
The estimated odds ratio of CHD for females compared to males increases by 0.8% for each additional mmHG in systolic blood pressure.