Back to Course

FoSSA: Fundamentals of Statistical Software & Analysis

0% Complete
0/0 Steps
  1. Course Information

    Meet the Teaching Team
  2. Course Dataset 1
  3. Course Dataset 2
  4. MODULE A1: INTRODUCTION TO STATISTICS USING R, STATA, AND SPSS
    A1.1 What is Statistics?
  5. A1.2.1a Introduction to Stata
  6. A1.2.2b: Introduction to R
  7. A1.2.2c: Introduction to SPSS
  8. A1.3: Descriptive Statistics
  9. A1.4: Estimates and Confidence Intervals
  10. A1.5: Hypothesis Testing
  11. A1.6: Transforming Variables
  12. End of Module A1
    1 Quiz
  13. MODULE A2: POWER & SAMPLE SIZE CALCULATIONS
    A2.1 Key Concepts
  14. A2.2 Power calculations for a difference in means
  15. A2.3 Power Calculations for a difference in proportions
  16. A2.4 Sample Size Calculation for RCTs
  17. A2.5 Sample size calculations for cross-sectional studies (or surveys)
  18. A2.6 Sample size calculations for case-control studies
  19. End of Module A2
    1 Quiz
  20. MODULE B1: LINEAR REGRESSION
    B1.1 Correlation and Scatterplots
  21. B1.2 Differences Between Means (ANOVA 1)
  22. B1.3 Univariable Linear Regression
  23. B1.4 Multivariable Linear Regression
  24. B1.5 Model Selection and F-Tests
  25. B1.6 Regression Diagnostics
  26. End of Module B1
    1 Quiz
  27. MODULE B2: MULTIPLE COMPARISONS & REPEATED MEASURES
    B2.1 ANOVA Revisited - Post-Hoc Testing
  28. B2.2 Correcting For Multiple Comparisons
  29. B2.3 Two-way ANOVA
  30. B2.4 Repeated Measures and the Paired T-Test
  31. B2.5 Repeated Measures ANOVA
  32. End of Module B2
    1 Quiz
  33. MODULE B3: NON-PARAMETRIC MEASURES
    B3.1 The Parametric Assumptions
  34. B3.2 Mann-Whitney U Test
  35. B3.3 Kruskal-Wallis Test
  36. B3.4 Wilcoxon Signed Rank Test
  37. B3.5 Friedman Test
  38. B3.6 Spearman's Rank Order Correlation
  39. End of Module B3
    1 Quiz
  40. MODULE C1: BINARY OUTCOME DATA & LOGISTIC REGRESSION
    C1.1 Introduction to Prevalence, Risk, Odds and Rates
  41. C1.2 The Chi-Square Test and the Test For Trend
  42. C1.3 Univariable Logistic Regression
  43. C1.4 Multivariable Logistic Regression
  44. End of Module C1
    1 Quiz
  45. MODULE C2: SURVIVAL DATA
    C2.1 Introduction to Survival Data
  46. C2.2 Kaplan-Meier Survival Function & the Log Rank Test
  47. C2.3 Cox Proportional Hazards Regression
  48. C2.4 Poisson Regression
  49. End of Module C2
    1 Quiz

Learning Outcomes

By the end of this section, students will be able to:

  • Calculate and interpret a logistic regression for a binary exposure variable
  • Calculate and interpret a logistic regression for a categorical exposure variable
  • Calculate and interpret a logistic regression for a continuous exposure variable

You can download a copy of the slides here: Video C1.3a

Video C1.3a – Logistic Regression  (9 minutes)

You can download a copy of the slides here: Video C1.3b

Video C1.3b – Logistic Regression for Continuous & Categorical Variables  (8 minutes)

C1.3 PRACTICAL: Stata

Logistic regression with a binary exposure

To run a logistic regression in Stata, we use the ‘logistic’ command (the ‘logit‘ command is used for obtaining log odds, but in practice, we are usually only interested in the odds ratio and its 95% CI). The set up of the command is:

logistic depvar indepvars

‘Depvar’ is your outcome variable. To use this command, you need to confirm that your outcome variable is coded as 0 (negative outcome) and 1 (positive outcome). ‘Indepvars’ is where you list your exposure variable (i.e. independent variable).

Question C1.3a: Dichotomise the variable ‘sbp’ into below 140 mmHg and greater than or equal to 140 mmHg. Call this variable “hyperten” to indicate “hypertensive”. Use logistic regression to examine the association of hypertensive on your odds of having prior CVD (‘prior_cvd’). Interpret the output.

Answer

         recode sbp (min/139=0) (140/max=1), gen(hyperten)

logistic prior_cvd hyperten

The output looks like this:

Logistic regression                                     Number of obs =  4,318
                                                        LR chi2(1)    =  36.97
                                                        Prob > chi2   = 0.0000
Log likelihood = -2406.9283                             Pseudo R2     = 0.0076

——————————————————————————
   prior_cvd | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
————-+—————————————————————-
    hyperten |   1.596948   .1215625     6.15   0.000     1.375612    1.853898
       _cons |    .289861   .0123634   -29.03   0.000     .2666144    .3151345
——————————————————————————
Note: _cons estimates baseline odds.

The odds of having prior CVD are 1.60 times greater for someone that was hypertensive compared to those who were not (95% CI: 1.38-1.85). This effect is significant (p<0.001), which means that we can reject the null hypothesis that there is no association between hypertension and prior CVD, and that the odds ratio is equal to 1.

Logistic regression with a categorical exposure

We may use the logistic command to obtain odds ratios for a categorical variable such as BMI group.  Putting an ‘i.’ before the categorical exposure will automatically omit the baseline category as the reference category, but if you want to choose a different reference category you can write ‘ib2.” with the number specifying whichever category you prefer.  In this case, since the lowest category of BMI group (“underweight”) is quite small, we want to use the next category of “normal weight” as the baseline (or reference) group: 

logistic prior_cvd ib2.bmi_grp

The output looks like this:

Logistic regression                                     Number of obs =  4,310
                                                        LR chi2(3)    =   3.64
                                                        Prob > chi2   = 0.3033
Log likelihood = -2420.1961                             Pseudo R2     = 0.0008

——————————————————————————
   prior_cvd | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
————-+—————————————————————-
    bmi_grp4 |
Underweight  |   .8995327   .3111124    -0.31   0.760     .4566888    1.771795
 Overweight  |   1.085572   .0811883     1.10   0.272     .9375593    1.256952
      Obese  |   1.252077   .1593964     1.77   0.077     .9755921    1.606918
             |
       _cons |   .3135531   .0173705   -20.94   0.000     .2812907    .3495158
——————————————————————————
Note: _cons estimates baseline odds.

Odd ratios in a logistic regression with a categorical variable are interpreted as the odds of the outcome in the specified level compared with a baseline level. The constant refers to the odds of the outcome in the baseline category. Here, the odds of having prior CVD are 25% higher if a participant has obesity compared to if they have a normal weight (OR:1.25, 95% CI:0.98-1.61). This association is not statistically significant (p=0.08) so we cannot reject the null hypothesis that the OR of obese vs normal weight is 1.

Logistic regression with an ordered categorical exposure

We can use the logistic command to obtain an odds ratio and a linear trend test for an ordered categorical variable such as BMI group. Note that by not specifying the i. before the exposure variable ‘bmi_grp’, we have told Stata that our exposure variable ‘bmi_grp’ is to be treated as a continuous variable rather than as a categorical variable.

logistic prior_cvd bmi_grp

The output looks like this:

Logistic regression                                     Number of obs =  4,310
                                                        LR chi2(1)    =   3.50
                                                        Prob > chi2   = 0.0614
Log likelihood = -2420.2658                             Pseudo R2     = 0.0007

——————————————————————————
   prior_cvd | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
————-+—————————————————————-
    bmi_grp4 |   1.105901   .0594653     1.87   0.061     .9952821    1.228813
       _cons |   .2545908   .0376265    -9.26   0.000     .1905645    .3401288
——————————————————————————
Note: _cons estimates baseline odds.

Note that the Wald test here is equivalent to the chi-square test for linear trend (Section C1.2b, ‘chi-square test for trend’).  The null hypothesis is that there is no association between BMI group and prior CVD (i.e. that the linear trend OR equals 1), and the alternative hypothesis is that there is a linear increasing or decreasing trend.  Here, the p-value (p=0.06) indicates that there is not evidence against the null hypothesis (although this is a borderline significant p-value) and we conclude that there is not a statistically significant increasing trend in log odds of prior CVD across groups of BMI.

Logistic regression with a continuous exposure

To fit a logistic regression with a continuous exposure, we type the variable as it is (with no prefix) and we interpret the OR in terms of a 1 unit change in the exposure:

  logistic prior_cvd hdlc

The output:

Logistic regression                                     Number of obs =  4,302
                                                        LR chi2(1)    =  54.04
                                                        Prob > chi2   = 0.0000
Log likelihood = -2391.5959                             Pseudo R2     = 0.0112

——————————————————————————
   prior_cvd | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
————-+—————————————————————-
        hdlc |   .4875979   .0488451    -7.17   0.000     .4006755    .5933772
       _cons |   .7168523   .0790327    -3.02   0.003     .5775438    .8897631
——————————————————————————
Note: _cons estimates baseline odds.

The odds of having prior CVD were 51% lower for each 1 unit increase in HDL-C (OR: 0.49, 95%CI: 0.40-0.59). This association was statistically significant.

Question C1.3bi.: Use logistic regression to obtain odds ratios for the association of BMI group on prior diabetes (‘prior_t2dm’). Try fitting this variable both as a categorical variable and as a linear trend.

Question C1.3b.ii: Use logistic regression to assess the association between ‘prior_t2dm’ and the continuous exposure variable ‘hdlc’. How would you interpret this output?

C1.3b Answers

Answer C1.3bi:

Fitting BMI group as a categorical variable in logistic regression gives the following results:

logistic prior_t2dm ib2.bmi_grp4

Logistic regression                                     Number of obs =  4,310
                                                        LR chi2(3)    =   7.81
                                                        Prob > chi2   = 0.0501
Log likelihood = -969.91672                             Pseudo R2     = 0.0040

——————————————————————————
  prior_t2dm | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
————-+—————————————————————-
    bmi_grp4 |
Underweight  |   1.626374   .8656765     0.91   0.361      .572991    4.616287
 Overweight  |   1.229727   .1732047     1.47   0.142     .9330798    1.620686
      Obese  |   1.799443   .3808491     2.78   0.006     1.188454    2.724541
             |
       _cons |   .0534665   .0057527   -27.22   0.000     .0433009    .0660186
——————————————————————————
Note: _cons estimates baseline odds.

The odds ratio of prior diabetes for underweight vs normal weight (as BMI group 2 is the reference group) is 1.63 (95% CI: 0.57-4.62).  The z-statistic for this odds ratio is 0.91 and p=0.361 so this odds ratio is not significantly different to 1.

The odds ratio for overweight vs normal weight is 1.23 (95% CI: 0.93-1.62).  The z-statistic for this odds ratio is 1.47 and p=0.142 so this odds ratio is not significantly different to 1.

The odds ratio for obese vs normal weight is 1.80 (95% CI: 1.19-2.72).  The z-statistic for this odds ratio is 2.78 and p=0.006 so this odds ratio is significantly different to 1.

Fitting ‘BMI group’ as a continuous variable in logistic regression gives the following results:

logistic prior_t2dm bmi_grp4

Logistic regression                                     Number of obs =  4,310
                                                        LR chi2(1)    =   5.75
                                                        Prob > chi2   = 0.0165
Log likelihood = -970.94533                             Pseudo R2     = 0.0030

——————————————————————————
  prior_t2dm | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
————-+—————————————————————-
    bmi_grp4 |   1.263954   .1229826     2.41   0.016     1.044503    1.529513
       _cons |   .0337493   .0092462   -12.37   0.000     .0197271    .0577384
——————————————————————————
Note: _cons estimates baseline odds.

The odds ratio associated with a unit increase in BMI group in the past year is 1.26 (95% CI: 1.04-1.53).  Note that the Wald test here is equivalent to the chi-square test for linear trend.  The null hypothesis is that there is no association between prior diabetes and BMI group (i.e. that the linear trend OR equals 1), and the alternative hypothesis is that there is a linear increasing or decreasing trend.  Here, p=0.016 indicates that there is a statistically significant increasing linear trend in log odds of prior diabetes.

Answer C1.3b.ii:

The output is as follows:

logistic prior_t2dm hdlc

Logistic regression                                     Number of obs =  4,302
                                                        LR chi2(1)    =  17.93
                                                        Prob > chi2   = 0.0000
Log likelihood = -964.36248                             Pseudo R2     = 0.0092

——————————————————————————
  prior_t2dm | Odds ratio   Std. err.      z    P>|z|     [95% conf. interval]
————-+—————————————————————-
        hdlc |   .4634116   .0866931    -4.11   0.000     .3211646    .6686613
       _cons |   .1421395   .0282277    -9.82   0.000     .0963104    .2097762
——————————————————————————
Note: _cons estimates baseline odds.

For every 1 unit increase in HDL-C, the odds of having prior diabetes were 54% lower (95% CI: 0.32-0.67). This association is statistically significant (p<0.001) so we can reject the null hypothesis that there is no association of HDL-C with prior diabetes.

C1.3 PRACTICAL: SPSS

Logistic regression with a binary exposure

Firstly, create a new variable to dichotomise the variable ‘sbp’ into below 140 mmHg and greater than or equal to 140 mmHg. Call this variable “hyperten” to indicate “hypertensive”, using No= 0 and Yes =1

We are now going to use logistic regression to examine the association of hypertensive on your odds of having prior CVD (‘prior_cvd’).

Select

Analyze>> Regression >> Binary Logistic

Place prior_cvd in the Dependant Variable box and ‘hyperten’ in the covariates box.

Unlike with the ‘Crosstabs’ function for odds ratios and the Chi squared test, the Logistic Regression function allows you to define your reference category. To do this just click on ‘Categorical’ on the right-hand side of the pop up box. Move the ‘hyperten’ variable into the ‘Categorical Covariates’ box and indicate that you want to use the first category as the Reference Category. This way it will produce data based on what is different for the yes hypertension group compared to the no hypertension group (rather than the other way around).

Then click on ‘Statistics and Plots’ and tick the box next to ‘CI for exp(B)’. This will show the details of the confidence interval for the odds ratio. Once you have ticked the box you will then be able to alter the size of the confidence interval you wish to display. Leave it as the default value of 95%. Click Continue and then OK to run the test.

Question C1.3a: Can you interpret the output from the above analysis?

Answer

SPSS gives a lot of output tables for a logistic regression, but the one you are really interested in is the last one, which looks like the below.

ExpB is the odds ratio. The odds of having prior CVD are 1.59 times greater for someone that was hypertensive compared to those who were not (95% CI: 1.38-1.85). This effect is significant (p<0.001), which means that we can reject the null hypothesis that there is no association between hypertension and prior CVD, and that the odds ratio is equal to 1.

Logistic regression with a categorical exposure

We can use logistic regression to obtain odds ratios for a categorical variable such as BMI group. We follow the same process, but put the categorical variable into the ‘Covariates’ box, and define the reference category in the ‘Categorical’ tab as before.

A logistic regression output between prior_cvd and bmi_grp4 would look like the below.

Odds ratios in a logistic regression with a categorical variable are interpreted as the odds of the outcome in the specified level compared with a baseline level. The constant refers to the odds of the outcome in the baseline category. Here, the odds of having prior CVD are 39%% higher if a participant has obesity compared to if they are underweight (OR:1.39, 95% CI:0.69-2.82). This association is not statistically significant (p=0.36) so we cannot reject the null hypothesis that the OR of obese vs underweight is 1.

SPSS only allows the first or last categories to be defined as the reference category. If you wanted to compare each of the categories to the ‘normal weight’ category, you would first need to recode the variable so that ‘normal’ was associated with either the highest or lowest integer used for coding the groups.

Logistic regression with an ordered categorical exposure

If we do not define the bmi_grp variable in the ‘Categorical’ tab, then we are telling SPSS that our covariate is to be treated as continuous rather than categorical. The output from this analysis looks like the below.

The null hypothesis is that there is no association between BMI group and prior CVD (i.e. that the linear trend OR equals 1), and the alternative hypothesis is that there is a linear increasing or decreasing trend.  Here, the p-value (p=0.06) indicates that there is not evidence against the null hypothesis (although this is a borderline significant p-value) and we conclude that there is not a statistically significant increasing trend in log odds of prior CVD across groups of BMI.

Logistic regression with a continuous exposure

To fit a logistic regression with a continuous exposure, we use the same process and we interpret the OR in terms of a 1 unit change in the exposure. Below is an output table for prior_cvd with hdlc as the covariate.

The odds of having prior CVD were 51% lower for each 1 unit increase in HDL-C (OR: 0.49, 95%CI: 0.40-0.59). This association was statistically significant.

Question C1.3bi.: Use logistic regression to obtain odds ratios for the association of BMI group on prior diabetes (‘prior_t2dm’). Try fitting this variable both as a categorical variable and as a linear trend.

Question C1.3b.ii: Use logistic regression to assess the association between ‘prior_t2dm’ and the continuous exposure variable ‘hdlc’. How would you interpret this output? 

Answer

Answer C1.3b.i:

Fitting BMI group as a categorical variable in logistic regression gives the following results. In this example I have recoded the bmi_grp4 variable to compare all groups to the normal weight group.

The odds ratio of prior diabetes for underweight vs normal weight is 1.63 (95% CI: 0.57-4.62).  p=0.361 so this odds ratio is not significantly different to 1.

The odds ratio for overweight vs normal weight is 1.23 (95% CI: 0.93-1.62). p=0.142 so this odds ratio is not significantly different to 1.

The odds ratio for obese vs normal weight is 1.80 (95% CI: 1.19-2.72). p=0.006 so this odds ratio is significantly different to 1.

Fitting ‘BMI group’ as a continuous variable in logistic regression gives the following results. This example uses the original bmi_grp4 variable.

The odds ratio associated with a unit increase in BMI group in the past year is 1.26 (95% CI: 1.04-1.53).  Note that the Wald test here is equivalent to the chi-square test for linear trend.  The null hypothesis is that there is no association between prior diabetes and BMI group (i.e. that the linear trend OR equals 1), and the alternative hypothesis is that there is a linear increasing or decreasing trend.  Here, p=0.016 indicates that there is a statistically significant increasing linear trend in log odds of prior diabetes.

Answer C1.3b.ii:

Including HDL-C as a continuous variable gives the following results.

For every 1 unit increase in HDL-C, the odds of having prior diabetes were 54% lower (OR: 0.46, 95% CI: 0.32-0.67). This association is statistically significant (p<0.001) so we can reject the null hypothesis that there is no association of HDL-C with prior diabetes.

C1.3 PRACTICAL: R

Logistic regression with a binary exposure

To run a logistic regression in R, we use the ‘glm() ‘ command. The set up of the command is:

glm(formula, family, data)

where:

  • formula has the format “outcome ~ exposure”,
  • family is ‘binomial(link = “logit”)’ for logistic regression, and
  • data corresponds to the dataframe including our data.

To use this command, you need to confirm that your outcome variable is coded as a factor variable. You can find more details on glm() command in this help file.

Using the ‘glm()’ command above, the log OR (Estimate) and its standard error (Std. Error) are printed, and we estimate the OR and the 95% confidence intervals by exponentiating the results with this post-estimation command:

 exp(cbind(coef(model), confint(model)))

  • Question C1.3a: Dichotomise the variable ‘sbp’ into below 140 mmHg and greater than or equal to 140 mmHg. Call this variable “hyperten” to indicate “hypertensive”. Use logistic regression to examine the association of hypertensive on your odds of having prior CVD (‘prior_cvd’). Interpret the output.
Answer

Answer C1.3a:

To dichotomise the “sbp” variable, use the following command:

df[df$sbp < 140, “hyperten”] <- 0

df[df$sbp >= 140, “hyperten”] <- 1

To run the logistic regression we use the glm() command:

model <- glm(prior_cvd ~ hyperten, family = binomial(link = “logit”), data = df)

summary(model)

The output of the glm()command is the following:

Call:
glm(formula = “prior_cvd ~ hyperten”, family = binomial(link = “logit”), data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8753  -0.7157  -0.7157   1.5132   1.7248  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.23143    0.04284 -28.747  < 2e-16 ***
hyperten     0.46966    0.07644   6.144 8.03e-10 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4805.5  on 4265  degrees of freedom
Residual deviance: 4768.6  on 4264  degrees of freedom
AIC: 4772.6

Number of Fisher Scoring iterations: 4

The lnOR (Estimate) and its standard error (Std. Error) are printed, and we estimate the OR and the 95% confidence intervals by exponentiating:

>  exp(cbind(coef(model), confint(model)))
Waiting for profiling to be done…
                         2.5 %    97.5 %
(Intercept) 0.291874 0.2681921 0.3172374
hyperten    1.599446 1.3762597 1.8571969

The odds of having prior CVD are 1.60 times greater for someone that was hypertensive compared to those who were not (95% CI: 1.38-1.86). This effect is significant (p<0.001), which means that we can reject the null hypothesis that there is no association between hypertension and prior CVD, and that the odds ratio is equal to 1.

Logistic regression with a categorical exposure

We may use the glm() command to obtain odds ratios for a categorical variable such as BMI group. By default, the category with the lowest numerical value is treated as reference category. We fit the logistic regression model with the following command:

model2 <- glm(prior_cvd ~ bmi_grp4, family = binomial(link = “logit”), data = df)

summary(model2)

The output looks like this:

 Call:
glm(formula = “prior_cvd ~ bmi_grp4”, family = binomial(link = “logit”),data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8138  -0.7674  -0.7408   1.5913   1.7285  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)        -1.23969    0.34238  -3.621 0.000294 ***
bmi_grp4Normal      0.08689    0.34687   0.250 0.802210    
bmi_grp4Overweight  0.16795    0.34607   0.485 0.627460    
bmi_grp4Obese       0.30471    0.36106   0.844 0.398707    

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4805.5  on 4265  degrees of freedom
Residual deviance: 4802.1  on 4262  degrees of freedom
AIC: 4810.1

Number of Fisher Scoring iterations: 4

Odd ratios in a logistic regression with a categorical variable are interpreted as the odds of the outcome in the specified level compared with a baseline level. The constant refers to the odds of the outcome in the baseline category. We estimate the OR and the 95% confidence intervals by exponentiating:

 exp(cbind(coef(model2), confint(model2)))

Waiting for profiling to be done…
                                 2.5 %   97.5 %
(Intercept)        0.2894737 0.1409194 0.546895
bmi_grp4Normal     1.0907740 0.5717535 2.257744
bmi_grp4Overweight 1.1828794 0.6211096 2.445084
bmi_grp4Obese      1.3562290 0.6893090 2.875507

Here, the odds of having prior CVD are 36% higher if a participant has obesity compared to if they have a normal weight (OR:1.36, 95% CI:0.69-2.88). This association is not statistically significant (p=0.40) so we cannot reject the null hypothesis that the OR of obese vs normal weight is 1.

Logistic regression with an ordered categorical exposure

We can use the logistic command to obtain an odds ratio and a linear trend test for an ordered categorical variable such as BMI group. For this reason, we should tell R that our exposure variable is to be treated as a continuous variable rather than as a categorical variable. We achieve that using the ‘as.numeric()’ command in the formula.

The output looks like this:

 > > model3 <- glm(prior_cvd ~ as.numeric(bmi_grp4), family = binomial(link = “logit”), data = df)
> summary(model3)

Call:
glm(formula = “prior_cvd ~ as.numeric(bmi_grp4)”, family = binomial(link = “logit”), 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8040  -0.7710  -0.7389   1.6042   1.7359  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -1.35370    0.14824  -9.132   <2e-16 ***
as.numeric(bmi_grp4)  0.09753    0.05391   1.809   0.0704 .  

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4805.5  on 4265  degrees of freedom
Residual deviance: 4802.3  on 4264  degrees of freedom
AIC: 4806.3

Number of Fisher Scoring iterations: 4

Note that the Wald test here is equivalent to the chi-square test for linear trend (Section C1.2b, ‘chi-square test for trend’).  The null hypothesis is that there is no association between BMI group and prior CVD (i.e. that the linear trend OR equals 1), and the alternative hypothesis is that there is a linear increasing or decreasing trend. Here, the p-value (p=0.07) indicates that there is not evidence against the null hypothesis (although this is a borderline significant p-value) and we conclude that there is not a statistically significant increasing trend in log odds of prior CVD across groups of BMI.

Logistic regression with a continuous exposure

To fit a logistic regression with a continuous exposure, we type the variable as it is (with no prefix) and we interpret the OR in terms of a 1-unit change in the exposure:

> model4 <- glm(prior_cvd ~ hdlc, family = binomial(link = “logit”), data = df)
> summary(model4)

Call:
glm(formula = “prior_cvd ~ hdlc”, family = binomial(link = “logit”), 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.9867  -0.7962  -0.7180   1.3712   2.2951  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.3131     0.1106   -2.83  0.00466 ** 
hdlc         -0.7316     0.1005   -7.28 3.34e-13 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4805.5  on 4265  degrees of freedom
Residual deviance: 4749.7  on 4264  degrees of freedom
AIC: 4753.7

Number of Fisher Scoring iterations: 4

We estimate the OR and the 95% confidence intervals by exponentiating:

> exp(cbind(coef(model4), confint(model4)))
Waiting for profiling to be done…
                          2.5 %    97.5 %
(Intercept) 0.7311849 0.5887604 0.9084966
hdlc        0.4811146 0.3945020 0.5850291

The odds of having prior CVD were 52% lower for each 1 unit increase in HDL-C (OR: 0.48, 95%CI: 0.39-0.59). This association was statistically significant.

  • Question C1.3bi.: Use logistic regression to obtain odds ratios for the association of BMI group on prior diabetes. Try fitting this variable both as a categorical variable and as a linear trend.
  • Question C1.3b.ii: Use logistic regression to assess the association between ‘prior_t2dm’ and the continuous exposure variable ‘hdlc’. How would you interpret this output?  
Answer

Answer C1.3bi.:

We fit the logistic regression model with the following command:

model <- glm(prior_t2dm ~ bmi_grp4, family = binomial(link = “logit”), data = df)

summary(model)

The output looks like this:

> model5 <- glm(prior_t2dm ~ bmi_grp4, family = binomial(link = “logit”), data = df)
> summary(model5)

Call:
glm(formula = prior_t2dm ~ as.character(bmi_grp4), family = binomial(link = “logit”), 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4286  -0.3544  -0.3544  -0.3211   2.4457  

Coefficients:
                     Estimate    Std. Error z value Pr(>|z|)    
(Intercept)          -2.9391     0.1088     -27.023 < 2e-16 ***
bmi_grp4Obese         0.5979     0.2122     2.817   0.00485 ** 
bmi_grp4Overweight    0.2029     0.1424     1.425   0.15426    
bmi_grp4Underweight   0.5187     0.5330     0.973   0.33040    

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1914.6  on 4265  degrees of freedom
Residual deviance: 1906.6  on 4262  degrees of freedom
AIC: 1914.6

Just to note that the reference group in the model is those with a normal weight. We can estimate the different ORs and their corresponding the 95% confidence intervals from the model by taking the exponential:

> exp(cbind(coef(model5), confint(model5)))
Waiting for profiling to be done…
                                  2.5 %     97.5 %
(Intercept)         0.0529132 0.0424526 0.06505699
bmi_grp4Obese       1.8182592 1.1845276 2.72916539
bmi_grp4Overweight  1.2249272 0.9280857 1.62319656
bmi_grp4Underweight 1.6799001 0.4981225 4.25397420

The odds ratio of prior diabetes for underweight vs normal weight (as BMI group 2 is the reference group) is 1.68 (95% CI: 0.50-4.25). The z-statistic for this odds ratio is 0.97 and p=0.330 so this odds ratio is not significantly different to 1.

The odds ratio for overweight vs normal weight is 1.22 (95% CI: 0.93-1.62).  The z-statistic for this odds ratio is 1.43 and p=0.154 so this odds ratio is not significantly different to 1.

Finally, the odds ratio for obese vs normal weight is 1.82 (95% CI: 1.18-2.73).  The z-statistic for this odds ratio is 2.82 and p=0.005 so this odds ratio is significantly different to 1.

Fitting ‘BMI group’ as a continuous variable in logistic regression gives the following results:

> model6 <- glm(prior_t2dm ~ as.numeric(bmi_grp4), family = binomial(link = “logit”), data = df)
> summary(model6) 

Call:
glm(formula = prior_t2dm ~ as.numeric(bmi_grp4), family = binomial(link = “logit”), 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4048  -0.3613  -0.3613  -0.3223   2.5332  

Coefficients:
                     Estimate Std. Error z value Pr(>|z|)    
(Intercept)          -3.40303    0.27652 -12.306   <2e-16 ***
as.numeric(bmi_grp4)  0.23561    0.09811   2.402   0.0163 *  

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1914.6  on 4265  degrees of freedom
Residual deviance: 1908.9  on 4264  degrees of freedom
AIC: 1912.9

Number of Fisher Scoring iterations: 5

> exp(cbind(coef(model6), confint(model6)))
Waiting for profiling to be done…
                                     2.5 %     97.5 %
(Intercept)          0.03327238 0.01923618 0.05689893
as.numeric(bmi_grp4) 1.26568480 1.04369721 1.53348499 

The odds ratio associated with a unit increase in BMI group in the past year is 1.27 (95% CI: 1.04-1.53).  Note that the Wald test here is equivalent to the chi-square test for linear trend.  The null hypothesis is that there is no association between prior diabetes and BMI group (i.e. that the linear trend OR equals 1), and the alternative hypothesis is that there is a linear increasing or decreasing trend.  Here, p=0.016 indicates that there is a statistically significant increasing linear trend in log odds of prior diabetes.

Answer C1.3b.ii:

The output is as follows:

> model7 <- glm(prior_t2dm ~ hdlc, family = binomial(link = “logit”), data = df)
> summary(model7)

Call:
glm(formula = prior_t2dm ~ hdlc, family = binomial(link = “logit”), 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5043  -0.3757  -0.3448  -0.3069   2.7934  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.9596     0.2005  -9.775  < 2e-16 ***
hdlc         -0.7717     0.1887  -4.089 4.34e-05 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1914.6  on 4265  degrees of freedom
Residual deviance: 1896.9  on 4264  degrees of freedom
AIC: 1900.9

Number of Fisher Scoring iterations: 5

> exp(cbind(coef(model7), confint(model7)))
Waiting for profiling to be done…
                           2.5 %    97.5 %
(Intercept) 0.1409122 0.09493215 0.2083414
hdlc        0.4622490 0.31763986 0.6657489

For every 1 unit increase in HDL-C, the odds of having prior diabetes were 54% lower (OR: 0.46 95% CI: 0.32-0.67). This association is statistically significant (p<0.001) so we can reject the null hypothesis that there is no association of HDL-C with prior diabetes.

 
Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments
0
Questions or comments?x