Back to Course

FoSSA: Fundamentals of Statistical Software & Analysis

0% Complete
0/0 Steps
  1. Course Information

    Meet the Teaching Team
  2. Course Dataset 1
  3. Course Dataset 2
  4. MODULE A1: INTRODUCTION TO STATISTICS USING R, STATA, AND SPSS
    A1.1 What is Statistics?
  5. A1.2.1a Introduction to Stata
  6. A1.2.2b: Introduction to R
  7. A1.2.2c: Introduction to SPSS
  8. A1.3: Descriptive Statistics
  9. A1.4: Estimates and Confidence Intervals
  10. A1.5: Hypothesis Testing
  11. A1.6: Transforming Variables
  12. End of Module A1
    1 Quiz
  13. MODULE A2: POWER & SAMPLE SIZE CALCULATIONS
    A2.1 Key Concepts
  14. A2.2 Power calculations for a difference in means
  15. A2.3 Power Calculations for a difference in proportions
  16. A2.4 Sample Size Calculation for RCTs
  17. A2.5 Sample size calculations for cross-sectional studies (or surveys)
  18. A2.6 Sample size calculations for case-control studies
  19. End of Module A2
    1 Quiz
  20. MODULE B1: LINEAR REGRESSION
    B1.1 Correlation and Scatterplots
  21. B1.2 Differences Between Means (ANOVA 1)
  22. B1.3 Univariable Linear Regression
  23. B1.4 Multivariable Linear Regression
  24. B1.5 Model Selection and F-Tests
  25. B1.6 Regression Diagnostics
  26. End of Module B1
    1 Quiz
  27. MODULE B2: MULTIPLE COMPARISONS & REPEATED MEASURES
    B2.1 ANOVA Revisited - Post-Hoc Testing
  28. B2.2 Correcting For Multiple Comparisons
  29. B2.3 Two-way ANOVA
  30. B2.4 Repeated Measures and the Paired T-Test
  31. B2.5 Repeated Measures ANOVA
  32. End of Module B2
    1 Quiz
  33. MODULE B3: NON-PARAMETRIC MEASURES
    B3.1 The Parametric Assumptions
  34. B3.2 Mann-Whitney U Test
  35. B3.3 Kruskal-Wallis Test
  36. B3.4 Wilcoxon Signed Rank Test
  37. B3.5 Friedman Test
  38. B3.6 Spearman's Rank Order Correlation
  39. End of Module B3
    1 Quiz
  40. MODULE C1: BINARY OUTCOME DATA & LOGISTIC REGRESSION
    C1.1 Introduction to Prevalence, Risk, Odds and Rates
  41. C1.2 The Chi-Square Test and the Test For Trend
  42. C1.3 Univariable Logistic Regression
  43. C1.4 Multivariable Logistic Regression
  44. End of Module C1
    1 Quiz
  45. MODULE C2: SURVIVAL DATA
    C2.1 Introduction to Survival Data
  46. C2.2 Kaplan-Meier Survival Function & the Log Rank Test
  47. C2.3 Cox Proportional Hazards Regression
  48. C2.4 Poisson Regression
  49. End of Module C2
    1 Quiz

Learning Outcomes

By the end of this section, students will be able to:

  • Understand the theory of proportional hazards models
  • Calculate and interpret the results of a Cox proportional hazards regression model
  • Test the assumptions underlying a Cox regression model

Video C2.3a – Cox Model (8 minutes) 

C2.3a PRACTICAL: Stata

Unadjusted Cox regression

The ‘stcox’ command is used to run a Cox proportional hazards regression model. You must use the ‘stset’ command before using stcox, and due to this you do not need to specify an outcome variable or a time variable again. You only need to specify the explanatory variables that you are examining in your model.

stcox [varlist] [if] [in] [, options]

For example, for frailty, the command would be:

stcox i.frailty

The output looks like this:

Cox regression with Breslow method for ties

No. of subjects =       4,327                           Number of obs =  4,327
No. of failures =       1,526
Time at risk    = 29,476.9825

                                                                            LR chi2(4)    = 577.22
Log likelihood = -12128.151                                                 Prob > chi2   = 0.0000

 ——————————————————————————
          _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
————-+—————————————————————-
     frailty |
          2  |    1.45484   .1522779     3.58   0.000     1.185005    1.786119
          3  |   1.944776    .195232     6.63   0.000     1.597421    2.367664
          4  |   2.884121   .2643057    11.56   0.000     2.409949    3.451588
          5  |   5.740893   .5005441    20.04   0.000     4.839091    6.810752
——————————————————————————

Looking at the output, you can see that the hazard of dying is nearly 6 times greater for participants in frailty level 5 compared with frailty level 1 (HR 5.7, 95% CI:4.8-6.8). This is statistically significant.

  • Question C2.3a.i: What is association of vitamin D (using the variable coded above and below the median) with the risk (hazard) of dying?
C2.3a.i. Answer

stcox i.vitd_med

      Failure _d: death
  Analysis time _t: fu_years
       ID variable: whl1_id

Iteration 0:   log likelihood = -12416.762
Iteration 1:   log likelihood = -12356.639
Iteration 2:   log likelihood = -12356.629
Refining estimates:
Iteration 0:   log likelihood = -12356.629

Cox regression with Breslow method for ties

No. of subjects =       4,327                           Number of obs =  4,327
No. of failures =       1,526
Time at risk    = 29,476.9825
LR chi2(1)    = 120.27
Log likelihood = -12356.629                             Prob > chi2   = 0.0000

——————————————————————————
_t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
————-+—————————————————————-
2.vitd_med |    .566297   .0298226   -10.80   0.000     .5107612    .6278712
——————————————————————————

Participants that have vitamin D concentrations above the median have a 43% lower hazard (or risk) of dying, and this is statistically significant (HR 0.57, 95% CI: 0.51-0.63).

Multivariable Cox Regression

We can adjust for confounders in a Cox regression as we do in any other regression model. For example, if we want to examine the independent association of vitamin D on the hazard of mortality, adjusted for age group, we type:

stcox vitd_med i.age_grp

The output should look like this:

        Failure _d: death
  Analysis time _t: fu_years
       ID variable: whl1_id

Iteration 0:   log likelihood = -12416.762
Iteration 1:   log likelihood = -12186.532
Iteration 2:   log likelihood = -12137.041
Iteration 3:   log likelihood = -12136.832
Iteration 4:   log likelihood = -12136.832
Refining estimates:
Iteration 0:   log likelihood = -12136.832

Cox regression with Breslow method for ties

No. of subjects =       4,327                           Number of obs =  4,327
No. of failures =       1,526
Time at risk    = 29,476.9825
                                                        LR chi2(4)    = 559.86
Log likelihood = -12136.832                             Prob > chi2   = 0.0000

——————————————————————————
          _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
————-+—————————————————————-
    vitd_med |   .6728851    .035984    -7.41   0.000      .605928    .7472412
             |
     age_grp |
      71-75  |   1.860847   .1792291     6.45   0.000     1.540729    2.247475
      76-80  |   3.128741   .3046026    11.72   0.000     2.585233    3.786513
      81-95  |   5.800825   .5781615    17.64   0.000     4.771463    7.052256
——————————————————————————

We can see that once vitamin D has been adjusted for age group, the hazard ratio weakens from 0.57 (unadjusted HR, from section C2.3a.i.) to 0.67. There is still a statistically significant association (p<0.001), but it is weaker (i.e. closer to the null value of HR=1) since we controlled for the important confounder of age.

  • Question C2.3a.ii: What is the impact of frailty on the hazard of dying, once adjusted for age_grp?
C2.3a.ii. Answer

stcox i.frailty i.age_grp

        Failure _d: death
  Analysis time _t: fu_years
       ID variable: whl1_id

Iteration 0:   log likelihood = -12416.762
Iteration 1:   log likelihood = -12076.963
Iteration 2:   log likelihood =  -11986.28
Iteration 3:   log likelihood = -11985.881
Iteration 4:   log likelihood = -11985.881
Refining estimates:
Iteration 0:   log likelihood = -11985.881

Cox regression with Breslow method for ties

No. of subjects =       4,327                           Number of obs =  4,327
No. of failures =       1,526
Time at risk    = 29,476.9825
                                                        LR chi2(7)    = 861.76
Log likelihood = -11985.881                             Prob > chi2   = 0.0000

——————————————————————————
          _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
————-+—————————————————————-
     frailty |
          2  |   1.321821   .1386221     2.66   0.008      1.07623    1.623453
          3  |   1.612825   .1631257     4.73   0.000       1.3228    1.966437
          4  |   2.226646   .2074741     8.59   0.000     1.854975    2.672787
          5  |   4.125627   .3706378    15.78   0.000     3.459551    4.919944
             |
     age_grp |
      71-75  |   1.758254   .1695919     5.85   0.000     1.455389    2.124144
      76-80  |   2.645494   .2594916     9.92   0.000       2.1828    3.206266
      81-95  |    4.31949   .4368414    14.47   0.000     3.542813    5.266435

Participants in frailty level 5 have a hazard of dying that is over 4 times greater than those in frailty level 1, once adjusted for age group (HR 4.1, 95% CI: 3.5-4.9, p<0.001). The association of frailty level 5 was made weaker by the adjustment for age group (if we compare the association here to that seen above in the unadjusted model).

Note: In these adjusted Cox regression model examples, adjusting for confounders weakened the associations. However, remember that confounding can make associations appear spuriously weaker or stronger. Therefore, sometimes when you adjust for additional variables, your exposure’s association may get stronger.

C2.3a PRACTICAL: R

Unadjusted Cox regression

The coxph() command is used to run a Cox proportional hazards regression model in R.  The simplified syntax of the coxph() command is the following:

coxph(formula, data)

The formula used in the coxph() command needs a survival object as dependent variable which is created using the Surv() command. The syntax of the Surv() command is the following:

Surv(time, event)

where time is the follow-up time per individual and event is the disease/outcome status per individual.

For example, we can examine the effect of frailty on the risk of death. Frailty is measured using a quantitative scale and it would be better to treat it as a categorical variable:

df$frailty <- as.factor(df$frailty)
model <- coxph(Surv(fu_years, death) ~ frailty, data = df)
summary(model)

The output looks like this:

> df$frailty <- as.factor(df$frailty)
> model <- coxph(formula = Surv(fu_years, death) ~ frailty, data = df)
> summary(model)
Call:
coxph(formula = Surv(fu_years, death) ~ frailty, data = df)

  n= 4327, number of events= 1526 

            coef exp(coef) se(coef)      z Pr(>|z|)    
frailty2 0.37490   1.45484  0.10467  3.582 0.000341 ***
frailty3 0.66515   1.94478  0.10039  6.626 3.45e-11 ***
frailty4 1.05922   2.88412  0.09164 11.558  < 2e-16 ***
frailty5 1.74762   5.74090  0.08719 20.044  < 2e-16 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

         exp(coef) exp(-coef) lower .95 upper .95
frailty2     1.455     0.6874     1.185     1.786
frailty3     1.945     0.5142     1.597     2.368
frailty4     2.884     0.3467     2.410     3.452
frailty5     5.741     0.1742     4.839     6.811

Concordance= 0.671  (se = 0.007 )
Likelihood ratio test= 577.2  on 4 df,   p=<2e-16
Wald test            = 576.4  on 4 df,   p=<2e-16
Score (logrank) test = 667.9  on 4 df,   p=<2e-16

Looking at the output, you can see that the hazard of dying is nearly 6 times greater for participants in frailty level 5 compared with frailty level 1 (HR 5.7, 95% CI:4.8-6.8). This is statistically significant.

  • Question C2.3a.i: What is association of vitamin D (using the variable coded above and below the median) with the risk (hazard) of dying?
Answer

> model <- coxph(formula = Surv(fu_years, death) ~ vitd_med, data = df)
> summary(model)
Call:
coxph(formula = Surv(fu_years, death) ~ vitd_med, data = df)

  n= 4327, number of events= 1526 

             coef exp(coef) se(coef)     z Pr(>|z|)    
vitd_med -0.56864   0.56630  0.05266 -10.8   <2e-16 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

         exp(coef) exp(-coef) lower .95 upper .95
vitd_med    0.5663      1.766    0.5108    0.6279

Concordance= 0.57  (se = 0.006 )
Likelihood ratio test= 120.3  on 1 df,   p=<2e-16
Wald test            = 116.6  on 1 df,   p=<2e-16
Score (logrank) test = 119.8  on 1 df,   p=<2e-16

Participants that have vitamin D concentrations above the median have a 43% lower hazard (or risk) of dying, and this is statistically significant (HR 0.57, 95% CI: 0.51-0.63).

Multivariable Cox Regression

We can adjust for confounders in a Cox regression as we do in any other regression model. For example, if we want to examine the independent association of vitamin D on the hazard of mortality, adjusted for age group, we type:

df$age_grp <- as.factor(df$age_grp)
model <- coxph(formula = Surv(fu_years, death) ~ vitd_med + age_grp, data = df)
summary(model)

Note: Age is recorded as a categorical variable. For this reason, we should define the variable age_grp as a factor (and not as numeric). We do that using the as.factor() command.

The output should look like this:

Call:
coxph(formula = Surv(fu_years, death) ~ vitd_med + age_grp, data = df)

  n= 4327, number of events= 1526 

             coef exp(coef) se(coef)      z Pr(>|z|)    
vitd_med -0.39618   0.67288  0.05348 -7.408 1.28e-13 ***
age_grp2  0.62103   1.86085  0.09632  6.448 1.13e-10 ***
age_grp3  1.14063   3.12874  0.09736 11.716  < 2e-16 ***
age_grp4  1.75800   5.80083  0.09967 17.638  < 2e-16 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

         exp(coef) exp(-coef) lower .95 upper .95
vitd_med    0.6729     1.4861    0.6059    0.7472
age_grp2    1.8608     0.5374    1.5407    2.2475
age_grp3    3.1287     0.3196    2.5852    3.7865
age_grp4    5.8008     0.1724    4.7715    7.0523

Concordance= 0.664  (se = 0.007 )
Likelihood ratio test= 559.9  on 4 df,   p=<2e-16
Wald test            = 560.2  on 4 df,   p=<2e-16
Score (logrank) test = 638.8  on 4 df,   p=<2e-16

We can see that once vitamin D has been adjusted for age group, the hazard ratio weakens from 0.57 (unadjusted HR, from section C2.3a.i.) to 0.67. There is still a statistically significant association (p<0.001), but it is weaker (i.e. closer to the null value of HR=1) since we controlled for the important confounder of age.

  • Question C2.3a.ii: What is the impact of frailty on the hazard of dying, once adjusted for age_grp?
Answer

> model <- coxph(formula = Surv(fu_years, death) ~ frailty + age_grp, data = df)
> summary(model)
Call:
coxph(formula = Surv(fu_years, death) ~ frailty + age_grp, data = df)

  n= 4327, number of events= 1526 

coef exp(coef) se(coef)      z Pr(>|z|)    
frailty2 0.27901   1.32182  0.10487  2.660   0.0078 ** 
frailty3 0.47799   1.61282  0.10114  4.726 2.29e-06 ***
frailty4 0.80050   2.22665  0.09318  8.591  < 2e-16 ***
frailty5 1.41722   4.12563  0.08984 15.775  < 2e-16 ***
age_grp2 0.56432   1.75825  0.09645  5.851 4.90e-09 ***
age_grp3 0.97286   2.64549  0.09809  9.918  < 2e-16 ***
age_grp4 1.46314   4.31950  0.10113 14.468  < 2e-16 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

 exp(coef) exp(-coef) lower .95 upper .95
frailty2     1.322     0.7565     1.076     1.623
frailty3     1.613     0.6200     1.323     1.966
frailty4     2.227     0.4491     1.855     2.673
frailty5     4.126     0.2424     3.460     4.920
age_grp2     1.758     0.5687     1.455     2.124
age_grp3     2.645     0.3780     2.183     3.206
age_grp4     4.319     0.2315     3.543     5.266

Concordance= 0.709  (se = 0.006 )
Likelihood ratio test= 861.8  on 7 df,   p=<2e-16
Wald test            = 862.2  on 7 df,   p=<2e-16
Score (logrank) test = 1018  on 7 df,    p=<2e-16

Participants in frailty level 5 have a hazard of dying that is over 4 times greater than those in frailty level 1, once adjusted for age group (HR 4.1, 95% CI: 3.5-4.9, p<0.001). The association of frailty level 5 was made weaker by the adjustment for age group (if we compare the association here to that seen above in the unadjusted model).

Note: In these adjusted Cox regression model examples, adjusting for confounders weakened the associations. However, remember that confounding can make associations appear spuriously weaker or stronger. Therefore, sometimes when you adjust for additional variables, your exposure’s association may get stronger.

C2.3a PRACTICAL: SPSS

Unadjusted Cox regression

To conduct the Cox Regression in SPSS

Select

Analyze>.> Survival >> Cox Regression

Place the time variable (fu_years) in the ‘Time’ box, and the death variable in the ‘Status’ box and define the event just as you did with the Kaplan-Meier.

In this example we are going to add frailty as an explanatory variable in the model. Add frailty to the Covariates box, then click on ‘Categorical’ to define the frailty variable as categorical, just as before.

Click on options and tick the box next to CI for exp(B) to show the confidence interval for the hazard ratios in the output. Keep the default value of 95%. Press continue and then OK to run the test.

The output table you are interested in will look like the below.

Looking at the output, you can see that the hazard of dying is nearly 6 times greater for participants in frailty level 5 compared with frailty level 1 (HR 5.7, 95% CI:4.8-6.8). This is statistically significant (P<0.001).

Question C2.3a.i: What is association of vitamin D (using the variable coded above and below the median) with the risk (hazard) of dying?

Answer

The output table will look like the below.

Participants that have vitamin D concentrations above the median have a 44% lower hazard (or risk) of dying, and this is statistically significant (HR 0.56, 95% CI: 0.50-0.62).

 

Multivariable Cox Regression

We can adjust for confounders in a Cox regression as we do in any other regression model. For example, if we want to examine the independent association of vitamin D on the hazard of mortality, adjusted for age group we can just add age group into the analysis as another categorical covariate.

The output looks like the below.

We can see that once vitamin D has been adjusted for age group, the hazard ratio weakens from 0.56 (unadjusted HR, from section C2.3a.i.) to 0.67. There is still a statistically significant association (p<0.001), but it is weaker (i.e. closer to the null value of HR=1) since we controlled for the important confounder of age.

Question C2.3a.ii: What is the impact of frailty on the hazard of dying, once adjusted for age_grp?

Answer

Participants in frailty level 5 have a hazard of dying that is over 4 times greater than those in frailty level 1, once adjusted for age group (HR 4.1, 95% CI: 3.5-4.9, p<0.001). The association of frailty level 5 was made weaker by the adjustment for age group (if we compare the association here to that seen above in the unadjusted model).

Note: In these adjusted Cox regression model examples, adjusting for confounders weakened the associations. However, remember that confounding can make associations appear spuriously weaker or stronger. Therefore, sometimes when you adjust for additional variables, your exposure’s association may get stronger.

Video C2.3b – Proportional Hazards Assumption (8 minutes)

C2.3b PRACTICAL: Stata

Test the proportional hazards assumption

There are many similarities between Cox regression and linear and logistic regression. We have seen that you can adjust for additional variables in the same way. You can also fit interactions using the same commands. However, the one key difference between a Cox regression and other regression models is the proportional hazards assumption.

The proportional hazards assumption states that hazard ratio (comparing levels of an exposure variable) is constant over time. This assumption needs to be tested. The lecture explained 3 ways to do this: fitting an interaction with time, examining the residuals with a log-log plot and testing the residuals for a correlation with time. In practice, generally just choosing one method to use is sufficient.

Fitting an interaction with time

If the PH assumption is violated, it means that the hazard ratio is not constant over time, i.e. there is an interaction between a covariate and time. Thus, we can fit a Cox model with a time interaction in it and see if this is significant. We can also use a likelihood ratio (LR) test to see if a model with a time-varying covariate fits better than one without. The ‘tvc’ and ‘texp’ options of the stcox are used for this.

For example, to test age the command would be:

stcox i.age_grp frailty, tvc(age_grp)

——————————————————————————
          _t | Haz. ratio   Std. err.      z    P>|z|     [95% conf. interval]
————-+—————————————————————-
main         |
     age_grp |
      71-75  |   1.534684   .1700356     3.87   0.000     1.235122      1.9069
      76-80  |   2.028394   .2977641     4.82   0.000      1.52124    2.704624
      81-95  |   3.009572   .5683724     5.83   0.000     2.078505     4.35771
             |
     frailty |   1.434477   .0290646    17.81   0.000     1.378627    1.492588
————-+—————————————————————-
tvc          |
     age_grp |   1.026909    .012475     2.19   0.029     1.002747    1.051653
——————————————————————————
Note: Variables in tvc equation interacted with _t.

When we look at the p-value for the coefficient in the ‘tvc’ row, we see that it is significant (p=0.03). This suggests there is a significant interaction with time, which indicates the HR for age group is not constant over time.

Inspecting log-log plots

To compute a log-log plot, the command is ‘stphplot’, which can be adjusted for confounders.

A log-log plot for age group is computed using the command below:

stphplot, by(age_grp) adjust(frailty)

When we inspect a log-log plot, we want to see that the curves in the main portion of the plot (i.e. not the tails of the plot, where there are few events) are generally parallel over time. If the curves touch or cross, this indicates there is a violation of the PH assumption.

Looking at the curves for each level of age group above, they appear to be generally parallel with similar slopes over time. This looks acceptable and does not indicate a meaningful violation of the PH assumption. This may contradict the results of the ‘tvc’ command above as statistical tests will often be significant for small deviations if you have a lot of power (e.g. if you have a large dataset). For this reason, we often prefer to visually inspect the data for obvious violations of the PH assumption.

  • Question C2.3b:  Interpret a log-log plot for frailty (adjusted for age group) and assess if it violates the PH assumption. 
Answer

stphplot, by(frailty) adjust(age_grp)

In this log-log plot, we see the curves for the different levels of frailty (levels 1-4) criss-cross and touch each other at several points over time. This is a clearer violation of the PH assumption.              

Fixing violations in the PH assumption

If the PH assumption has been violated for a covariate, the lecture mentioned 2 ways to deal with the situation. One way is to specify the model using the ‘tvc’ option for the offending variable, which models an interaction with time. The other way, which we are doing here, is to fit a stratified Cox model where we stratify by the time-varying covariate.

We do this by using the strata option of the stcox command.

stcox i.age_grp, strata(frailty)

You do not get HRs or p-values for each stratum. Instead, Stata assumes there is a separate baseline hazard for each stratum in order to estimate the proportional effects of the other variables across strata (which provides only one coefficient since the proportional effects are assumed to be constant, which is the no-interaction assumption [ie no interaction between other explanatory variables and the time-varying coefficient]). Since you do not get HR for levels of a variable you stratify on, do not use this method if you have non-proportionality in your main exposure of interest.

If you run the command you above, you will get this output:

Stratified Cox regression with Breslow    method    for ties
Strata variable: frailty

No. of subjects =       4,327            Number of obs =  4,327
No. of failures =       1,526
Time at risk    = 29,476.9825
            LR chi2(3)    = 280.76
Log likelihood = -9708.127            Prob > chi2   = 0.0000

           
_t  Haz. ratio   Std. err.    z    P>z    [95% conf. interval]
            
age_grp 
71-75     1.756097   .1693938    5.84    0.000    1.453588    2.121562
76-80     2.642535   .2592631    9.90    0.000    2.180259    3.202827
81-95      4.28578   .4339019    14.37    0.000    3.514409    5.226457

Notice that you do not get any output regarding the variable ‘frailty’ that you stratified on, although Stata indicates at the top of the output that you have used this variable in the strata. You would still interpret the HR of age_grp as having been adjusted for frailty, though. So you would say that the hazard of dying was 4.3 times greater in the 81-95 age group compared to the 60-70 age group, once it was adjusted for frailty (as stratifying on a variable is another way to adjust for its association).

C2.3b PRACTICAL: R

Test the proportional hazards assumption

There are many similarities between Cox regression and linear and logistic regression. We have seen that you can adjust for additional variables in the same way. You can also fit interactions using the same commands. However, the one key difference between a Cox regression and other regression models is the proportional hazards assumption.

The proportional hazards assumption states that hazard ratio (comparing levels of an exposure variable) is constant over time. This assumption needs to be tested. The lecture explained 3 ways to do this: fitting an interaction with time, examining the residuals with a log-log plot and testing the residuals for a correlation with time. In practice, generally just choosing one method to use is sufficient.

Fitting an interaction with time

If the PH assumption is violated, it means that the hazard ratio is not constant over time, i.e. there is an interaction between a covariate and time. Thus, we can fit a Cox model with a time interaction in it and see if this is significant. We can also use a likelihood ratio (LR) test to see if a model with a time-varying covariate fits better than one without. In R, the interaction of a covariate with time can be fitted by using the time-transform functionality tt() in the coxph() command.

model <- coxph(formula = Surv(fu_years, death) ~ frailty + age_grp + tt(age_grp), data = df)

summary(model)

The output looks like this:

Call:
coxph(formula = Surv(fu_years, death) ~ frailty + age_grp + tt(age_grp), 
    data = df)

  n= 4327, number of events= 1526 

               coef exp(coef) se(coef)      z Pr(>|z|)    
frailty2    0.27705   1.31923  0.10488  2.642  0.00825 ** 
frailty3    0.47470   1.60754  0.10115  4.693 2.69e-06 ***
frailty4    0.79874   2.22273  0.09317  8.573  < 2e-16 ***
frailty5    1.41966   4.13570  0.08975 15.817  < 2e-16 ***
age_grp2    0.53277   1.70365  0.09704  5.490 4.01e-08 ***
age_grp3    0.82084   2.27241  0.11049  7.429 1.09e-13 ***
age_grp4    0.68145   1.97674  0.28382  2.401  0.01635 *  
tt(age_grp) 0.04261   1.04353  0.01434  2.971  0.00297 *

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

            exp(coef) exp(-coef) lower .95 upper .95
frailty2        1.319     0.7580     1.074     1.620
frailty3        1.608     0.6221     1.318     1.960
frailty4        2.223     0.4499     1.852     2.668
frailty5        4.136     0.2418     3.469     4.931
age_grp2        1.704     0.5870     1.409     2.061
age_grp3        2.272     0.4401     1.830     2.822
age_grp4        1.977     0.5059     1.133     3.448
tt(age_grp)     1.044     0.9583     1.015     1.073

Concordance= 0.71  (se = 0.007 )
Likelihood ratio test= 870.4  on 8 df,   p=<2e-16
Wald test            = 880.1  on 8 df,   p=<2e-16
Score (logrank) test = 1045  on 8 df,    p=<2e-16

When we look at the p-value for the coefficient in the tt(age_grp) row, we see that it is significant (p=0.03). This suggests there is a significant interaction with time, which indicates the HR for age group is not constant over time.

Inspecting log-log plots

To compute a log-log plot, we should compute the survival curves for our variable using the survfit() command and then we can print the log-log plot using the ggsurvplot() command.

A log-log plot for age group is computed using the command below:

model <- coxph(Surv(fu_years, death) ~ frailty + strata(age_grp), data = df)

fit <- survfit(model)

ggsurvplot(fit, data = df, fun = “cloglog”)

When we inspect a log-log plot, we want to see that the curves in the main portion of the plot (i.e. not the tails of the plot, where there are few events) are generally parallel over time. If the curves touch or cross, this indicates there is a violation of the PH assumption.

Looking at the curves for each level of age group above, they appear to be generally parallel with similar slopes over time. This looks acceptable and does not indicate a meaningful violation of the PH assumption. This may contradict the results of the tt() command above as statistical tests will often be significant for small deviations if you have a lot of power (e.g. if you have a large dataset). For this reason, we often prefer to visually inspect the data for obvious violations of the PH assumption.

Testing the residuals for a correlation with time

The PH assumption can be also assessed using a statistical test based on the scaled Schoenfeld residuals. In principle, Schoenfeld residuals are independent of time. If a plot shows that Schoenfeld residuals are distributed in a non-random pattern against time, this is evidence of violation of the PH assumption.

We can test the proportional hazards assumption for each covariate in a Cox model using the cox.zph() command.

In a cox model including age_grp and frailty as covariates, we can examine if any of these variables violate the PH assumption by running the following command:

model <- coxph(Surv(fu_years, death) ~ frailty + age_grp, data = df)
cox.zph(model)

The output looks like this:
        chisq df       p
frailty  9.51  4 0.04945
age_grp 11.47  3 0.00943
GLOBAL  26.33  7 0.00044

The cox.zph() command has two outputs. The first one is a chi-squared test and a p-value for testing the violation of PH assumption for each one of the variables included in the cox model. The second one is a global chi-squared test and a p-value for testing if there is any variable in our model that violates the PH assumption.

We observe that both frailty and age_grp variables violate the PH assumption.

  • Question C2.3b:  Interpret a log-log plot for frailty (adjusted for age group) and assess if it violates the PH assumption. 
Answer

model <- coxph(Surv(fu_years, death) ~ strata(frailty) + age_grp, data = df)

fit <- survfit(model)

ggsurvplot(fit, data = df, fun = “cloglog”)

In this log-log plot, we see the curves for the different levels of frailty (levels 1-4) criss-cross and touch each other at several points over time. This is a clearer violation of the PH assumption.

Fixing violations in the PH assumption

If the PH assumption has been violated for a covariate, we can fit a stratified Cox model where we stratify by the covariate that violates the PH assumption.

To stratify a Cox model by a specific variable, we use the strata() command from the survival R package.

model <- coxph(formula = Surv(fu_years, death) ~ strata(frailty) + age_grp, data = df)
summary(model)

You do not get HRs or p-values for each stratum. Instead, R assumes there is a separate baseline hazard for each stratum in order to estimate the proportional effects of the other variables across strata (which provides only one coefficient since the proportional effects are assumed to be constant, which is the no-interaction assumption [i.e., no interaction between other explanatory variables and the time-varying coefficient]). Since you do not get HR for levels of a variable you stratify on, do not use this method if you have non-proportionality in your main exposure of interest.

If you run the command above, you will get this output:

Call:
coxph(formula = Surv(fu_years, death) ~ strata(frailty) + age_grp, 
    data = df)

  n= 4327, number of events= 1526 

            coef exp(coef) se(coef)      z Pr(>|z|)    
age_grp2 0.56309   1.75610  0.09646  5.838  5.3e-09 ***
age_grp3 0.97174   2.64253  0.09811  9.904  < 2e-16 ***
age_grp4 1.45530   4.28579  0.10124 14.374  < 2e-16 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Notice that you do not get any output regarding the variable ‘frailty’ that you stratified on, although R indicates in the formula that you have used this variable in the strata. You would still interpret the HR of age_grp as having been adjusted for frailty, though. So you would say that the hazard of dying was 4.3 times greater in the 81-95 age group compared to the 60-70 age group, once it was adjusted for frailty (as stratifying on a variable is another way to adjust for its association).

C2.3b PRACTICAL: SPSS

Test the proportional hazards assumption

There are many similarities between Cox regression and linear and logistic regression. We have seen that you can adjust for additional variables in the same way. You can also fit interactions using the same commands. However, the one key difference between a Cox regression and other regression models is the proportional hazards assumption.

The proportional hazards assumption states that hazard ratio (comparing levels of an exposure variable) is constant over time. This assumption needs to be tested. The lecture explained 3 ways to do this: fitting an interaction with time, examining the residuals with a log-log plot and testing the residuals for a correlation with time. In practice, generally just choosing one method to use is sufficient.

Fitting an interaction with time

If the PH assumption is violated, it means that the hazard ratio is not constant over time, i.e. there is an interaction between a covariate and time. Thus, we can fit a Cox model with a time interaction in it and see if this is significant. We can also use a likelihood ratio (LR) test to see if a model with a time-varying covariate fits better than one without.

Select

Analyze >> Survival >> Cox w/ Time-Dep Cov

This opens a window for you to define your time dependant covariate. The variable Time [T_] will be added to your list of potential variables. You need to use this, multiplied by your variable of interest (in his case age group) to define the time dependant covariate. When you have defined the expression, press ‘Model’.

The normal Cox Regression box will then open, with your new time dependant covariate indicated by T_COV in the left hand menu. Set up your analysis exactly as before, but add T_COV as another covariate.

Your output will look like the below.

When we look at the p-value for the coefficient in the T_COV row, we see that it is significant (p=0.022). This suggests there is a significant interaction with time, which indicates the HR for age group is not constant over time.

Inspecting log-log plots

To compute a log-log plot in SPSS click on ‘Plots’ within the Cox Regression window. Tick the box next to ‘log minus log’ and then move the variable you are interested in into the ‘Separate Lines for:’ box.

Your output will look like the below.

When we inspect a log-log plot, we want to see that the curves in the main portion of the plot (i.e. not the tails of the plot, where there are few events) are generally parallel over time. If the curves touch or cross, this indicates there is a violation of the PH assumption.

Fixing violations in the PH assumption

If the PH assumption has been violated for a covariate, the lecture mentioned two ways to deal with the situation. One way is to specify the model using the T_COV option for the offending variable, which models an interaction with time. The other way, which we are doing here, is to fit a stratified Cox model where we stratify by the time-varying covariate.

We do this by adding the variable (frailty in this example) to the ‘Strata’ box in the Cox Regression window rather than the ‘Covariate’ box.

You do not get HRs or p-values for each stratum. Instead, SPSS assumes there is a separate baseline hazard for each stratum in order to estimate the proportional effects of the other variables across strata (which provides only one coefficient since the proportional effects are assumed to be constant, which is the no-interaction assumption [i.e. no interaction between other explanatory variables and the time-varying coefficient]). Since you do not get HR for levels of a variable you stratify on, do not use this method if you have non-proportionality in your main exposure of interest.

You would still interpret the HR of age_grp as having been adjusted for frailty. So, you would say that the hazard of dying was 4.3 times greater in the 81-95 age group compared to the 60-70 age group, once it was adjusted for frailty (as stratifying on a variable is another way to adjust for its association).

Subscribe
Notify of
guest

1 Comment
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
Sayed Jalal

No presentation. Please add downloadable presentation

1
0
Questions or comments?x