Back to Course

FoSSA: Fundamentals of Statistical Software & Analysis

0% Complete
0/0 Steps
  1. Course Information

    Meet the Teaching Team
  2. Course Dataset 1
  3. Course Dataset 2
  4. MODULE A1: INTRODUCTION TO STATISTICS USING R, STATA, AND SPSS
    A1.1 What is Statistics?
  5. A1.2.1a Introduction to Stata
  6. A1.2.2b: Introduction to R
  7. A1.2.2c: Introduction to SPSS
  8. A1.3: Descriptive Statistics
  9. A1.4: Estimates and Confidence Intervals
  10. A1.5: Hypothesis Testing
  11. A1.6: Transforming Variables
  12. End of Module A1
    1 Quiz
  13. MODULE A2: POWER & SAMPLE SIZE CALCULATIONS
    A2.1 Key Concepts
  14. A2.2 Power calculations for a difference in means
  15. A2.3 Power Calculations for a difference in proportions
  16. A2.4 Sample Size Calculation for RCTs
  17. A2.5 Sample size calculations for cross-sectional studies (or surveys)
  18. A2.6 Sample size calculations for case-control studies
  19. End of Module A2
    1 Quiz
  20. MODULE B1: LINEAR REGRESSION
    B1.1 Correlation and Scatterplots
  21. B1.2 Differences Between Means (ANOVA 1)
  22. B1.3 Univariable Linear Regression
  23. B1.4 Multivariable Linear Regression
  24. B1.5 Model Selection and F-Tests
  25. B1.6 Regression Diagnostics
  26. End of Module B1
    1 Quiz
  27. MODULE B2: MULTIPLE COMPARISONS & REPEATED MEASURES
    B2.1 ANOVA Revisited - Post-Hoc Testing
  28. B2.2 Correcting For Multiple Comparisons
  29. B2.3 Two-way ANOVA
  30. B2.4 Repeated Measures and the Paired T-Test
  31. B2.5 Repeated Measures ANOVA
  32. End of Module B2
    1 Quiz
  33. MODULE B3: NON-PARAMETRIC MEASURES
    B3.1 The Parametric Assumptions
  34. B3.2 Mann-Whitney U Test
  35. B3.3 Kruskal-Wallis Test
  36. B3.4 Wilcoxon Signed Rank Test
  37. B3.5 Friedman Test
  38. B3.6 Spearman's Rank Order Correlation
  39. End of Module B3
    1 Quiz
  40. MODULE C1: BINARY OUTCOME DATA & LOGISTIC REGRESSION
    C1.1 Introduction to Prevalence, Risk, Odds and Rates
  41. C1.2 The Chi-Square Test and the Test For Trend
  42. C1.3 Univariable Logistic Regression
  43. C1.4 Multivariable Logistic Regression
  44. End of Module C1
    1 Quiz
  45. MODULE C2: SURVIVAL DATA
    C2.1 Introduction to Survival Data
  46. C2.2 Kaplan-Meier Survival Function & the Log Rank Test
  47. C2.3 Cox Proportional Hazards Regression
  48. C2.4 Poisson Regression
  49. End of Module C2
    1 Quiz
Topic 46 of 49
In Progress

C2.2 Kaplan-Meier Survival Function & the Log Rank Test

Learning Outcomes

By the end of this section, students will be able to:

  • Produce a Kaplan-Meier survival curve and interpret it
  • Test for differences in survival by using a log-rank test

Video C2.2a – Kaplan-Meier Survival Function (7 minutes)

Video C2.2b – The Log-Rank Test (4 minutes)

C2.2a PRACTICAL: Stata

Kaplan-Meier Survival Curve

To produce a K-M curve for all-cause mortality, the first step is to use the ‘stset’ command to tell Stata you are using survival-time data (which only needs to be done once for any analyses using the same follow-up time and event variable).

The ‘sts list’ command lists the estimated survivor (or failure, if you add this option) function. It can only be used after an ‘stset’ command, and you do not need to specify any variables after the command. If you type ‘sts list’ by itself, the estimated survival function will be calculated at the point of each event failure. To view the survival function in a more concise way, use the at() option with ‘sts list’ to specify at what times you want to view the survival function, for example, at(0.5(0.5)8.5)). That option specifies you want to view the survival function starting at 0.5 units in time, and then at each 0.5 unit increase in time until 8.5 units (or years in this case) have passed.

Produce a Kaplan-Meier curve for mortality

The command ‘sts graph’ produces a plot of the K-M survivor function. There are a variety of options that you could use that might make a graph more informative. Read through the help file for ‘sts graph’ to investigate some of the command options.

  • Question C2.2ai.: Use the ‘sts list’ command and the at option ‘at(0.5(0.5)8.5)’ to see what time since re-survey corresponds to 75% survival and check using the ‘stsum’ command. The ‘stsum‘ command can be typed after the ‘stset’ command and it is akin to the ‘summarise’ command, presenting summary statistics for the current survival data. Reference the help file for more information on the ‘stsum’ command.
  •  Question C2.2a.ii: Use ‘sts graph’ to plot the cumulative survival, and use the appropriate option to estimate how many participants were at risk going into the 8th year of follow-up.
Answer

Answer C2.2a.i.:

sts list, at(0.5(0.5)8.5)

         failure _d:  death
   analysis time _t:  nyearscf
                 id:  whl2_id

              Beg.                      Survivor      Std.
Time     Total     Fail             Function     Error     [95% Conf. Int.]
——————————————————————————-
.5      4286       41              0.9905    0.0015     0.9872    0.9930
1      4217       69              0.9746    0.0024     0.9694    0.9789
1.5      4139       78              0.9565    0.0031     0.9500    0.9622
2      4044       95              0.9346    0.0038     0.9268    0.9416
2.5      3956       88              0.9142    0.0043     0.9055    0.9222
3      3866       89              0.8937    0.0047     0.8841    0.9025
3.5      3766      100              0.8705    0.0051     0.8602    0.8802
4      3668       97              0.8481    0.0055     0.8371    0.8585
4.5      3572       96              0.8259    0.0058     0.8143    0.8369
5      3474       97              0.8035    0.0060     0.7913    0.8150
5.5      3373      100              0.7803    0.0063     0.7677    0.7924
6      3243      129              0.7505    0.0066     0.7373    0.7631
6.5      3137      106              0.7259    0.0068     0.7124    0.7390
7      3021      114              0.6995    0.0070     0.6856    0.7130
7.5      2891      125              0.6706    0.0071     0.6563    0.6844
8      1459       48              0.6555    0.0073     0.6410    0.6697
8.5      1404       54              0.6313    0.0078     0.6158    0.6463
——————————————————————————-
Note:  Survivor function is calculated over full data and evaluated at
indicated times; it is not calculated from aggregates shown at left.

75% survival occurs at around 6 years, and this agrees with report from stsum below.

stsum

         failure _d:  death
   analysis time _t:  nyearscf
                 id:  whl2_id

         |               incidence       no. of    |—— Survival time —–|
         | time at risk     rate        subjects        25%       50%       75%
———+———————————————————————
   total |  29476.98247   .0517692          4327    6.02445         .         .

Answer C2.2a.ii:

The correct command is below, and the graph shows that 1458 participants were at risk going into the 8th year of follow-up.

sts graph, risktable

The log-rank test

First, we will divide the variable ‘vitd’ at the median (nb: there are several commands you can use to code this correctly).

 xtile vitd_med=vitd, nq(2)

Then we will investigate the effect of levels of VitD on survival by plotting the survival functions for each of the two categories of VitD with the inclusion of confidence intervals (option ci) :

sts graph, by(vitd_med) ci ylab(0.5 0.6 to 1.0)

 The ‘sts test’ command by default compares the survival distributions of an exposure variable by performing a log-rank test. One of the options you can specify is “trend”, which will perform a test for trend across levels of an ordered categorical variable.

 sts test varlist [, options]

  • Question C2.2b: Using the sts test command, compare the survival distributions for the two categories of VitD. Is there a significant difference in the survival of participants below and above the median of vitamin D concentration?
Answer

sts test vitd_med

Failure _d: death

Analysis time _t: fu_years

ID variable: whl1_id

Equality of survivor functions
Log-rank test

        |  Observed       Expected
vitd_med |    events         events
———+————————-
       1 |       940         726.62
       2 |       586         799.38
———+————————-
   Total |      1526        1526.00

    chi2(1) = 119.76

    Pr>chi2 = 0.0000

The log-rank test indicates that we can reject the null hypothesis that the survival distributions for the two categories are the same, as the chi(1)=119.76, p<0.001. There is a statistically significant difference in survival for participants above and below the median of vitamin D concentrations. Looking at the KM plot of the survival functions, we can see that participants with vitamin D above the median tend to survive longer.

C2.2 PRACTICAL: SPSS

Kaplan-Meier Survival Curve

To produce a K-M curve in SPSS

Select

Analyze >> Survival >> Kaplan-Meier

Place the time variable (fu_years) into the ‘Time’ box, and then the survival variable ‘death’ into the ‘Status’ box.

Click on ‘Define event’ below the ‘Status’ box. This is where you define the value that indicates the event that we are interested in (i.e. death). As we have death coded as a binary variable, where No =0 and Yes = 1, here we simply indicate that we are looking for the single value of 1. Then press Continue.

In this example we want to see what time since re-survey corresponds with 75% survival. To do this we click on ‘Options’ and the tick the box next to ‘Quartiles’. In the same tab we can select if we want to see a survival plot of our data buy ticking the box next to ‘Survival’ under ‘Plots’. Click Continue and then OK to run the test.

Answer

If you ticked ‘survival table’ it will output a very long table of all participants in the study like the below. You can read the surviving fraction off the bottom of this table which is 0.631 or 63.1%

The time at which 75% survival was seen is easier, as this is read directly from the Percentiles table. This case it is approximately 6 years.

The survival plot shows the cumulative survival over the follow up period. Below is the default produced by SPSS, but you can double click on the graph in the output window to edit it to your needs.

Life tables

In SPSS you can also use the ‘Life tables’ function to produce a table which gives us information at set time points throughout the follow up period.

Select

Analyze >> Survival >> Life Tables

Add fu_years as the ‘Time’ variable and death as the Status variable and then define death as the single value of 1, just as before.

Then you can add the time intervals, so select 8.5 years as the end point, and 0.5 years as the reporting interval. Press OK to run the test.

Answer

The life table output looks like the below. This gives detailed data from each time point you have selected. If you look at the cumulative proportion surviving at 8.5 years this is 0.63, which agrees with the analysis conducted through the Kaplan-Meier function.

Log rank test

Here we are going to compare survival distributions for participants with high and low vitamin  D levels. Firstly, we need to create a dichotomous variable for vitamin d, with one group being up to the median value, and the other group above the median.

Use the ‘Recode in Different Variables’ function under the Transform menu to create a new variable called vitd_med with one group =< than the median value (55.7) and one group >55.7. Call these groups ‘low’ and ‘high’ respectively.

We will then conduct the log rank test. Open the Kaplan Meier test box exactly as before and enter fu_years as the Time variable, add death as the Status variable and define it. Now add the new variable vitd_med to the ‘Factor’ box.

Click on ‘Compare Factor’ and tick the box next to ‘Log rank’.

It is also useful to select ‘Mean and median survival’ under statistics in the ‘Options’ tab, as the long rank test does not automatically include these in the output, and you will need them to interpret the direction of the difference.

You may also want to select to output a survival plot as before, so that you can visualise any difference between the groups.

Once you have made all of your selections click Continue and then OK to run the test.

Answer

The test output will look like the below. The log-rank test indicates that we can reject the null hypothesis that the survival distributions for the two categories are the same, as the X2 value =130.193 and p<0.001. There is a statistically significant difference in survival for participants above and below the median of vitamin D concentrations. Looking at the mean values we can see that participants with vitamin D above the median tend to survive longer.

If you’ve chosen to include a survival plot in this analysis it will look like the below. You can clearly see that the high VitD group tend to survive longer than the low VitD group.

C2.2a PRACTICAL: R

Kaplan-Meier Survival Curve

To produce a Kaplan-Meier (K-M) Survival Curve for all-cause mortality, we will use the survminer R package which aims to plot time-to-event data using the ggplot R package.

Produce a Kaplan-Meier curve for mortality

The command ggsurvplot() from the survminer R package produces a plot of the K-M survivor function. There are a variety of options that you could use that might make a graph more informative. You can find more details about the survminer R package in this website.

To produce a Kaplan-Meier curve for the total population we run the following command:

model <- survfit(Surv(fu_years, death) ~ 1, data = df)

To get the risk table at prespecified times, we used the summary() command with the times option:

summary(model, times)

  • Question C2.2a.i: Use the survfit() and summary() commands and the times option seq(0.5, 9, 0.5) to see what time since re-survey corresponds to 75% survival.
  • Question C2.2a.ii: Use the ggsurvplot() command to plot the cumulative survival, and use the appropriate option to estimate how many participants were at risk going into the 8th year of follow-up.
Answer

Answer C2.2a.i:

> model <- survfit(Surv(fu_years, death) ~ 1, data = df)
> summary(model, times = seq(0.5, 9, 0.5))
Call: survfit(formula = Surv(fu_years, death) ~ 1, data = df)

 time n.risk n.event survival std.err lower 95% CI upper 95% CI
  0.5   4285      41    0.991 0.00147        0.988        0.993
  1.0   4216      69    0.975 0.00239        0.970        0.979
  1.5   4138      78    0.957 0.00310        0.950        0.963
  2.0   4043      95    0.935 0.00376        0.927        0.942
  2.5   3955      88    0.914 0.00426        0.906        0.923
  3.0   3865      89    0.894 0.00469        0.885        0.903
  3.5   3765     100    0.871 0.00510        0.861        0.881
  4.0   3667      97    0.848 0.00546        0.837        0.859
  4.5   3571      96    0.826 0.00577        0.815        0.837
  5.0   3473      97    0.803 0.00604        0.792        0.815
  5.5   3372     100    0.780 0.00630        0.768        0.793
  6.0   3242     129    0.750 0.00658        0.738        0.763
  6.5   3136     106    0.726 0.00678        0.713        0.739
  7.0   3020     114    0.700 0.00697        0.686        0.713
  7.5   2890     125    0.671 0.00715        0.657        0.685
  8.0   1458      48    0.656 0.00732        0.641        0.670
  8.5   1404      54    0.631 0.00776        0.616        0.647

75% survival occurs at around 6 years, and this agrees with report from stsum below.

Answer C2.2a.ii:

The correct command is below, and the graph shows that 1458 participants were at risk going into the 8th year of follow-up.

ggsurvplot(model, risk.table = T)

The log-rank test

First, we will divide the variable vitd at the median (nb: there are several methods you can use to code this correctly in R). We will show a recoding approach based on tidyverse R package. We can use a combination of the mutate() command and case_when() command as follows:

df <- df %>% mutate(vitd_med = case_when(vitd <= summary(df$vitd)[3] ~ 0, vitd > summary(df$vitd)[3] ~ 1))

Note: We use the summary() command to estimate the descriptive statistics (i.e., minimum value, 1st quartile, median, mean, 3rd quartile, and maximum value) of the vitd variable. The 3rd element in the output of the summary()command is the median which is used to dichotomise the vtid variable.

The survdiff() command compares the survival distrubutions of an exposure variable by performing a log-rank test. The syntax of the survdiff() command is the following:

survdiff(Surv(time, event) ~ variable, data)

The survdiff() command uses as a dependent variable a survival object which is produced by the Surv() command. To create a survival object, we should define the follow up time (time) and the outcome status (event) in the Surv() command.

  • Question C2.2b: Using the survdiff() command, compare the survival distributions for the two categories of VitD. Is there a significant difference in the survival of participants below and above the median of vitamin D concentration?
Answer

> survdiff(Surv(fu_years, death) ~ vitd_med, data = df)
Call:
survdiff(formula = Surv(fu_years, death) ~ vitd_med, data = df)

    N Observed Expected (O-E)^2/E (O-E)^2/V
vitd_med=0 2183      940      727      62.7       120
vitd_med=1 2144      586      799      57.0       120

 Chisq= 120  on 1 degrees of freedom, p= <2e-16

The log-rank test indicates that we can reject the null hypothesis that the survival distributions for the two categories are the same, as the chi(1)=120 , p<2e-16. There is a statistically significant difference in survival for participants above and below the median of vitamin D concentrations. Looking at the KM plot of the survival functions, we can see that participants with vitamin D above the median tend to survive longer.

👋 Before you go, please rate your satisfaction with this lesson

Ratings are completely anonymous

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

Please share any positive or negative feedback you may have.

Feedback is completely anonymous

Subscribe
Notify of
guest

1 Comment
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
Sayed Jalal

No presentation. Please add downloadable presentation

1
0
Questions or comments?x