Learning Outcomes
By the end of this section, students will be able to:
- Produce a Kaplan-Meier survival curve and interpret it
- Test for differences in survival by using a log-rank test
Video C2.2a – Kaplan-Meier Survival Function (7 minutes)
Video C2.2b – The Log-Rank Test (4 minutes)
C2.2a PRACTICAL: Stata
Kaplan-Meier Survival Curve
To produce a K-M curve for all-cause mortality, the first step is to use the ‘stset’ command to tell Stata you are using survival-time data (which only needs to be done once for any analyses using the same follow-up time and event variable).
The ‘sts list’ command lists the estimated survivor (or failure, if you add this option) function. It can only be used after an ‘stset’ command, and you do not need to specify any variables after the command. If you type ‘sts list’ by itself, the estimated survival function will be calculated at the point of each event failure. To view the survival function in a more concise way, use the at() option with ‘sts list’ to specify at what times you want to view the survival function, for example, at(0.5(0.5)8.5)). That option specifies you want to view the survival function starting at 0.5 units in time, and then at each 0.5 unit increase in time until 8.5 units (or years in this case) have passed.
Produce a Kaplan-Meier curve for mortality
The command ‘sts graph’ produces a plot of the K-M survivor function. There are a variety of options that you could use that might make a graph more informative. Read through the help file for ‘sts graph’ to investigate some of the command options.
- Question C2.2ai.: Use the ‘sts list’ command and the at option ‘at(0.5(0.5)8.5)’ to see what time since re-survey corresponds to 75% survival and check using the ‘stsum’ command. The ‘stsum‘ command can be typed after the ‘stset’ command and it is akin to the ‘summarise’ command, presenting summary statistics for the current survival data. Reference the help file for more information on the ‘stsum’ command.
- Question C2.2a.ii: Use ‘sts graph’ to plot the cumulative survival, and use the appropriate option to estimate how many participants were at risk going into the 8th year of follow-up.
Answer
Answer C2.2a.i.:
sts list, at(0.5(0.5)8.5)
failure _d: death
analysis time _t: nyearscf
id: whl2_id
Beg. Survivor Std.
Time Total Fail Function Error [95% Conf. Int.]
——————————————————————————-
.5 4286 41 0.9905 0.0015 0.9872 0.9930
1 4217 69 0.9746 0.0024 0.9694 0.9789
1.5 4139 78 0.9565 0.0031 0.9500 0.9622
2 4044 95 0.9346 0.0038 0.9268 0.9416
2.5 3956 88 0.9142 0.0043 0.9055 0.9222
3 3866 89 0.8937 0.0047 0.8841 0.9025
3.5 3766 100 0.8705 0.0051 0.8602 0.8802
4 3668 97 0.8481 0.0055 0.8371 0.8585
4.5 3572 96 0.8259 0.0058 0.8143 0.8369
5 3474 97 0.8035 0.0060 0.7913 0.8150
5.5 3373 100 0.7803 0.0063 0.7677 0.7924
6 3243 129 0.7505 0.0066 0.7373 0.7631
6.5 3137 106 0.7259 0.0068 0.7124 0.7390
7 3021 114 0.6995 0.0070 0.6856 0.7130
7.5 2891 125 0.6706 0.0071 0.6563 0.6844
8 1459 48 0.6555 0.0073 0.6410 0.6697
8.5 1404 54 0.6313 0.0078 0.6158 0.6463
——————————————————————————-
Note: Survivor function is calculated over full data and evaluated at
indicated times; it is not calculated from aggregates shown at left.
75% survival occurs at around 6 years, and this agrees with report from stsum below.
stsum
failure _d: death
analysis time _t: nyearscf
id: whl2_id
| incidence no. of |—— Survival time —–|
| time at risk rate subjects 25% 50% 75%
———+———————————————————————
total | 29476.98247 .0517692 4327 6.02445 . .
Answer C2.2a.ii:
The correct command is below, and the graph shows that 1458 participants were at risk going into the 8th year of follow-up.
sts graph, risktable

The log-rank test
First, we will divide the variable ‘vitd’ at the median (nb: there are several commands you can use to code this correctly).
xtile vitd_med=vitd, nq(2)
Then we will investigate the effect of levels of VitD on survival by plotting the survival functions for each of the two categories of VitD with the inclusion of confidence intervals (option ci) :
sts graph, by(vitd_med) ci ylab(0.5 0.6 to 1.0)
The ‘sts test’ command by default compares the survival distributions of an exposure variable by performing a log-rank test. One of the options you can specify is “trend”, which will perform a test for trend across levels of an ordered categorical variable.
sts test varlist [, options]
- Question C2.2b: Using the sts test command, compare the survival distributions for the two categories of VitD. Is there a significant difference in the survival of participants below and above the median of vitamin D concentration?
Answer
sts test vitd_med
Failure _d: death
Analysis time _t: fu_years
ID variable: whl1_id
Equality of survivor functions
Log-rank test
| Observed Expected
vitd_med | events events
———+————————-
1 | 940 726.62
2 | 586 799.38
———+————————-
Total | 1526 1526.00
chi2(1) = 119.76
Pr>chi2 = 0.0000
The log-rank test indicates that we can reject the null hypothesis that the survival distributions for the two categories are the same, as the chi(1)=119.76, p<0.001. There is a statistically significant difference in survival for participants above and below the median of vitamin D concentrations. Looking at the KM plot of the survival functions, we can see that participants with vitamin D above the median tend to survive longer.
C2.2 PRACTICAL: SPSS
Kaplan-Meier Survival Curve
To produce a K-M curve in SPSS
Select
Analyze >> Survival >> Kaplan-Meier
Place the time variable (fu_years) into the ‘Time’ box, and then the survival variable ‘death’ into the ‘Status’ box.

Click on ‘Define event’ below the ‘Status’ box. This is where you define the value that indicates the event that we are interested in (i.e. death). As we have death coded as a binary variable, where No =0 and Yes = 1, here we simply indicate that we are looking for the single value of 1. Then press Continue.

In this example we want to see what time since re-survey corresponds with 75% survival. To do this we click on ‘Options’ and the tick the box next to ‘Quartiles’. In the same tab we can select if we want to see a survival plot of our data buy ticking the box next to ‘Survival’ under ‘Plots’. Click Continue and then OK to run the test.

Answer
If you ticked ‘survival table’ it will output a very long table of all participants in the study like the below. You can read the surviving fraction off the bottom of this table which is 0.631 or 63.1%

The time at which 75% survival was seen is easier, as this is read directly from the Percentiles table. This case it is approximately 6 years.

The survival plot shows the cumulative survival over the follow up period. Below is the default produced by SPSS, but you can double click on the graph in the output window to edit it to your needs.

Life tables
In SPSS you can also use the ‘Life tables’ function to produce a table which gives us information at set time points throughout the follow up period.
Select
Analyze >> Survival >> Life Tables
Add fu_years as the ‘Time’ variable and death as the Status variable and then define death as the single value of 1, just as before.
Then you can add the time intervals, so select 8.5 years as the end point, and 0.5 years as the reporting interval. Press OK to run the test.

Answer
The life table output looks like the below. This gives detailed data from each time point you have selected. If you look at the cumulative proportion surviving at 8.5 years this is 0.63, which agrees with the analysis conducted through the Kaplan-Meier function.

Log rank test
Here we are going to compare survival distributions for participants with high and low vitamin D levels. Firstly, we need to create a dichotomous variable for vitamin d, with one group being up to the median value, and the other group above the median.
Use the ‘Recode in Different Variables’ function under the Transform menu to create a new variable called vitd_med with one group =< than the median value (55.7) and one group >55.7. Call these groups ‘low’ and ‘high’ respectively.
We will then conduct the log rank test. Open the Kaplan Meier test box exactly as before and enter fu_years as the Time variable, add death as the Status variable and define it. Now add the new variable vitd_med to the ‘Factor’ box.

Click on ‘Compare Factor’ and tick the box next to ‘Log rank’.

It is also useful to select ‘Mean and median survival’ under statistics in the ‘Options’ tab, as the long rank test does not automatically include these in the output, and you will need them to interpret the direction of the difference.
You may also want to select to output a survival plot as before, so that you can visualise any difference between the groups.
Once you have made all of your selections click Continue and then OK to run the test.
Answer
The test output will look like the below. The log-rank test indicates that we can reject the null hypothesis that the survival distributions for the two categories are the same, as the X2 value =130.193 and p<0.001. There is a statistically significant difference in survival for participants above and below the median of vitamin D concentrations. Looking at the mean values we can see that participants with vitamin D above the median tend to survive longer.


If you’ve chosen to include a survival plot in this analysis it will look like the below. You can clearly see that the high VitD group tend to survive longer than the low VitD group.

C2.2a PRACTICAL: R
Kaplan-Meier Survival Curve
To produce a Kaplan-Meier (K-M) Survival Curve for all-cause mortality, we will use the survminer R package which aims to plot time-to-event data using the ggplot R package.
Produce a Kaplan-Meier curve for mortality
The command ggsurvplot() from the survminer R package produces a plot of the K-M survivor function. There are a variety of options that you could use that might make a graph more informative. You can find more details about the survminer R package in this website.
To produce a Kaplan-Meier curve for the total population we run the following command:
model <- survfit(Surv(fu_years, death) ~ 1, data = df)
To get the risk table at prespecified times, we used the summary() command with the times option:
summary(model, times)
- Question C2.2a.i: Use the survfit() and summary() commands and the times option seq(0.5, 9, 0.5) to see what time since re-survey corresponds to 75% survival.
- Question C2.2a.ii: Use the ggsurvplot() command to plot the cumulative survival, and use the appropriate option to estimate how many participants were at risk going into the 8th year of follow-up.
Answer
Answer C2.2a.i:
> model <- survfit(Surv(fu_years, death) ~ 1, data = df)
> summary(model, times = seq(0.5, 9, 0.5))
Call: survfit(formula = Surv(fu_years, death) ~ 1, data = df)
time n.risk n.event survival std.err lower 95% CI upper 95% CI
0.5 4285 41 0.991 0.00147 0.988 0.993
1.0 4216 69 0.975 0.00239 0.970 0.979
1.5 4138 78 0.957 0.00310 0.950 0.963
2.0 4043 95 0.935 0.00376 0.927 0.942
2.5 3955 88 0.914 0.00426 0.906 0.923
3.0 3865 89 0.894 0.00469 0.885 0.903
3.5 3765 100 0.871 0.00510 0.861 0.881
4.0 3667 97 0.848 0.00546 0.837 0.859
4.5 3571 96 0.826 0.00577 0.815 0.837
5.0 3473 97 0.803 0.00604 0.792 0.815
5.5 3372 100 0.780 0.00630 0.768 0.793
6.0 3242 129 0.750 0.00658 0.738 0.763
6.5 3136 106 0.726 0.00678 0.713 0.739
7.0 3020 114 0.700 0.00697 0.686 0.713
7.5 2890 125 0.671 0.00715 0.657 0.685
8.0 1458 48 0.656 0.00732 0.641 0.670
8.5 1404 54 0.631 0.00776 0.616 0.647
75% survival occurs at around 6 years, and this agrees with report from stsum below.
Answer C2.2a.ii:
The correct command is below, and the graph shows that 1458 participants were at risk going into the 8th year of follow-up.
ggsurvplot(model, risk.table = T)

The log-rank test
First, we will divide the variable vitd at the median (nb: there are several methods you can use to code this correctly in R). We will show a recoding approach based on tidyverse R package. We can use a combination of the mutate() command and case_when() command as follows:
df <- df %>% mutate(vitd_med = case_when(vitd <= summary(df$vitd)[3] ~ 0, vitd > summary(df$vitd)[3] ~ 1))
Note: We use the summary() command to estimate the descriptive statistics (i.e., minimum value, 1st quartile, median, mean, 3rd quartile, and maximum value) of the vitd variable. The 3rd element in the output of the summary()command is the median which is used to dichotomise the vtid variable.
The survdiff() command compares the survival distrubutions of an exposure variable by performing a log-rank test. The syntax of the survdiff() command is the following:
survdiff(Surv(time, event) ~ variable, data)
The survdiff() command uses as a dependent variable a survival object which is produced by the Surv() command. To create a survival object, we should define the follow up time (time) and the outcome status (event) in the Surv() command.
- Question C2.2b: Using the survdiff() command, compare the survival distributions for the two categories of VitD. Is there a significant difference in the survival of participants below and above the median of vitamin D concentration?
Answer
> survdiff(Surv(fu_years, death) ~ vitd_med, data = df)
Call:
survdiff(formula = Surv(fu_years, death) ~ vitd_med, data = df)
N Observed Expected (O-E)^2/E (O-E)^2/V
vitd_med=0 2183 940 727 62.7 120
vitd_med=1 2144 586 799 57.0 120
Chisq= 120 on 1 degrees of freedom, p= <2e-16
The log-rank test indicates that we can reject the null hypothesis that the survival distributions for the two categories are the same, as the chi(1)=120 , p<2e-16. There is a statistically significant difference in survival for participants above and below the median of vitamin D concentrations. Looking at the KM plot of the survival functions, we can see that participants with vitamin D above the median tend to survive longer.
No presentation. Please add downloadable presentation