Back to Course

FoSSA: Fundamentals of Statistical Software & Analysis

0% Complete
0/0 Steps
  1. Course Information

    Meet the Teaching Team
  2. Course Dataset 1
  3. Course Dataset 2
  4. MODULE A1: INTRODUCTION TO STATISTICS USING R, STATA, AND SPSS
    A1.1 What is Statistics?
  5. A1.2.1a Introduction to Stata
  6. A1.2.2b: Introduction to R
  7. A1.2.2c: Introduction to SPSS
  8. A1.3: Descriptive Statistics
  9. A1.4: Estimates and Confidence Intervals
  10. A1.5: Hypothesis Testing
  11. A1.6: Transforming Variables
  12. End of Module A1
    1 Quiz
  13. MODULE A2: POWER & SAMPLE SIZE CALCULATIONS
    A2.1 Key Concepts
  14. A2.2 Power calculations for a difference in means
  15. A2.3 Power Calculations for a difference in proportions
  16. A2.4 Sample Size Calculation for RCTs
  17. A2.5 Sample size calculations for cross-sectional studies (or surveys)
  18. A2.6 Sample size calculations for case-control studies
  19. End of Module A2
    1 Quiz
  20. MODULE B1: LINEAR REGRESSION
    B1.1 Correlation and Scatterplots
  21. B1.2 Differences Between Means (ANOVA 1)
  22. B1.3 Univariable Linear Regression
  23. B1.4 Multivariable Linear Regression
  24. B1.5 Model Selection and F-Tests
  25. B1.6 Regression Diagnostics
  26. End of Module B1
    1 Quiz
  27. MODULE B2: MULTIPLE COMPARISONS & REPEATED MEASURES
    B2.1 ANOVA Revisited - Post-Hoc Testing
  28. B2.2 Correcting For Multiple Comparisons
  29. B2.3 Two-way ANOVA
  30. B2.4 Repeated Measures and the Paired T-Test
  31. B2.5 Repeated Measures ANOVA
  32. End of Module B2
    1 Quiz
  33. MODULE B3: NON-PARAMETRIC MEASURES
    B3.1 The Parametric Assumptions
  34. B3.2 Mann-Whitney U Test
  35. B3.3 Kruskal-Wallis Test
  36. B3.4 Wilcoxon Signed Rank Test
  37. B3.5 Friedman Test
  38. B3.6 Spearman's Rank Order Correlation
  39. End of Module B3
    1 Quiz
  40. MODULE C1: BINARY OUTCOME DATA & LOGISTIC REGRESSION
    C1.1 Introduction to Prevalence, Risk, Odds and Rates
  41. C1.2 The Chi-Square Test and the Test For Trend
  42. C1.3 Univariable Logistic Regression
  43. C1.4 Multivariable Logistic Regression
  44. End of Module C1
    1 Quiz
  45. MODULE C2: SURVIVAL DATA
    C2.1 Introduction to Survival Data
  46. C2.2 Kaplan-Meier Survival Function & the Log Rank Test
  47. C2.3 Cox Proportional Hazards Regression
  48. C2.4 Poisson Regression
  49. End of Module C2
    1 Quiz
Topic 40 of 49
In Progress

C1.1 Introduction to Prevalence, Risk, Odds and Rates

Learning Outcomes

By the end of this session, students will be able to:

  • Define and calculate prevalence, risk, odds and rates
  • Calculate and interpret risk ratios, odds ratios, rate ratios and their 95% confidence intervals

You can download a copy of the slides here: Video C1.1a

Video C1.1a – Definition of Risk, Odds and Rates (5 minutes)

C1.1a PRACTICAL: Stata

Risks

Calculate the overall risk of death by using the command:

tab death, m

Examine the pattern of risk of death in relation to increasing age. We generally list the outcome variable second when using the ‘tab’ command. For example, for age and death, this command would be:

  tab age_grp death, row

Note: we cannot use the ‘tab’ command with continuous variables.

Question C1.1a.i: What does your table suggest about the risk of death overall?

Question C1.1a.ii: What does your table suggest about the risk death across levels of age?

C1.1a. Answer

Answer C1.1a.i:

For this cohort the overall risk of death is 1526/4327=0.3527 or 35.3% over the years of follow up.

Answer C1.1a.ii:

The risk of death appears to increase as age increases.

The ‘cs’ command

A good command in Stata to explore risk and odds ratios is the command ‘cs’.  Type the following into Stata to read about this command:

 help cs

This command is used to produce epidemiological tables with point estimates and CI for risk differences, risk ratios, odds ratios and attributable/prevented fractions.

The syntax for the command is:

cs var_case var_exposed

which means that we place the outcome variable first, followed by the exposure variable. With this command, your exposure can only be binary. If you have a categorical exposure, you will have to recode it into a set of binary variables.

NOTE: Stata often requires variables to be coded such that the baseline category equals 0.  That is the case with the cs command.  If you try to use the command in future and find, for example, that your variable has been coded as 1 and 2, then you will need to recode it to 0 and 1. Other commands similarly prefer 0/1 coding, so as good practice you should always recode binary variables this way in Stata (other statistical programs may be different).

Using the ‘cs’ command to calculate a risk ratio

We want to obtain the risk ratio of mortality for individuals that are current smokers compared to non-smokers. We type the command:

 cs death currsmoker

Use the help file for the cs command to find out how to obtain an odds ratio.

Question C1.1b.i: What is the risk ratio and 95% CI of death for individuals that are current smokers compared with individuals that are non-smokers?

Question C1.1b.ii: What is the odds ratio for death in individuals that are current smokers vs non-smokers? State and interpret the 95% confidence interval.

C1.1b. Answers

Answer C1.1b.i: The risk ratio for death is 1.16 times higher for current smokers compared to non-smokers. The 95% confidence interval around the risk ratio is 1.04-1.30.

Answer C1.1b.ii:

The command is:  cs death currsmoker, or

The odds ratio for death in current smokers compared to non-smokers is 1.27 with a 95% confidence interval of 1.05-1.52. Or, you could say: “The odds of death in current smokers are 1.27 times those of non-smokers.”

The 95% CI states that there is a 95% chance that the interval containing the true population odds ratio lies between 1.05-1.52. Since this interval does not cross 1, it indicates that there are significantly higher odds death for smokers compared to non-smokers.

The ‘ir’ command for rates

Now let’s have a look at the rate ratio for death in smokers vs non-smokers.  This will require the ‘ir’ command.  Use Stata’s help file to explore this command:

 help ir

The basic set up follows the command ir with the outcome variable (called the ‘var_case’) variable here and then the exposure variable (called ‘var_exposed’). Then you specify the follow-up time variable, or ‘var_time’. This command only works with binary exposure variables. You will learn about other commands for examining rates in the Module C2: Survival Data.

Type the following command to produce the rate and rate ratio:

ir death currsmoker fu_years

Question C1.1b.iii: What is the rate of death in current smokers compared to non-smokers (i.e. what is the rate ratio of ‘currsmoker’)? Is the rate of death in current smokers significantly different from that in non-smokers?

C1.1b.iii. Answer

The rate ratio of death in current vs non-smokers is 1.18. This means the rate of death in current smokers is 1.18 times that of non-smokers. Alternatively, you could say the rate of death in smokers was 18% higher than in non-smokers. The 95% CI is 1.02-1.37. This CI does not cross the null value so we know the rate of death in current smokers is significantly higher than that of non-smokers.

C1.1 PRACTICAL: SPSS

Open the Whitehall FoSSA data set in SPSS

When calculating risk ratios, SPSS automatically places the lowest number first (in this case 0 or ‘No’) and calculates how likely people with X characteristic are NOT to have the outcome of interest (e.g. disease/death), and there is no option to change the reference category. You could conduct the analysis and calculate 1/ans in order to get the ratios for how likely people with X characteristic are TO have the outcome of interest, or you can recode the variable so that Yes= 1 and No =2. You learnt how to recode a variable in practical A1.3b. The rest of this practical will show output tables for the recoded variables, so for each variable mentioned, you will need to recode it to Yes= 1 and No=2 before conducting the analysis, or your risk ratios will be in inverse of those shown here.

Firstly,  calculate the overall risk of death.

Select

Analyze >> Descriptive Statistics >> Frequencies

Move ‘death’ to the variables box, make sure that ‘Display frequency tables’ is ticked, then press OK.

Now we are going to examine the pattern of risk of death in relation to increasing age.

Select

Analyze >> Descriptive Statistics >> Crosstabs

Move ‘death’ and age group into rows and columns boxes. It does not really matter which variable goes in which box, the analysis will be the same, it just affects how your output table looks.

image.png

Click on ‘cells’ and select what you want to display in the table. You at the very least want to show observed counts, and you can also select to display percentages. Click continue, and then OK to run.

Question C1.1a.i: What does your table suggest about the risk of death overall?

Question C1.1a.ii: What does your table suggest about the risk death across levels of age?

Answer

Answer C1.1a.i:

For this cohort the overall risk of death is 1526/4327=0.3527 or 35.3% over the years of follow up.

Answer C1.1a.ii:

The risk of death appears to increase as age increases.

Risk Ratio

We want to obtain the odds ratio of mortality for individuals that are current smokers compared to non-smokers.

Select

Analyze >> Descriptive Statistics >> Crosstabs

Move ‘current smoker’ and ‘death’ variables into the rows and columns boxes as before, then click on ‘statistics’ on the right-hand side, the tick the box next to ‘risk’. (NB: SPSS calculates an odds ratio AND a risk ratio when you select ‘risk’)

Question C1.1b.i: What is the risk ratio and 95% CI of death for individuals that are current smokers compared with individuals that are non-smokers?

Question C1.1b.iiWhat is the odds ratio for death in individuals that are current smokers vs non-smokers? State and interpret the 95% confidence interval.

Answer

Your output tables will look like the below. You may have more percentage values in the cross tabs table if you have selected to display percentages for rows, columns and overall. Here I have selected just to display percentages for rows.

Answer C1.1b.i: 

You read the risk ratio from the second row down in the first column. This is the ratio of the risk of death (where death=Yes) for the group where current smoker = Yes compared to the groups where current smoker = No.

The risk ratio for death is 1.16 times higher for current smokers compared to non-smokers. The 95% confidence interval around the risk ratio is 1.04-1.30. (Figures round to 2dp, as is common for reporting).

Answer C1.1b.ii:

The odds ratio for death in current smokers compared to non-smokers is 1.27 with a 95% confidence interval of 1.05-1.52. Or, you could say: “The odds of death in current smokers are 1.27 times those of non-smokers.”

The 95% CI states that there is a 95% chance that the interval containing the true population odds ratio lies between 1.05-1.52. Since this interval does not cross 1, it indicates that there are significantly higher odds death for smokers compared to non-smokers.

NB: SPSS does not offer an option to calculate rate ratios.

C1.1 PRACTICAL: R

Risks 

Calculate the overall risk of death by using a combination of ‘table()’ and ‘prop.table()’ commands:

table(df$death)

prop.table(table(df$death))

Examine the pattern of risk of death in relation to increasing age. The margins option in the ‘prop.table()’ command needs to be set to “2” because we want to report the proportions per group. For example, for age and death, this command would be:

table(df$age_grp, df$death)

prop.table(table(df$age_grp, df$death), margins=2)

  • Question C1.1a.i: What does your table suggest about the risk of death overall?
  • Question C1.1a.ii: What does your table suggest about the risk death across levels of age?
Answer

Answer C1.1a.i:

> table(df$death)
   0    1 
2761 1505 

> prop.table(table(d$death))
        0         1 
0.6472105 0.3527895 

For this cohort the overall risk of death is 1505/(2761+1505) = 0.3527 or 35.3% over the years of follow up.

Answer C1.1a.ii:

The risk of death appears to increase as age increases.

> table(df$age_grp, df$death)
       0    1
  1  698  135
  2 1277  498
  3  602  464
  4  184  408

> prob.table(table(df$age_grp, df$death),2)
   
             0          1
  1 0.25280695 0.08970100
  2 0.46251358 0.33089701
  3 0.21803694 0.30830565

The epitools R package

A good R package to calculate basic epidemiologic analysis is the epitools R package, which includes commands for the design of two-way and multi-way contingency tables and epidemiologic measures, such as risk ratios and odds ratios. You can find more details on the commands included in the epitools R package here.

Install and load the epitools R package with the following commands:

install.packages(“epitools”)

library(epitools)

The epitab() command

A good command in R to explore risk and odds ratios is the epitab() command from the epitools R package. This command is used to produce epidemiological tables with point estimates and CI for risk ratios, and odds ratios.

The syntax for the command is:

epitab(x, y, method)

where x is the vector of the exposure, y is the vector of the outcome and method can be either “oddsratio” to calculate an odds ratio or “riskratio” to calculate a risk ratio. This command works with either binary or categorical exposures. For categorical exposures, R uses by default the lowest value of the exposure as the reference category.

Using the epitab() command to calculate a risk ratio

We want to obtain the risk ratio of mortality for individuals that are current smokers compared to non-smokers. We type the command:

epitab(x = df$currsmoker, y = df$death, method = “riskratio”)

Use the help file for the epitab() command to find out how to obtain an odds ratio.

    • Question C1.1b.i: What is the risk ratio and 95% CI of death for individuals that are current smokers compared with individuals that are non-smokers?
    • Question C1.1b.ii: What is the odds ratio for death in individuals that are current smokers vs non-smokers? State and interpret the 95% confidence interval.
Answer

Answer C1.1b.i: 

> epitab(x = df$currsmoker, y = df$death, method = “riskratio”)

$tab
         Outcome
Predictor    0        p0    1        p1 riskratio    lower    upper    p.value
        0 2473 0.6549258 1303 0.3450742  1.000000       NA       NA         NA
        1  327 0.6000000  218 0.4000000  1.159171 1.036537 1.296314 0.01263365

$measure
[1] “wald”

$conf.level
[1] 0.95

$pvalue
[1] “fisher.exact”

The risk ratio for death is 1.16 times higher for current smokers compared to non-smokers. The 95% confidence interval around the risk ratio is 1.04-1.30.

Answer C1.1b.ii: 
The command is:

> epitab(x = df$currsmoker, y = df$death, method = “oddsratio”)
$tab
         Outcome
Predictor    0        p0    1        p1 oddsratio    lower    upper    p.value
        0 2473 0.8832143 1303 0.8566732  1.000000       NA       NA         NA
        1  327 0.1167857  218 0.1433268  1.265285 1.052595 1.520953 0.01263365

$measure
[1] “wald”

$conf.level
[1] 0.95

$pvalue
[1] “fisher.exact”

The odds ratio for death in current smokers compared to non-smokers is 1.27 with a 95% confidence interval of 1.05-1.52. Or, you could say: “The odds of death in current smokers are 1.27 times those of non-smokers.”

The 95% CI states that there is a 95% chance that the interval containing the true population odds ratio lies between 1.05-1.52. Since this interval does not cross 1, it indicates that there are significantly higher odds death for smokers compared to non-smokers.

The ‘ir’ command for rates

 Now let’s have a look at the rate ratio for death in smokers vs non-smokers.  This will require the ‘rateratio’ method in the ‘epitab()’  command.

Type the following command to produce the rate and rate ratio:

> death_no_smokers <- sum(df$death[df$currsmoker==”0″])
> death_no_smokers 
[1] 1290 

> death_smokers <- sum(df$death[df$currsmoker==”1″])
> death_smokers 
[1] 215

> fup_no_smoker <- sum(df$fu_years[df$currsmoker==”0″])
> fup_no_smoker
[1] 25460.25

> fup_smoker <- sum(df$fu_years[df$currsmoker==”1″])
> fup_smoker
[1] 3593.209

> epitab(x=c(death_no_smokers, death_smokers), y= c(fu_no_smoker, fu_smoker),
method = c(“rateratio”))

          Outcome
Predictor  c(death_no_smokers, death_smokers) c(fup_no_smoker, fup_smoker) 
  Exposed1                               1290                    25460.250  
  Exposed2                                215                     3593.209  
          
Predictor    rateratio     lower      upper       p.value
  Exposed1   1.000000         NA         NA            NA
  Exposed2   1.180943   1.022177   1.364369    0.02633476

$measure
[1] “wald”

$conf.level
[1] 0.95

$pvalue
[1] “midp.exact”

    • Question C1.1c: What is the rate of death in current smokers compared to non-smokers (i.e. what is the rate ratio of ‘currsmoker’)? Is the rate of death in current smokers significantly different from that in non-smokers?
C1.1c. Answer

Answer C1.1c:  The rate ratio of death in current vs non-smokers is 1.18. This means the rate of death in current smokers is 1.18 times that of non-smokers. Alternatively, you could say the rate of death in smokers was 18% higher than in non-smokers. The 95% CI is 1.02-1.36. This CI does not cross the null value so we know the rate of death in current smokers is significantly higher than that of non-smokers.

 

👋 Before you go, please rate your satisfaction with this lesson

Ratings are completely anonymous

Average rating 3.5 / 5. Vote count: 2

No votes so far! Be the first to rate this post.

Please share any positive or negative feedback you may have.

Feedback is completely anonymous

Subscribe
Notify of
guest

0 Comments
Inline Feedbacks
View all comments
0
Questions or comments?x