Back to Course

FoSSA: Fundamentals of Statistical Software & Analysis

0% Complete
0/0 Steps
  1. Course Information

    Meet the Teaching Team
  2. Course Dataset 1
  3. Course Dataset 2
  4. MODULE A1: INTRODUCTION TO STATISTICS USING R, STATA, AND SPSS
    A1.1 What is Statistics?
  5. A1.2.1a Introduction to Stata
  6. A1.2.2b: Introduction to R
  7. A1.2.2c: Introduction to SPSS
  8. A1.3: Descriptive Statistics
  9. A1.4: Estimates and Confidence Intervals
  10. A1.5: Hypothesis Testing
  11. A1.6: Transforming Variables
  12. End of Module A1
    1 Quiz
  13. MODULE A2: POWER & SAMPLE SIZE CALCULATIONS
    A2.1 Key Concepts
  14. A2.2 Power calculations for a difference in means
  15. A2.3 Power Calculations for a difference in proportions
  16. A2.4 Sample Size Calculation for RCTs
  17. A2.5 Sample size calculations for cross-sectional studies (or surveys)
  18. A2.6 Sample size calculations for case-control studies
  19. End of Module A2
    1 Quiz
  20. MODULE B1: LINEAR REGRESSION
    B1.1 Correlation and Scatterplots
  21. B1.2 Differences Between Means (ANOVA 1)
  22. B1.3 Univariable Linear Regression
  23. B1.4 Multivariable Linear Regression
  24. B1.5 Model Selection and F-Tests
  25. B1.6 Regression Diagnostics
  26. End of Module B1
    1 Quiz
  27. MODULE B2: MULTIPLE COMPARISONS & REPEATED MEASURES
    B2.1 ANOVA Revisited - Post-Hoc Testing
  28. B2.2 Correcting For Multiple Comparisons
  29. B2.3 Two-way ANOVA
  30. B2.4 Repeated Measures and the Paired T-Test
  31. B2.5 Repeated Measures ANOVA
  32. End of Module B2
    1 Quiz
  33. MODULE B3: NON-PARAMETRIC MEASURES
    B3.1 The Parametric Assumptions
  34. B3.2 Mann-Whitney U Test
  35. B3.3 Kruskal-Wallis Test
  36. B3.4 Wilcoxon Signed Rank Test
  37. B3.5 Friedman Test
  38. B3.6 Spearman's Rank Order Correlation
  39. End of Module B3
    1 Quiz
  40. MODULE C1: BINARY OUTCOME DATA & LOGISTIC REGRESSION
    C1.1 Introduction to Prevalence, Risk, Odds and Rates
  41. C1.2 The Chi-Square Test and the Test For Trend
  42. C1.3 Univariable Logistic Regression
  43. C1.4 Multivariable Logistic Regression
  44. End of Module C1
    1 Quiz
  45. MODULE C2: SURVIVAL DATA
    C2.1 Introduction to Survival Data
  46. C2.2 Kaplan-Meier Survival Function & the Log Rank Test
  47. C2.3 Cox Proportional Hazards Regression
  48. C2.4 Poisson Regression
  49. End of Module C2
    1 Quiz

Learning Outcomes

By the end of this section, students will be able to:

  • Open datasets in their chosen statistical software programme
  • Explore datasets and understand what data they have
  • Use basic commands to edit their data

 

Video C2.4 – Poisson Regression (4 minutes)

 

C2.4 PRACTICAL: R

Poisson regression can be used to model the log(count of events) or the log(rate), since a rate is equivalent to the ‘count of events’ divided by a period of follow-up time. Here, we show a Poisson regression modelling the log(rate of disease) as the outcome.

In R, we can use the glm() command to fit a Poisson regression model as follows:

glm(formula, data, family = poisson(link = “log”))

The command glm() is the same command that was used to fit a logistic regression model in the previous module. The only difference is that in Poisson regression we use the family poisson(link = “log”), whereas in logistic regression we use the family binomial(link = “logit”). Note that we use the offset(time) when we specify the formula of the Poisson regression to specify follow-up time. If you do not use the offset(time) to specify the period of follow-up time, then you will be analysing a log(count) [rather than rate].

To estimate the incidence rate ratio of death for current smokers compared to non-smokers, we run the following command:

> model <- glm(death ~ offset(log(fu_years)) + currsmoker, data = df, family = poisson(link = “log”))
> summary(model)

Call:
glm(formula = death ~ offset(log(fu_years)) + currsmoker, family = poisson(link = “log”), 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0083  -0.9267  -0.8854   0.9764   3.3032  

Coefficients:
            Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -2.98556    0.02770 -107.770   <2e-16 ***
currsmoker   0.16891    0.07318    2.308    0.021 *  

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 5138.5  on 4320  degrees of freedom
Residual deviance: 5133.3  on 4319  degrees of freedom
  (6 observations deleted due to missingness)
AIC: 8179.3

Number of Fisher Scoring iterations: 6

The estimate in the row for ‘currsmoker’ is the rate ratio of death for current smokers compared to non-smokers. This output shows that the rate of death is 18% higher in current smokers compared to non-smokers (RR 1.18, 95% CI: 1.03-1.37). This association is statistically significant (p=0.02).

 

Assumptions of Poisson regression

Poisson regression requires that within an exposure group such as smokers or non-smokers, the rate of the event of interest (such as death) stays constant over time, a very strong assumption. A cohort study often involves follow-up over many years and it is unrealistic, because of changes in age, to assume that the rate stays unchanged over follow-up.

  • Question C2.4: Use a Poisson regression to assess if current smoking associated with the rate of death, once adjusted for age group and frailty?

 

Answer

The command and output is:> model <- glm(death ~ offset(log(fu_years)) + currsmoker + age_grp + frailty, data = df, family = poisson(link = “log”))
> summary(model)

Call:
glm(formula = death ~ offset(log(fu_years)) + currsmoker + age_grp + 
    frailty, family = poisson(link = “log”), data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0348  -0.8082  -0.5892   0.7480   3.2916  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.35993    0.10691 -40.783  < 2e-16 ***
currsmoker   0.22259    0.07330   3.037  0.00239 ** 
age_grp2     0.54990    0.09650   5.698 1.21e-08 ***
age_grp3     0.94373    0.09820   9.610  < 2e-16 ***
age_grp4     1.38579    0.10113  13.703  < 2e-16 ***
frailty2     0.26967    0.10491   2.571  0.01015 *  
frailty3     0.46882    0.10116   4.634 3.58e-06 ***
frailty4     0.76742    0.09329   8.226  < 2e-16 ***
frailty5     1.33908    0.08988  14.899  < 2e-16 ***

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 5138.5  on 4320  degrees of freedom
Residual deviance: 4352.0  on 4312  degrees of freedom
  (6 observations deleted due to missingness)
AIC: 7412

Number of Fisher Scoring iterations: 6

The rate of death is 23% higher in current smokers compared to non-smokers, once adjusted for age group and frailty (RR 1.25, 95% CI: 1.09-1.43). This association is statistically significant (p=0.002).

 

C2.4 PRACTICAL: Stata

There are several Stata commands which allow the analysis of data with follow-up time and rates.  These are broadly grouped into ‘stand-alone’ commands (e.g. ‘poisson’) and those which are used after the ‘stset’ command which tells Stata that you have ‘survival data’.  You can run a Poisson regression using the ‘stset’ command, followed by the ‘streg’ command, but we are not covering that here. For more information, see the recommended readings at the end of this module.

Poisson regression can be used to model the log(count of events) or the log(rate), since a rate is equivalent to the ‘count of events’ divided by a period of follow-up time.

Here, we show a Poisson regression modelling the log(rate of disease) as the outcome.  In Stata, we can use the ‘poisson’ command as follows:

poisson death currsmoker, e(fu_years) irr

The command poisson works similarly to other regression commands such as logistic and regress, but note we use the ‘e’ option to specify follow-up time (see the help file for more details on this). If you do not use the ‘e’ option to specify the period of follow-up time, then you will be analysing a log(count) [rather than rate].

If we run the ‘poisson’ command above, we get the following output:

poisson death currsmoker,e(fu_years) irr

Iteration 0:   log likelihood = -4087.6737 

Iteration 1:   log likelihood = -4087.6737 

Poisson regression                                      Number of obs =  4,321

                                                        LR chi2(1)    =   5.12

                                                        Prob > chi2   = 0.0237

Log likelihood = -4087.6737                             Pseudo R2     = 0.0006

——————————————————————————

       death |        IRR   Std. err.      z    P>|z|     [95% conf. interval]

————-+—————————————————————-

  currsmoker |   1.184012   .0866404     2.31   0.021     1.025815    1.366605

       _cons |   .0505112   .0013993  -107.77   0.000     .0478417    .0533296

ln(fu_years) |          1  (exposure)

——————————————————————————

Note: _cons estimates baseline incidence rate.

The ‘IRR’ coefficient in the row for ‘currsmoker’ is the rate ratio of death for current smokers compared to non-smokers. This output shows that the rate of death is 18% higher in current smokers compared to non-smokers (RR 1.18, 95% CI: 1.03-1.37). This association is statistically significant (p=0.02).

The row ‘ln(fu_years)’ is always set at 1, and this is correct (even though it looks odd on the output). This refers to the follow-up time, which we set at 1 so that the regression does not multiply the variable denoting the follow-up time by any sort of constant.

 

Assumptions of Poisson regression

Poisson regression requires that within an exposure group such as smokers or non-smokers, the rate of the event of interest (such as death) stays constant over time, a very strong assumption. A cohort study often involves follow-up over many years and it is unrealistic, because of changes in age, to assume that the rate stays unchanged over follow-up.

One way to deal with this assumption is to split follow-up time using the ‘stsplit’ command in Stata, and model the rate of death in separate time bands of follow-up. The rate of death for smokers vs non-smokers can vary between time bands (such as the first 5 years of follow-up, compared to follow-up over years 6-10), as long as the rate is generally constant within time bands. This is outside the scope of introductory module on Poisson regression to cover, so please see the recommended readings at the end of the module.

  •  Question C2.4: Use a Poisson regression to assess if current smoking associated with the rate of death, once adjusted for age group and frailty?

 

Answer

The command and output is:

poisson death currsmoker age_grp frailty,e(fu_years) irr

Iteration 0:   log likelihood = -3706.1288 

Iteration 1:   log likelihood = -3706.1281 

Iteration 2:   log likelihood = -3706.1281 

Poisson regression                             Number of obs =  4,321

                                                                      LR chi2(3)    = 768.21

                                                                           Prob > chi2   = 0.0000

Log likelihood = -3706.1281                      Pseudo R2     = 0.0939

———————————————————————————

     death |        IRR   Std. err.      z    P>|z|     [95% conf. interval]

————-+——————————————————————-

 currsmoker |   1.234536   .0903537     2.88   0.004     1.069562    1.424958

         age_grp |   1.550334   .0431322    15.76   0.000      1.46806    1.637218

   frailty |   1.403122   .0282948    16.80   0.000     1.348747    1.459689

    _cons |   .0056696   .0005372   -54.59   0.000     .0047086    .0068266

ln(fu_years) |          1  (exposure)

——————————————————————————–
Note: _cons estimates baseline incidence rate.

The rate of death is 23% higher in current smokers compared to non-smokers, once adjusted for age group and frailty (RR 1.23, 95% CI: 1.07-1.42). This association is statistically significant (p=0.004).

 

C2.4 PRACTICAL: SPSS

SPSS does not have Poisson regression as an option within its survival analysis menu. A general Poisson regression can be conducted using the function Analyze >> Generalized Linear Models >> Generalized Linear Models and then ticking ‘Poisson Loglinear’ in the ‘Type of Model’ tab. This function does not allow the time component of the survival analysis to be entered, and will give very different results, so SPSS should not be used for Poisson regression on survival data.

 

👋 Before you go, please rate your satisfaction with this lesson

Ratings are completely anonymous

Average rating 0 / 5. Vote count: 0

No votes so far! Be the first to rate this post.

Please share any positive or negative feedback you may have.

Feedback is completely anonymous

Subscribe
Notify of
guest

1 Comment
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
Sayed Jalal

No presentation. Please add downloadable presentation

1
0
Questions or comments?x