Learning Outcomes
By the end of this section, students will be able to:
- Open datasets in their chosen statistical software programme
- Explore datasets and understand what data they have
- Use basic commands to edit their data
Video C2.4 – Poisson Regression (4 minutes)
C2.4 PRACTICAL: R
Poisson regression can be used to model the log(count of events) or the log(rate), since a rate is equivalent to the ‘count of events’ divided by a period of follow-up time. Here, we show a Poisson regression modelling the log(rate of disease) as the outcome.
In R, we can use the glm() command to fit a Poisson regression model as follows:
glm(formula, data, family = poisson(link = “log”))
The command glm() is the same command that was used to fit a logistic regression model in the previous module. The only difference is that in Poisson regression we use the family poisson(link = “log”), whereas in logistic regression we use the family binomial(link = “logit”). Note that we use the offset(time) when we specify the formula of the Poisson regression to specify follow-up time. If you do not use the offset(time) to specify the period of follow-up time, then you will be analysing a log(count) [rather than rate].
To estimate the incidence rate ratio of death for current smokers compared to non-smokers, we run the following command:
> model <- glm(death ~ offset(log(fu_years)) + currsmoker, data = df, family = poisson(link = “log”))
> summary(model)
Call:
glm(formula = death ~ offset(log(fu_years)) + currsmoker, family = poisson(link = “log”),
data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0083 -0.9267 -0.8854 0.9764 3.3032
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.98556 0.02770 -107.770 <2e-16 ***
currsmoker 0.16891 0.07318 2.308 0.021 *
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 5138.5 on 4320 degrees of freedom
Residual deviance: 5133.3 on 4319 degrees of freedom
(6 observations deleted due to missingness)
AIC: 8179.3
Number of Fisher Scoring iterations: 6
The estimate in the row for ‘currsmoker’ is the rate ratio of death for current smokers compared to non-smokers. This output shows that the rate of death is 18% higher in current smokers compared to non-smokers (RR 1.18, 95% CI: 1.03-1.37). This association is statistically significant (p=0.02).
Assumptions of Poisson regression
Poisson regression requires that within an exposure group such as smokers or non-smokers, the rate of the event of interest (such as death) stays constant over time, a very strong assumption. A cohort study often involves follow-up over many years and it is unrealistic, because of changes in age, to assume that the rate stays unchanged over follow-up.
- Question C2.4: Use a Poisson regression to assess if current smoking associated with the rate of death, once adjusted for age group and frailty?
Answer
The command and output is:> model <- glm(death ~ offset(log(fu_years)) + currsmoker + age_grp + frailty, data = df, family = poisson(link = “log”))
> summary(model)
Call:
glm(formula = death ~ offset(log(fu_years)) + currsmoker + age_grp +
frailty, family = poisson(link = “log”), data = df)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0348 -0.8082 -0.5892 0.7480 3.2916
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.35993 0.10691 -40.783 < 2e-16 ***
currsmoker 0.22259 0.07330 3.037 0.00239 **
age_grp2 0.54990 0.09650 5.698 1.21e-08 ***
age_grp3 0.94373 0.09820 9.610 < 2e-16 ***
age_grp4 1.38579 0.10113 13.703 < 2e-16 ***
frailty2 0.26967 0.10491 2.571 0.01015 *
frailty3 0.46882 0.10116 4.634 3.58e-06 ***
frailty4 0.76742 0.09329 8.226 < 2e-16 ***
frailty5 1.33908 0.08988 14.899 < 2e-16 ***
—
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 5138.5 on 4320 degrees of freedom
Residual deviance: 4352.0 on 4312 degrees of freedom
(6 observations deleted due to missingness)
AIC: 7412
Number of Fisher Scoring iterations: 6
The rate of death is 23% higher in current smokers compared to non-smokers, once adjusted for age group and frailty (RR 1.25, 95% CI: 1.09-1.43). This association is statistically significant (p=0.002).
C2.4 PRACTICAL: Stata
There are several Stata commands which allow the analysis of data with follow-up time and rates. These are broadly grouped into ‘stand-alone’ commands (e.g. ‘poisson’) and those which are used after the ‘stset’ command which tells Stata that you have ‘survival data’. You can run a Poisson regression using the ‘stset’ command, followed by the ‘streg’ command, but we are not covering that here. For more information, see the recommended readings at the end of this module.
Poisson regression can be used to model the log(count of events) or the log(rate), since a rate is equivalent to the ‘count of events’ divided by a period of follow-up time.
Here, we show a Poisson regression modelling the log(rate of disease) as the outcome. In Stata, we can use the ‘poisson’ command as follows:
poisson death currsmoker, e(fu_years) irr
The command poisson works similarly to other regression commands such as logistic and regress, but note we use the ‘e’ option to specify follow-up time (see the help file for more details on this). If you do not use the ‘e’ option to specify the period of follow-up time, then you will be analysing a log(count) [rather than rate].
If we run the ‘poisson’ command above, we get the following output:
poisson death currsmoker,e(fu_years) irr
Iteration 0: log likelihood = -4087.6737
Iteration 1: log likelihood = -4087.6737
Poisson regression Number of obs = 4,321
LR chi2(1) = 5.12
Prob > chi2 = 0.0237
Log likelihood = -4087.6737 Pseudo R2 = 0.0006
——————————————————————————
death | IRR Std. err. z P>|z| [95% conf. interval]
————-+—————————————————————-
currsmoker | 1.184012 .0866404 2.31 0.021 1.025815 1.366605
_cons | .0505112 .0013993 -107.77 0.000 .0478417 .0533296
ln(fu_years) | 1 (exposure)
——————————————————————————
Note: _cons estimates baseline incidence rate.
The ‘IRR’ coefficient in the row for ‘currsmoker’ is the rate ratio of death for current smokers compared to non-smokers. This output shows that the rate of death is 18% higher in current smokers compared to non-smokers (RR 1.18, 95% CI: 1.03-1.37). This association is statistically significant (p=0.02).
The row ‘ln(fu_years)’ is always set at 1, and this is correct (even though it looks odd on the output). This refers to the follow-up time, which we set at 1 so that the regression does not multiply the variable denoting the follow-up time by any sort of constant.
Assumptions of Poisson regression
Poisson regression requires that within an exposure group such as smokers or non-smokers, the rate of the event of interest (such as death) stays constant over time, a very strong assumption. A cohort study often involves follow-up over many years and it is unrealistic, because of changes in age, to assume that the rate stays unchanged over follow-up.
One way to deal with this assumption is to split follow-up time using the ‘stsplit’ command in Stata, and model the rate of death in separate time bands of follow-up. The rate of death for smokers vs non-smokers can vary between time bands (such as the first 5 years of follow-up, compared to follow-up over years 6-10), as long as the rate is generally constant within time bands. This is outside the scope of introductory module on Poisson regression to cover, so please see the recommended readings at the end of the module.
- Question C2.4: Use a Poisson regression to assess if current smoking associated with the rate of death, once adjusted for age group and frailty?
Answer
The command and output is:
poisson death currsmoker age_grp frailty,e(fu_years) irr
Iteration 0: log likelihood = -3706.1288
Iteration 1: log likelihood = -3706.1281
Iteration 2: log likelihood = -3706.1281
Poisson regression Number of obs = 4,321
LR chi2(3) = 768.21
Prob > chi2 = 0.0000
Log likelihood = -3706.1281 Pseudo R2 = 0.0939
———————————————————————————
death | IRR Std. err. z P>|z| [95% conf. interval]
————-+——————————————————————-
currsmoker | 1.234536 .0903537 2.88 0.004 1.069562 1.424958
age_grp | 1.550334 .0431322 15.76 0.000 1.46806 1.637218
frailty | 1.403122 .0282948 16.80 0.000 1.348747 1.459689
_cons | .0056696 .0005372 -54.59 0.000 .0047086 .0068266
ln(fu_years) | 1 (exposure)
——————————————————————————–
Note: _cons estimates baseline incidence rate.
The rate of death is 23% higher in current smokers compared to non-smokers, once adjusted for age group and frailty (RR 1.23, 95% CI: 1.07-1.42). This association is statistically significant (p=0.004).
C2.4 PRACTICAL: SPSS
SPSS does not have Poisson regression as an option within its survival analysis menu. A general Poisson regression can be conducted using the function Analyze >> Generalized Linear Models >> Generalized Linear Models and then ticking ‘Poisson Loglinear’ in the ‘Type of Model’ tab. This function does not allow the time component of the survival analysis to be entered, and will give very different results, so SPSS should not be used for Poisson regression on survival data.
No presentation. Please add downloadable presentation