Learning Outcomes
By the end of this session, students will be able to:
- Understand the concepts of statistical inference and sampling distributions
- Calculate standard error and confidence intervals by hand
- Calculate standard error and confidence intervals using statistical software
- Interpret confidence intervals for a difference in means
You can download the slides here: 1.4: Estimates and Confidence Intervals
Video A1.4a – Introduction (2 minutes)
Video A1.4b- Sampling distributions (7 minutes)
Video A1.4c – Standard Error (5 minutes)
Video A1.4d – Confidence Intervals (12 minutes)
A1.4 PRACTICAL: R
Standard error (SE) can be defined as the standard deviation of the sampling distribution. Standard Deviation (SD) and sample size (n) can be used to calculate SE as per the following formula:
To calculate SE in R, we will need to first calculate both the SD and n and define these calculations as objects. We will then input these calculations into a formula in order to calculate SE. Let’s calculate the SE for SBP, as an example. Remember to name your objects properly and tell R to exclude missing values!
1) To calculate n, we will use the sum function.
SBP.n <- sum(!is.na(whitehall.data$sbp))
SBP.n
[1] 4318
2) To calculate SD, noting again that we are rounding to 3 significant figures and excluding NA values.
SBP.sd <- sd(whitehall.data$sbp, na.rm=T)
round(SBP.sd, digits=3)
[1] 17.566
3) To calculate SE using the formula we derived above, and have further simplified our digits instructions.
SBP.se <- SBP.sd/sqrt(SBP.n)
round (SBP.se, 3)
[1] 0.267
A 95% confidence interval can be calculated using the following formula. Here
Z0.975 is the 97.5% percentile of the standard normal distribution (~1.96 SD from the mean). We can determine this precisely in R using qnorm(0.975), as previously demonstrated.
Knowing this formula, and armed with both our previous calculations, and knowledge of how to calculate the mean, we can calculate the 95% Cis for SBP as follows.
Za <- qnorm (0.975)
SBP.mean <- mean(whitehall.data$sbp, na.rm=TRUE)
ci.Z_SBP <- c(whitehall.data$sbp.mu – (SBP.sd/sqrt(SBP.n)*Za), whitehall.data$sbp + )SBP.sd/sqrt(SBP.n)*Za))
ci.Z_SBP
[1] 130.2276 131.2754
We can interpret this is as that ‘we are 95% confident that the true mean SBP for this population lies between 130.2276 mmHG and 131.2754 mmHG.
Notice how LDLC.sd/sqrt(LDLC.n) is effectively SD/√n which is the SE. The implication of this is that the larger n is (i.e., the sample size), the smaller SE and the narrower the CI will be (i.e., there will be less uncertainty around the estimated mean). We can also plot a histogram, using our knowledge from previous sections, in order to check whether this variable is approximately normally distributed.
Question A1.4.i: Calculate a 95% confidence interval for mean BMI. How would you interpret this interval?
Question A1.4.ii: Now calculate a 99% confidence interval for mean BMI. Compare the width of this confidence interval to the width of the 95% CI. Why is it different?
Question A1.4 Answers
Answer A1.4.i.
Following the same process as above, and adding calculations for n and SD:
Za <- qnorm (0.975)
BMI.n <- sum(!is.na(whitehall.data$bmi))
BMI.sd <- sd(whitehall.data$bmi, na.rm = T )
BMI.mean <- mean(whitehall.data$bmi, na.rm = T)
ci.Z_BMI <- c(BMI.mean – (BMI.sd/sqrt(BMI.n)*Za), BMI.mean + (BMI.sd/sqrt(BMI.n)*Za))
ci.Z_BMI
[1] 25.11954 25.31434
95% of sample means for BMI lie between 25.12 and 25.31
Answer A1.4.ii:
This involves using the same code, but extending out to qnorm (0.995).
Za <- qnorm (0.995)
BMI.n <- sum(!is.na(whitehall.data$bmi))
BMI.sd <- sd(whitehall.data$bmi, na.rm = T )
BMI.mean <- mean(whitehall.data$bmi, na.rm = T)
ci.Z_BMI <- c(BMI.mean – (BMI.sd/sqrt(BMI.n)*Za), BMI.mean + (BMI.sd/sqrt(BMI.n)*Za))
ci.Z_BMI
[1] 25.08894 25.34494
The 99% CI is wider. This is because only 1% of your sample means will fall outside this range. Therefore, to be more certain, the range of values will increase to allow for this.
A1.4 PRACTICAL: Stata
In A1.4, we will estimate the standard error and confidence intervals for systolic blood pressure (SBP). In research, it is usually not feasible to measure SBP in every member of our population of interest. Therefore, we must measure SBP in a sample of that population (sample population).
Calculating standard error of systolic blood pressure (SBP)
The standard error is a measure of how accurate your estimate is likely to be i.e. predicts how different your sample mean is likely to be compared to the true population mean.
To calculate the standard error of a population, we need to know the standard deviation of the population mean and the total sample size. Once we know these two pieces of information we can calculate the standard error using the following formula:
The standard deviation is a measure of the how spread out the data is around the mean. For example, if you have a low SD, this indicates the data typically clusters around the mean, and having a high SD indicates your data is spread out over a wide range of values. Assume this dataset is the population for this question i.e. the standard deviation calculated here is the population standard deviation.
First, we must find the sample size and the standard deviation:
describe sbp
summarize sbp, detail
Note: you can shorten summarize to sum ‘sum sbp, detail’
Alternatively, we can find the sample size and standard deviation using the tabstat command too:
tabstat sbp, stat(n mean sd)
Next, we need to check that the data is normally distributed:
histogram sbp
Using the sample size and standard deviation, we can now enter this into a line of code so Stata can calculate the SE:
gen se=17.57/sqrt(4318)
You can also use the ‘display’ command like a calculator in Stata:
display 17.57/sqrt(4318)
Next, try changing the sample size:
-
- A sample size of 10:
gen se_1=17.57/sqrt(10)
2. A sample size of 20:
gen se_2=17.57/sqrt(20)
3. A sample size of 40:
gen se_3=17.57/sqrt(40)
4. A sample size of 80:
gen se_4=17.57/sqrt(80)
5. A sample size of 160:
gen se_5=17.57/sqrt(160)
- Comment on how the sample size influences the standard error
Calculating confidence intervals for SBP
A confidence interval (CI) is a range of values in which a population parameter is likely to fall. It is common to calculate 95% Cis in research. This means that 95% of the time, the true population parameter is likely to fall within this range of values. To calculate 95% CIs, we need the sample mean and standard error:
Note: In a normal distribution, 95% of the area under the curve will fall within 1.96 standard deviations from the mean. Therefore, if you were to generate 90% or 99% confidence intervals, this value would be different.
To get the sample mean:
tabstat sbp, stat(n mean sd)
This gives a mean of 130.75.
You previously calculated the SE using the following:
gen se=17.57/sqrt(4318)
This gave a SE of 0.27.
Therefore, to calculate 95% CI, you can use the following to obtain the upper limit:
gen ci95_u=130.75+1.96*0.27
tab ci95_u
To calculate the lower limit:
gen ci95_l=130.75-1.96*0.27
tab ci95_l
- Can you interpret this confidence interval?
Try calculating 90% and 99% CIs for SBP and see how this might impact your interpretation of the results.
Using the following formulae, calculate 90% confidence intervals for SBP:
Using the following formulae, calculate 99% confidence intervals for SBP:
Question A1.4.i. Use the ci means command to calculate a 95% confidence interval for mean BMI. (Hint: type help ci to see how to use this command). How would you interpret this interval?
Question A1.4.ii: Now use the ci means command to calculate a 99% confidence interval for mean BMI. Compare the width of this confidence interval to the width of the 95% CI. Why is it different?
Question A1.4 Answers
Question A1.4.i:
ci means bmi
Variable | Obs Mean Std. err. [95% conf. interval]
————-+—————————————————————
bmi | 4,310 25.21694 .0496936 25.11951 25.31436
95% of sample means for BMI lie between 25.12 and 25.31
Question A1.4.ii:
ci means bmi, level(99)
Variable | Obs Mean Std. err. [99% conf. interval]
————-+—————————————————————
bmi | 4,310 25.21694 .0496936 25.08888 25.345
99% of sample means for BMI lie between 25.09 and 25.35
The width is 0.26kg/m2 compared to 0.19 kg/m2
The 99% CI is wider. This is because only 1% of your sample means will fall outside this range. Therefore, to be more certain, the range of values will increase to allow for this
A1.4 PRACTICAL: SPSS
In A1.4, we will estimate the standard error and confidence intervals for systolic blood pressure (SBP). In research, it is usually not feasible to measure SBP in every member of our population of interest. Therefore, we must measure SBP in a sample of that population (sample population).
Standard error
The standard error is a measure of how accurate your estimate is likely to be i.e. predicts how different your sample mean is likely to be compared to the true population mean.
To calculate the standard error of a population, we need to know the standard deviation of the population mean and the total sample size. Once we know these two pieces of information we can calculate the standard error using the following formula:
The standard deviation is a measure of the how spread out the data is around the mean. For example, if you have a low SD, this indicates the data typically clusters around the mean, and having a high SD indicates your data is spread out over a wide range of values. Assume this dataset is the population for this question i.e. the standard deviation calculated here is the population standard deviation.
First, we must find the sample size and the standard deviation.
i. Find these by running the ‘Explore’ function from practical A1.3 on the SBP data.
Next, we need to check that the data are normally distributed. Create a histogram of the SBP data using the ‘Chart Builder’ function and interpret this.
Using the sample size and standard deviation, we can calculate the SE using the formula above. Check this against the SE provided by SPSS in the output from the ‘Explore’ function.
ii. Try altering the sample size in your calculation and see what effect this has on the standard error.
Confidence intervals
A confidence interval (CI) is a range of values in which a population parameter is likely to fall. It is common to calculate 95% CIs in research. This means that 95% of the time, the true population parameter is likely to fall within this range of values. To calculate 95% CIs, we need the sample mean and standard error:
In a normal distribution, 95% of the area under the curve will fall within 1.96 standard deviations from the mean. Therefore, if you were to generate 90% or 99% confidence intervals, this value would be different.
You have previously calculated the sample mean and the standard error, so use these to calculate the upper and lower limits of the 95% CI for the mean of SBP.
Try calculating 90% and 99% CIs for SBP and see how this might impact your interpretation of the results.
Using the following formula, calculate 90% confidence intervals for SBP:
Using the following formula, calculate 99% confidence intervals for SBP:
You will have seen that the ‘Explore’ function already gives you the upper and lower bounds of the 95% CI as standard. If you want a different value you need to run a one sample t-test.
Select
Analyze >> Compare Means and Proportions >> One- Sample T Test
Move the variable of interest (in this case SBP) into the ‘Test Variables’ box, then click on ‘Options’ on the right hand side. In this window you can set the Confidence Interval percentage that you wish to show. Press continue and then OK to run the test. Do not change anything else from the default values.

iii. Use this method to calculate a 95% confidence interval for mean BMI. How would you interpret this interval?
iv. Now calculate a 99% confidence interval for mean BMI. Compare the width of this confidence interval to the width of the 95% CI. Why is it different?
Answer
i. The ‘Explore’ output for SBP should look like the below. You can use this to check your calculations of the SE and 95% CI for SBP.

ii. Increasing the sample size will decrease the standard error, and vice versa, provided all other parameters stay the same.
iii. 95% of sample means for BMI lie between 25.12 and 25.31. You can ignore the actual one-sampled t test statistics, these don’t mean anything as we were comparing our data to zero. Just focus on the CI values on the right hand side of the table.

iv. 99% of sample means for BMI lie between 25.09 and 25.34. The width of the CI is 0.25kg/m2 compared to 0.19 kg/m2. The 99% CI is wider. This is because only 1% of your sample means will fall outside this range. Therefore, to be more certain, the range of values will increase to allow for this.

Good presentation
Insightful videos. It was very easy to follow along because of the clear explanation. Thank you!
Nice presentation.
However, this part of the code needs to be reviewed.
ci.Z_SBP <- c(whitehall.data$sbp.mu – (SBP.sd/sqrt(SBP.n)*Za), whitehall.data$sbp + )SBP.sd/sqrt(SBP.n)*Za))
to
ci.Z_SBP <- c(SBP.mean – (SBP.sd / sqrt(SBP.n)*Za),
SBP.mean + SBP.sd / sqrt(SBP.n)*Za))
Thanks
Clear and concise explanations with simple, easy-to-understand graphics to represent complicated topics. Fantastic lectures.
In the example on how an increase in sample size results in a smaller standard error of the mean and a narrower confidence interval, the SE used for the calculation would be 2.6 and not 4.3.
And, for the formula of the SE, shouldn’t the denominator be the square root of the sample size.
Thank you for spotting this typo Juliet, you are correct. We will update the slide with the correct text shortly
Very good presentation. 👍🏾