Back to Course

FoSSA: Fundamentals of Statistical Software & Analysis

0% Complete
0/0 Steps
  1. Course Information

    Meet the Teaching Team
  2. Course Dataset 1
  3. Course Dataset 2
  4. MODULE A1: INTRODUCTION TO STATISTICS USING R, STATA, AND SPSS
    A1.1 What is Statistics?
  5. A1.2.1a Introduction to Stata
  6. A1.2.2b: Introduction to R
  7. A1.2.2c: Introduction to SPSS
  8. A1.3: Descriptive Statistics
  9. A1.4: Estimates and Confidence Intervals
  10. A1.5: Hypothesis Testing
  11. A1.6: Transforming Variables
  12. End of Module A1
    1 Quiz
  13. MODULE A2: POWER & SAMPLE SIZE CALCULATIONS
    A2.1 Key Concepts
  14. A2.2 Power calculations for a difference in means
  15. A2.3 Power Calculations for a difference in proportions
  16. A2.4 Sample Size Calculation for RCTs
  17. A2.5 Sample size calculations for cross-sectional studies (or surveys)
  18. A2.6 Sample size calculations for case-control studies
  19. End of Module A2
    1 Quiz
  20. MODULE B1: LINEAR REGRESSION
    B1.1 Correlation and Scatterplots
  21. B1.2 Differences Between Means (ANOVA 1)
  22. B1.3 Univariable Linear Regression
  23. B1.4 Multivariable Linear Regression
  24. B1.5 Model Selection and F-Tests
  25. B1.6 Regression Diagnostics
  26. End of Module B1
    1 Quiz
  27. MODULE B2: MULTIPLE COMPARISONS & REPEATED MEASURES
    B2.1 ANOVA Revisited - Post-Hoc Testing
  28. B2.2 Correcting For Multiple Comparisons
  29. B2.3 Two-way ANOVA
  30. B2.4 Repeated Measures and the Paired T-Test
  31. B2.5 Repeated Measures ANOVA
  32. End of Module B2
    1 Quiz
  33. MODULE B3: NON-PARAMETRIC MEASURES
    B3.1 The Parametric Assumptions
  34. B3.2 Mann-Whitney U Test
  35. B3.3 Kruskal-Wallis Test
  36. B3.4 Wilcoxon Signed Rank Test
  37. B3.5 Friedman Test
  38. B3.6 Spearman's Rank Order Correlation
  39. End of Module B3
    1 Quiz
  40. MODULE C1: BINARY OUTCOME DATA & LOGISTIC REGRESSION
    C1.1 Introduction to Prevalence, Risk, Odds and Rates
  41. C1.2 The Chi-Square Test and the Test For Trend
  42. C1.3 Univariable Logistic Regression
  43. C1.4 Multivariable Logistic Regression
  44. End of Module C1
    1 Quiz
  45. MODULE C2: SURVIVAL DATA
    C2.1 Introduction to Survival Data
  46. C2.2 Kaplan-Meier Survival Function & the Log Rank Test
  47. C2.3 Cox Proportional Hazards Regression
  48. C2.4 Poisson Regression
  49. End of Module C2
    1 Quiz

Learning Outcomes

By the end of this session, students will be able to:

  • Understand the concepts of statistical inference and sampling distributions
  • Calculate standard error and confidence intervals by hand
  • Calculate standard error and confidence intervals using statistical software
  • Interpret confidence intervals for a difference in means

You can download the slides here: 1.4: Estimates and Confidence Intervals

Video A1.4a – Introduction (2 minutes)

Video A1.4b- Sampling distributions (7 minutes)

Video A1.4c – Standard Error (5 minutes)

Video A1.4d – Confidence Intervals (12 minutes)

A1.4 PRACTICAL: R

Standard error (SE) can be defined as the standard deviation of the sampling distribution. Standard Deviation (SD) and sample size (n) can be used to calculate SE as per the following formula:

LaTeX: SE=\frac{SD}{\sqrt{n}}

To calculate SE in R, we will need to first calculate both the SD and n and define these calculations as objects. We will then input these calculations into a formula in order to calculate SE. Let’s calculate the SE for SBP, as an example. Remember to name your objects properly and tell R to exclude missing values!

1) To calculate n, we will use the sum function.

SBP.n <- sum(!is.na(whitehall.data$sbp))

SBP.n

[1] 4318

2) To calculate SD, noting again that we are rounding to 3 significant figures and excluding NA values.

SBP.sd <- sd(whitehall.data$sbp, na.rm=T)

round(SBP.sd, digits=3)

[1] 17.566

3) To calculate SE using the formula we derived above, and have further simplified our digits instructions.

SBP.se <- SBP.sd/sqrt(SBP.n)

round (SBP.se, 3)

[1] 0.267

A 95% confidence interval can be calculated using the following formula. Here

LaTeX: \left\lbrack\hat{x}\overline{}-Z_{0.975}\ast SE,\overline{\hat{x}}+Z_{0.975}\ast SE\right\rbrack

Z0.975 is the 97.5% percentile of the standard normal distribution (~1.96 SD from the mean). We can determine this precisely in R using qnorm(0.975), as previously demonstrated.

Knowing this formula, and armed with both our previous calculations, and knowledge of how to calculate the mean, we can calculate the 95% Cis for SBP as follows.

Za <- qnorm (0.975)

SBP.mean <- mean(whitehall.data$sbp, na.rm=TRUE)

ci.Z_SBP <- c(whitehall.data$sbp.mu – (SBP.sd/sqrt(SBP.n)*Za), whitehall.data$sbp + )SBP.sd/sqrt(SBP.n)*Za))

ci.Z_SBP

[1] 130.2276 131.2754

We can interpret this is as that ‘we are 95% confident that the true mean SBP for this population lies between 130.2276  mmHG and 131.2754 mmHG.

Notice how LDLC.sd/sqrt(LDLC.n) is effectively SD/√n which is the SE. The implication of this is that the larger n is (i.e., the sample size), the smaller SE and the narrower the CI will be (i.e., there will be less uncertainty around the estimated mean). We can also plot a histogram, using our knowledge from previous sections, in order to check whether this variable is approximately normally distributed.

Question A1.4.i: Calculate a 95% confidence interval for mean BMI. How would you interpret this interval?

Question A1.4.ii: Now calculate a 99% confidence interval for mean BMI. Compare the width of this confidence interval to the width of the 95% CI. Why is it different?

Question A1.4 Answers

Answer A1.4.i.

Following the same process as above, and adding calculations for n and SD:

Za <- qnorm (0.975)

BMI.n <- sum(!is.na(whitehall.data$bmi))

BMI.sd <- sd(whitehall.data$bmi, na.rm = T )

BMI.mean <- mean(whitehall.data$bmi, na.rm = T)

ci.Z_BMI <- c(BMI.mean – (BMI.sd/sqrt(BMI.n)*Za), BMI.mean + (BMI.sd/sqrt(BMI.n)*Za))

ci.Z_BMI

[1] 25.11954 25.31434 

95% of sample means for BMI lie between 25.12 and 25.31

Answer A1.4.ii:

This involves using the same code, but extending out to qnorm (0.995).

Za <- qnorm (0.995)

BMI.n <- sum(!is.na(whitehall.data$bmi))

BMI.sd <- sd(whitehall.data$bmi, na.rm = T )

BMI.mean <- mean(whitehall.data$bmi, na.rm = T)

ci.Z_BMI <- c(BMI.mean – (BMI.sd/sqrt(BMI.n)*Za), BMI.mean + (BMI.sd/sqrt(BMI.n)*Za))

ci.Z_BMI

[1] 25.08894 25.34494

The 99% CI is wider. This is because only 1% of your sample means will fall outside this range. Therefore, to be more certain, the range of values will increase to allow for this.

A1.4 PRACTICAL: Stata

In A1.4, we will estimate the standard error and confidence intervals for systolic blood pressure (SBP). In research, it is usually not feasible to measure SBP in every member of our population of interest. Therefore, we must measure SBP in a sample of that population (sample population).

Calculating standard error of systolic blood pressure (SBP)

The standard error is a measure of how accurate your estimate is likely to be i.e. predicts how different your sample mean is likely to be compared to the true population mean.

To calculate the standard error of a population, we need to know the standard deviation of the population mean and the total sample size. Once we know these two pieces of information we can calculate the standard error using the following formula:

LaTeX: Standard\,error=\frac{standard\, deviation\, of\, population\, mean{}}{\sqrt{Number\, in\, sample{}}}

The standard deviation is a measure of the how spread out the data is around the mean. For example, if you have a low SD, this indicates the data typically clusters around the mean, and having a high SD indicates your data is spread out over a wide range of values. Assume this dataset is the population for this question i.e. the standard deviation calculated here is the population standard deviation.

First, we must find the sample size and the standard deviation:

describe sbp

summarize sbp, detail

Note: you can shorten summarize to sum ‘sum sbp, detail’

Alternatively, we can find the sample size and standard deviation using the tabstat command too:

tabstat sbp, stat(n mean sd)

Next, we need to check that the data is normally distributed:

histogram sbp

Using the sample size and standard deviation, we can now enter this into a line of code so Stata can calculate the SE:

gen se=17.57/sqrt(4318)

You can also use the ‘display’ command like a calculator in Stata:

display 17.57/sqrt(4318)

Next, try changing the sample size:

    1. A sample size of 10:

gen se_1=17.57/sqrt(10)

2. A sample size of 20:

gen se_2=17.57/sqrt(20)

3. A sample size of 40:

gen se_3=17.57/sqrt(40)

4. A sample size of 80:

gen se_4=17.57/sqrt(80)

5. A sample size of 160:

gen se_5=17.57/sqrt(160)

  • Comment on how the sample size influences the standard error

Calculating confidence intervals for SBP

A confidence interval (CI) is a range of values in which a population parameter is likely to fall. It is common to calculate 95% Cis in research. This means that 95% of the time, the true population parameter is likely to fall within this range of values. To calculate 95% CIs, we need the sample mean and standard error:

LaTeX: 95\%\, CI = sample\, mean \pm 1.96\times SE

Note: In a normal distribution, 95% of the area under the curve will fall within 1.96 standard deviations from the mean. Therefore, if you were to generate 90% or 99% confidence intervals, this value would be different.  

To get the sample mean:

tabstat sbp, stat(n mean sd)

This gives a mean of 130.75.

You previously calculated the SE using the following:

gen se=17.57/sqrt(4318)

This gave a SE of 0.27.

Therefore, to calculate 95% CI, you can use the following to obtain the upper limit:

gen ci95_u=130.75+1.96*0.27

tab ci95_u

To calculate the lower limit:

gen ci95_l=130.75-1.96*0.27

tab ci95_l

  • Can you interpret this confidence interval?

Try calculating 90% and 99% CIs for SBP and see how this might impact your interpretation of the results.

Using the following formulae, calculate 90% confidence intervals for SBP:

LaTeX: 90\%\, CI = sample\, mean \pm 1.65\times SE

Using the following formulae, calculate 99% confidence intervals for SBP:

LaTeX: 90\%\, CI = sample\, mean \pm 2.58\times SE

Question A1.4.i. Use the ci means command to calculate a 95% confidence interval for mean BMI. (Hint: type help ci to see how to use this command). How would you interpret this interval?

Question A1.4.ii: Now use the ci means command to calculate a 99% confidence interval for mean BMI. Compare the width of this confidence interval to the width of the 95% CI. Why is it different?

Question A1.4 Answers

Question A1.4.i:

ci means bmi

Variable |        Obs        Mean    Std. err.       [95% conf. interval]

————-+—————————————————————

bmi |      4,310    25.21694    .0496936        25.11951    25.31436

95% of sample means for BMI lie between 25.12 and 25.31

Question A1.4.ii:

ci means bmi, level(99)

Variable |        Obs        Mean    Std. err.       [99% conf. interval]

————-+—————————————————————

bmi |      4,310    25.21694    .0496936        25.08888      25.345

99% of sample means for BMI lie between 25.09 and 25.35

The width is 0.26kg/m2 compared to 0.19 kg/m2

The 99% CI is wider. This is because only 1% of your sample means will fall outside this range. Therefore, to be more certain, the range of values will increase to allow for this

A1.4 PRACTICAL: SPSS

In A1.4, we will estimate the standard error and confidence intervals for systolic blood pressure (SBP). In research, it is usually not feasible to measure SBP in every member of our population of interest. Therefore, we must measure SBP in a sample of that population (sample population).

Standard error

The standard error is a measure of how accurate your estimate is likely to be i.e. predicts how different your sample mean is likely to be compared to the true population mean.

To calculate the standard error of a population, we need to know the standard deviation of the population mean and the total sample size. Once we know these two pieces of information we can calculate the standard error using the following formula:

LaTeX: Standard\,error=\frac{standard\, deviation\, of\, population\, mean{}}{\sqrt{Number\, in\, sample{}}}

The standard deviation is a measure of the how spread out the data is around the mean. For example, if you have a low SD, this indicates the data typically clusters around the mean, and having a high SD indicates your data is spread out over a wide range of values. Assume this dataset is the population for this question i.e. the standard deviation calculated here is the population standard deviation.

First, we must find the sample size and the standard deviation.

i. Find these by running the ‘Explore’ function from practical A1.3 on the SBP data. 

Next, we need to check that the data are normally distributed. Create a histogram of the SBP data using the ‘Chart Builder’ function and interpret this.

Using the sample size and standard deviation, we can calculate the SE using the formula above. Check this against the SE provided by SPSS in the output from the ‘Explore’ function.

ii. Try altering the sample size in your calculation and see what effect this has on the standard error. 

Confidence intervals

A confidence interval (CI) is a range of values in which a population parameter is likely to fall. It is common to calculate 95% CIs in research. This means that 95% of the time, the true population parameter is likely to fall within this range of values. To calculate 95% CIs, we need the sample mean and standard error:

LaTeX: 95\%\, CI = sample\, mean \pm 1.96\times SE

In a normal distribution, 95% of the area under the curve will fall within 1.96 standard deviations from the mean. Therefore, if you were to generate 90% or 99% confidence intervals, this value would be different.

You have previously calculated the sample mean and the standard error, so use these to calculate the upper and lower limits of the 95% CI for the mean of SBP. 

Try calculating 90% and 99% CIs for SBP and see how this might impact your interpretation of the results.

Using the following formula, calculate 90% confidence intervals for SBP:

LaTeX: 90\%\, CI = sample\, mean \pm 1.65\times SE

Using the following formula, calculate 99% confidence intervals for SBP:

LaTeX: 99\%\,CI=sample\,mean\pm2.58\times SE

You will have seen that the ‘Explore’ function already gives you the upper and lower bounds of the 95% CI as standard. If you want a different value you need to run a one sample t-test.

Select

Analyze >> Compare Means and Proportions >> One- Sample T Test

Move the variable of interest (in this case SBP) into the ‘Test Variables’ box, then click on ‘Options’ on the right hand side. In this window you can set the Confidence Interval percentage that you wish to show.  Press continue and then OK to run the test. Do not change anything else from the default values.

iii. Use this method to calculate a 95% confidence interval for mean BMI. How would you interpret this interval?

iv. Now calculate a 99% confidence interval for mean BMI. Compare the width of this confidence interval to the width of the 95% CI. Why is it different?

Answer

i. The ‘Explore’ output for SBP should look like the below. You can use this to check your calculations of the SE and 95% CI for SBP.

ii. Increasing the sample size will decrease the standard error, and vice versa, provided all other parameters stay the same.

iii. 95% of sample means for BMI lie between 25.12 and 25.31. You can ignore the actual one-sampled t test statistics, these don’t mean anything as we were comparing our data to zero. Just focus on the CI values on the right hand side of the table.

iv. 99% of sample means for BMI lie between 25.09 and 25.34. The width of the CI is 0.25kg/m2 compared to 0.19 kg/m2. The 99% CI is wider. This is because only 1% of your sample means will fall outside this range. Therefore, to be more certain, the range of values will increase to allow for this.

👋 Before you go, please rate your satisfaction with this lesson

Ratings are completely anonymous

Average rating 4.5 / 5. Vote count: 2

No votes so far! Be the first to rate this post.

Please share any positive or negative feedback you may have.

Feedback is completely anonymous

Subscribe
Notify of
guest

3 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
Juliet

In the example on how an increase in sample size results in a smaller standard error of the mean and a narrower confidence interval, the SE used for the calculation would be 2.6 and not 4.3.   And, for the formula of the SE, shouldn’t the denominator be the square root of the… Read more »

Alhassane

Very good presentation. 👍🏾

3
0
Questions or comments?x