Back to Course

FoSSA: Fundamentals of Statistical Software & Analysis

0% Complete
0/0 Steps
  1. Course Information

    Meet the Teaching Team
  2. Course Dataset 1
  3. Course Dataset 2
  4. MODULE A1: INTRODUCTION TO STATISTICS USING R, STATA, AND SPSS
    A1.1 What is Statistics?
  5. A1.2.1a Introduction to Stata
  6. A1.2.2b: Introduction to R
  7. A1.2.2c: Introduction to SPSS
  8. A1.3: Descriptive Statistics
  9. A1.4: Estimates and Confidence Intervals
  10. A1.5: Hypothesis Testing
  11. A1.6: Transforming Variables
  12. End of Module A1
    1 Quiz
  13. MODULE A2: POWER & SAMPLE SIZE CALCULATIONS
    A2.1 Key Concepts
  14. A2.2 Power calculations for a difference in means
  15. A2.3 Power Calculations for a difference in proportions
  16. A2.4 Sample Size Calculation for RCTs
  17. A2.5 Sample size calculations for cross-sectional studies (or surveys)
  18. A2.6 Sample size calculations for case-control studies
  19. End of Module A2
    1 Quiz
  20. MODULE B1: LINEAR REGRESSION
    B1.1 Correlation and Scatterplots
  21. B1.2 Differences Between Means (ANOVA 1)
  22. B1.3 Univariable Linear Regression
  23. B1.4 Multivariable Linear Regression
  24. B1.5 Model Selection and F-Tests
  25. B1.6 Regression Diagnostics
  26. End of Module B1
    1 Quiz
  27. MODULE B2: MULTIPLE COMPARISONS & REPEATED MEASURES
    B2.1 ANOVA Revisited - Post-Hoc Testing
  28. B2.2 Correcting For Multiple Comparisons
  29. B2.3 Two-way ANOVA
  30. B2.4 Repeated Measures and the Paired T-Test
  31. B2.5 Repeated Measures ANOVA
  32. End of Module B2
    1 Quiz
  33. MODULE B3: NON-PARAMETRIC MEASURES
    B3.1 The Parametric Assumptions
  34. B3.2 Mann-Whitney U Test
  35. B3.3 Kruskal-Wallis Test
  36. B3.4 Wilcoxon Signed Rank Test
  37. B3.5 Friedman Test
  38. B3.6 Spearman's Rank Order Correlation
  39. End of Module B3
    1 Quiz
  40. MODULE C1: BINARY OUTCOME DATA & LOGISTIC REGRESSION
    C1.1 Introduction to Prevalence, Risk, Odds and Rates
  41. C1.2 The Chi-Square Test and the Test For Trend
  42. C1.3 Univariable Logistic Regression
  43. C1.4 Multivariable Logistic Regression
  44. End of Module C1
    1 Quiz
  45. MODULE C2: SURVIVAL DATA
    C2.1 Introduction to Survival Data
  46. C2.2 Kaplan-Meier Survival Function & the Log Rank Test
  47. C2.3 Cox Proportional Hazards Regression
  48. C2.4 Poisson Regression
  49. End of Module C2
    1 Quiz

Learning Outcomes

By the end of this session, students will be able to:

  • Interpret p-values
  • Describe type I and type II errors
  • Calculate and interpret t-tests for a difference in means

You can download a copy of the slides here: 1.5: Hypothesis Testing

Video A1.5a – Estimation and Hypothesis Testing (10 minutes)

Video A1.5b – Type I and Type II Errors (2 minutes)

Video A1.5c – Hypothesis Testing & Confidence Intervals (7 minutes)

Video A1.5d – T-tests (7 minutes)

A1.5 PRACTICAL: Stata

A t-test compares the mean of a sample against another value or sample mean.  Prior to performing a t-test we should think about whether the data are continuous, have been randomly sampled from a population, the variance is approximately equal and the distribution is approximately normal.

To test whether the variability of the data in each group is similar:

robvar chol, by(prior_t2dm)

If p-values are >0.05, then we can assume equal variances and continue with the t-test.

If variances are not equal, we can still perform a t-test, but a slightly different version called a Welch t-test.

To check the distribution of the data:

histogram chol

We can also check whether both groups follow a normal distribution:

histogram chol, by(prior_t2dm) normal

There are a number of different t-tests you can perform:

  • A one sample t-test
  • A two sample t-test
  • A paired t-test

One sample t-test

A one-sample t-test is used to compare the sample mean to a specific value (or hypothesised population mean).

For example, does the mean cholesterol level in this sample differ from the national average of 5.55?

ttest chol == 5.55

In the output from Stata, you should be able to see the sample size, the mean, standard error, standard deviation and 95% confidence intervals for cholesterol. At the bottom of the output, in the centre, should be able to see a p-value for the result ‘Pr(|T| . |t|)’. This refers to whether the mean cholesterol level in this sample is equal to the national mean of 5.55. The results indicate that the sample mean of 5.51 is significantly different from the national average (p=0.009).

Two-sample t-test

A two-sampled t-test is used to compare means between two groups. Check how your variables are categorised – you may need to create a new, binary variable in order to perform the t-test.

Example 1:

A systolic blood pressure of over 130 is defined as high. If we want to test whether cholesterol levels differ if an individual has high systolic blood pressure or not, we can use the following code:

  • Create a new binary variable for systolic blood pressure

gen h_sbp=1 if sbp>130 & sbp~=.

replace h_sbp=0 if h_sbp==.

label var h_sbp “high SBP”

label values h_sbp h_sbp

label define h_sbp 1 “High SBP” 0 “Not high SBP”

tab h_sbp, m

  • Perform the t-test using new variable

ttest chol, by (h_sbp)

Example 2:

Does the mean total cholesterol level differ between people with and without type 2 diabetes?

ttest chol, by (prior_t2dm)

If you can’t assume equal variances, you can use either of the following commands:

ttest chol, by (prior_t2dm) welch

ttest chol, by (prior_t2dm) unequal

In the output from Stata, you should be able to see the sample size, the mean, standard error, standard deviation and 95% confidence intervals for cholesterol level in each group (diabetes: yes or no). At the bottom of the output, in the centre, should be able to see a p-value for the result ‘Pr(|T| > |t|) ‘ for the two-tailed p-value for the t-test. If the p-value for ‘Pr(|T| > |t|)’ is >0.05, there is no significant difference between the means of both groups. However, if the p-value is <0.05, then we infer that there is a statistically significant difference between the mean cholesterol levels in people with and without type 2 diabetes.

  • See if you can perform a two-sample t-test to compare cholesterol levels between two other groups. 

To perform a t-test for each level of a categorical variable you can use the command ‘by’. To do this sort your data by the categorical variable and then perform the t-test:

sort death

by death: ttest chol, by (prior_t2dm)

The above command should perform two t-tests: One comparing mean cholesterol in those with and without type 2 diabetes, but only in those who died. The other comparing mean cholesterol in those with and without type 2 diabetes, but only in those who did not die.

Paired t-test

A paired t-test compares the means between two groups when there are paired observations. This will be covered in Module B2.5

Question A1.5a: Recode BMI into a binary variable so that one group has a BMI below 25, and the other group has a BMI of 25 and above. Perform a t-test to compare the mean SBP in those with BMI<25 and those with BMI≥. Answer the questions:

    1. What is the mean SBP where BMI <25 (LaTeX: \overline{{x}}_1) ?
    2. What is the mean SBP where BMI ≥25 (LaTeX: \overline{{x}}_2)?
    3. What is the mean difference (LaTeX: \overline{{x}}_1-\overline{{x}}_2)?
    4. What is the test statistic t ?
    5. What is 95% CI for the mean difference?
    6. What is the p-value for this t-test and what does it mean?

Question A1.5b: If a clinician has decided that a difference of at least 5 mmHg is considered a clinically worthwhile difference in blood pressure with regard to morbidity associated with high blood pressure, do you consider the result in A1.5a to be clinically significant?

Answer

Answer A1.5a:

Code looks similar to:

gen bmi_2=1 if bmi<25

replace bmi_2=0 if bmi_2==.

drop if bmi==.

ttest sbp, by(bmi_2)

    1. 129.49
    2. 131.65
    3. 2.16
    4. 4.01
    5. 1.11 – 3.22
    6. P<0.001 which means there is a significant difference in the mean SBP measures in people who are overweight and those who are not

Answer A1.5b: Whilst there is a statistically significant difference between the 2 mean SBP measures, with a mean difference of 2.16 mmHg, the result is not clinically significant.

A1.5 PRACTICAL: R

A t-test is an inferential statistic that can be calculated in order to determine the relationship and significance of the difference between the mean values of two separate population variables. There are several assumptions that we take to be true when performing a t-test:

  • The data are continuous
  • The sample data have been randomly sampled from a population
  • There is homogeneity of variance (i.e. the variability of the data in each group is similar)
  • The distribution is approximately normal

A one-sample t-test (or student’s t-test) is used on a continuous measurement to decide whether a population mean is equal to a specific value — i.e. is the systolic blood pressure of our population 120, or not. In R, to complete a one-sample t-test, the syntax is:

t.test(x, conf.level=0.95 , …[options] )

For a two-sample t-test (independent samples t-test), the following code should be used:

t.test(x[group==1], x[group==2] , conf.level=0.95 , …[options] )

For a two-sample test, x specifies the variable on which to perform the t-test, while the group variable defines the groups to be compared. An alternative formulation for the two sample t.test is:

t.test(x~group, data=)

See ?t.test for details of the other options within the native R help files.

Question A1.5a: Recode BMI into a binary variable so that one group has a BMI below 25, and the other group has a BMI of 25 and above. Perform a t-test to compare the mean SBP in those with BMI<25 and those with BMI≥. Answer the questions:

  1. What is the mean SBP where BMI <25 (LaTeX: \overline{x_1})?
  2. What is the mean SBP where BMI ≥25 (LaTeX: \overline{x_2})?
  3. What is the mean difference (LaTeX: (\overline{x_1}-\overline{x_2}))?
  4. What is the test statistic t ?
  5. What is 95% CI for the mean difference?
  6. What is the p-value for this t-test and what does it mean?

Question A1.5b: If a clinician has decided that a difference of at least 5 mmHg is considered a clinically worthwhile difference in blood pressure with regard to morbidity associated with high blood pressure, do you consider the result in A1.5a to be clinically significant?

Answer

whitehall.data$bmibinary <- NA
whitehall.data$bmibinary[whitehall.data$bmi <25] = 1
whitehall.data$bmibinary[whitehall.data$bmi >= 25] = 2
table (whitehall.data$bmibinary)
t.test(sbp~bmibinary, data=whitehall.data)
Welch Two Sample t-test
data:  sbp by bmibinary
t = -4.0068, df = 3959, p-value = 6.267e-05
alternative hypothesis: true difference in means between group 1 and group 2 is not equal to 0
95 percent confidence interval:
 -3.219796 – -1.104094
sample estimates:
mean in group 1 mean in group 2 
       129.4905        131.6524

  1. 129.49mmHg
  2. 131.65mmHg
  3. Using basic arithmetic in R:131.6524-129.4905[1] 2.1619
  4. The t-statistic is the ratio of the departure of the estimated value of a parameter from its hypothesized value to its standard error. It can be utilised to tell us about whether we should support or reject the null hypothesis.t = -4.0068
  5. 95 percent confidence interval: -3.22, -1.10That is to say, we are 95% confident that the difference in the mean systolic blood pressure between those with BMI <25 and BMI   25 lies between 1.10 and 3.22 mmHg.
  6. The p-value is 6.267e-05 = 0.00006267. Remember that p-values provide us with the probability of getting the mean difference we saw here, or one greater, if the null hypothesis was actually true. In this case, that probability is extremely low.Here, our p-value tells us that there is very strong evidence against the null hypothesis (which would be that there is no difference in SBP between our two BMI groups).Therefore, we can say that there is a significant difference in the mean SBP measures in people who are overweight and those who are not.

Answer A1.5b: No. Note here the difference between clinical and statistical significance. Whilst there is a statistically significant difference between the 2 mean SBP measures, with a mean difference of 2.16 mmHg, the result is not clinically significant.

A1.5 PRACTICAL: SPSS
  1. A t-test compares the mean of a sample against another value or sample mean.  Prior to performing a t-test we should think about whether the data are continuous, have been randomly sampled from a population, the variance is approximately equal across groups and the distribution is approximately normal.There are a number of different t-tests you can perform:
    • A one sample t-test
    • A two sample t-test
    • A paired t-test (covered in Module B2)

    One sample t-test

    A one-sample t-test is used to compare the sample mean to a specific value (or hypothesised population mean).

    For example, does the mean cholesterol level in this sample differ from the national average of 5.55?

    Run the One-Sample T Test we used in practical A1.4, but this time put cholesterol (chol) in the test Variable List and add 5.55 in the ‘Test Value’ box.

    Answer

    In the output window you will see one table which contains the sample size (N), the mean, standard error, and standard deviation for cholesterol.

    Then a second table which contains the t-test statistic(t), degrees of freedom (df) and then the significance (p). This refers to whether the mean cholesterol level in this sample is equal to the national mean of 5.55. The results indicate that the sample mean of 5.51 is significantly different from the national average (p=0.009).

    We always use the two-sided (two tailed) p value if SPSS offers both.

    Two-sample t-test

    A two-sampled t-test is used to compare means between two groups. In SPSS this is referred to as an Independent Samples T Test.

    Example 1:

    A systolic blood pressure of over 130 is defined as high. We want to test whether cholesterol levels differ if an individual has high systolic blood pressure or not. In SPSS we can use a continuous variable to define groups within the t-test, if we know what our cut off value is.

    Select

    Analyze >> Compare Means and Proportions >> Independent Samples T Test.

    Move cholesterol (chol) into the Test Variable(s) list and the blood pressure (sbp) into the Grouping Variable box. You then need to click on ‘Define Groups’, select ‘Cut point’ and input the lower bound of the top group (i.e. 131 in this case) as the cut point. Press Continue and then OK to run the test.

    Answer

    The SPSS output will look like this

    One of the assumptions of the t-test is that the two groups have roughly equal variance. SPSS automatically does the Levene’s test for Equality of Variances (more on this in Module B3). If the Sig. (P value) for Levene’s is <0.05 then the two groups do not have equal variances, if P ≥ 0,05 then the two groups do have equal variances. SPSS automatically runs two versions of the t-test, one for equal variances, and one for unequal variances (equal variances not assumed). Use whichever row is appropriate for your data based on the outcome of the Levene’s test. In this case we would use ‘equal variances assumed’ on the top row.

    Example 2:

    Does the mean total cholesterol level differ between people with and without type 2 diabetes?

    To answer this run the independent samples t-test again, but this time use the binary variable prior_t2dm as the grouping variable. When you get to the point of defining groups select the top option of ‘Use specified values’ and input the numerical codes which have been used to assign groups, in this case 0 and 1.

    Answer

    The SPSS output will look like this

    Example 3:

    Perform a t-test to compare the mean SBP in those with BMI<25 and those with BMI≥25.

    Answer the questions:

        1. What is the mean SBP where BMI <25?
        2. What is the mean SBP where BMI ≥25 ?
        3. What is the mean difference?
        4. What is the test statistic t ?
        5. What is 95% CI for the mean difference?
        6. What is the p-value for this t-test and what does it mean?
        7. If a clinician has decided that a difference of at least 5 mmHg is considered a clinically worthwhile difference in blood pressure with regard to morbidity associated with high blood pressure, do you consider the result in this task to be clinically significant?
    Answer

    The SPSS output will look like this

    1. 129.49
    2. 131.65
    3. 2.16
    4. 4.01
    5. 1.11 – 3.22
    6. P<0.001 which means there is a significant difference in the mean SBP measures in people who are overweight and those who are not
    7. Whilst there is a statistically significant difference between the two mean SBP measures, with a mean difference of 2.16 mmHg, the result is not clinically significant.
Subscribe
Notify of
guest

1 Comment
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
Juliet

Thank you for the excellent presentations, notes and practice questions for a not too easy concept.

1
0
Questions or comments?x