FoSSA: Fundamentals of Statistical Software & Analysis
-
Course Information
Meet the Teaching Team -
Course Dataset 1
-
Course Dataset 2
-
MODULE A1: INTRODUCTION TO STATISTICS USING R, STATA, AND SPSSA1.1 What is Statistics?
-
A1.2.1a Introduction to Stata
-
A1.2.2b: Introduction to R
-
A1.2.2c: Introduction to SPSS
-
A1.3: Descriptive Statistics
-
A1.4: Estimates and Confidence Intervals
-
A1.5: Hypothesis Testing
-
A1.6: Transforming Variables
-
End of Module A11 Quiz
-
MODULE A2: POWER & SAMPLE SIZE CALCULATIONSA2.1 Key Concepts
-
A2.2 Power calculations for a difference in means
-
A2.3 Power Calculations for a difference in proportions
-
A2.4 Sample Size Calculation for RCTs
-
A2.5 Sample size calculations for cross-sectional studies (or surveys)
-
A2.6 Sample size calculations for case-control studies
-
End of Module A21 Quiz
-
MODULE B1: LINEAR REGRESSIONB1.1 Correlation and Scatterplots
-
B1.2 Differences Between Means (ANOVA 1)
-
B1.3 Univariable Linear Regression
-
B1.4 Multivariable Linear Regression
-
B1.5 Model Selection and F-Tests
-
B1.6 Regression Diagnostics
-
End of Module B11 Quiz
-
MODULE B2: MULTIPLE COMPARISONS & REPEATED MEASURESB2.1 ANOVA Revisited – Post-Hoc Testing
-
B2.2 Correcting For Multiple Comparisons
-
B2.3 Two-way ANOVA
-
B2.4 Repeated Measures and the Paired T-Test
-
B2.5 Repeated Measures ANOVA
-
End of Module B21 Quiz
-
MODULE B3: NON-PARAMETRIC MEASURESB3.1 The Parametric Assumptions
-
B3.2 Mann-Whitney U Test
-
B3.3 Kruskal-Wallis Test
-
B3.4 Wilcoxon Signed Rank Test
-
B3.5 Friedman Test
-
B3.6 Spearman’s Rank Order Correlation
-
End of Module B31 Quiz
-
MODULE C1: BINARY OUTCOME DATA & LOGISTIC REGRESSIONC1.1 Introduction to Prevalence, Risk, Odds and Rates
-
C1.2 The Chi-Square Test and the Test For Trend
-
C1.3 Univariable Logistic Regression
-
C1.4 Multivariable Logistic Regression
-
End of Module C11 Quiz
-
MODULE C2: SURVIVAL DATAC2.1 Introduction to Survival Data
-
C2.2 Kaplan-Meier Survival Function & the Log Rank Test
-
C2.3 Cox Proportional Hazards Regression
-
C2.4 Poisson Regression
-
End of Module C21 Quiz
-
A Note about the Fossa Certificate
Course Dataset 1
The Whitehall FoSSA Study is a simulated cohort study similar to the original Whitehall Study of Civil Servants, set up in the 1960s, which followed London-based male civil servants with a view to investigating cardiovascular disease and mortality. Participants from the original Whitehall cohort in the 1960s were flagged for mortality at the Office for National Statistics, which provided the date and cause of all deaths occurring until the end of September 2005. The (simulated) Whitehall FoSSA Study was conducted in 1997 to assess risk factors for cardiac and all-cause mortality in a subset of the original cohort that was still being followed. The Whitehall FoSSA Study contains information on 4,327 individuals that were followed-up from 1997 until 2005, and the variables are summarised in the table below. See Clarke et al. (2007), Arch Intern Med, 167(13) for more details on the real data that inspired this dataset.
| Variable name | Description | Type of measure | Coding |
| whl1_id | Participant ID number | Continuous | |
| age_grp | Age group (years) | Categorical | 1=60-70; 2=71-75; 3=76-80; 4=81-95 |
| prior_cvd | Prior CVD | Binary | 0=No; 1=Yes |
| prior_t2dm | Prior Type 2 Diabetes | Binary | 0=No; 1=Yes |
| prior_cancer | Prior Cancer | Binary | 0=No; 1=Yes |
| sbp | Systolic Blood Pressure (mmHg) | Continuous | 86-230 mmHg |
| bmi | Body Mass Index (BMI; kg/m2) | Continuous | 15-44 kg/m2 |
| bmi_grp4 | BMI, grouped | Categorical | 1=Underweight; 2=Normal; 3=Overweight; 4=Obese |
| hdlc | HDL cholesterol (mmol/L) | Continuous | 0.5-3.07 mmol/L |
| ldlc | LDL cholesterol (mmol/L) | Continuous | 1.05-6.81 mmol/L |
| chol | Total cholesterol (mmol/L) | Continuous | 2.24-10.77 mmol/L |
| currsmoker | Current smoker | Binary | 0= Not a current smoker, 1= Current smoker |
| frailty | Summary frailty score | Categorical | 1= least frail quintile; 5= most frail quintile |
| vitd | Vitamin D [25(OH)D] nmol/L |
Continuous | 18.92-419.89 nmol/L |
| cvd_death | Fatal CVD | Binary | 0=No; 1=Yes |
| death | Death | Binary | 0=No; 1=Yes |
| fu_years | Years of follow-up | Continuous | 0.03-8.5 years |
As this is simulated data, some of the associations you may find between variables are not real, nor do they reflect what the current science suggests about risk factors for cardiovascular disease and mortality. Please do not use this dataset for anything beyond the training material in this course.
Stata Dataset
XLSX Dataset
👋 Before you go, leave an anonymous rating & feedback
Average rating 4.5 / 5. Vote count: 128
No votes so far! Be the first to rate this post.
Please share any positive or negative feedback you may have.
Feedback is completely anonymous
Interesting data set
Hello,
I am working with the Whitehall dataset in CSV format and I noticed that some continuous variables (HDLC, LDL, Cholesterol, VitD) are imported as string variables with nominal measure, likely due to non-numeric entries like
N/A.I tried converting them to numeric using syntax: COMPUTE HDLC_num = NUMBER(hdlc, F20.9). But I keep getting Warnings like #1102: “An invalid numeric field has been found. The result has been set to system-missing value”.
I believe the errors are due to the presence of non-numeric values (
N/A) or hidden spaces, but I’m not sure how to properly clean or convert these variables so that I can set their measure as scale for analysis.Could you advise the best approach to handle this conversion?
Thank you!
It worked after adding the variables and the data manually myself.
Well designed
Insightful summary data set
Well-structured and insightful summary
Insightful
well designed
Insightful
an so expectant
insightful
Hello. I’m thrilled to be here and I hope that at the end of this course, statistics will be demystified. How long does the entire course last and is there a deadline for completion?
What app was used in opening the file?
I was able to download the stats dataset
In the CSV Dataset,
I noticed that those with cardiovascular death have bmi greater than 20 and this can be as a result of the obstruction of blood flow due to their excess adipose tissue.
But in the CVS Dataset,
I’m a bit confused as to why smokers also have bmi greater than 20, I was thinking they’re underweight since they tend to become dry overtime
Data is fictitious and for training purposes only
I am unable to download the file
Uhoh