Back to Course

FoSSA: Fundamentals of Statistical Software & Analysis

0% Complete
0/0 Steps
  1. Course Information

    Meet the Teaching Team
  2. Course Dataset 1
  3. Course Dataset 2
    A1.1 What is Statistics?
  5. A1.2.1a Introduction to Stata
  6. A1.2.2b: Introduction to R
  7. A1.2.2c: Introduction to SPSS
  8. A1.3: Descriptive Statistics
  9. A1.4: Estimates and Confidence Intervals
  10. A1.5: Hypothesis Testing
  11. A1.6: Transforming Variables
  12. End of Module A1
    1 Quiz
    A2.1 Key Concepts
  14. A2.2 Power calculations for a difference in means
  15. A2.3 Power Calculations for a difference in proportions
  16. A2.4 Sample Size Calculation for RCTs
  17. A2.5 Sample size calculations for cross-sectional studies (or surveys)
  18. A2.6 Sample size calculations for case-control studies
  19. End of Module A2
    1 Quiz
    B1.1 Correlation and Scatterplots
  21. B1.2 Differences Between Means (ANOVA 1)
  22. B1.3 Univariable Linear Regression
  23. B1.4 Multivariable Linear Regression
  24. B1.5 Model Selection and F-Tests
  25. B1.6 Regression Diagnostics
  26. End of Module B1
    1 Quiz
    B2.1 ANOVA Revisited - Post-Hoc Testing
  28. B2.2 Correcting For Multiple Comparisons
  29. B2.3 Two-way ANOVA
  30. B2.4 Repeated Measures and the Paired T-Test
  31. B2.5 Repeated Measures ANOVA
  32. End of Module B2
    1 Quiz
    B3.1 The Parametric Assumptions
  34. B3.2 Mann-Whitney U Test
  35. B3.3 Kruskal-Wallis Test
  36. B3.4 Wilcoxon Signed Rank Test
  37. B3.5 Friedman Test
  38. B3.6 Spearman's Rank Order Correlation
  39. End of Module B3
    1 Quiz
    C1.1 Introduction to Prevalence, Risk, Odds and Rates
  41. C1.2 The Chi-Square Test and the Test For Trend
  42. C1.3 Univariable Logistic Regression
  43. C1.4 Multivariable Logistic Regression
  44. End of Module C1
    1 Quiz
    C2.1 Introduction to Survival Data
  46. C2.2 Kaplan-Meier Survival Function & the Log Rank Test
  47. C2.3 Cox Proportional Hazards Regression
  48. C2.4 Poisson Regression
  49. End of Module C2
    1 Quiz

Learning Outcomes

By the end of this session, students will be able to:

  • Open datasets in their chosen statistical software programme
  • Explore datasets and understand what data they have
  • Use basic commands to edit their data

Video A1.2.1a – Welcome to Stata I (4 minutes)

There is no need to reinvent the wheel , please check out Stata’s official introductory video here. This will introduce you to the different windows of the Stata interface you see when you open the software.

Video A1.2.1b – Welcome to Stata  II (11 minutes)

A1.2.1a PRACTICAL: Getting started in Stata

Find your data file

Download your dataset and save it to your laptop. Note the location of the dataset and the file pathway.

Open Stata

Stata can be opened either from the desktop icon or by going through the ‘start’ menu – as with other Windows based programs. When Stata is opened it should have four windows (similar to the screen shot in your lecture notes).  Take some time to remind yourself the purpose of each window on the Stata interface.

Open a do file

Create a do file, click on the icon for New do file editor, and then select save as within the do file editor to save the file.


Make sure that the do file is saved at the end of every session. Do files are essential because they create a permanent record of all the commands that were used for the statistical analysis. This is important because it allows you to quickly reproduce all your work by simply re-running your do file. If you do not use a do file, and instead type all of your commands directly into the Stata command window, then none of your commands will be saved and you will have to re-type everything into Stata if you want to repeat the same analysis.

The code in your .do file can be run by highlighting the command you want to run, and then clicking on the ‘execute (do)’ button.


Change the working directory

It is often easiest to begin Stata by defining a working directory. The working directory is the file where Stata will look for input (ie, your dataset) and where it will save your output. In addition, predefining a working directory means that you will not have to type out the full file pathway every time you want to load a file.

Typing the following commands will display the name of the current working directory:



If you would like to change the current working directory, then you can type the following command (where “filepath” is the name of the full pathway to the folder where the “condom study_Stata_MScGHSE” dataset is saved):

 cd “filepath”

Alternatively, you can use the dropdown menu to change the working directory to the folder where the “condom study_Stata_MScGHSE” dataset is saved (File>Change working directory).

You can check the contents of the working directory by typing:


However, it may be better to specify the full path name every time you open or save a file especially if you are using files from different locations.

Open a log file

A log file is a record of your commands and output (aka the results of your analysis). You can open a log file by typing:

log using “filepath\filename.log”

For example, you can create a log file in your “Documents” folder by typing (*Replace “filepath\” and “filename.log” with the relevant file pathway and file name):

log using “filepath\Documents\filename.log”

Alternatively, go to the drop down menu File>Log>Begin and then enter the filename. We recommend that you save it as an unformatted log file (.log) as you can open these in other programs, such as Notepad or MS Word. The default in Stata is to save the log file as a .smcl file, which only opens in Stata).

Open a log file.

Open your data set

You can open your dataset by typing the following command in the command window:

use “your filename.dta”

For today this will be:

use “filepath/Whitehall_fossa.dta”

Alternatively, you can open a file from the drop down menu (File>Open) and then browse the directory and click on the appropriate file. This is not recommended because it does not keep a record of the files that are used in the analysis.

To open your dataset in Stata, the file must be in the .dta format.  Data can also be imported from other programs as long as it is in the correct format (usually tab-delimited or .csv). There is more information about importing data from other programs in your handout and in Stata help.

Take a moment to look at the variables in your dataset by typing the following command:


Question A1.2.1a:    Browse the full dataset. 
o    How many rows (individuals or observations) are there? 
o    How many columns (variables)? 
o    Notice the colour coding of the variables, and the coding of the labelled categorical (blue) variables is visible using browse.
o    Look at the variable ‘currsmoker’. Is it clear what the values refer to?

A1.2.1a. Answers

You can browse the dataset by opening the Data Editor (Browse):


Or by typing browse into the command window.

You will then be able to see that there are 4,327 rows and 17 variables – in this case we have data on N=4327 people, and the data is arranged with one row per person. This information is also available by typing the ‘describe’ command.

You are also able to see how the variable called “bmi_grp4” is coded. The value labels “underweight” and “normal” (etc) are assigned to data coded 1 and 2, respectively. The variable ‘currsmoker’ does not have labels so it is not immediately clear what 0 and 1 refer to. Note: there are no variables coded in the colour red. If you see a red variable, this is a string variable, and when you double-click on individual cells, you notice that there are no underlying numeric codes for these data. It only contains characters.

You can find out some of this information using the command describe, though sometimes it can be helpful to look at the data in the data browser.

Video A1.2.1c – Exploring Data in Stata  (5 minutes)

A1.2.1b PRACTICAL: Exploring and getting to know your data

General syntax 

There are a number of ways to ‘get to know’ your dataset. As you proceed through this section, note that the general syntax used for the commands in Stata are as follows:

command {space} variable_name(s) {space}  [if expression], {space} options

Note that after the command and the variable name, there is a comma before any options are listed. Some commands can be abbreviated as well. Conditional statements involving the word “if” come before the comma. Keep this general syntax in mind as you work with the commands below.

Trouble shooting – how to find help

The ‘help’ command in Stata can be very useful to understand more about the commands in Stata. Simply type help and the command you want more information on and you will open a help window. For example, if you want more information on the tabulate command, you can type:

help tabulate

You can also search for a command. For example, if you wanted to look for general help for histograms, you can type:

search histogram

Exploring your data

The command ‘ describe ‘ will give provide the you the variable names and their labels. You can look at the whole dataset or specific variables. Try:


describe bmi_grp4

The command ‘ codebook ‘  provides a little more information about the variables in the dataset, such as the minimum and maximum values and information on missing data. Missing values in Stata are generally coded as . but missing values can also be coded as 99 or 0 so you need to be clear about how missing values are coded prior to exploring your dataset. Try:

codebook bmi_grp4

You can also get a feel for the dataset by using the ‘ list ‘ and ‘ tabulate ‘ commands.  Try looking at the variables for ‘currsmoker’ and ‘frailty’:

tabulate currsmoker

tab currsmoker, missing

tab currsmoker frailty, row

tab currsmoker frailty, col

You can use the ‘ if ‘ condition to view specific observations. Try browsing the data for current smokers, aged between 60-70 years old:

browse if currsmoker==1 & age_grp==1

Now try to tabulate frailty level among the current smokers aged 60-70 using the if condition:

tab frailty if currsmoker==1 & age_grp==1

Notice that in Stata, if you want to specify that a variable is “equal to” some value, then you need to hit two = signs, like this “==”.

Question A1.2.1b:  Look at the variable bmi_grp4.

    • What information is collected in this variable?
    • How are the responses coded?
    • Is there any missing data?
A1.2.1b. Answers

You can get information about a variable using several different methods. For instance, if you use the ‘codebook’ command it will show you the following:


Here you can see that the variable is a numeric variable and it is coded as 1-4.  Each code has a label assigned to it, so that 1 =Underweight and 2 = Normal (etc). Stata also tells you that there are 17 missing values, denoted by a decimal point (.).

Video A1.2.1d – Editing and Generating Variables in Stata  (9 minutes)

A1.2.1c PRACTICAL: Editing, creating and amending variables

Labelling variables

If you look at the ‘Variables’ window, you will notice that some of the variables do not have labels, or the given name of the variable is not very clear.  The data might be easier to work with if you have a short description (or label) of the variables. You can label variables using the label command.

General syntax:

 label variable variable_name “label”

When adding a label to a variable, the command is ‘ label variable ‘, and simply typing label is incorrect. For example:

label variable prior_cvd “Prior CVD” 

If you look at the ‘Variables’ window, you should see that there now is a label next to ‘prior_cvd’. Now look at this variable using either the codebook or tab command:

 codebook prior_cvd      … (or tab prior_cvd)

You can see that ‘currsmoker’ is a binary variable and it takes the value either 0 or 1.  This is not very meaningful, in fact, 0 =no and 1=yes. Therefore, we need to relabel the numeric values within the variable as ‘yes’ or ‘no’. Relabelling a variable is a two-step process. First, you must define the label and then assign the label to the variable. The general syntax is presented below:

label define label_name 0 “label1” 1 “label2” …[, add modify replace]

label values variable_name(s)  label_name

For the currsmoker variable, first we have to use the ‘label define’ command to create value label called ‘smoke_lab’, which defines 0 with “no” and 1 with “yes”. For example:

 label define smoke_lab 0 “no” 1 “yes”

Next, we need to apply the new label (smoke_lab) to the currsmoker variable. For example:

label values currsmoker smoke_lab

Now look at currsmoker (either tab or codebook) and you should see that the values 0 and 1 are now labelled as “no” and “yes”.  You can also look at the labels of a variable by typing:

label list label_name

For example, type:  label list smoke_lab

  • Can you follow the commands detailed here to label the variable ‘currsmoker’ in your own dataset and version of Stata?

Creating new variables – generate

The generate command allows you to create new variables. The general syntax is:

generate new_variable=expression [if ]

Try the following:

generate var1 = 1

generate var2 = 5


You can also copy existing variables or generate new variables based on existing data. For example:

generate age4 = age_grp

You can also use mathematical operations and functions such as + (add),  – (subtract),  * (multiply),   / (divide), ^ (to the power of), sqrt square root, ln (natural log), exp (exponential) For example:

 generate var3 = var1+var2

 gen var4 = var3*var2

These calculations can be used to calculate different variables. For example, you could compute the total number of prior diseases a participant has been diagnosed with:

gen prior_disease = prior_cvd + prior_t2dm + prior_cancer

tab prior_disease, m

There is also an extension to the generate command, called egen, which can be useful (see help egen for more information).

Replace and recode 

You can edit variables using ‘replace’ and ‘recode’ commands. PLEASE BE AWARE THERE ARE MULTIPLE (CORRECT) WAYS TO RECODE AND GENERATE NEW VARIABLES. You will see some examples in this practical, and more examples in future sessions. Choosing the method to replace or recode variables is generally a matter of personal preference, and it will rarely matter which method you use when recoding variables as several different commands will produce the same result.

Try changing the ‘bmi’ variable so that you create to a binary variable, which indicates those who have obesity and those that do not. But never recode the original variable (in case you change your mind)! Duplicate a variable first, and then recode it. For example:

gen bmi2=bmi 

recode bmi2 min/29=0 30/max=1

Now compare ‘bmi’ and ‘bmi2’:

 browse bmi bmi2

Here is another way to recode BMI:

gen bmi_bin = 1 if bmi_grp4<=2

replace bmi_bin = 0 if bmi_grp4>2

It is good practice to cross-tabulate your binary and categorical variables to check your coding:

tab bmi_bin bmi_grp4, miss

Stata considers missing values to be the highest numerical values, so notice where the missing values went using this code.

Question A1.2.1c:

    • Can you generate a new variable that is the same as the ‘ldlc’ variable? (call it ldl2)
    • Can you recode your new variable to indicate those with LDL-C that is 4 or below, and those with LDL-C that is above 4 mmol/L? (hint: use ‘replace’ as we did above, with ‘<=’ signs)
    • Optional: look at the recode syntax (type help recode) to figure out how you can define the value labels within the recode command.
A1.2.1c. Answer

If you copy the way we did it above, this is what you would get the output below. I have also given the new variable value labels and tabulated it to check it:

 gen ldl2=ldlc

replace ldl2=0 if ldl2<=4 

       replace ldl2=1 if ldl2>4

     tab ldl2

 label define ldl 0 “Under 4” 1 “Over 4”

           label values ldl2 ldl

     tab ldl2, m

Note, as with many commands in Stata, there are alternative correct ways to generate our new variable. One way would be to use the ‘recode’ command.

If you used the ‘recode’ command above, the recoded information is stored in the variable to be recoded, i.e. the original information stored in this variable is overwritten. This is the reason why we created a copy of the variable first. We can skip this step, and make our code more efficient by utilising the “, gen()” option of the “recode” command, as shown below:

recode ldl2 min/4=0 4.01/max=1, gen(ldl2)

The code can be made even more efficient by combining the information on value labels in the “recode” command:

recode ldlc (min/4=0  “under 4”) (4.01/max =1 “over 4”), gen(ldl2_b)

Dropping variables 

You can drop variables from the dataset if you no longer want to use them. But once you do this, you cannot undo it, so be careful when using this command. A variable can be dropped from the dataset by typing the following command:

 drop var1

Editing data – making changes in data editor

 Stata has an editing browser, where you can see the data in your dataset and make changes to the dataset. To access the edit window, you can either type edit or you can open it from the drop down menu (Data>Data Editor> Data Editor (Edit)). You can then click on the appropriate cell in the Edit Window and change the values of the dataset.  HOWEVER… when data cleaning it is strongly recommended that you save the relevant commands in a .do-file and then run that for each session. This ensures your original dataset is kept intact in case you make a mistake while editing- or you need to remember something you edited a long time ago- and you have a permanent record of the data cleaning process. “Data cleaning” is the process whereby you get all the variables you received in your raw dataset ready to be used in your analysis.

👋 Before you go, please rate your satisfaction with this lesson

Ratings are completely anonymous

Average rating 5 / 5. Vote count: 2

No votes so far! Be the first to rate this post.

Please share any positive or negative feedback you may have.

Feedback is completely anonymous

Notify of

Oldest Most Voted
Inline Feedbacks
View all comments

is the stata software available here or need to buy mine


Excellent presentations, very clear and very useful.


Hi, I am using Stata/SE 11.2, and whenever I try to open a dataset I keep on getting “file Whitehall_fossa.dta not Stata format”. I am getting the same response for the other dataset. Please help.


I expect this is because the datasets were made in Stata version 17 so they may not be compatible with v.11. I would try using the import function in Stata to import a csv file and then re-save it as a .dta.

Questions or comments?x