Learning Outcomes
By the end of this session, students will be able to:
- Open datasets in their chosen statistical software programme
- Explore datasets and understand what data they have
- Use basic commands to edit their data
Video A1.2.1a – Welcome to Stata I (4 minutes)
There is no need to reinvent the wheel , please check out Stata’s official introductory video here. This will introduce you to the different windows of the Stata interface you see when you open the software.
Video A1.2.1b – Welcome to Stata II (11 minutes)
A1.2.1a PRACTICAL: Getting started in Stata
Find your data file
Download your dataset and save it to your laptop. Note the location of the dataset and the file pathway.
Open Stata
Stata can be opened either from the desktop icon or by going through the ‘start’ menu – as with other Windows based programs. When Stata is opened it should have four windows (similar to the screen shot in your lecture notes). Take some time to remind yourself the purpose of each window on the Stata interface.
Open a do file
Create a do file, click on the icon for New do file editor, and then select save as within the do file editor to save the file.
Make sure that the do file is saved at the end of every session. Do files are essential because they create a permanent record of all the commands that were used for the statistical analysis. This is important because it allows you to quickly reproduce all your work by simply re-running your do file. If you do not use a do file, and instead type all of your commands directly into the Stata command window, then none of your commands will be saved and you will have to re-type everything into Stata if you want to repeat the same analysis.
The code in your .do file can be run by highlighting the command you want to run, and then clicking on the ‘execute (do)’ button.
Change the working directory
It is often easiest to begin Stata by defining a working directory. The working directory is the file where Stata will look for input (ie, your dataset) and where it will save your output. In addition, predefining a working directory means that you will not have to type out the full file pathway every time you want to load a file.
Typing the following commands will display the name of the current working directory:
pwd
cd
If you would like to change the current working directory, then you can type the following command (where “filepath” is the name of the full pathway to the folder where the “condom study_Stata_MScGHSE” dataset is saved):
cd “filepath”
Alternatively, you can use the dropdown menu to change the working directory to the folder where the “condom study_Stata_MScGHSE” dataset is saved (File>Change working directory).
You can check the contents of the working directory by typing:
dir
However, it may be better to specify the full path name every time you open or save a file especially if you are using files from different locations.
Open a log file
A log file is a record of your commands and output (aka the results of your analysis). You can open a log file by typing:
log using “filepath\filename.log”
For example, you can create a log file in your “Documents” folder by typing (*Replace “filepath\” and “filename.log” with the relevant file pathway and file name):
log using “filepath\Documents\filename.log”
Alternatively, go to the drop down menu File>Log>Begin and then enter the filename. We recommend that you save it as an unformatted log file (.log) as you can open these in other programs, such as Notepad or MS Word. The default in Stata is to save the log file as a .smcl file, which only opens in Stata).
Open a log file.
Open your data set
You can open your dataset by typing the following command in the command window:
use “your filename.dta”
For today this will be:
use “filepath/Whitehall_fossa.dta”
Alternatively, you can open a file from the drop down menu (File>Open) and then browse the directory and click on the appropriate file. This is not recommended because it does not keep a record of the files that are used in the analysis.
To open your dataset in Stata, the file must be in the .dta format. Data can also be imported from other programs as long as it is in the correct format (usually tab-delimited or .csv). There is more information about importing data from other programs in your handout and in Stata help.
Take a moment to look at the variables in your dataset by typing the following command:
browse
Question A1.2.1a: Browse the full dataset.
o How many rows (individuals or observations) are there?
o How many columns (variables)?
o Notice the colour coding of the variables, and the coding of the labelled categorical (blue) variables is visible using browse.
o Look at the variable ‘currsmoker’. Is it clear what the values refer to?
A1.2.1a. Answers
You can browse the dataset by opening the Data Editor (Browse):
Or by typing browse into the command window.
You will then be able to see that there are 4,327 rows and 17 variables – in this case we have data on N=4327 people, and the data is arranged with one row per person. This information is also available by typing the ‘describe’ command.
You are also able to see how the variable called “bmi_grp4” is coded. The value labels “underweight” and “normal” (etc) are assigned to data coded 1 and 2, respectively. The variable ‘currsmoker’ does not have labels so it is not immediately clear what 0 and 1 refer to. Note: there are no variables coded in the colour red. If you see a red variable, this is a string variable, and when you double-click on individual cells, you notice that there are no underlying numeric codes for these data. It only contains characters.
You can find out some of this information using the command describe, though sometimes it can be helpful to look at the data in the data browser.
Video A1.2.1c – Exploring Data in Stata (5 minutes)
A1.2.1b PRACTICAL: Exploring and getting to know your data
General syntax
There are a number of ways to ‘get to know’ your dataset. As you proceed through this section, note that the general syntax used for the commands in Stata are as follows:
command {space} variable_name(s) {space} [if expression], {space} options
Note that after the command and the variable name, there is a comma before any options are listed. Some commands can be abbreviated as well. Conditional statements involving the word “if” come before the comma. Keep this general syntax in mind as you work with the commands below.
Trouble shooting – how to find help
The ‘help’ command in Stata can be very useful to understand more about the commands in Stata. Simply type help and the command you want more information on and you will open a help window. For example, if you want more information on the tabulate command, you can type:
help tabulate
You can also search for a command. For example, if you wanted to look for general help for histograms, you can type:
search histogram
Exploring your data
The command ‘ describe ‘ will give provide the you the variable names and their labels. You can look at the whole dataset or specific variables. Try:
describe
describe bmi_grp4
The command ‘ codebook ‘ provides a little more information about the variables in the dataset, such as the minimum and maximum values and information on missing data. Missing values in Stata are generally coded as . but missing values can also be coded as 99 or 0 so you need to be clear about how missing values are coded prior to exploring your dataset. Try:
codebook bmi_grp4
You can also get a feel for the dataset by using the ‘ list ‘ and ‘ tabulate ‘ commands. Try looking at the variables for ‘currsmoker’ and ‘frailty’:
tabulate currsmoker
tab currsmoker, missing
tab currsmoker frailty, row
tab currsmoker frailty, col
You can use the ‘ if ‘ condition to view specific observations. Try browsing the data for current smokers, aged between 60-70 years old:
browse if currsmoker==1 & age_grp==1
Now try to tabulate frailty level among the current smokers aged 60-70 using the if condition:
tab frailty if currsmoker==1 & age_grp==1
Notice that in Stata, if you want to specify that a variable is “equal to” some value, then you need to hit two = signs, like this “==”.
Question A1.2.1b: Look at the variable bmi_grp4.
-
- What information is collected in this variable?
- How are the responses coded?
- Is there any missing data?
A1.2.1b. Answers
You can get information about a variable using several different methods. For instance, if you use the ‘codebook’ command it will show you the following:
Here you can see that the variable is a numeric variable and it is coded as 1-4. Each code has a label assigned to it, so that 1 =Underweight and 2 = Normal (etc). Stata also tells you that there are 17 missing values, denoted by a decimal point (.).
Video A1.2.1d – Editing and Generating Variables in Stata (9 minutes)
A1.2.1c PRACTICAL: Editing, creating and amending variables
Labelling variables
If you look at the ‘Variables’ window, you will notice that some of the variables do not have labels, or the given name of the variable is not very clear. The data might be easier to work with if you have a short description (or label) of the variables. You can label variables using the label command.
General syntax:
label variable variable_name “label”
When adding a label to a variable, the command is ‘ label variable ‘, and simply typing label is incorrect. For example:
label variable prior_cvd “Prior CVD”
If you look at the ‘Variables’ window, you should see that there now is a label next to ‘prior_cvd’. Now look at this variable using either the codebook or tab command:
codebook prior_cvd … (or tab prior_cvd)
You can see that ‘currsmoker’ is a binary variable and it takes the value either 0 or 1. This is not very meaningful, in fact, 0 =no and 1=yes. Therefore, we need to relabel the numeric values within the variable as ‘yes’ or ‘no’. Relabelling a variable is a two-step process. First, you must define the label and then assign the label to the variable. The general syntax is presented below:
label define label_name 0 “label1” 1 “label2” …[, add modify replace]
label values variable_name(s) label_name
For the currsmoker variable, first we have to use the ‘label define’ command to create value label called ‘smoke_lab’, which defines 0 with “no” and 1 with “yes”. For example:
label define smoke_lab 0 “no” 1 “yes”
Next, we need to apply the new label (smoke_lab) to the currsmoker variable. For example:
label values currsmoker smoke_lab
Now look at currsmoker (either tab or codebook) and you should see that the values 0 and 1 are now labelled as “no” and “yes”. You can also look at the labels of a variable by typing:
label list label_name
For example, type: label list smoke_lab
- Can you follow the commands detailed here to label the variable ‘currsmoker’ in your own dataset and version of Stata?
Creating new variables – generate
The generate command allows you to create new variables. The general syntax is:
generate new_variable=expression [if ]
Try the following:
generate var1 = 1
generate var2 = 5
browse
You can also copy existing variables or generate new variables based on existing data. For example:
generate age4 = age_grp
You can also use mathematical operations and functions such as + (add), – (subtract), * (multiply), / (divide), ^ (to the power of), sqrt square root, ln (natural log), exp (exponential) For example:
generate var3 = var1+var2
gen var4 = var3*var2
These calculations can be used to calculate different variables. For example, you could compute the total number of prior diseases a participant has been diagnosed with:
gen prior_disease = prior_cvd + prior_t2dm + prior_cancer
tab prior_disease, m
There is also an extension to the generate command, called egen, which can be useful (see help egen for more information).
Replace and recode
You can edit variables using ‘replace’ and ‘recode’ commands. PLEASE BE AWARE THERE ARE MULTIPLE (CORRECT) WAYS TO RECODE AND GENERATE NEW VARIABLES. You will see some examples in this practical, and more examples in future sessions. Choosing the method to replace or recode variables is generally a matter of personal preference, and it will rarely matter which method you use when recoding variables as several different commands will produce the same result.
Try changing the ‘bmi’ variable so that you create to a binary variable, which indicates those who have obesity and those that do not. But never recode the original variable (in case you change your mind)! Duplicate a variable first, and then recode it. For example:
gen bmi2=bmi
recode bmi2 min/29=0 30/max=1
Now compare ‘bmi’ and ‘bmi2’:
browse bmi bmi2
Here is another way to recode BMI:
gen bmi_bin = 1 if bmi_grp4<=2
replace bmi_bin = 0 if bmi_grp4>2
It is good practice to cross-tabulate your binary and categorical variables to check your coding:
tab bmi_bin bmi_grp4, miss
Stata considers missing values to be the highest numerical values, so notice where the missing values went using this code.
Question A1.2.1c:
-
- Can you generate a new variable that is the same as the ‘ldlc’ variable? (call it ldl2)
- Can you recode your new variable to indicate those with LDL-C that is 4 or below, and those with LDL-C that is above 4 mmol/L? (hint: use ‘replace’ as we did above, with ‘<=’ signs)
- Optional: look at the recode syntax (type help recode) to figure out how you can define the value labels within the recode command.
A1.2.1c. Answer
If you copy the way we did it above, this is what you would get the output below. I have also given the new variable value labels and tabulated it to check it:
gen ldl2=ldlc
replace ldl2=0 if ldl2<=4
replace ldl2=1 if ldl2>4
tab ldl2
label define ldl 0 “Under 4” 1 “Over 4”
label values ldl2 ldl
tab ldl2, m
Note, as with many commands in Stata, there are alternative correct ways to generate our new variable. One way would be to use the ‘recode’ command.
If you used the ‘recode’ command above, the recoded information is stored in the variable to be recoded, i.e. the original information stored in this variable is overwritten. This is the reason why we created a copy of the variable first. We can skip this step, and make our code more efficient by utilising the “, gen()” option of the “recode” command, as shown below:
recode ldl2 min/4=0 4.01/max=1, gen(ldl2)
The code can be made even more efficient by combining the information on value labels in the “recode” command:
recode ldlc (min/4=0 “under 4”) (4.01/max =1 “over 4”), gen(ldl2_b)
Dropping variables
You can drop variables from the dataset if you no longer want to use them. But once you do this, you cannot undo it, so be careful when using this command. A variable can be dropped from the dataset by typing the following command:
drop var1
Editing data – making changes in data editor
Stata has an editing browser, where you can see the data in your dataset and make changes to the dataset. To access the edit window, you can either type edit or you can open it from the drop down menu (Data>Data Editor> Data Editor (Edit)). You can then click on the appropriate cell in the Edit Window and change the values of the dataset. HOWEVER… when data cleaning it is strongly recommended that you save the relevant commands in a .do-file and then run that for each session. This ensures your original dataset is kept intact in case you make a mistake while editing- or you need to remember something you edited a long time ago- and you have a permanent record of the data cleaning process. “Data cleaning” is the process whereby you get all the variables you received in your raw dataset ready to be used in your analysis.
It is a very nice and clear presentation. Thank you!
Any free version of the software for practice??
Dear Temesgen. Unfortunately STATA is not free software. Perhaps you can try the R track, which uses the free R and RStudio software.
May someone assist me , my code are not giving me the output.
This how it looks
label define smoke_lab 0 “No” 1 “Yes”
. label values currsmoker smoke_lab
may not label strings
This is the output of codebook of bmi_grp4
codebook bmi_grp4
———————————————–
bmi_grp4 (unlabeled)
———————————————–
Type: String (str2)
Unique values: 5
> Missing “”: 0/4,327
Tabulation: Freq. Value
50 “1”
1,793 “2”
2,091 “3”
376 “4”
17 “NA”
so it does not show labels and no missing. There is only “NA”
Hello, i was trying some codes but this did not give the output. May someone help me.
tabulate frailty if currsmoker == 1 & age_grp == 1 .I even used this
tabulate frailty if currsmoker==1 & age_grp==1
Good
is the stata software available here or need to buy mine
Excellent presentations, very clear and very useful.
Hi, I am using Stata/SE 11.2, and whenever I try to open a dataset I keep on getting “file Whitehall_fossa.dta not Stata format”. I am getting the same response for the other dataset. Please help.
I expect this is because the datasets were made in Stata version 17 so they may not be compatible with v.11. I would try using the import function in Stata to import a csv file and then re-save it as a .dta.