Learning Outcomes
By the end of this section, students will be able to:
- Open datasets in their chosen statistical software programme
- Explore datasets and understand what data they have
- Use basic commands to edit their data
Download R and R Studio Instructions
Video A1.2.2a – Getting Started with R and RStudio (2 minutes)
Video A1.2.2b – Data and Variables in R (5 minutes)
A1.2.2a PRACTICAL: R
Find your data file
Download your dataset and save it to your laptop. Note the location of the dataset and the file pathway.
Open R
R can be opened from your desktop or taskbar. When R is opened it should have three windows: the ‘console’, the ‘environment’ tab, and the ‘files’ tab. Create a new ‘project’, by clicking “File” and “New Project”. Save it in an accessible place in your file directory with an appropriate name that references your project and its contents.
Create a new R script
Create a new ‘R Script’ by clicking “File”, “New File” and “R Script”. R Script files are essential to create as they create a permanent record of all the commands that were used in your analysis. This is important, because it allows you to quickly reproduce all of your work by simply re-running your “R Script” the next time you open RStudio. Save this file along an easily accessible pathway. When naming files for use with R, it is best to keep the names simple, as this will simplify later coding that you undertake. Ensure also that your files are saved in suitable locations — changing the save destinations of files at later dates will affect how your code runs. You should now have four screens open.
It is good practice to annotate your script with descriptive labels to orientate yourself, using ‘#’ as a prefix. See below:
# Practical 1:
# Author:
# Date:
If there are unsaved changes in your ‘R Script’, the tab name will be displayed in red with an * adjacent to it. Click save regularly by clicking on the floppy disc icon.
Changing the working directory
It is often easiest to begin in R by defining a working directory. The working directory is the file where RStudio will look for input (i.e. your dataset) and where it will save your output. In addition, predefining a working directory means that you will not have to type out the full file pathway every time you want to load a file.
Typing the following commands will display the name of the current working directory:
getwd()
If you would like to change the current working directory, then you can type the following command (where “filepath” is the name of the full pathway to the folder where the “Whitehall_fossa.csv” dataset is saved):
For Windows: setwd(“C:/filepath”)
For Mac: setwd(“~/filepath”)
i.e. setwd(“~/Desktop/modules”)
Alternatively, you can use the dropdown menu to change the working directory to the folder where the “Whitehall_fossa.csv” dataset is saved (“Session” > “Set Working Dictionary > “To Source File Location”).
Open your data set
Importing your dataset can be performed using a variety of different commands. When importing your data, you will perform a command asking RStudio to read your ‘file’. You will then ask RStudio to import your data into a structure called an ‘object’, which will contain your imported dataset.
You will separately name the ‘object’, which exists in the R workspace. This name should be succinct, as you will be required to type it out repeatedly when conducting your analysis. When you have correctly typed in your instructional code, click “Run” in the script box. You should see your script executed below in the command box in blue text.
To open CSV files:
object.name <- read.csv(“filename.csv”, header=””, na.strings=c(“”))
whitehall.data <- read.csv(“Whitehall_fossa.csv”, header=TRUE, na.strings=c(“”))
The additional commands within “read.csv” provide R with information on how to read the dataset. Here, “header=TRUE” tells RStudio that the first row contains the names of the variables. “na.strings=c(“”)” dictates that any data with the value “ “ should be treated as NA by R.
For general tables:
object.name <- read_dta(“filename”, header=””, na.strings=c(“”))
For .dta (Stata) files:
Install.packages(‘haven’)
Library(haven)
object.name <- read_dta(“filename.dta”, header=””, na.strings=c(“”))
You can also click “File” > “Import Dataset”. If you have imported your dataset correctly, you should be able to see it in the “Environment” screen, with a description of the number of observations and number of variables. You may click the table icon to see the data displayed.
To explore your dataset, you can use several commands:
To show the total number of rows:
nrow(“object.name”)
nrow(whitehall.data)
To show the total number of columns:
ncol(“object.name”)
ncol(whitehall.data)
Or, more simply, to show both rows and columns simultaneously:
dim(“object.name”)
dim(whitehall.data)
Question A1.2.2a: Browse the full dataset.
-
- How many rows (individuals or observations) are there?
- How many columns (variables)?
- Look at the variable ‘currsmoker’. Is it clear which values identify men and women?
Answer
- 4327 rows
- 17 columns
- It is not immediately clear what 0 and 1 refer to in currsmoker as they do not have labels.
Video A1.2.2c – Functions in R
We are experiencing a technical issue with the video for this section. Please watch this external video in the meantime:
A1.2.2b PRACTICAL: R
General syntax
R is an object-orientated language where ‘objects’ are anything (constants, data, structures, functions and graphs) that can be assigned to a “variable”. These objects may be either:
- Data objects: store values, logical values, or characters. These exist as outlined in the image below. Vectors are the base unit of data objects.
- Language object: functions (user-created objects that are generated to execute specific operations) and expressions.
Trouble shooting – how to find help
The ‘?’ command in RStudio can be very useful to understand more about the commands you use. Simply type ‘?command_name’ and you will see a description of the command. For example, if you want more information on the ‘ncol’ command, you can type:
?ncol
If you type “??command_name” that will take you to a directory and show you what package the command is from.
Exploring your data
Several commands can be used to explore your data. The str() function will list the names of your variables, and how RStudio is currently defining those variables (i.e. integer, numeric, categorical or character).
str(object.name)
str(whitehall.data)
You can use the functions head() and tail() in the same way display the first or last n= rows of your dataset.
head(object.name,n=)
head(whitehall.data,n=5)
You may also use summary() in order to generate basic summary statistics about your whole data frame. To isolate a single variable, use the $ operator after your object.name and then type in the variable name. The same logic can also be applied to examining the structure of a variable.
summary(object.name$variable.name)
summary(whitehall.data$bmi_grp4)
You can also get a feel for variables within your dataset by using the ‘table’ command.
table(whitehall.data$currsmoker)
- Question A1.2.2b: Look at the variable ‘bmi_grp4’.
-
- What information is collected in this variable?
- How are the responses coded?
- Is there any missing data?
Answer
- Integer information, about what BMI group classification participants fit into. We can tell this using
str(whitehall.data$bmi_grp4)
- We can find information about how this variable is coded using:table(whitehall.data$bmi_grp4)This shows us that the integer values range between 1 and 4.
- We can see that there are 17 missing values, using the following function:summary(whitehall.data$bmi_grp4)
A1.2.2c PRACTICAL: R
Labelling variables
If you look at the ‘Variables’ window, you can see that ‘currsmoker’ is a binary variable and it takes the value either 0 or 1. This is not very meaningful, in fact, 0 =no and 1=yes. Therefore, we need to relabel the numeric values within the variable as ‘yes’ or ‘no’.
You can label your data, and tell RStudio to treat it is a ‘factor’, using the following command:
whitehall.data$currsmoker <- factor(whitehall.data$currsmoker, labels=c(“No”, “Yes”))
Creating new variables – generate
To generate a new variable in R, the general syntax is:
variable.name <- operation
See the following, where we use the creation of new variables ‘a’ and ‘b’ to perform a simple sum:
a <- 30
b <- 50
c <- a + b
What result did you find?
If you run ‘c‘ you should find that it was equal to 80. You can also perform other mathematical operations, such as + (add), – (subtract), * (multiply), / (divide), ^ (to the power of), sqrt (square root), ln (natural log), exp (exponential).
You can also use these calculations to calculate different variables. For example, you could calculate the total number of prior diseases participants have been diagnosed with, by creating a new variable ‘prior_disease’, which is the sum of ‘prior_cvd’, ‘prior_t2dm’ and ‘prior_cancer’ and then tabulating the result.
prior_disease <- prior_cvd + prior_t2dm + prior_cancer
table(prior_disease)
Replace and recode
You can edit variables using ‘replace’ and ‘recode’ commands. PLEASE BE AWARE THERE ARE MULTIPLE (CORRECT) WAYS TO RECODE AND GENERATE NEW VARIABLES. You will see some examples in this practical, and more examples in future sessions. Choosing the method to replace or recode variables is generally a matter of personal preference, and it will rarely matter which method you use when recoding variables as several different commands will produce the same result.
Try changing the ‘bmi’ variable so that you create to a binary variable, which indicates those who have obesity and those that do not. But never recode the original variable (in case you change your mind)! Duplicate a variable first, and then recode it.
To create a duplicate variable within the ‘whitehall.data’ data frame, named ‘bmi.2’, which is assigned the value of ‘bmi’:
whitehall.data$bmi.2 <- Whitehall.data$bmi
To create a new binary variable, first define a new variable and, for the moment, assign the values as ‘NA’.
whitehall.data$bmi.2 <- NA
Now, we must assign values of either 0 or 1 to the empty rows in our new column, dependent upon whether bmi is < 30, or > = 30 in the original variable, bmi.
whitehall.data$bmi.2[whitehall.data$bmi<30] <- 0
whitehall.data$bmi.2[whitehall.data$bmi>=30] <- 1
Examine ‘bmi.2’ using table.
You can label your data, and tell RStudio to treat it is a ‘factor’, using the following command:
whitehall.data$bmi.2 <- factor(whitehall.data$bmi.2, labels=c("Not Obese", "Obese"))
- Question A1.2.2c:
-
- Can you generate a new variable that is the same as the ‘ldlc’ variable? (call it ldl2)
- Can you recode your new variable to indicate those with LDL-C that is 4 or below, and those with LDL-C that is above 4 mmol/L? (hint: use ‘replace’ as we did above, with ‘<=’ signs)
- Optional: look at the recode syntax (type help recode) to figure out how you can define the value labels within the recode command.
Answer
To complete this exercise in R, we simply apply the same logic that we did when categorising BMI into the binary variables of <30 and >=30. Note that, extending the rationale employed here, we could have transformed this variable into integer values, or as many different categories as we desired for our analysis.
whitehall.data$ldl2 <- NA whitehall.data$ldl2[whitehall.data$ldlc<=4] <- 0 whitehall.data$ldl2[whitehall.data$ldlc>4] <- 1 whitehall.data$ldl2 <- factor(whitehall.data$ldl2, labels = c("Under 4", "Over 4")) table(whitehall.data$ldl2)
We could have alternatively combined the steps of defining the variable, assigning the values, and assigning the labels, into one piece of code, as follows, by using the ‘cut’ function and defining ‘breaks’ at a minimum value (0), our desired split (4) and a maximum value (100).
whitehall.data$ldl2 <- cut(whitehall.data$ldlc, breaks=c(0,4,100), labels=c(“Under 4”, “Over 4”))
Video A1.2.2d – R Markdown (12 minutes)
When I attempt to use the following
whitehall.data$currsmoker <- factor(whitehall.data$currsmoker, labels=c(“No”, “Yes”))
which I have written as
whitehalldata$currsmoker <- factor(whitehalldata$currsmoker, labels=c(“No”, “Yes”))
I get a warning saying:
in factor(whitehalldata$currsmoker, labels = c(“No”, “Yes”)) :
invalid ‘labels’; length 2 should be 1 or 3
Very informative and interesting!
please am beginner for R programming but I don’t understand anything from that file hat we find in global environment where does it come
summary(whitehall.data$bmi_grp4) Length Class Mode 4327 character characterThe answer says that this was supposed to be integers. What did I do wrong?
I have the exact same problem. It essentially means that R read the data as character data.
learn lots of thinks.thanks
effective course
what does DRAT/log transformation actually do to mke it normal distribution? what is the basis for it
effective course
R is very powerful for data preprocessing and data analysis
presentation was good
Unable to watch the videos. This is the message it’s showing – “Click to sign in and play video”. When I click it, it’s taking me to this “https://login.canvas.ox.ac.uk/” website.
Hello! Could you perhaps try a different browser, maybe Google Chrome or Mozilla Firefox?
Hello same issue with me. Tried on Chrome and Safari
Sound volume is too low.
does theFoSSA has a certifiucate?
Thank you for providing this valuable course.
Just one honest comment: some tutorials are not well presented and are not easy to understand.
Great presentation. However, it is not clear how you added the graph of Horse power Vs Cylinder number in the R Mardown. Could you kindly elaborate? Thank you.