Back to Course

FoSSA: Fundamentals of Statistical Software & Analysis

0% Complete
0/0 Steps
  1. Course Information

    Meet the Teaching Team
  2. Course Dataset 1
  3. Course Dataset 2
  4. MODULE A1: INTRODUCTION TO STATISTICS USING R, STATA, AND SPSS
    A1.1 What is Statistics?
  5. A1.2.1a Introduction to Stata
  6. A1.2.2b: Introduction to R
  7. A1.2.2c: Introduction to SPSS
  8. A1.3: Descriptive Statistics
  9. A1.4: Estimates and Confidence Intervals
  10. A1.5: Hypothesis Testing
  11. A1.6: Transforming Variables
  12. End of Module A1
    1 Quiz
  13. MODULE A2: POWER & SAMPLE SIZE CALCULATIONS
    A2.1 Key Concepts
  14. A2.2 Power calculations for a difference in means
  15. A2.3 Power Calculations for a difference in proportions
  16. A2.4 Sample Size Calculation for RCTs
  17. A2.5 Sample size calculations for cross-sectional studies (or surveys)
  18. A2.6 Sample size calculations for case-control studies
  19. End of Module A2
    1 Quiz
  20. MODULE B1: LINEAR REGRESSION
    B1.1 Correlation and Scatterplots
  21. B1.2 Differences Between Means (ANOVA 1)
  22. B1.3 Univariable Linear Regression
  23. B1.4 Multivariable Linear Regression
  24. B1.5 Model Selection and F-Tests
  25. B1.6 Regression Diagnostics
  26. End of Module B1
    1 Quiz
  27. MODULE B2: MULTIPLE COMPARISONS & REPEATED MEASURES
    B2.1 ANOVA Revisited - Post-Hoc Testing
  28. B2.2 Correcting For Multiple Comparisons
  29. B2.3 Two-way ANOVA
  30. B2.4 Repeated Measures and the Paired T-Test
  31. B2.5 Repeated Measures ANOVA
  32. End of Module B2
    1 Quiz
  33. MODULE B3: NON-PARAMETRIC MEASURES
    B3.1 The Parametric Assumptions
  34. B3.2 Mann-Whitney U Test
  35. B3.3 Kruskal-Wallis Test
  36. B3.4 Wilcoxon Signed Rank Test
  37. B3.5 Friedman Test
  38. B3.6 Spearman's Rank Order Correlation
  39. End of Module B3
    1 Quiz
  40. MODULE C1: BINARY OUTCOME DATA & LOGISTIC REGRESSION
    C1.1 Introduction to Prevalence, Risk, Odds and Rates
  41. C1.2 The Chi-Square Test and the Test For Trend
  42. C1.3 Univariable Logistic Regression
  43. C1.4 Multivariable Logistic Regression
  44. End of Module C1
    1 Quiz
  45. MODULE C2: SURVIVAL DATA
    C2.1 Introduction to Survival Data
  46. C2.2 Kaplan-Meier Survival Function & the Log Rank Test
  47. C2.3 Cox Proportional Hazards Regression
  48. C2.4 Poisson Regression
  49. End of Module C2
    1 Quiz

Learning Outcomes

By the end of this section, students will be able to:

  • Open datasets in their chosen statistical software programme
  • Explore datasets and understand what data they have
  • Use basic commands to edit their data

Download R and R Studio Instructions

You can download a copy of the slides here: A1.2.2 Introduction to R

Video A1.2.2a – Welcome to R (8 minutes)

Video A1.2.2b – R Variables (9 minutes)

Video A1.2.2c – R Scripts (8 minutes)

Video A1.2.2d – Opening Data (10 minutes)

A1.2.2a PRACTICAL: R

Find your data file

Download your dataset and save it to your laptop. Note the location of the dataset and the file pathway.

Open R

R can be opened from your desktop or taskbar. When R is opened it should have three windows: the ‘console’, the ‘environment’ tab, and the ‘files’ tab. Create a new ‘project’, by clicking “File” and “New Project”. Save it in an accessible place in your file directory with an appropriate name that references your project and its contents.

Create a new R script

Create a new ‘R Script’ by clicking “File”, “New File” and “R Script”. R Script files are essential to create as they create a permanent record of all the commands that were used in your analysis. This is important, because it allows you to quickly reproduce all of your work by simply re-running your “R Script” the next time you open RStudio. Save this file along an easily accessible pathway. When naming files for use with R, it is best to keep the names simple, as this will simplify later coding that you undertake. Ensure also that your files are saved in suitable locations — changing the save destinations of files at later dates will affect how your code runs. You should now have four screens open.

It is good practice to annotate your script with descriptive labels to orientate yourself, using ‘#’ as a prefix. See below:

# Practical 1:

# Author:

# Date:

If there are unsaved changes in your ‘R Script’, the tab name will be displayed in red with an * adjacent to it. Click save regularly by clicking on the floppy disc icon.

Changing the working directory

It is often easiest to begin in R by defining a working directory. The working directory is the file where RStudio will look for input (i.e. your dataset) and where it will save your output. In addition, predefining a working directory means that you will not have to type out the full file pathway every time you want to load a file.

Typing the following commands will display the name of the current working directory:

getwd()

If you would like to change the current working directory, then you can type the following command (where “filepath” is the name of the full pathway to the folder where the “Whitehall_fossa.csv” dataset is saved):

For Windows: setwd(“C:/filepath”)

For Mac: setwd(“~/filepath”)

i.e. setwd(“~/Desktop/modules”)

Alternatively, you can use the dropdown menu to change the working directory to the folder where the “Whitehall_fossa.csv” dataset is saved (“Session” > “Set Working Dictionary > “To Source File Location”).

Open your data set

Importing your dataset can be performed using a variety of different commands. When importing your data, you will perform a command asking RStudio to read your ‘file’. You will then ask RStudio to import your data into a structure called an ‘object’, which will contain your imported dataset.

You will separately name the ‘object’, which exists in the R workspace. This name should be succinct, as you will be required to type it out repeatedly when conducting your analysis. When you have correctly typed in your instructional code, click “Run” in the script box. You should see your script executed below in the command box in blue text.

To open CSV files:

object.name <- read.csv(“filename.csv”, header=””, na.strings=c(“”))

whitehall.data <- read.csv(“Whitehall_fossa.csv”, header=TRUE, na.strings=c(“”))

The additional commands within “read.csv” provide R with information on how to read the dataset. Here, “header=TRUE” tells RStudio that the first row contains the names of the variables. “na.strings=c(“”)” dictates that any data with the value “ “ should be treated as NA by R.

For general tables:

object.name <- read_dta(“filename”, header=””, na.strings=c(“”))

For .dta (Stata) files:

Install.packages(‘haven’)

Library(haven)

object.name <- read_dta(“filename.dta”, header=””, na.strings=c(“”))

You can also click “File” > “Import Dataset”. If you have imported your dataset correctly, you should be able to see it in the “Environment” screen, with a description of the number of observations and number of variables. You may click the table icon to see the data displayed.

To explore your dataset, you can use several commands:

To show the total number of rows:

nrow(“object.name”)

nrow(whitehall.data)

To show the total number of columns:

ncol(“object.name”)

ncol(whitehall.data)

Or, more simply, to show both rows and columns simultaneously:

dim(“object.name”)

dim(whitehall.data) 

Question A1.2.2a: Browse the full dataset.

    1. How many rows (individuals or observations) are there?
    2. How many columns (variables)?
    3. Look at the variable ‘currsmoker’. Is it clear which values identify men and women?
Answer
  1. 4327 rows
  2. 17 columns
  3. It is not immediately clear what 0 and 1 refer to in currsmoker as they do not have labels.

Video A1.2.2e – Basic Programming in R  (19 minutes)

Video A1.2.2f – Data Structure  (15 minutes)

A1.2.2b PRACTICAL: R

General syntax 

R is an object-orientated language where ‘objects’ are anything (constants, data, structures, functions and graphs) that can be assigned to a “variable”.  These objects may be either:

  1. Data objects: store values, logical values, or characters. These exist as outlined in the image below. Vectors are the base unit of data objects.
  2. Language object: functions (user-created objects that are generated to execute specific operations) and expressions.

A1.2.2b Figure 1.png

Trouble shooting – how to find help

The ‘?’ command in RStudio can be very useful to understand more about the commands you use. Simply type ‘?command_name’ and you will see a description of the command. For example, if you want more information on the ‘ncol’ command, you can type:

?ncol

If you type “??command_name” that will take you to a directory and show you what package the command is from.

Exploring your data

Several commands can be used to explore your data. The str() function will list the names of your variables, and how RStudio is currently defining those variables (i.e. integer, numeric, categorical or character).

str(object.name)

str(whitehall.data)

You can use the functions head() and tail() in the same way display the first or last n= rows of your dataset.

head(object.name,n=)

head(whitehall.data,n=5)

You may also use summary() in order to generate basic summary statistics about your whole data frame. To isolate a single variable, use the $ operator after your object.name and then type in the variable name. The same logic can also be applied to examining the structure of a variable.

summary(object.name$variable.name)

summary(whitehall.data$bmi_grp4)

You can also get a feel for variables within your dataset by using the  ‘table’ command.

table(whitehall.data$currsmoker)

  • Question A1.2.2b: Look at the variable ‘bmi_grp4’.
    1. What information is collected in this variable?
    2. How are the responses coded?
    3. Is there any missing data?
Answer
  1. Integer information, about what BMI group classification participants fit into. We can tell this using

   str(whitehall.data$bmi_grp4)

  1. We can find information about how this variable is coded using:

    table(whitehall.data$bmi_grp4)

    This shows us that the integer values range between 1 and 4.

  2. We can see that there are 17 missing values, using the following function:

    summary(whitehall.data$bmi_grp4)

A1.2.2c PRACTICAL: R

Labelling variables

If you look at the ‘Variables’ window, you can see that ‘currsmoker’ is a binary variable and it takes the value either 0 or 1.  This is not very meaningful, in fact, 0 =no and 1=yes. Therefore, we need to relabel the numeric values within the variable as ‘yes’ or ‘no’.

You can label your data, and tell RStudio to treat it is a ‘factor’,  using the following command:

whitehall.data$currsmoker <- factor(whitehall.data$currsmoker, labels=c(“No”, “Yes”))

Creating new variables – generate

To generate a new variable in R, the general syntax is:

variable.name <- operation

See the following, where we use the creation of new variables ‘a’ and ‘b’ to perform a simple sum:

a <- 30

b <- 50

c <- a + b

What result did you find?

If you run ‘c‘ you should find that it was equal to 80. You can also perform other mathematical operations, such as + (add),  – (subtract),  * (multiply),   / (divide), ^ (to the power of), sqrt (square root), ln (natural log), exp (exponential).

You can also use these calculations to calculate different variables. For example, you could calculate the total number of prior diseases participants have been diagnosed with, by creating a new variable ‘prior_disease’, which is the sum of ‘prior_cvd’, ‘prior_t2dm’  and ‘prior_cancer’ and then tabulating the result.

prior_disease <- prior_cvd + prior_t2dm + prior_cancer

table(prior_disease)

Replace and recode 

You can edit variables using ‘replace’ and ‘recode’ commands. PLEASE BE AWARE THERE ARE MULTIPLE (CORRECT) WAYS TO RECODE AND GENERATE NEW VARIABLES. You will see some examples in this practical, and more examples in future sessions. Choosing the method to replace or recode variables is generally a matter of personal preference, and it will rarely matter which method you use when recoding variables as several different commands will produce the same result.

Try changing the ‘bmi’ variable so that you create to a binary variable, which indicates those who have obesity and those that do not. But never recode the original variable (in case you change your mind)! Duplicate a variable first, and then recode it.

To create a duplicate variable within the ‘whitehall.data’ data frame, named ‘bmi.2’, which is assigned the value of ‘bmi’:

whitehall.data$bmi.2 <- Whitehall.data$bmi

To create a new binary variable, first define a new variable and, for the moment, assign the values as ‘NA’.

whitehall.data$bmi.2 <- NA

Now, we must assign values of either 0 or 1 to the empty rows in our new column, dependent upon whether bmi is < 30, or > = 30 in the original variable, bmi.

whitehall.data$bmi.2[whitehall.data$bmi<30] <- 0

whitehall.data$bmi.2[whitehall.data$bmi>=30] <- 1

Examine ‘bmi.2’ using table.

You can label your data, and tell RStudio to treat it is a ‘factor’,  using the following command:

whitehall.data$bmi.2 <- factor(whitehall.data$bmi.2, labels=c("Not Obese", "Obese"))
  • Question A1.2.2c:
    • Can you generate a new variable that is the same as the ‘ldlc’ variable? (call it ldl2)
    • Can you recode your new variable to indicate those with LDL-C that is 4 or below, and those with LDL-C that is above 4 mmol/L? (hint: use ‘replace’ as we did above, with ‘<=’ signs)
    • Optional: look at the recode syntax (type help recode) to figure out how you can define the value labels within the recode command. 
Answer

To complete this exercise in R, we simply apply the same logic that we did when categorising BMI into the binary variables of <30 and >=30. Note that, extending the rationale employed here, we could have transformed this variable into integer values, or as many different categories as we desired for our analysis.

whitehall.data$ldl2 <- NA
whitehall.data$ldl2[whitehall.data$ldlc<=4] <- 0
whitehall.data$ldl2[whitehall.data$ldlc>4] <- 1
whitehall.data$ldl2 <- factor(whitehall.data$ldl2, labels = c("Under 4", "Over 4"))
table(whitehall.data$ldl2)

We could have alternatively combined the steps of defining the variable, assigning the values, and assigning the labels, into one piece of code, as follows, by using the ‘cut’ function and defining ‘breaks’ at a minimum value (0), our desired split (4) and a maximum value (100).

whitehall.data$ldl2 <- cut(whitehall.data$ldlc, breaks=c(0,4,100), labels=c(“Under 4”, “Over 4”))

Video A1.2.2g – R Markdown  (12 minutes)

👋 Before you go, please rate your satisfaction with this lesson

Ratings are completely anonymous

Average rating 4 / 5. Vote count: 1

No votes so far! Be the first to rate this post.

Please share any positive or negative feedback you may have.

Feedback is completely anonymous

Subscribe
Notify of
guest

5 Comments
Newest
Oldest Most Voted
Inline Feedbacks
View all comments
Samuel

Great presentation. However, it is not clear how you added the graph of Horse power Vs Cylinder number in the R Mardown. Could you kindly elaborate? Thank you.

Alhassane

The presentations was good and very useful.

Sayed Jalal

When I click on below text link. It goes to other page and ask for my user id but I do kow how to go with it.
Download R and R Studio Instructions

Sayed Jalal

Thank you very much. The problem is solved now

5
0
Questions or comments?x