Topic 1, Lesson 1
In Progress

Direct lesson paste


Learning objectives

  • You can use dplyr::select() to keep or exclude columns.
  • You can use the c(), : and ! operators within select().
  • You can use the following dplyr helper verbs within select(): starts_with(), ends_with(), contains(), and everything().

The Yaounde COVID-19 dataset

In this lesson, we analyse results from a COVID-19 serological survey conducted in Yaounde, Cameroon in late 2020. The survey estimated how many people had been infected with COVID-19 in the region, by testing for IgG and IgM antibodies. The full dataset can be obtained from Zenodo, and the paper can be viewed here.

Spend some time browsing through this dataset. Each line corresponds to one patient surveyed. There are some demographic, socio-economic and COVID-related variables. The results of the IgG and IgM antibody tests are in the columns igg_result and igm_result.

yaounde <- read_csv(here::here('chapter_04_data_wrangling/data/yaounde_data.csv'))
yaounde  
## # A tibble: 5 × 37
##   id          date_surveyed   age age_category sex   highest_educati… occupation
##   <chr>       <date>        <dbl> <chr>        <chr> <chr>            <chr>     
## 1 BRIQUETERI… 2020-10-22       45 45 - 64      Fema… Secondary        Informal …
## 2 BRIQUETERI… 2020-10-24       55 45 - 64      Male  University       Salaried …
## 3 BRIQUETERI… 2020-10-24       23 15 - 29      Male  University       Student   
## 4 BRIQUETERI… 2020-10-22       20 15 - 29      Fema… Secondary        Student   
## 5 BRIQUETERI… 2020-10-22       55 45 - 64      Fema… Primary          Trader--F…
## # … with 30 more variables: weight_kg <dbl>, height_cm <dbl>, is_smoker <chr>,
## #   is_pregnant <chr>, is_medicated <chr>, …

_Left: the Yaounde survey team. Right: an antibody test being administered._

Figure 1: Left: the Yaounde survey team. Right: an antibody test being administered.

Introducing select()


_Fig: the `select()` function. (Drawing adapted from Allison Horst)_

Figure 2: Fig: the select() function. (Drawing adapted from Allison Horst)

dplyr::select() lets us pick which columns (variables) to keep or exclude.

We can select a column by name:

yaounde %>% select(age) 
## # A tibble: 5 × 1
##     age
##   <dbl>
## 1    45
## 2    55
## 3    23
## 4    20
## 5    55

Or we can select a column by position:

yaounde %>% select(3) ## `age` is the 3rd column
## # A tibble: 5 × 1
##     age
##   <dbl>
## 1    45
## 2    55
## 3    23
## 4    20
## 5    55

To select multiple variables, we separate them with commas:

yaounde %>% select(age, sex, igg_result)
## # A tibble: 5 × 3
##     age sex    igg_result
##   <dbl> <chr>  <chr>     
## 1    45 Female Negative  
## 2    55 Male   Positive  
## 3    23 Male   Negative  
## 4    20 Female Positive  
## 5    55 Female Positive
  • Select the weight and height variables in the yaounde data frame. Store the result in q1.
  • Select the 16th and 22nd columns in the yaounde data frame. Store the result in q2.

In the examples above we used the pipe symbol %>% to pass our data into select(). But we can also provide the data as the first argument of select(), without the pipe.

## select the column `age` from the `yaounde` data frame
select(yaounde, age)
## # A tibble: 5 × 1
##     age
##   <dbl>
## 1    45
## 2    55
## 3    23
## 4    20
## 5    55

We are prioritizing the pipe notation because it is easier to read, especially when creating a longer function chain:

yaounde %>% ## from the yaounde dataset
  select(age) %>% ## select the age variable
  arrange(-age) %>% ## then arrange it in ascending order
  slice(1:2) ## then take the two oldest respondents
## # A tibble: 2 × 1
##     age
##   <dbl>
## 1    79
## 2    79

We will see more such pipe chains in upcoming lessons.


For the next part of the tutorial, let’s create a smaller subset of the data, called yao.

yao <-
  yaounde %>% select(age,
                     sex,
                     highest_education,
                     occupation,
                     is_smoker,
                     is_pregnant,
                     igg_result,
                     igm_result)
yao
## # A tibble: 5 × 8
##     age sex    highest_education occupation     is_smoker is_pregnant igg_result
##   <dbl> <chr>  <chr>             <chr>          <chr>     <chr>       <chr>     
## 1    45 Female Secondary         Informal work… Non-smok… No          Negative  
## 2    55 Male   University        Salaried work… Ex-smoker <NA>        Positive  
## 3    23 Male   University        Student        Smoker    <NA>        Negative  
## 4    20 Female Secondary         Student        Non-smok… No          Positive  
## 5    55 Female Primary           Trader--Farmer Non-smok… No          Positive  
## # … with 1 more variable: igm_result <chr>

Selecting column ranges with :

The : operator selects a range of consecutive variables:

yao %>% select(age:occupation) ## select all columns from `age` to `occupation`
## # A tibble: 5 × 4
##     age sex    highest_education occupation     
##   <dbl> <chr>  <chr>             <chr>          
## 1    45 Female Secondary         Informal worker
## 2    55 Male   University        Salaried worker
## 3    23 Male   University        Student        
## 4    20 Female Secondary         Student        
## 5    55 Female Primary           Trader--Farmer

We can also specify a range with column numbers:

yao %>% select(1:4) ## select columns 1 to 4
## # A tibble: 5 × 4
##     age sex    highest_education occupation     
##   <dbl> <chr>  <chr>             <chr>          
## 1    45 Female Secondary         Informal worker
## 2    55 Male   University        Salaried worker
## 3    23 Male   University        Student        
## 4    20 Female Secondary         Student        
## 5    55 Female Primary           Trader--Farmer
  • With the yaounde data frame, select the columns between symptoms and sequelae, inclusive. (“Inclusive” means you should also include symptoms and sequelae in the selection.)
q3 <- "yaounde %>%" 
.check_q3()
.hint_q3()

Excluding columns with !

The exclamation point negates a selection:

yao %>% select(!age) ## select all columns except `age`
## # A tibble: 5 × 7
##   sex    highest_educati… occupation is_smoker is_pregnant igg_result igm_result
##   <chr>  <chr>            <chr>      <chr>     <chr>       <chr>      <chr>     
## 1 Female Secondary        Informal … Non-smok… No          Negative   Negative  
## 2 Male   University       Salaried … Ex-smoker <NA>        Positive   Negative  
## 3 Male   University       Student    Smoker    <NA>        Negative   Negative  
## 4 Female Secondary        Student    Non-smok… No          Positive   Negative  
## 5 Female Primary          Trader--F… Non-smok… No          Positive   Negative

To exclude a range of consecutive columns, we use, for example,!age:occupation:

yao %>% select(!age:occupation) ## exclude columns from `age` to `occupation`
## # A tibble: 5 × 4
##   is_smoker  is_pregnant igg_result igm_result
##   <chr>      <chr>       <chr>      <chr>     
## 1 Non-smoker No          Negative   Negative  
## 2 Ex-smoker  <NA>        Positive   Negative  
## 3 Smoker     <NA>        Negative   Negative  
## 4 Non-smoker No          Positive   Negative  
## 5 Non-smoker No          Positive   Negative

To exclude several non-consecutive columns, place them inside !c():

yao %>% select(!c(age, sex, igg_result))
## # A tibble: 5 × 5
##   highest_education occupation      is_smoker  is_pregnant igm_result
##   <chr>             <chr>           <chr>      <chr>       <chr>     
## 1 Secondary         Informal worker Non-smoker No          Negative  
## 2 University        Salaried worker Ex-smoker  <NA>        Negative  
## 3 University        Student         Smoker     <NA>        Negative  
## 4 Secondary         Student         Non-smoker No          Negative  
## 5 Primary           Trader--Farmer  Non-smoker No          Negative
  • From the yaounde data frame, remove all columns between highest_education and consultation, inclusive.
q4 <- "yaounde %>%" 
.check_q4()
.hint_q4()

Helper functions for select()

dplyr has a number of helper functions to make selecting easier. Let’s take a look at some of these.

starts_with() and ends_with()

These two helpers work exactly as their names suggest!

yao %>% select(starts_with("is")) ## columns that start with "is"
## # A tibble: 5 × 2
##   is_smoker  is_pregnant
##   <chr>      <chr>      
## 1 Non-smoker No         
## 2 Ex-smoker  <NA>       
## 3 Smoker     <NA>       
## 4 Non-smoker No         
## 5 Non-smoker No
yao %>% select(ends_with("result")) ## columns that end with "result"
## # A tibble: 5 × 2
##   igg_result igm_result
##   <chr>      <chr>     
## 1 Negative   Negative  
## 2 Positive   Negative  
## 3 Negative   Negative  
## 4 Positive   Negative  
## 5 Positive   Negative

contains()

Contains helps select columns that contain a certain string:

yao %>% select(contains("_")) ## columns that contain an underscore
## # A tibble: 5 × 5
##   highest_education is_smoker  is_pregnant igg_result igm_result
##   <chr>             <chr>      <chr>       <chr>      <chr>     
## 1 Secondary         Non-smoker No          Negative   Negative  
## 2 University        Ex-smoker  <NA>        Positive   Negative  
## 3 University        Smoker     <NA>        Negative   Negative  
## 4 Secondary         Non-smoker No          Positive   Negative  
## 5 Primary           Non-smoker No          Positive   Negative

everything()

Another helper function, everything(), matches all variables that have not yet been selected.

## first, `is_pregnant`, then every other column.
yao %>% select(is_pregnant, everything())
## # A tibble: 5 × 8
##   is_pregnant   age sex    highest_education occupation     is_smoker igg_result
##   <chr>       <dbl> <chr>  <chr>             <chr>          <chr>     <chr>     
## 1 No             45 Female Secondary         Informal work… Non-smok… Negative  
## 2 <NA>           55 Male   University        Salaried work… Ex-smoker Positive  
## 3 <NA>           23 Male   University        Student        Smoker    Negative  
## 4 No             20 Female Secondary         Student        Non-smok… Positive  
## 5 No             55 Female Primary           Trader--Farmer Non-smok… Positive  
## # … with 1 more variable: igm_result <chr>

It is often useful for rearranging columns.

Say we wanted to bring the is_pregnant column to the start of the yao data frame, we could type out all the column names manually:

yao %>% select(is_pregnant, age, sex, highest_education, occupation, is_smoker, is_pregnant, igg_result, igm_result)
## # A tibble: 5 × 8
##   is_pregnant   age sex    highest_education occupation     is_smoker igg_result
##   <chr>       <dbl> <chr>  <chr>             <chr>          <chr>     <chr>     
## 1 No             45 Female Secondary         Informal work… Non-smok… Negative  
## 2 <NA>           55 Male   University        Salaried work… Ex-smoker Positive  
## 3 <NA>           23 Male   University        Student        Smoker    Negative  
## 4 No             20 Female Secondary         Student        Non-smok… Positive  
## 5 No             55 Female Primary           Trader--Farmer Non-smok… Positive  
## # … with 1 more variable: igm_result <chr>

But this would be painful for larger data frames, such as our original yaounde data frame. In such a case, we can use everything():

## bring `is_pregnant` to the front of the data frame
yaounde %>% select(is_pregnant, everything())
## # A tibble: 5 × 37
##   is_pregnant id         date_surveyed   age age_category sex   highest_educati…
##   <chr>       <chr>      <date>        <dbl> <chr>        <chr> <chr>           
## 1 No          BRIQUETER… 2020-10-22       45 45 - 64      Fema… Secondary       
## 2 <NA>        BRIQUETER… 2020-10-24       55 45 - 64      Male  University      
## 3 <NA>        BRIQUETER… 2020-10-24       23 15 - 29      Male  University      
## 4 No          BRIQUETER… 2020-10-22       20 15 - 29      Fema… Secondary       
## 5 No          BRIQUETER… 2020-10-22       55 45 - 64      Fema… Primary         
## # … with 30 more variables: occupation <chr>, weight_kg <dbl>, height_cm <dbl>,
## #   is_smoker <chr>, is_medicated <chr>, …

This helper can be combined with many others.

## bring columns that start with "is" to the front of the data frame
yaounde %>% select(ends_with("result"), everything())
## # A tibble: 5 × 37
##   igg_result igm_result id                date_surveyed   age age_category sex  
##   <chr>      <chr>      <chr>             <date>        <dbl> <chr>        <chr>
## 1 Negative   Negative   BRIQUETERIE_000_… 2020-10-22       45 45 - 64      Fema…
## 2 Positive   Negative   BRIQUETERIE_000_… 2020-10-24       55 45 - 64      Male 
## 3 Negative   Negative   BRIQUETERIE_000_… 2020-10-24       23 15 - 29      Male 
## 4 Positive   Negative   BRIQUETERIE_002_… 2020-10-22       20 15 - 29      Fema…
## 5 Positive   Negative   BRIQUETERIE_002_… 2020-10-22       55 45 - 64      Fema…
## # … with 30 more variables: highest_education <chr>, occupation <chr>,
## #   weight_kg <dbl>, height_cm <dbl>, is_smoker <chr>, …
  • Select all columns in the yaounde data frame that start with “is_”.
q5 <- "yaounde %>%" 
.check_q5()
.hint_q5()
  • Move the columns that start with “is_” to the beginning of the yaounde data frame.
q6 <- "yaounde %>%" 
.check_q6()
.hint_q6()

.score_print()

Extra Treat: dplyr::rename()


_Fig: the `summarize()` function. (Drawing adapted from Allison Horst)_

Figure 3: Fig: the summarize() function. (Drawing adapted from Allison Horst)

Rename one by one

dplyr::rename() is your ally to change column names into names that make sense for you, and that represent your data.

## rename `age` and `sex` to `patient_age` and `patient_sex`
yaounde %>% 
  rename(patient_age = age, 
         patient_sex = sex)
## # A tibble: 5 × 37
##   id         date_surveyed patient_age age_category patient_sex highest_educati…
##   <chr>      <date>              <dbl> <chr>        <chr>       <chr>           
## 1 BRIQUETER… 2020-10-22             45 45 - 64      Female      Secondary       
## 2 BRIQUETER… 2020-10-24             55 45 - 64      Male        University      
## 3 BRIQUETER… 2020-10-24             23 15 - 29      Male        University      
## 4 BRIQUETER… 2020-10-22             20 15 - 29      Female      Secondary       
## 5 BRIQUETER… 2020-10-22             55 45 - 64      Female      Primary         
## # … with 31 more variables: occupation <chr>, weight_kg <dbl>, height_cm <dbl>,
## #   is_smoker <chr>, is_pregnant <chr>, …
  • The fact that the new name comes first in the function (rename(NEWNAME = OLDNAME)) is sometimes confusing. You should get used to this with time.

Renaming in an automatized manner using a function

dplyr::rename_with() is the way to go if you want to rename several variables in a similar manner. Taking the example above, we know we want to rename the age and sex columns similarly (i.e. adding “patient” as a prefix). Let’s see how to do this with dplyr::rename_with() !

  • paste0() is a function that allows to paste/glue/put together two strings.
  • .x is a reference to each of the variables, allowing each variable to “pass through” the function ~paste0("patients_", .x)
## rename `age` and `sex` to `patient_age` and `patient_sex`
yaounde %>% 
  rename_with(~paste0("patients_", .x), c(age, sex))
## # A tibble: 5 × 37
##   id       date_surveyed patients_age age_category patients_sex highest_educati…
##   <chr>    <date>               <dbl> <chr>        <chr>        <chr>           
## 1 BRIQUET… 2020-10-22              45 45 - 64      Female       Secondary       
## 2 BRIQUET… 2020-10-24              55 45 - 64      Male         University      
## 3 BRIQUET… 2020-10-24              23 15 - 29      Male         University      
## 4 BRIQUET… 2020-10-22              20 15 - 29      Female       Secondary       
## 5 BRIQUET… 2020-10-22              55 45 - 64      Female       Primary         
## # … with 31 more variables: occupation <chr>, weight_kg <dbl>, height_cm <dbl>,
## #   is_smoker <chr>, is_pregnant <chr>, …

Contributors

The following team members contributed to this lesson:

References

Some material in this lesson was adapted from the following sources:

Artwork was adapted from: