Interactive mode: click a code block or Show Plot button to reveal/hide its corresponding plot.

Session 2 Intro to dplyr

Introduction to dplyr

In this class, we will explore the dplyr package for data manipulation in R. You will learn how to use its key functions such as select(), filter(), arrange(), and mutate(). We will also cover advanced topics like using across() for applying functions to multiple columns, grouping and summarizing data, and joining datasets.

Loading Libraries and Dataset

We begin by loading the tidyverse package (which includes dplyr) and using the built-in mtcars dataset.
If you haven’t installed tidyverse, run:

if (FALSE) install.packages("tidyverse")

Now, load the library and view the first few rows of the dataset:

library(tidyverse)
head(mtcars)

Part 1: Selecting and Renaming Columns

1.1 Select Specific Columns

The select() function extracts specific columns from a dataset. For example, to select only mpg, hp, and cyl:

mtcars_selected <- mtcars %>%
  select(mpg, hp, cyl)

head(mtcars_selected)

Task 1:
Exercise: Select columns wt, qsec, and gear from the mtcars dataset.

# Your code here

1.2 Renaming Columns

Use rename() to change column names without altering the data structure. For example, to rename mpg to Miles_Per_Gallon:

mtcars_renamed <- mtcars %>%
  rename(Miles_Per_Gallon = mpg)
head(mtcars_renamed)

Task 2:
Exercise: Rename the hp column to Horsepower.

# Your code here

Part 2: Filtering Rows

2.1 Filter Based on One Condition

Use filter() to select rows meeting a condition. For example, filtering cars with more than 6 cylinders:

mtcars_filtered <- mtcars %>%
  filter(cyl >= 6)
head(mtcars_filtered)

2.2 Filter Based on Multiple Conditions

Combine conditions using logical operators (& for AND, | for OR). For example, filtering cars with more than 6 cylinders and more than 100 horsepower:

mtcars_filtered_advanced <- mtcars %>%
  filter(cyl > 6 | hp > 100)
head(mtcars_filtered_advanced)

Task 3:
Exercise: Filter cars that have either 4 or 8 cylinders.

# Your code here

Part 3: Creating New Variables with mutate()

3.1 Basic Mutate

The mutate() function creates or modifies columns. For example, to create a new column representing horsepower per weight:

mtcars_new_var <- mtcars %>%
  mutate(hp_per_wt = hp - wt)

head(mtcars_new_var)

3.2 Creating Multiple New Columns

You can create multiple columns in one go. For example, add hp_per_wt and a scaled version of mpg:

mtcars_multi_mutate <- mtcars %>%
  mutate(hp_per_wt = hp / wt,
         scaled_mpg = scale(mpg)
         )
head(mtcars_multi_mutate)

3.3 Conditional Mutate

Create new columns based on conditions using if_else(). For example, classify cars as “High HP” or “Low HP”:

mtcars_classified <- mtcars %>%
  mutate(hp_class = if_else(hp > 150, "High HP", "Low HP")
         )

head(mtcars_classified)

Task 4:
Exercise: Create a new variable classifying cars as “Heavy” or “Light” based on their weight (wt).

# Your code here

3.4 Advanced: Using case_when()

Use case_when() for multiple conditions. For example, classify cars into weight categories:

mtcars_weight_class <- mtcars %>%
  mutate(weight_class = case_when(
    wt < 2.5 ~ "Light",
    wt >= 2.5 & wt < 3.5 ~ "Medium",
    wt >= 3.5 ~ "Heavy"
  ))
head(mtcars_weight_class)

3.5 Using mutate() with across()

Apply the same transformation to multiple columns using across(). For example, standardize the mpg and hp columns:

mtcars_scaled_vars <- mtcars %>%
  mutate(across(c(mpg, hp,qsec), scale))
head(mtcars_scaled_vars)

Part 4: Sorting and Arranging Data

Use the arrange() function to sort your data. For example, to sort by mpg in ascending order:

mtcars_sorted <- mtcars %>%
  arrange(mpg)
head(mtcars_sorted)

Task 5:
Exercise: Arrange the cars by horsepower (hp) in descending order.

mtcars_sorted_desc <- mtcars %>%
  arrange(desc(hp))
head(mtcars_sorted_desc)

Part 5: Joining Data

Sometimes you’ll need to combine data from two sources. dplyr offers functions like left_join(), inner_join(), etc. For example, suppose we have another dataset:

inner_join() returns matched x rows.
left_join() returns all x rows.
right_join() returns matched of x rows, followed by unmatched y rows.
full_join() returns all x rows, followed by unmatched y rows.

# Create a simple data frame with car models and a new variable
car_info <- tibble(
  model = rownames(mtcars),
  origin = rep(c("USA", "Europe", "Japan"), length.out = nrow(mtcars))
)

mt<-mtcars

# Join mtcars with car_info by converting row names to a column
mtcars_joined <- mtcars %>%
  rownames_to_column(var = "model")

car_info%>%
  right_join(mtcars_joined, by = "model")

head(mtcars_joined)

mtcars_joined%>%
  rownames_to_column(var = "id")

Tip: Use left_join() when you want to keep all observations from your main dataset.

Part 7: Exercises and Further Exploration

Now it’s your turn! Try writing your own dplyr code: - Experiment with different filtering conditions. - Create new variables based on your own criteria. - Explore additional joins such as right_join() or full_join() with custom datasets.

# Your exercise code here

Session 2: Introduction to dplyr