Interactive mode: click a code block or Show Plot button to reveal/hide its corresponding plot.
In this class, we will explore the dplyr package for
data manipulation in R. You will learn how to use its key functions such
as select(), filter(), arrange(),
and mutate(). We will also cover advanced topics like using
across() for applying functions to multiple columns,
grouping and summarizing data, and joining datasets.
We begin by loading the tidyverse package (which
includes dplyr) and using the built-in mtcars
dataset.
If you haven’t installed tidyverse, run:
if (FALSE) install.packages("tidyverse")
Now, load the library and view the first few rows of the dataset:
library(tidyverse)
head(mtcars)
The select() function extracts specific columns from a
dataset. For example, to select only mpg, hp,
and cyl:
mtcars_selected <- mtcars %>%
select(mpg, hp, cyl)
head(mtcars_selected)
Task 1:
Exercise: Select columns wt, qsec, and
gear from the mtcars dataset.
# Your code here
Use rename() to change column names without altering the
data structure. For example, to rename mpg to
Miles_Per_Gallon:
mtcars_renamed <- mtcars %>%
rename(Miles_Per_Gallon = mpg)
head(mtcars_renamed)
Task 2:
Exercise: Rename the hp column to
Horsepower.
# Your code here
Use filter() to select rows meeting a condition. For
example, filtering cars with more than 6 cylinders:
mtcars_filtered <- mtcars %>%
filter(cyl >= 6)
head(mtcars_filtered)
Combine conditions using logical operators (& for
AND, | for OR). For example, filtering cars with more than
6 cylinders and more than 100 horsepower:
mtcars_filtered_advanced <- mtcars %>%
filter(cyl > 6 | hp > 100)
head(mtcars_filtered_advanced)
Task 3:
Exercise: Filter cars that have either 4 or 8 cylinders.
# Your code here
The mutate() function creates or modifies columns. For
example, to create a new column representing horsepower per weight:
mtcars_new_var <- mtcars %>%
mutate(hp_per_wt = hp - wt)
head(mtcars_new_var)
You can create multiple columns in one go. For example, add
hp_per_wt and a scaled version of mpg:
mtcars_multi_mutate <- mtcars %>%
mutate(hp_per_wt = hp / wt,
scaled_mpg = scale(mpg)
)
head(mtcars_multi_mutate)
Create new columns based on conditions using if_else().
For example, classify cars as “High HP” or “Low HP”:
mtcars_classified <- mtcars %>%
mutate(hp_class = if_else(hp > 150, "High HP", "Low HP")
)
head(mtcars_classified)
Task 4:
Exercise: Create a new variable classifying cars as “Heavy” or
“Light” based on their weight (wt).
# Your code here
Use case_when() for multiple conditions. For example,
classify cars into weight categories:
mtcars_weight_class <- mtcars %>%
mutate(weight_class = case_when(
wt < 2.5 ~ "Light",
wt >= 2.5 & wt < 3.5 ~ "Medium",
wt >= 3.5 ~ "Heavy"
))
head(mtcars_weight_class)
Apply the same transformation to multiple columns using
across(). For example, standardize the mpg and
hp columns:
mtcars_scaled_vars <- mtcars %>%
mutate(across(c(mpg, hp,qsec), scale))
head(mtcars_scaled_vars)
Use the arrange() function to sort your data. For
example, to sort by mpg in ascending order:
mtcars_sorted <- mtcars %>%
arrange(mpg)
head(mtcars_sorted)
Task 5:
Exercise: Arrange the cars by horsepower (hp) in
descending order.
mtcars_sorted_desc <- mtcars %>%
arrange(desc(hp))
head(mtcars_sorted_desc)
Sometimes you’ll need to combine data from two sources. dplyr offers
functions like left_join(), inner_join(), etc.
For example, suppose we have another dataset:
inner_join() returns matched x
rows.
left_join() returns all x
rows.
right_join() returns matched of x rows,
followed by unmatched y rows.
full_join() returns all x rows,
followed by unmatched y rows.
# Create a simple data frame with car models and a new variable
car_info <- tibble(
model = rownames(mtcars),
origin = rep(c("USA", "Europe", "Japan"), length.out = nrow(mtcars))
)
mt<-mtcars
# Join mtcars with car_info by converting row names to a column
mtcars_joined <- mtcars %>%
rownames_to_column(var = "model")
car_info%>%
right_join(mtcars_joined, by = "model")
head(mtcars_joined)
mtcars_joined%>%
rownames_to_column(var = "id")
Tip: Use left_join() when you want to keep all
observations from your main dataset.
Now it’s your turn! Try writing your own dplyr code: - Experiment
with different filtering conditions. - Create new variables based on
your own criteria. - Explore additional joins such as
right_join() or full_join() with custom
datasets.
# Your exercise code here