Interactive mode: click a code block or Show Plot button to reveal/hide its corresponding plot.
The summarize() function from the dplyr
package is a powerful tool for creating summary statistics of your data.
It allows you to collapse a dataset to a single row or a summary for
each group of observations. In this tutorial, we’ll explore the basic
and advanced uses of summarize().
library(tidyverse)
gapminder<-haven::read_dta("gapminder.dta")
head(gapminder)
summarize()The basic syntax of summarize() is straightforward. You
provide it with a dataset and specify the summary statistics you want to
compute.
gapminder %>%
summarize(global_avg_lifeExp = mean(lifeExp,na.rm = TRUE),
n= n())
Explanation of na.rm = TRUE |
|---|
When working with data in R, it’s common to encounter missing
values ( To handle this, many R functions include an argument
called In our case today, we know there is no NA in the data so I omitted
|
group_by()Often, you want to compute summaries for subgroups within your data.
This is where group_by() comes into play.
gapminder %>%
group_by(country) %>%
summarize(avg_lifeExp = mean(lifeExp),
n=n())
Calculate the total population growth for each country over the years (1952-2007).
# Example: Summarizing Population Growth
population_growth <- gapminder %>%
group_by(country) %>%
summarize(
from = first(year),
pop1952 = first(pop),
to = last(year),
pop2007 = last(pop),
pop_growth = last(pop) - first(pop))
head(population_growth)
By summarizing longitudinal data, you can create new cross-sectional datasets for further analysis.
Create a cross-sectional dataset that includes the average life expectancy, average GDP per capital and population growth for each continent.
cross_sectional_data <- gapminder %>%
mutate(country = as.character(country)) %>%
group_by(continent) %>%
summarize(
avg_lifeExp = mean(lifeExp),
avg_gdpPercap = median(gdpPercap),
continent_pop = sum(pop)
)
head(cross_sectional_data)
| Why Summarizing Longitudinal Data to Cross-Sectional Data Could be Useful |
|---|
Longitudinal data tracks the same subjects (e.g., countries, individuals) across multiple time points. While this is useful for analyzing trends over time, sometimes it’s necessary to condense the data into a cross-sectional format, where each observation is represented by a single row. Cross-sectional data represents the “snapshot” of each entity at a given moment or an aggregation over time, and it’s often used for comparative or overview analyses. Benefits of Summarizing Longitudinal Data:
|
You can summarize data using multiple grouping variables to get more granular insights.
#Example: Average Life Expectancy ect by Continent and Year
by_continent_year <- gapminder %>%
group_by(continent, year) %>%
summarize(
avg_lifeExp = mean(lifeExp),
avg_gdpPercap = mean(gdpPercap),
continent_pop = sum(pop))
head(by_continent_year)
Counts and proportions of logical
values: sum(x > 10), mean(y == 0). When
used with numeric functions, TRUE is converted to 1
and FALSE to 0. This makes sum() and mean() very
useful: sum(x) gives the number of TRUEs
in x, and mean(x) gives the proportion.
gapminder %>%
group_by(continent,year) %>%
summarize(
prop_1000 = mean(gdpPercap<1000)*100
)
You can merge the summarized data back with the original dataset for comparative analysis.
# Example: Merging Average Life Expectancy with Original Data
gapminder_with_summary <- gapminder %>%
left_join(by_continent_year, by = c("continent","year"))
head(gapminder_with_summary)
window Functionsgapminder_with_summary<-gapminder_with_summary%>%
mutate(lag_avg_GPDpc = lag(avg_gdpPercap))
head(gapminder_with_summary)
by_continent_year_wide <- by_continent_year %>%
pivot_wider(names_from = year, values_from = c(avg_lifeExp,avg_gdpPercap,continent_pop))
head(by_continent_year_wide)
across() for Summarizing Multiple
ColumnsDemonstrate how to apply summary functions across multiple columns
using the across() helper.
# Example: Calculate the mean of multiple numeric columns
gapminder %>%
group_by(continent) %>%
summarize(across(c(lifeExp, gdpPercap), mean))
across()Apply different functions to different columns within a
single summarize() call.
# Example: Apply different functions to different columns
gapminder %>%
group_by(continent) %>%
summarize(
across(c(lifeExp,gdpPercap), mean, .names = "avg_{col}"),
across(c(lifeExp,gdpPercap), median, .names = "median_{col}")
)
Make sure you have the necessary packages installed:
#if (FALSE) install.packages("ggplot2")
#if (FALSE) install.packages("rnaturalearth")
#if (FALSE) install.packages("rnaturalearthdata")
library(tidyverse)
library(rnaturalearth)
library(rnaturalearthdata)
library(ggplot2)
we will summarize the gapminder data
by country to calculate the average life expectancy for
each country.
if (FALSE) install.packages("stargazer")
# Summarizing data by continent
cross_sectional_data <- gapminder %>%
group_by(country) %>%
summarize(
avg_lifeExp = mean(lifeExp, na.rm = TRUE)
)
Use the rnaturalearth package to get the world map data
for countries.
# Getting world map data
world_map <- ne_countries(scale = "medium", returnclass = "sf")
Next, we will merge the country_data (average life
expectancy) with the world_map dataset.
The world_map dataset has country names, so we will
use left_join() to merge them based on the country
name.
# Merging the country-level life expectancy with the world map
cross_sectional_data$country <- as.character(cross_sectional_data$country)
world_map$name <- as.character(world_map$name)
world_map_data <- world_map %>%
left_join(cross_sectional_data, by = c("name" = "country"))
#why error?
cross_sectional_data$country <- as.character(cross_sectional_data$country)
world_map_data <- world_map %>%
left_join(cross_sectional_data, by = c("name" = "country"))
Now we can create the map using ggplot2. We will
use geom_sf() to plot the map,
and scale_fill_viridis_c() to color the countries based on
life expectancy.
# Plotting the map
ggplot(data = world_map_data)+
geom_sf(aes(fill = avg_lifeExp)) +
scale_fill_viridis_c(option = "plasma", na.value = "gray50") +
labs(title = "Average Life Expectancy by Continent",
fill = "Life Expectancy") +
theme_minimal()
# why some countries are gray (NA)?
table(cross_sectional_data$country)
##
## 1 10 100 101 102 103 104 105 106 107 108 109 11 110 111 112 113 114 115 116 117 118 119 12 120
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 121 122 123 124 125 126 127 128 129 13 130 131 132 133 134 135 136 137 138 139 14 140 141 142 15
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 16 17 18 19 2 20 21 22 23 24 25 26 27 28 29 3 30 31 32 33 34 35 36 37 38
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 39 4 40 41 42 43 44 45 46 47 48 49 5 50 51 52 53 54 55 56 57 58 59 6 60
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 61 62 63 64 65 66 67 68 69 7 70 71 72 73 74 75 76 77 78 79 8 80 81 82 83
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 84 85 86 87 88 89 9 90 91 92 93 94 95 96 97 98 99
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
table(world_map$name)
##
## Afghanistan Åland Albania
## 1 1 1
## Algeria American Samoa Andorra
## 1 1 1
## Angola Anguilla Antarctica
## 1 1 1
## Antigua and Barb. Argentina Armenia
## 1 1 1
## Aruba Ashmore and Cartier Is. Australia
## 1 1 1
## Austria Azerbaijan Bahamas
## 1 1 1
## Bahrain Bangladesh Barbados
## 1 1 1
## Belarus Belgium Belize
## 1 1 1
## Benin Bermuda Bhutan
## 1 1 1
## Bolivia Bosnia and Herz. Botswana
## 1 1 1
## Br. Indian Ocean Ter. Brazil British Virgin Is.
## 1 1 1
## Brunei Bulgaria Burkina Faso
## 1 1 1
## Burundi Cabo Verde Cambodia
## 1 1 1
## Cameroon Canada Cayman Is.
## 1 1 1
## Central African Rep. Chad Chile
## 1 1 1
## China Colombia Comoros
## 1 1 1
## Congo Cook Is. Costa Rica
## 1 1 1
## Côte d'Ivoire Croatia Cuba
## 1 1 1
## Curaçao Cyprus Czechia
## 1 1 1
## Dem. Rep. Congo Denmark Djibouti
## 1 1 1
## Dominica Dominican Rep. Ecuador
## 1 1 1
## Egypt El Salvador Eq. Guinea
## 1 1 1
## Eritrea Estonia eSwatini
## 1 1 1
## Ethiopia Faeroe Is. Falkland Is.
## 1 1 1
## Fiji Finland Fr. Polynesia
## 1 1 1
## Fr. S. Antarctic Lands France Gabon
## 1 1 1
## Gambia Georgia Germany
## 1 1 1
## Ghana Greece Greenland
## 1 1 1
## Grenada Guam Guatemala
## 1 1 1
## Guernsey Guinea Guinea-Bissau
## 1 1 1
## Guyana Haiti Heard I. and McDonald Is.
## 1 1 1
## Honduras Hong Kong Hungary
## 1 1 1
## Iceland India Indian Ocean Ter.
## 1 1 1
## Indonesia Iran Iraq
## 1 1 1
## Ireland Isle of Man Israel
## 1 1 1
## Italy Jamaica Japan
## 1 1 1
## Jersey Jordan Kazakhstan
## 1 1 1
## Kenya Kiribati Kosovo
## 1 1 1
## Kuwait Kyrgyzstan Laos
## 1 1 1
## Latvia Lebanon Lesotho
## 1 1 1
## Liberia Libya Liechtenstein
## 1 1 1
## Lithuania Luxembourg Macao
## 1 1 1
## Madagascar Malawi Malaysia
## 1 1 1
## Maldives Mali Malta
## 1 1 1
## Marshall Is. Mauritania Mauritius
## 1 1 1
## Mexico Micronesia Moldova
## 1 1 1
## Monaco Mongolia Montenegro
## 1 1 1
## Montserrat Morocco Mozambique
## 1 1 1
## Myanmar N. Cyprus N. Mariana Is.
## 1 1 1
## Namibia Nauru Nepal
## 1 1 1
## Netherlands New Caledonia New Zealand
## 1 1 1
## Nicaragua Niger Nigeria
## 1 1 1
## Niue Norfolk Island North Korea
## 1 1 1
## North Macedonia Norway Oman
## 1 1 1
## Pakistan Palau Palestine
## 1 1 1
## Panama Papua New Guinea Paraguay
## 1 1 1
## Peru Philippines Pitcairn Is.
## 1 1 1
## Poland Portugal Puerto Rico
## 1 1 1
## Qatar Romania Russia
## 1 1 1
## Rwanda S. Geo. and the Is. S. Sudan
## 1 1 1
## Saint Helena Saint Lucia Samoa
## 1 1 1
## San Marino São Tomé and Principe Saudi Arabia
## 1 1 1
## Senegal Serbia Seychelles
## 1 1 1
## Siachen Glacier Sierra Leone Singapore
## 1 1 1
## Sint Maarten Slovakia Slovenia
## 1 1 1
## Solomon Is. Somalia Somaliland
## 1 1 1
## South Africa South Korea Spain
## 1 1 1
## Sri Lanka St-Barthélemy St-Martin
## 1 1 1
## St. Kitts and Nevis St. Pierre and Miquelon St. Vin. and Gren.
## 1 1 1
## Sudan Suriname Sweden
## 1 1 1
## Switzerland Syria Taiwan
## 1 1 1
## Tajikistan Tanzania Thailand
## 1 1 1
## Timor-Leste Togo Tonga
## 1 1 1
## Trinidad and Tobago Tunisia Turkey
## 1 1 1
## Turkmenistan Turks and Caicos Is. Tuvalu
## 1 1 1
## U.S. Virgin Is. Uganda Ukraine
## 1 1 1
## United Arab Emirates United Kingdom United States of America
## 1 1 1
## Uruguay Uzbekistan Vanuatu
## 1 1 1
## Vatican Venezuela Vietnam
## 1 1 1
## W. Sahara Wallis and Futuna Is. Yemen
## 1 1 1
## Zambia Zimbabwe
## 1 1
world_map<-world_map%>%
mutate(
name = case_when(
name == "United States of America" ~ "United States",
T~name
))
table(world_map$name)
##
## Afghanistan Åland Albania
## 1 1 1
## Algeria American Samoa Andorra
## 1 1 1
## Angola Anguilla Antarctica
## 1 1 1
## Antigua and Barb. Argentina Armenia
## 1 1 1
## Aruba Ashmore and Cartier Is. Australia
## 1 1 1
## Austria Azerbaijan Bahamas
## 1 1 1
## Bahrain Bangladesh Barbados
## 1 1 1
## Belarus Belgium Belize
## 1 1 1
## Benin Bermuda Bhutan
## 1 1 1
## Bolivia Bosnia and Herz. Botswana
## 1 1 1
## Br. Indian Ocean Ter. Brazil British Virgin Is.
## 1 1 1
## Brunei Bulgaria Burkina Faso
## 1 1 1
## Burundi Cabo Verde Cambodia
## 1 1 1
## Cameroon Canada Cayman Is.
## 1 1 1
## Central African Rep. Chad Chile
## 1 1 1
## China Colombia Comoros
## 1 1 1
## Congo Cook Is. Costa Rica
## 1 1 1
## Côte d'Ivoire Croatia Cuba
## 1 1 1
## Curaçao Cyprus Czechia
## 1 1 1
## Dem. Rep. Congo Denmark Djibouti
## 1 1 1
## Dominica Dominican Rep. Ecuador
## 1 1 1
## Egypt El Salvador Eq. Guinea
## 1 1 1
## Eritrea Estonia eSwatini
## 1 1 1
## Ethiopia Faeroe Is. Falkland Is.
## 1 1 1
## Fiji Finland Fr. Polynesia
## 1 1 1
## Fr. S. Antarctic Lands France Gabon
## 1 1 1
## Gambia Georgia Germany
## 1 1 1
## Ghana Greece Greenland
## 1 1 1
## Grenada Guam Guatemala
## 1 1 1
## Guernsey Guinea Guinea-Bissau
## 1 1 1
## Guyana Haiti Heard I. and McDonald Is.
## 1 1 1
## Honduras Hong Kong Hungary
## 1 1 1
## Iceland India Indian Ocean Ter.
## 1 1 1
## Indonesia Iran Iraq
## 1 1 1
## Ireland Isle of Man Israel
## 1 1 1
## Italy Jamaica Japan
## 1 1 1
## Jersey Jordan Kazakhstan
## 1 1 1
## Kenya Kiribati Kosovo
## 1 1 1
## Kuwait Kyrgyzstan Laos
## 1 1 1
## Latvia Lebanon Lesotho
## 1 1 1
## Liberia Libya Liechtenstein
## 1 1 1
## Lithuania Luxembourg Macao
## 1 1 1
## Madagascar Malawi Malaysia
## 1 1 1
## Maldives Mali Malta
## 1 1 1
## Marshall Is. Mauritania Mauritius
## 1 1 1
## Mexico Micronesia Moldova
## 1 1 1
## Monaco Mongolia Montenegro
## 1 1 1
## Montserrat Morocco Mozambique
## 1 1 1
## Myanmar N. Cyprus N. Mariana Is.
## 1 1 1
## Namibia Nauru Nepal
## 1 1 1
## Netherlands New Caledonia New Zealand
## 1 1 1
## Nicaragua Niger Nigeria
## 1 1 1
## Niue Norfolk Island North Korea
## 1 1 1
## North Macedonia Norway Oman
## 1 1 1
## Pakistan Palau Palestine
## 1 1 1
## Panama Papua New Guinea Paraguay
## 1 1 1
## Peru Philippines Pitcairn Is.
## 1 1 1
## Poland Portugal Puerto Rico
## 1 1 1
## Qatar Romania Russia
## 1 1 1
## Rwanda S. Geo. and the Is. S. Sudan
## 1 1 1
## Saint Helena Saint Lucia Samoa
## 1 1 1
## San Marino São Tomé and Principe Saudi Arabia
## 1 1 1
## Senegal Serbia Seychelles
## 1 1 1
## Siachen Glacier Sierra Leone Singapore
## 1 1 1
## Sint Maarten Slovakia Slovenia
## 1 1 1
## Solomon Is. Somalia Somaliland
## 1 1 1
## South Africa South Korea Spain
## 1 1 1
## Sri Lanka St-Barthélemy St-Martin
## 1 1 1
## St. Kitts and Nevis St. Pierre and Miquelon St. Vin. and Gren.
## 1 1 1
## Sudan Suriname Sweden
## 1 1 1
## Switzerland Syria Taiwan
## 1 1 1
## Tajikistan Tanzania Thailand
## 1 1 1
## Timor-Leste Togo Tonga
## 1 1 1
## Trinidad and Tobago Tunisia Turkey
## 1 1 1
## Turkmenistan Turks and Caicos Is. Tuvalu
## 1 1 1
## U.S. Virgin Is. Uganda Ukraine
## 1 1 1
## United Arab Emirates United Kingdom United States
## 1 1 1
## Uruguay Uzbekistan Vanuatu
## 1 1 1
## Vatican Venezuela Vietnam
## 1 1 1
## W. Sahara Wallis and Futuna Is. Yemen
## 1 1 1
## Zambia Zimbabwe
## 1 1