Interactive mode: click a code block or Show Plot button to reveal/hide its corresponding plot.

Week 11 EDA Exercise 1

Getting Started

This analysis explores the General Social Survey (GSS) dataset to perform exploratory data analysis and inferential tests.

Load packages

library(tidyverse)

Load data

load("gss.Rdata")
dim(gss)
## [1] 57061   114

Part 1: Data

Background

The GSS dataset collects information on American societal trends, covering demographics, behaviors, attitudes, and issues like civil liberties, crime, and social mobility. The dataset contains 57,061 observations with 114 variables, representing diverse sociological factors.

Methodology

GSS surveys are conducted face-to-face with randomly selected U.S. adults. The sampling uses a mix of cluster and stratified random sampling, primarily targeting adults (18+).

Scope of Inference

As an observational study, the GSS allows for generalization to the U.S. population but does not support causal inference. Potential bias includes non-response, as participation is voluntary and time-intensive.

Part 2: Research Questions

  1. Is there a relationship between race and perception of premarital sex?

  2. Is there a relationship between age and perception of premarital sex?

  3. Is there a relationship between sex and perception of premarital sex?

These questions aim to understand how race, age and sex influence attitudes toward premarital sex, an issue tied to public health concerns.

Data Preparation

gss_sub <- gss %>% 
  filter(year >= 2010, !is.na(age)&!is.na(race)&!is.na(premarsx)&!is.na(sex)) %>% 
  select(age, race, premarsx,sex)

dim(gss_sub)
## [1] 2654    4

Part 3: Exploratory Data Analysis

summary(gss_sub)
##       age           race                  premarsx        sex      
##  Min.   :18.00   White:2003   Always Wrong    : 576   Male  :1174  
##  1st Qu.:33.00   Black: 413   Almst Always Wrg: 187   Female:1480  
##  Median :46.00   Other: 238   Sometimes Wrong : 447                
##  Mean   :47.65                Not Wrong At All:1444                
##  3rd Qu.:61.00                Other           :   0                
##  Max.   :89.00

Race vs. Perception toward Premarital Sex

ggplot(gss_sub, aes(x = race, fill = premarsx)) + 
  geom_bar(position = "fill") + 
  labs(x = "Race", y = "Proportion", title = "Race vs. Premarital Sex Perception")

Age vs. Perception toward Premarital Sex

ggplot(gss_sub, aes(x = age, y = premarsx)) + 
  geom_boxplot() + 
  labs(x = "Age (years)", y = "Premarital Sex Perception", title = "Age vs. Premarital Sex Perception")

ggplot(gss_sub, aes(x = premarsx, y = age)) + 
  geom_boxplot(fill = "lightblue") + 
  labs(x = "Premarital Sex Perception", y = "Age", title = "Age Distribution by Premarital Sex Perception") +
  theme_classic()

ggplot(gss_sub, aes(x = age, fill = premarsx)) + 
  geom_bar(position = "fill") + 
  labs(x = "Age", y = "Proportion", title = "Age vs. Premarital Sex Perception")

Sex vs. Perception toward Premarital Sex

ggplot(gss_sub, aes(x = sex, fill = premarsx)) + 
  geom_bar(position = "fill") + 
  labs(x = "Sex", y = "Proportion", title = "Sex vs. Premarital Sex Perception")

Part 4: Inference

In this section, we’ll use different statistical tests to see if race, age, and other factors relate to people’s perceptions of premarital sex. We’ll use Chi-squared tests to check if categories like race and perception are related, t-tests to compare group means when we have only two groups, ANOVA to see if age varies by perception groups, and linear regression to explore relationships and control for other factors.

Chi-Squared Test: Race and Perception of Premarital Sex

The Chi-squared test is used when we want to check if two categories, like race and perception of premarital sex, are connected. It’s perfect for testing relationships between groups without worrying about averages or other statistics.

  • When to Use: Chi-squared is useful for comparing categories, like race and perceptions, to see if one might impact the other.

  • Null Hypothesis (H_0): Race and perceptions of premarital sex are not connected.

  • Alternative Hypothesis (H_A): Race and perceptions of premarital sex are connected.

# Chi-squared test between race and perception on premarital sex
chisq_test <- chisq.test(gss_sub$race, gss_sub$premarsx)
chisq_test
## 
##  Pearson's Chi-squared test
## 
## data:  gss_sub$race and gss_sub$premarsx
## X-squared = 22.248, df = 6, p-value = 0.001092

If the p-value is low (typically below 0.05), we can say there’s likely a connection between race and perception of premarital sex.

t-Test: Comparing Two Groups

The t-test is used to compare the means of a numeric variable between two groups. For example, if we wanted to see if there was a difference in approval of premarital sex between male and female , we’d use a t-test.

  • When to Use: Use the t-test to compare the mean of a numeric variable between two groups.

  • Null Hypothesis (H_0): There is no difference in the means between the two groups.

  • Alternative Hypothesis (H_A): There is a difference in the means between the two groups.

# Convert perception to numeric if necessary
gss_sub <- gss_sub %>%
  mutate(premarsx_num = as.numeric(as.factor(premarsx)))

table(gss_sub$premarsx,gss_sub$premarsx_num)
##                   
##                       1    2    3    4
##   Always Wrong      576    0    0    0
##   Almst Always Wrg    0  187    0    0
##   Sometimes Wrong     0    0  447    0
##   Not Wrong At All    0    0    0 1444
##   Other               0    0    0    0
# t-test for age between two groups of premarital sex perception
t_test <- t.test(premarsx_num ~ sex, data = gss_sub)
t_test
## 
##  Welch Two Sample t-test
## 
## data:  premarsx_num by sex
## t = 4.8299, df = 2586.5, p-value = 1.446e-06
## alternative hypothesis: true difference in means between group Male and group Female is not equal to 0
## 95 percent confidence interval:
##  0.1347867 0.3190325
## sample estimates:
##   mean in group Male mean in group Female 
##             3.166099             2.939189

ANOVA: Age and Perception of Premarital Sex

ANOVA (Analysis of Variance) tests if the average age is different across groups defined by another factor, in this case, people’s views on premarital sex. It’s great for comparing averages when there are multiple groups.

  • When to Use: Use ANOVA to check for differences in averages across groups.

  • Null Hypothesis (H_0): There’s no difference in the average age of people with different views on premarital sex.

  • Alternative Hypothesis (H_A): The average age is different for people with different views on premarital sex.

# ANOVA test for age vs. premarital sex perception
aov_test <- aov(age ~ premarsx, data = gss_sub)
summary(aov_test)
##               Df Sum Sq Mean Sq F value Pr(>F)    
## premarsx       3  33298   11099   36.92 <2e-16 ***
## Residuals   2650 796574     301                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A low p-value means there’s probably a difference in age between groups with different views on premarital sex.

Linear Regression: Age Predicting Premarital Sex Perception

Linear regression lets us see if there’s a straight-line relationship between two variables. Here, we’re checking if age can predict how people feel about premarital sex, treating perception as a number (ordered, like on a scale).

  • When to Use: Regression is good when you want to predict one variable based on another.

  • Regression Model: This model checks if there’s a straight-line relationship between age and perception of premarital sex.

  • Disturbance Term (Error): The “disturbance” or “error” term represents the variation we can’t explain. We assume these errors are random and don’t follow any particular pattern.

# Run linear regression model
lm_model <- lm(premarsx_num ~ age, data = gss_sub)
summary(lm_model)
## 
## Call:
## lm(formula = premarsx_num ~ age, data = gss_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4372 -0.9812  0.6299  0.9249  1.5151 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.678586   0.066619   55.22   <2e-16 ***
## age         -0.013412   0.001311  -10.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.194 on 2652 degrees of freedom
## Multiple R-squared:  0.03797,    Adjusted R-squared:  0.03761 
## F-statistic: 104.7 on 1 and 2652 DF,  p-value: < 2.2e-16
plot(lm_model)

Multiple Regression: Controlling for Additional Factors

Multiple regression is like regular regression, but it includes more variables to control for their effects. For example, by adding race as a predictor, we can see the effect of age on perception of premarital sex while accounting for race.

  • Why Control for Another Factor? By controlling for race, we can see if age really affects perception independently of race.

  • Key Assumptions: For this to work well, we assume no large outliers, normally distributed errors, and no strong overlap (multicollinearity) among predictors.

# Multiple regression with age and race
lm_model_mult <- lm(premarsx_num ~ age + race, data = gss_sub)
summary(lm_model_mult)
## 
## Call:
## lm(formula = premarsx_num ~ age + race, data = gss_sub)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5032 -0.9527  0.5819  0.9332  1.6844 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.772501   0.070436  53.559  < 2e-16 ***
## age         -0.014174   0.001324 -10.702  < 2e-16 ***
## raceBlack   -0.308789   0.064630  -4.778 1.87e-06 ***
## raceOther   -0.106146   0.082426  -1.288    0.198    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.189 on 2650 degrees of freedom
## Multiple R-squared:  0.04631,    Adjusted R-squared:  0.04523 
## F-statistic:  42.9 on 3 and 2650 DF,  p-value: < 2.2e-16

Week 12 EDA Exercise 1

Getting Started

This analysis explores the General Social Survey (GSS) dataset to perform exploratory data analysis and inferential tests.

Load packages

library(tidyverse)

Load data

load("gss.Rdata")
dim(gss)
## [1] 57061   114

Part 1: Data

Background

The GSS dataset collects information on American societal trends, covering demographics, behaviors, attitudes, and issues like civil liberties, crime, and social mobility. The dataset contains 57,061 observations with 114 variables, representing diverse sociological factors.

Methodology

GSS surveys are conducted face-to-face with randomly selected U.S. adults. The sampling uses a mix of cluster and stratified random sampling, primarily targeting adults (18+).

Scope of Inference

As an observational study, the GSS allows for generalization to the U.S. population but does not support causal inference. Potential bias includes non-response, as participation is voluntary and time-intensive.

Part 2: Research Questions

  1. Is there a relationship between race and perception of premarital sex?

  2. Is there a relationship between age and perception of premarital sex?

  3. Is there a relationship between sex and perception of premarital sex?

These questions aim to understand how race, age and sex influence attitudes toward premarital sex, an issue tied to public health concerns.

Data Preparation

gss_sub <- gss %>% 
  filter(year >= 2010, !is.na(age)&!is.na(race)&!is.na(premarsx)&!is.na(sex)) %>% 
  select(age, race, premarsx,sex)

dim(gss_sub)
## [1] 2654    4

Part 3: Exploratory Data Analysis

summary(gss_sub)
##       age           race                  premarsx        sex      
##  Min.   :18.00   White:2003   Always Wrong    : 576   Male  :1174  
##  1st Qu.:33.00   Black: 413   Almst Always Wrg: 187   Female:1480  
##  Median :46.00   Other: 238   Sometimes Wrong : 447                
##  Mean   :47.65                Not Wrong At All:1444                
##  3rd Qu.:61.00                Other           :   0                
##  Max.   :89.00

Race vs. Perception toward Premarital Sex Plot

ggplot(gss_sub, aes(x = race, fill = premarsx)) + 
  geom_bar(position = "fill") + 
  labs(x = "Race", y = "Proportion", title = "Race vs. Premarital Sex Perception")

Age vs. Perception toward Premarital Sex Plot

hint: how would you plot the relationship between a categorical variable and a numeric variable

Sex vs. Perception toward Premarital Sex Plot

hint: these are two categorical variables just like race vs. perception

Part 4: Inference

In this section, we’ll use different statistical tests to see if race, age, and other factors relate to people’s perceptions of premarital sex. We’ll use Chi-squared tests to check if categories like race and perception are related, t-tests to compare group means when we have only two groups, ANOVA to see if age varies by perception groups, and linear regression to explore relationships and control for other factors.

Chi-Squared Test: Race and Perception of Premarital Sex

The Chi-squared test is used when we want to check if two categories, like race and perception of premarital sex, are connected. It’s perfect for testing relationships between groups without worrying about averages or other statistics.

  • When to Use: Chi-squared is useful for comparing categories, like race and perceptions, to see if one might impact the other.

  • Null Hypothesis (H_0): Race and perceptions of premarital sex are not connected.

  • Alternative Hypothesis (H_A): Race and perceptions of premarital sex are connected.

# Chi-squared test between race and perception on premarital sex

If the p-value is low (typically below 0.05), we can say there’s likely a connection between race and perception of premarital sex.

t-Test: Comparing Two Groups

The t-test is used to compare the means of a numeric variable between two groups. For example, if we wanted to see if there was a difference in approval of premarital sex between male and female , we’d use a t-test.

  • When to Use: Use the t-test to compare the mean of a numeric variable between two groups.

  • Null Hypothesis (H_0): There is no difference in the means between the two groups.

  • Alternative Hypothesis (H_A): There is a difference in the means between the two groups.

# Convert perception to numeric if necessary
gss_sub <- gss_sub %>%
  mutate(premarsx_num = as.numeric(as.factor(premarsx)))

table(gss_sub$premarsx,gss_sub$premarsx_num)
##                   
##                       1    2    3    4
##   Always Wrong      576    0    0    0
##   Almst Always Wrg    0  187    0    0
##   Sometimes Wrong     0    0  447    0
##   Not Wrong At All    0    0    0 1444
##   Other               0    0    0    0
# t-test for age between two groups of premarital sex perception

ANOVA: Age and Perception of Premarital Sex

ANOVA (Analysis of Variance) tests if the average age is different across groups defined by another factor, in this case, people’s views on premarital sex. It’s great for comparing averages when there are multiple groups.

  • When to Use: Use ANOVA to check for differences in averages across groups.

  • Null Hypothesis (H_0): There’s no difference in the average age of people with different views on premarital sex.

  • Alternative Hypothesis (H_A): The average age is different for people with different views on premarital sex.

# ANOVA test for age vs. premarital sex perception

A low p-value means there’s probably a difference in age between groups with different views on premarital sex.

Linear Regression: Age Predicting Premarital Sex Perception

Linear regression lets us see if there’s a straight-line relationship between two variables. Here, we’re checking if age can predict how people feel about premarital sex, treating perception as a number (ordered, like on a scale).

  • When to Use: Regression is good when you want to predict one variable based on another.

  • Regression Model: This model checks if there’s a straight-line relationship between age and perception of premarital sex.

  • Disturbance Term (Error): The “disturbance” or “error” term represents the variation we can’t explain. We assume these errors are random and don’t follow any particular pattern.

# Run linear regression model

Multiple Regression: Controlling for Additional Factors

Multiple regression is like regular regression, but it includes more variables to control for their effects. For example, by adding race as a predictor, we can see the effect of age on perception of premarital sex while accounting for race.

  • Why Control for Another Factor? By controlling for race, we can see if age really affects perception independently of race.

  • Key Assumptions: For this to work well, we assume no large outliers, normally distributed errors, and no strong overlap (multicollinearity) among predictors.

# Multiple regression with age and race