Session 7: Review Exercises

Interactive mode: click a code block or Show Plot button to reveal/hide its corresponding plot.

Review 1

Dataset

We will use the NHANES dataset, which contains health and demographic information collected from a representative sample of the U.S. population. The dataset is available in R through the NHANES package.

Accessing the Dataset:

#if (FALSE) install.packages("NHANES")  # Install if not already installed
library(NHANES)
data <- data.frame(NHANES)

TASK 1: Data Exploration

Load the NHANES dataset.
Use tidyverse functions to:
- View the structure of the dataset.
- Summarize key variables: Age, Gender, BMI, SmokeNow (whether the person currently smokes), and PhysActive(physical activity status).
- Just check the levels if it is categorical variable.

library(tidyverse)

glimpse(data)

## Rows: 10,000
## Columns: 76
## $ ID               <int> 51624, 51624, 51624, 51625, 51630, 51638, 51646, 51647, 51647, 51647, 516…
## $ SurveyYr         <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2…
## $ Gender           <fct> male, male, male, male, female, male, male, female, female, female, male,…
## $ Age              <int> 34, 34, 34, 4, 49, 9, 8, 45, 45, 45, 66, 58, 54, 10, 58, 50, 9, 33, 60, 1…
## $ AgeDecade        <fct>  30-39,  30-39,  30-39,  0-9,  40-49,  0-9,  0-9,  40-49,  40-49,  40-49,…
## $ AgeMonths        <int> 409, 409, 409, 49, 596, 115, 101, 541, 541, 541, 795, 707, 654, 123, 700,…
## $ Race1            <fct> White, White, White, Other, White, White, White, White, White, White, Whi…
## $ Race3            <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Education        <fct> High School, High School, High School, NA, Some College, NA, NA, College …
## $ MaritalStatus    <fct> Married, Married, Married, NA, LivePartner, NA, NA, Married, Married, Mar…
## $ HHIncome         <fct> 25000-34999, 25000-34999, 25000-34999, 20000-24999, 35000-44999, 75000-99…
## $ HHIncomeMid      <int> 30000, 30000, 30000, 22500, 40000, 87500, 60000, 87500, 87500, 87500, 300…
## $ Poverty          <dbl> 1.36, 1.36, 1.36, 1.07, 1.91, 1.84, 2.33, 5.00, 5.00, 5.00, 2.20, 5.00, 2…
## $ HomeRooms        <int> 6, 6, 6, 9, 5, 6, 7, 6, 6, 6, 5, 10, 6, 10, 10, 4, 3, 11, 5, 7, 10, 10, 9…
## $ HomeOwn          <fct> Own, Own, Own, Own, Rent, Rent, Own, Own, Own, Own, Own, Rent, Rent, Own,…
## $ Work             <fct> NotWorking, NotWorking, NotWorking, NA, NotWorking, NA, NA, Working, Work…
## $ Weight           <dbl> 87.4, 87.4, 87.4, 17.0, 86.7, 29.8, 35.2, 75.7, 75.7, 75.7, 68.0, 78.4, 7…
## $ Length           <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ HeadCirc         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Height           <dbl> 164.7, 164.7, 164.7, 105.4, 168.4, 133.1, 130.6, 166.7, 166.7, 166.7, 169…
## $ BMI              <dbl> 32.22, 32.22, 32.22, 15.30, 30.57, 16.82, 20.64, 27.24, 27.24, 27.24, 23.…
## $ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ BMI_WHO          <fct> 30.0_plus, 30.0_plus, 30.0_plus, 12.0_18.5, 30.0_plus, 12.0_18.5, 18.5_to…
## $ Pulse            <int> 70, 70, 70, NA, 86, 82, 72, 62, 62, 62, 60, 62, 76, 80, 94, 74, 92, 96, 8…
## $ BPSysAve         <int> 113, 113, 113, NA, 112, 86, 107, 118, 118, 118, 111, 104, 134, 104, 127, …
## $ BPDiaAve         <int> 85, 85, 85, NA, 75, 47, 37, 64, 64, 64, 63, 74, 85, 68, 83, 68, 63, 74, 1…
## $ BPSys1           <int> 114, 114, 114, NA, 118, 84, 114, 106, 106, 106, 124, 108, 136, 102, NA, 1…
## $ BPDia1           <int> 88, 88, 88, NA, 82, 50, 46, 62, 62, 62, 64, 76, 86, 66, NA, 66, 56, 80, 9…
## $ BPSys2           <int> 114, 114, 114, NA, 108, 84, 108, 118, 118, 118, 108, 104, 132, 102, 134, …
## $ BPDia2           <int> 88, 88, 88, NA, 74, 50, 36, 68, 68, 68, 62, 72, 88, 66, 82, 74, 64, 74, 9…
## $ BPSys3           <int> 112, 112, 112, NA, 116, 88, 106, 118, 118, 118, 114, 104, 136, 106, 120, …
## $ BPDia3           <int> 82, 82, 82, NA, 76, 44, 38, 60, 60, 60, 64, 76, 82, 70, 84, 62, 62, NA, 1…
## $ Testosterone     <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ DirectChol       <dbl> 1.29, 1.29, 1.29, NA, 1.16, 1.34, 1.55, 2.12, 2.12, 2.12, 0.67, 0.96, 1.1…
## $ TotChol          <dbl> 3.49, 3.49, 3.49, NA, 6.70, 4.86, 4.09, 5.82, 5.82, 5.82, 4.99, 4.24, 6.4…
## $ UrineVol1        <int> 352, 352, 352, NA, 77, 123, 238, 106, 106, 106, 113, 163, 215, 7, 29, 64,…
## $ UrineFlow1       <dbl> NA, NA, NA, NA, 0.094, 1.538, 1.322, 1.116, 1.116, 1.116, 0.489, NA, 0.90…
## $ UrineVol2        <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ UrineFlow2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Diabetes         <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, N…
## $ DiabetesAge      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ HealthGen        <fct> Good, Good, Good, NA, Good, NA, NA, Vgood, Vgood, Vgood, Vgood, Vgood, Fa…
## $ DaysPhysHlthBad  <int> 0, 0, 0, NA, 0, NA, NA, 0, 0, 0, 10, 0, 4, NA, NA, 0, NA, 3, 7, 0, 3, 3, …
## $ DaysMentHlthBad  <int> 15, 15, 15, NA, 10, NA, NA, 3, 3, 3, 0, 0, 0, NA, NA, 0, NA, 7, 0, 20, 0,…
## $ LittleInterest   <fct> Most, Most, Most, NA, Several, NA, NA, None, None, None, None, None, None…
## $ Depressed        <fct> Several, Several, Several, NA, Several, NA, NA, None, None, None, None, N…
## $ nPregnancies     <int> NA, NA, NA, NA, 2, NA, NA, 1, 1, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ nBabies          <int> NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Age1stBaby       <int> NA, NA, NA, NA, 27, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ SleepHrsNight    <int> 4, 4, 4, NA, 8, NA, NA, 8, 8, 8, 7, 5, 4, NA, 5, 7, NA, 6, 6, 6, 7, 7, 8,…
## $ SleepTrouble     <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, No, Yes, NA, No, No, NA, …
## $ PhysActive       <fct> No, No, No, NA, No, NA, NA, Yes, Yes, Yes, Yes, Yes, Yes, NA, Yes, Yes, N…
## $ PhysActiveDays   <int> NA, NA, NA, NA, NA, NA, NA, 5, 5, 5, 7, 5, 1, NA, 2, 7, NA, NA, NA, 3, 7,…
## $ TVHrsDay         <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ CompHrsDay       <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TVHrsDayChild    <int> NA, NA, NA, 4, NA, 5, 1, NA, NA, NA, NA, NA, NA, 4, NA, NA, 0, NA, NA, NA…
## $ CompHrsDayChild  <int> NA, NA, NA, 1, NA, 0, 6, NA, NA, NA, NA, NA, NA, 3, NA, NA, 1, NA, NA, NA…
## $ Alcohol12PlusYr  <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, Yes, Yes, NA, NA, No,…
## $ AlcoholDay       <int> NA, NA, NA, NA, 2, NA, NA, 3, 3, 3, 1, 2, 6, NA, NA, NA, NA, 3, 6, NA, 1,…
## $ AlcoholYear      <int> 0, 0, 0, NA, 20, NA, NA, 52, 52, 52, 100, 104, 364, NA, NA, 0, NA, 104, 3…
## $ SmokeNow         <fct> No, No, No, NA, Yes, NA, NA, NA, NA, NA, No, NA, NA, NA, Yes, NA, NA, No,…
## $ Smoke100         <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, Yes, No, No, NA, Yes, No, NA,…
## $ Smoke100n        <fct> Smoker, Smoker, Smoker, NA, Smoker, NA, NA, Non-Smoker, Non-Smoker, Non-S…
## $ SmokeAge         <int> 18, 18, 18, NA, 38, NA, NA, NA, NA, NA, 13, NA, NA, NA, 17, NA, NA, NA, 1…
## $ Marijuana        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, NA, Yes, Yes, NA, NA, No, …
## $ AgeFirstMarij    <int> 17, 17, 17, NA, 18, NA, NA, 13, 13, 13, NA, 19, 15, NA, NA, NA, NA, NA, N…
## $ RegularMarij     <fct> No, No, No, NA, No, NA, NA, No, No, No, NA, Yes, Yes, NA, NA, No, NA, No,…
## $ AgeRegMarij      <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20, 15, NA, NA, NA, NA, NA, N…
## $ HardDrugs        <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, Yes, Yes, NA, NA, No, NA,…
## $ SexEver          <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, Yes, Yes, NA, NA, Yes…
## $ SexAge           <int> 16, 16, 16, NA, 12, NA, NA, 13, 13, 13, 17, 22, 12, NA, NA, NA, NA, 27, 2…
## $ SexNumPartnLife  <int> 8, 8, 8, NA, 10, NA, NA, 20, 20, 20, 15, 7, 100, NA, NA, 9, NA, 1, 1, NA,…
## $ SexNumPartYear   <int> 1, 1, 1, NA, 1, NA, NA, 0, 0, 0, NA, 1, 1, NA, NA, 1, NA, 1, NA, NA, 1, 1…
## $ SameSex          <fct> No, No, No, NA, Yes, NA, NA, Yes, Yes, Yes, No, No, No, NA, NA, No, NA, N…
## $ SexOrientation   <fct> Heterosexual, Heterosexual, Heterosexual, NA, Heterosexual, NA, NA, Bisex…
## $ PregnantNow      <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

# Summarize Age and BMI
summary(data %>% select(Gender,Age, BMI,SmokeNow,PhysActive))

##     Gender          Age             BMI        SmokeNow    PhysActive 
##  female:5020   Min.   : 0.00   Min.   :12.88   No  :1745   No  :3677  
##  male  :4980   1st Qu.:17.00   1st Qu.:21.58   Yes :1466   Yes :4649  
##                Median :36.00   Median :25.98   NA's:6789   NA's:1674  
##                Mean   :36.74   Mean   :26.66                          
##                3rd Qu.:54.00   3rd Qu.:30.89                          
##                Max.   :80.00   Max.   :81.25                          
##                                NA's   :366

# Frequency table for Gender
table(data$Gender)

## 
## female   male 
##   5020   4980

# Frequency table for SmokeNow
table(data$SmokeNow)

## 
##   No  Yes 
## 1745 1466

# two ways to check levels

print(head(data$Gender))

## [1] male   male   male   male   female male  
## Levels: female male

levels(data$Gender)

## [1] "female" "male"

TASK 2: Data Cleaning

Check for missing values in key variables.
Remove any rows with missing values.
Recode SmokeNow and PhysActive to have more interpretable categories if necessary.

data_clean <- data %>%
  drop_na(Age, BMI, Gender, SmokeNow, PhysActive)




# (Optionally)

# Optional 1: Recode variables for more information when doing summary

data_clean1 <- data_clean %>%
  mutate(
    SmokeNow = recode(SmokeNow, "Yes" = 1, "No" = 0),
    PhysActive = recode(PhysActive, "Yes" = 1, "No" = 0),
    Gender = recode(Gender, "male" = 1, "female" = 0)
  )

summary(data_clean1 %>% select(Gender,Age, BMI,SmokeNow,PhysActive))

##      Gender            Age             BMI           SmokeNow        PhysActive    
##  Min.   :0.0000   Min.   :20.00   Min.   :15.02   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:35.00   1st Qu.:23.90   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :1.0000   Median :49.00   Median :27.54   Median :0.0000   Median :0.0000  
##  Mean   :0.5625   Mean   :48.91   Mean   :28.45   Mean   :0.4579   Mean   :0.4771  
##  3rd Qu.:1.0000   3rd Qu.:62.00   3rd Qu.:31.90   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :1.0000   Max.   :80.00   Max.   :67.83   Max.   :1.0000   Max.   :1.0000

# Option 2: Recode variables for regression table later

# this way so that your regrestion table is more straitforward

data_clean2 <- data_clean %>%
  mutate(
    SmokeNow = recode(SmokeNow, "Yes" = "Smoker", "No" = "Non-Smoker"),
    PhysActive = recode(PhysActive, "Yes" = "Active", "No" = "Inactive"))

summary(data_clean2 %>% select(Gender,Age, BMI,SmokeNow,PhysActive))

##     Gender          Age             BMI              SmokeNow       PhysActive  
##  female:1393   Min.   :20.00   Min.   :15.02   Non-Smoker:1726   Inactive:1665  
##  male  :1791   1st Qu.:35.00   1st Qu.:23.90   Smoker    :1458   Active  :1519  
##                Median :49.00   Median :27.54                                    
##                Mean   :48.91   Mean   :28.45                                    
##                3rd Qu.:62.00   3rd Qu.:31.90                                    
##                Max.   :80.00   Max.   :67.83

TASK 3: Data Visualization

Create a histogram of BMI.
Create a boxplot of BMI by Gender.
Add appropriate labels and titles.

## basic R

hist(data_clean2$BMI, 
     main = "Histogram of BMI", 
     xlab = "BMI", 
     col = "lightblue",  # Optional: Adds color to the bars
     border = "black")   # Optional: Adds a border to the bars

boxplot(BMI ~ Gender, data = data_clean2,  # remember to use the dataset you didn't code Gender as numbers so that X aes wound't be a continues variable
        main = "Boxplot of BMI by Gender",
        xlab = "Gender", 
        ylab = "BMI",
        col = "lightblue")  # Optional: Adds color to the boxplot

# ggplot2 package


ggplot(data_clean1, aes(x = BMI)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black") +  # we are doing it a bit differently by changing the binwidth
  labs(title = "Distribution of BMI", x = "BMI", y = "Frequency")

ggplot(data_clean2, aes(x = Gender, y = BMI)) +
  geom_boxplot(fill = "pink") +
  labs(title = "BMI by Gender", x = "Gender", y = "BMI")

TASK 4: Descriptive Statistics

Use stargazer Present sample statistics for our key variables, mutate the data if needed ( for categorical variables)
Present the results in a neat table.

# Only do sample statistic for our key variables
key_variables <- data_clean1 %>%
  select(Age, BMI, Gender, SmokeNow, PhysActive)



library(stargazer)

# Key variables: Age,Gender,BMI,SmokeNow(whether the person currently smokes), and PhysActive(physical activity status).
stargazer(
  key_variables,
  type = "text", #comment out this line and de-comment the following two line if you need a doc file
#  type = "html",
#  out = "sample statistics.doc",
  title = "Descriptive Statistics",
  digits = 2, # keeping two digit after "."
  summary.stat = c("mean", "sd", "min", "max","n")
)

## 
## Descriptive Statistics
## ===========================================
## Statistic  Mean  St. Dev.  Min   Max    N  
## -------------------------------------------
## Age        48.91  16.79    20    80   3,184
## BMI        28.45   6.33   15.02 67.83 3,184
## Gender     0.56    0.50     0     1   3,184
## SmokeNow   0.46    0.50     0     1   3,184
## PhysActive 0.48    0.50     0     1   3,184
## -------------------------------------------

TASK 5: Regression Analysis

Run a linear regression with BMI as the dependent variable and Age, Gender, and PhysActive as independent variables.
Interpret the coefficients ( to yourself or group member).

model_basic <- lm(BMI ~ Age + Gender + PhysActive, data = data_clean2)
summary(model_basic)

## 
## Call:
## lm(formula = BMI ~ Age + Gender + PhysActive, data = data_clean2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.092  -4.434  -0.957   3.488  38.686 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      28.536543   0.399157  71.492  < 2e-16 ***
## Age               0.015578   0.006699   2.325   0.0201 *  
## Gendermale        0.009406   0.223630   0.042   0.9665    
## PhysActiveActive -1.785032   0.225212  -7.926  3.1e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.258 on 3180 degrees of freedom
## Multiple R-squared:  0.02348,    Adjusted R-squared:  0.02256 
## F-statistic: 25.49 on 3 and 3180 DF,  p-value: 2.716e-16

TASK 6: Presenting Results

Use stargazer to create a regression table (both in the console and a Word doc).
Use coefplot to visualize the coefficients.

stargazer(model_basic, type = "text", title = "Regression Results")

## 
## Regression Results
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                                 BMI            
## -----------------------------------------------
## Age                           0.016**          
##                               (0.007)          
##                                                
## Gendermale                     0.009           
##                               (0.224)          
##                                                
## PhysActiveActive             -1.785***         
##                               (0.225)          
##                                                
## Constant                     28.537***         
##                               (0.399)          
##                                                
## -----------------------------------------------
## Observations                   3,184           
## R2                             0.023           
## Adjusted R2                    0.023           
## Residual Std. Error      6.258 (df = 3180)     
## F Statistic          25.492*** (df = 3; 3180)  
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

library(coefplot)
coefplot(model_basic, title = "Coefficient Plot",intercept = F)

Review 2

Dataset

Accessing the Dataset:

#if (FALSE) install.packages("NHANES")  # Install if not already installed
library(NHANES)
data <- data.frame(NHANES)

TASK 1: Data Exploration and Cleaning

Load the NHANES dataset.
Check for missing values and handle them appropriately.
Summarize key variables: Age, Gender, BMI, SmokeNow (whether the person currently smokes), PhysActive(physical activity status) and SleepHrsNight, recode them if it is necessary.
Identify and remove outliers in BMI (e.g., using the IQR method).
Transform BMI using log transformation if you think it is appropriate.
Recode SmokeNow and PhysActive to have more interpretable categories if necessary.

library(tidyverse)
data_clean <- data %>%
  drop_na(Age, BMI, Gender, SmokeNow, PhysActive)
# Remove outliers
Q1 <- quantile(data_clean$BMI, 0.25, na.rm = TRUE)
Q3 <- quantile(data_clean$BMI, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
data_no_outliers <- data_clean %>%
  filter(BMI >= (Q1 - 1.5 * IQR) & BMI <= (Q3 + 1.5 * IQR))

# Log transformation (if distribution is skewed)
data_no_outliers <- data_no_outliers %>%
  mutate(log_BMI = log(BMI))

Dealing with Missing Data
When working with datasets like NHANES, missing data can be common. We need to handle missing data because missing values can distort analysis results, lead to biased conclusions, and reduce the dataset size. There are different ways to deal with missing data: Remove Missing Data: The simplest method is to remove rows with missing values for important variables. This method works when the amount of missing data is small and doesn’t significantly affect the dataset’s size. However, if there are many missing values, removing them may lead to losing too much data, which can reduce the statistical power of the analysis. Imputation (Advanced): This is when missing values are filled in with estimates, such as the mean, median, or more complex statistical methods. We’re not using this here but it’s useful when you have larger portions of missing data. In our project, we chose to drop rows with missing data for important variables (Age, BMI, Gender, SmokeNow, PhysActive) because the dataset is large enough, and this ensures that our results won’t be biased by incomplete records.

When working with datasets like NHANES, missing data can be common. We need to handle missing data because missing values can distort analysis results, lead to biased conclusions, and reduce the dataset size. There are different ways to deal with missing data:

Remove Missing Data: The simplest method is to remove rows with missing values for important variables. This method works when the amount of missing data is small and doesn’t significantly affect the dataset’s size. However, if there are many missing values, removing them may lead to losing too much data, which can reduce the statistical power of the analysis.
Imputation (Advanced): This is when missing values are filled in with estimates, such as the mean, median, or more complex statistical methods. We’re not using this here but it’s useful when you have larger portions of missing data.

In our project, we chose to drop rows with missing data for important variables (Age, BMI, Gender, SmokeNow, PhysActive) because the dataset is large enough, and this ensures that our results won’t be biased by incomplete records.

Dealing with Outliers
Outliers are data points that differ significantly from other observations. They can distort statistical results and lead to incorrect conclusions if not properly addressed. Here’s how we deal with outliers in this project: Why Outliers Matter: Outliers can affect the mean and standard deviation, inflate regression coefficients, and lead to misleading results. Therefore, identifying and managing outliers is important to ensure that the results of the analysis are robust. Identifying Outliers: One common way to detect outliers is by using the Interquartile Range (IQR) method: The IQR is the range between the 1st quartile (Q1) and the 3rd quartile (Q3) of the data. Outliers are defined as any values below *Q1 - 1.5 IQR or above Q3 + 1.5 * IQR. In our case, we used this method to identify outliers in BMI. Handling Outliers: Once we’ve identified the outliers, there are a few ways to handle them: Remove the Outliers: If the outliers are errors or extreme values that don’t represent the population we’re studying, we can remove them. This helps make the data more representative and reduces the chance of the outliers influencing the analysis disproportionately. Transform the Data: In some cases, instead of removing outliers, we can transform the data to minimize their effect. For example, a log transformation can reduce the skewness caused by outliers, as we did for BMI if needed. In this project, we chose to remove outliers** from the BMI variable to ensure that our regression models and visualizations are not overly influenced by extreme BMI values that may not represent the general population. By doing this, we can obtain more reliable and meaningful results.

Outliers are data points that differ significantly from other observations. They can distort statistical results and lead to incorrect conclusions if not properly addressed. Here’s how we deal with outliers in this project:

Why Outliers Matter: Outliers can affect the mean and standard deviation, inflate regression coefficients, and lead to misleading results. Therefore, identifying and managing outliers is important to ensure that the results of the analysis are robust.
Identifying Outliers: One common way to detect outliers is by using the Interquartile Range (IQR) method:
- The IQR is the range between the 1st quartile (Q1) and the 3rd quartile (Q3) of the data.
- Outliers are defined as any values below Q1 - 1.5 * IQR or above Q3 + 1.5 * IQR. In our case, we used this method to identify outliers in BMI.
Handling Outliers: Once we’ve identified the outliers, there are a few ways to handle them:
- Remove the Outliers: If the outliers are errors or extreme values that don’t represent the population we’re studying, we can remove them. This helps make the data more representative and reduces the chance of the outliers influencing the analysis disproportionately.
- Transform the Data: In some cases, instead of removing outliers, we can transform the data to minimize their effect. For example, a log transformation can reduce the skewness caused by outliers, as we did for BMI if needed.

In this project, we chose to remove outliers from the BMI variable to ensure that our regression models and visualizations are not overly influenced by extreme BMI values that may not represent the general population. By doing this, we can obtain more reliable and meaningful results.

TASK 2: Data Visualization

Create a scatter plot of BMI vs. Age.
Color the points by SmokeNow.
Use smooth lines to show trends (e.g., using geom_smooth).
Customize the plot with themes and labels.

ggplot(data_no_outliers, aes(x = Age, y = BMI, color = SmokeNow)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "loess", se = FALSE) +
  theme_minimal() +
  labs(title = "BMI vs Age by Smoking Status",
       x = "Age",
       y = "BMI")

TASK 3: Descriptive Analysis

Calculate correlation coefficients between Age, BMI, and SleepHrsNight (average hours of sleep per night, if available).
Present the correlation matrix.

# Ensure 'SleepHrsNight' is available and has no missing values
data_no_outliers <- data_no_outliers %>%
  drop_na(SleepHrsNight)

correlation_matrix <- data_no_outliers %>%
  select(Age, BMI, SleepHrsNight) %>%
  cor()
print(correlation_matrix)

##                      Age         BMI SleepHrsNight
## Age           1.00000000  0.09484044    0.10211937
## BMI           0.09484044  1.00000000   -0.04470914
## SleepHrsNight 0.10211937 -0.04470914    1.00000000

TASK 4: Multiple Regression Analysis

Run multiple regression models:

Model 1: BMI ~ Age
Model 2: Add Gender as an independent variable.
Model 3: Include PhysActive , SmokeNow and SleepHrsNight as independent variables.

model1 <- lm(BMI ~ Age, data = data_no_outliers)
model2 <- lm(BMI ~ Age + Gender, data = data_no_outliers)
model3 <- lm(BMI ~ Age + Gender + PhysActive + SmokeNow +SleepHrsNight, data = data_no_outliers)

TASK 5: Model Diagnostics

Check the assumptions of linear regression for model3:
- Linearity
- Independence
- Homoscedasticity
- Normality of residuals

# Residual plots
par(mfrow = c(2, 2))
plot(model3)

Four Assumptions of Linear Regression
Linear regression is a powerful tool, but for it to work well, four key assumptions must hold: Linearity: This assumes that the relationship between the independent variable(s) and the dependent variable (BMI, in this case) is linear. If this assumption is violated, the model won’t accurately capture the relationship. We check this by looking at scatter plots or residual plots, where the data points should align in a linear pattern. Independence: This means that the observations (rows of data) are independent of each other. In other words, one person’s data shouldn’t influence another’s. If independence is violated (for example, if the data includes repeated measures for the same individual), the results may be skewed. Homoscedasticity: This means that the variability of the residuals (differences between predicted and actual values) is consistent across all levels of the independent variables. If this assumption is violated (heteroscedasticity), the model may give too much importance to certain observations. We check this by looking at a plot of residuals versus predicted values. Normality of Residuals: This means that the residuals (errors) of the model should be normally distributed. This assumption can be checked by looking at a histogram or Q-Q plot of the residuals. If the residuals aren’t normal, the p-values from the regression model might be unreliable. By checking and ensuring these assumptions hold, we can be more confident that the results from our regression models are reliable and valid for interpretation.

Linear regression is a powerful tool, but for it to work well, four key assumptions must hold:

Linearity: This assumes that the relationship between the independent variable(s) and the dependent variable (BMI, in this case) is linear. If this assumption is violated, the model won’t accurately capture the relationship. We check this by looking at scatter plots or residual plots, where the data points should align in a linear pattern.
Independence: This means that the observations (rows of data) are independent of each other. In other words, one person’s data shouldn’t influence another’s. If independence is violated (for example, if the data includes repeated measures for the same individual), the results may be skewed.
Homoscedasticity: This means that the variability of the residuals (differences between predicted and actual values) is consistent across all levels of the independent variables. If this assumption is violated (heteroscedasticity), the model may give too much importance to certain observations. We check this by looking at a plot of residuals versus predicted values.
Normality of Residuals: This means that the residuals (errors) of the model should be normally distributed. This assumption can be checked by looking at a histogram or Q-Q plot of the residuals. If the residuals aren’t normal, the p-values from the regression model might be unreliable.

By checking and ensuring these assumptions hold, we can be more confident that the results from our regression models are reliable and valid for interpretation.

Checking Four Assumptions of Linear Regression
1. Linearity (Residuals vs Fitted Plot) The first plot (Residuals vs Fitted) helps us assess the linearity assumption. Ideally, the red line in the plot should be horizontal and the residuals should be randomly scattered around the line. If there is a clear pattern (e.g., curved or funnel shape), the linearity assumption is likely violated. What to look for: A random scatter of residuals indicates that the linearity assumption holds. If a clear trend or curve is visible, it suggests a non-linear relationship, and we may need to transform variables or try a non-linear model. 2. Independence (Residuals vs Fitted Plot) The same Residuals vs Fitted plot also helps check for independence. If residuals are randomly scattered without showing patterns (like clustering or waves), the independence assumption is likely met. What to look for: No distinct patterns or grouping in the residuals suggests independence. If there is clustering, autocorrelation, or waves in the residuals, the independence assumption might be violated. This could happen, for instance, if data points are not independent (e.g., repeated measures for the same individuals). 3. Homoscedasticity (Scale-Location Plot) The second plot, Scale-Location (also called Spread-Location), helps assess homoscedasticity (constant variance). The red line should be roughly horizontal, and the points should be evenly spread out along the line. What to look for: If the residuals are evenly spread out and form a horizontal line, this indicates homoscedasticity (constant variance). If the points spread out in a funnel shape (narrow at one end, wider at the other), this suggests heteroscedasticity (non-constant variance). In this case, we might need to transform variables or use robust standard errors. 4. Normality of Residuals (Normal Q-Q Plot) The third plot is the Normal Q-Q plot, which helps assess if the residuals are normally distributed. The points should lie on or close to the diagonal line. What to look for: If the points fall on or very close to the straight diagonal line, it indicates that the residuals are normally distributed. If the points deviate substantially from the line, especially at the tails, it suggests that the residuals are not normally distributed. If the residuals are not normally distributed, transformations or non-parametric methods may be necessary.

1. Linearity (Residuals vs Fitted Plot)

The first plot (Residuals vs Fitted) helps us assess the linearity assumption. Ideally, the red line in the plot should be horizontal and the residuals should be randomly scattered around the line. If there is a clear pattern (e.g., curved or funnel shape), the linearity assumption is likely violated.
What to look for:
- A random scatter of residuals indicates that the linearity assumption holds.
- If a clear trend or curve is visible, it suggests a non-linear relationship, and we may need to transform variables or try a non-linear model.

2. Independence (Residuals vs Fitted Plot)

The same Residuals vs Fitted plot also helps check for independence. If residuals are randomly scattered without showing patterns (like clustering or waves), the independence assumption is likely met.
What to look for:
- No distinct patterns or grouping in the residuals suggests independence.
- If there is clustering, autocorrelation, or waves in the residuals, the independence assumption might be violated. This could happen, for instance, if data points are not independent (e.g., repeated measures for the same individuals).

3. Homoscedasticity (Scale-Location Plot)

The second plot, Scale-Location (also called Spread-Location), helps assess homoscedasticity (constant variance). The red line should be roughly horizontal, and the points should be evenly spread out along the line.
What to look for:
- If the residuals are evenly spread out and form a horizontal line, this indicates homoscedasticity (constant variance).
- If the points spread out in a funnel shape (narrow at one end, wider at the other), this suggests heteroscedasticity (non-constant variance). In this case, we might need to transform variables or use robust standard errors.

4. Normality of Residuals (Normal Q-Q Plot)

The third plot is the Normal Q-Q plot, which helps assess if the residuals are normally distributed. The points should lie on or close to the diagonal line.
What to look for:
- If the points fall on or very close to the straight diagonal line, it indicates that the residuals are normally distributed.
- If the points deviate substantially from the line, especially at the tails, it suggests that the residuals are not normally distributed. If the residuals are not normally distributed, transformations or non-parametric methods may be necessary.

Checking Four Assumptions of Linear Regression
5. Residuals vs Leverage (for detecting influential points) The fourth plot, Residuals vs Leverage, helps identify influential data points that have a strong effect on the model. The Cook’s distance lines help flag these points. What to look for: Points within Cook’s distance lines suggest no influential data points. If points fall outside the Cook’s distance lines, they may be influential and disproportionately affect the model. In this case, we may consider removing or investigating these points further.

5. Residuals vs Leverage (for detecting influential points)

The fourth plot, Residuals vs Leverage, helps identify influential data points that have a strong effect on the model. The Cook’s distance lines help flag these points.
What to look for:
- Points within Cook’s distance lines suggest no influential data points.
- If points fall outside the Cook’s distance lines, they may be influential and disproportionately affect the model. In this case, we may consider removing or investigating these points further.

TASK 6: Presenting Results

Use stargazer to compare all four models in a single table with robust standard errors.
Use coefplot to visualize and compare coefficients across models.

library(stargazer)
library(sandwich)  # For robust standard errors
stargazer(model1, model2, model3, type = "text",
          title = "Regression Models Comparison",
          se = list(
            sqrt(diag(vcovHC(model1, type = "HC1"))),
            sqrt(diag(vcovHC(model2, type = "HC1"))),
            sqrt(diag(vcovHC(model3, type = "HC1")))
          ))

## 
## Regression Models Comparison
## ==============================================================================================
##                                                Dependent variable:                            
##                     --------------------------------------------------------------------------
##                                                        BMI                                    
##                               (1)                      (2)                      (3)           
## ----------------------------------------------------------------------------------------------
## Age                         0.032***                 0.032***                  0.009          
##                             (0.006)                  (0.006)                  (0.006)         
##                                                                                               
## Gendermale                                            0.405*                   0.397*         
##                                                      (0.208)                  (0.205)         
##                                                                                               
## PhysActiveYes                                                                -1.607***        
##                                                                               (0.206)         
##                                                                                               
## SmokeNowYes                                                                  -1.672***        
##                                                                               (0.218)         
##                                                                                               
## SleepHrsNight                                                                -0.220***        
##                                                                               (0.077)         
##                                                                                               
## Constant                   26.489***                26.246***                30.413***        
##                             (0.310)                  (0.342)                  (0.666)         
##                                                                                               
## ----------------------------------------------------------------------------------------------
## Observations                 3,113                    3,113                    3,113          
## R2                           0.009                    0.010                    0.044          
## Adjusted R2                  0.009                    0.010                    0.043          
## Residual Std. Error    5.608 (df = 3111)        5.606 (df = 3110)        5.511 (df = 3107)    
## F Statistic         28.237*** (df = 1; 3111) 16.121*** (df = 2; 3110) 28.721*** (df = 5; 3107)
## ==============================================================================================
## Note:                                                              *p<0.1; **p<0.05; ***p<0.01

library(coefplot)

coefplot(model3, legend = TRUE, title = "Coefficient Comparison",intercept=F)

Session 7: Review Exercises

Review 1

Dataset

TASK 1: Data Exploration

TASK 2: Data Cleaning

TASK 3: Data Visualization

TASK 4: Descriptive Statistics

TASK 5: Regression Analysis

TASK 6: Presenting Results

Review 2

Dataset

TASK 1: Data Exploration and Cleaning

TASK 2: Data Visualization

TASK 3: Descriptive Analysis

TASK 4: Multiple Regression Analysis

TASK 5: Model Diagnostics

Four Assumptions of Linear Regression

1. Linearity (Residuals vs Fitted Plot)

2. Independence (Residuals vs Fitted Plot)

3. Homoscedasticity (Scale-Location Plot)

4. Normality of Residuals (Normal Q-Q Plot)

5. Residuals vs Leverage (for detecting influential points)

TASK 6: Presenting Results