Interactive mode: click a code block or Show Plot button to reveal/hide its corresponding plot.
We will use the NHANES dataset, which contains
health and demographic information collected from a representative
sample of the U.S. population. The dataset is available in R through
the NHANES package.
Accessing the Dataset:
#if (FALSE) install.packages("NHANES") # Install if not already installed
library(NHANES)
data <- data.frame(NHANES)
Load the NHANES dataset.
Use tidyverse functions to:
View the structure of the dataset.
Summarize key
variables: Age, Gender, BMI, SmokeNow (whether
the person currently smokes), and PhysActive(physical
activity status).
Just check the levels if it is categorical variable.
library(tidyverse)
glimpse(data)
## Rows: 10,000
## Columns: 76
## $ ID <int> 51624, 51624, 51624, 51625, 51630, 51638, 51646, 51647, 51647, 51647, 516…
## $ SurveyYr <fct> 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2009_10, 2…
## $ Gender <fct> male, male, male, male, female, male, male, female, female, female, male,…
## $ Age <int> 34, 34, 34, 4, 49, 9, 8, 45, 45, 45, 66, 58, 54, 10, 58, 50, 9, 33, 60, 1…
## $ AgeDecade <fct> 30-39, 30-39, 30-39, 0-9, 40-49, 0-9, 0-9, 40-49, 40-49, 40-49,…
## $ AgeMonths <int> 409, 409, 409, 49, 596, 115, 101, 541, 541, 541, 795, 707, 654, 123, 700,…
## $ Race1 <fct> White, White, White, Other, White, White, White, White, White, White, Whi…
## $ Race3 <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Education <fct> High School, High School, High School, NA, Some College, NA, NA, College …
## $ MaritalStatus <fct> Married, Married, Married, NA, LivePartner, NA, NA, Married, Married, Mar…
## $ HHIncome <fct> 25000-34999, 25000-34999, 25000-34999, 20000-24999, 35000-44999, 75000-99…
## $ HHIncomeMid <int> 30000, 30000, 30000, 22500, 40000, 87500, 60000, 87500, 87500, 87500, 300…
## $ Poverty <dbl> 1.36, 1.36, 1.36, 1.07, 1.91, 1.84, 2.33, 5.00, 5.00, 5.00, 2.20, 5.00, 2…
## $ HomeRooms <int> 6, 6, 6, 9, 5, 6, 7, 6, 6, 6, 5, 10, 6, 10, 10, 4, 3, 11, 5, 7, 10, 10, 9…
## $ HomeOwn <fct> Own, Own, Own, Own, Rent, Rent, Own, Own, Own, Own, Own, Rent, Rent, Own,…
## $ Work <fct> NotWorking, NotWorking, NotWorking, NA, NotWorking, NA, NA, Working, Work…
## $ Weight <dbl> 87.4, 87.4, 87.4, 17.0, 86.7, 29.8, 35.2, 75.7, 75.7, 75.7, 68.0, 78.4, 7…
## $ Length <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ HeadCirc <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Height <dbl> 164.7, 164.7, 164.7, 105.4, 168.4, 133.1, 130.6, 166.7, 166.7, 166.7, 169…
## $ BMI <dbl> 32.22, 32.22, 32.22, 15.30, 30.57, 16.82, 20.64, 27.24, 27.24, 27.24, 23.…
## $ BMICatUnder20yrs <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ BMI_WHO <fct> 30.0_plus, 30.0_plus, 30.0_plus, 12.0_18.5, 30.0_plus, 12.0_18.5, 18.5_to…
## $ Pulse <int> 70, 70, 70, NA, 86, 82, 72, 62, 62, 62, 60, 62, 76, 80, 94, 74, 92, 96, 8…
## $ BPSysAve <int> 113, 113, 113, NA, 112, 86, 107, 118, 118, 118, 111, 104, 134, 104, 127, …
## $ BPDiaAve <int> 85, 85, 85, NA, 75, 47, 37, 64, 64, 64, 63, 74, 85, 68, 83, 68, 63, 74, 1…
## $ BPSys1 <int> 114, 114, 114, NA, 118, 84, 114, 106, 106, 106, 124, 108, 136, 102, NA, 1…
## $ BPDia1 <int> 88, 88, 88, NA, 82, 50, 46, 62, 62, 62, 64, 76, 86, 66, NA, 66, 56, 80, 9…
## $ BPSys2 <int> 114, 114, 114, NA, 108, 84, 108, 118, 118, 118, 108, 104, 132, 102, 134, …
## $ BPDia2 <int> 88, 88, 88, NA, 74, 50, 36, 68, 68, 68, 62, 72, 88, 66, 82, 74, 64, 74, 9…
## $ BPSys3 <int> 112, 112, 112, NA, 116, 88, 106, 118, 118, 118, 114, 104, 136, 106, 120, …
## $ BPDia3 <int> 82, 82, 82, NA, 76, 44, 38, 60, 60, 60, 64, 76, 82, 70, 84, 62, 62, NA, 1…
## $ Testosterone <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ DirectChol <dbl> 1.29, 1.29, 1.29, NA, 1.16, 1.34, 1.55, 2.12, 2.12, 2.12, 0.67, 0.96, 1.1…
## $ TotChol <dbl> 3.49, 3.49, 3.49, NA, 6.70, 4.86, 4.09, 5.82, 5.82, 5.82, 4.99, 4.24, 6.4…
## $ UrineVol1 <int> 352, 352, 352, NA, 77, 123, 238, 106, 106, 106, 113, 163, 215, 7, 29, 64,…
## $ UrineFlow1 <dbl> NA, NA, NA, NA, 0.094, 1.538, 1.322, 1.116, 1.116, 1.116, 0.489, NA, 0.90…
## $ UrineVol2 <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ UrineFlow2 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ Diabetes <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, No, N…
## $ DiabetesAge <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ HealthGen <fct> Good, Good, Good, NA, Good, NA, NA, Vgood, Vgood, Vgood, Vgood, Vgood, Fa…
## $ DaysPhysHlthBad <int> 0, 0, 0, NA, 0, NA, NA, 0, 0, 0, 10, 0, 4, NA, NA, 0, NA, 3, 7, 0, 3, 3, …
## $ DaysMentHlthBad <int> 15, 15, 15, NA, 10, NA, NA, 3, 3, 3, 0, 0, 0, NA, NA, 0, NA, 7, 0, 20, 0,…
## $ LittleInterest <fct> Most, Most, Most, NA, Several, NA, NA, None, None, None, None, None, None…
## $ Depressed <fct> Several, Several, Several, NA, Several, NA, NA, None, None, None, None, N…
## $ nPregnancies <int> NA, NA, NA, NA, 2, NA, NA, 1, 1, 1, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ nBabies <int> NA, NA, NA, NA, 2, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ Age1stBaby <int> NA, NA, NA, NA, 27, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ SleepHrsNight <int> 4, 4, 4, NA, 8, NA, NA, 8, 8, 8, 7, 5, 4, NA, 5, 7, NA, 6, 6, 6, 7, 7, 8,…
## $ SleepTrouble <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, No, Yes, NA, No, No, NA, …
## $ PhysActive <fct> No, No, No, NA, No, NA, NA, Yes, Yes, Yes, Yes, Yes, Yes, NA, Yes, Yes, N…
## $ PhysActiveDays <int> NA, NA, NA, NA, NA, NA, NA, 5, 5, 5, 7, 5, 1, NA, 2, 7, NA, NA, NA, 3, 7,…
## $ TVHrsDay <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ CompHrsDay <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
## $ TVHrsDayChild <int> NA, NA, NA, 4, NA, 5, 1, NA, NA, NA, NA, NA, NA, 4, NA, NA, 0, NA, NA, NA…
## $ CompHrsDayChild <int> NA, NA, NA, 1, NA, 0, 6, NA, NA, NA, NA, NA, NA, 3, NA, NA, 1, NA, NA, NA…
## $ Alcohol12PlusYr <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, Yes, Yes, NA, NA, No,…
## $ AlcoholDay <int> NA, NA, NA, NA, 2, NA, NA, 3, 3, 3, 1, 2, 6, NA, NA, NA, NA, 3, 6, NA, 1,…
## $ AlcoholYear <int> 0, 0, 0, NA, 20, NA, NA, 52, 52, 52, 100, 104, 364, NA, NA, 0, NA, 104, 3…
## $ SmokeNow <fct> No, No, No, NA, Yes, NA, NA, NA, NA, NA, No, NA, NA, NA, Yes, NA, NA, No,…
## $ Smoke100 <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, Yes, No, No, NA, Yes, No, NA,…
## $ Smoke100n <fct> Smoker, Smoker, Smoker, NA, Smoker, NA, NA, Non-Smoker, Non-Smoker, Non-S…
## $ SmokeAge <int> 18, 18, 18, NA, 38, NA, NA, NA, NA, NA, 13, NA, NA, NA, 17, NA, NA, NA, 1…
## $ Marijuana <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, NA, Yes, Yes, NA, NA, No, …
## $ AgeFirstMarij <int> 17, 17, 17, NA, 18, NA, NA, 13, 13, 13, NA, 19, 15, NA, NA, NA, NA, NA, N…
## $ RegularMarij <fct> No, No, No, NA, No, NA, NA, No, No, No, NA, Yes, Yes, NA, NA, No, NA, No,…
## $ AgeRegMarij <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20, 15, NA, NA, NA, NA, NA, N…
## $ HardDrugs <fct> Yes, Yes, Yes, NA, Yes, NA, NA, No, No, No, No, Yes, Yes, NA, NA, No, NA,…
## $ SexEver <fct> Yes, Yes, Yes, NA, Yes, NA, NA, Yes, Yes, Yes, Yes, Yes, Yes, NA, NA, Yes…
## $ SexAge <int> 16, 16, 16, NA, 12, NA, NA, 13, 13, 13, 17, 22, 12, NA, NA, NA, NA, 27, 2…
## $ SexNumPartnLife <int> 8, 8, 8, NA, 10, NA, NA, 20, 20, 20, 15, 7, 100, NA, NA, 9, NA, 1, 1, NA,…
## $ SexNumPartYear <int> 1, 1, 1, NA, 1, NA, NA, 0, 0, 0, NA, 1, 1, NA, NA, 1, NA, 1, NA, NA, 1, 1…
## $ SameSex <fct> No, No, No, NA, Yes, NA, NA, Yes, Yes, Yes, No, No, No, NA, NA, No, NA, N…
## $ SexOrientation <fct> Heterosexual, Heterosexual, Heterosexual, NA, Heterosexual, NA, NA, Bisex…
## $ PregnantNow <fct> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
# Summarize Age and BMI
summary(data %>% select(Gender,Age, BMI,SmokeNow,PhysActive))
## Gender Age BMI SmokeNow PhysActive
## female:5020 Min. : 0.00 Min. :12.88 No :1745 No :3677
## male :4980 1st Qu.:17.00 1st Qu.:21.58 Yes :1466 Yes :4649
## Median :36.00 Median :25.98 NA's:6789 NA's:1674
## Mean :36.74 Mean :26.66
## 3rd Qu.:54.00 3rd Qu.:30.89
## Max. :80.00 Max. :81.25
## NA's :366
# Frequency table for Gender
table(data$Gender)
##
## female male
## 5020 4980
# Frequency table for SmokeNow
table(data$SmokeNow)
##
## No Yes
## 1745 1466
# two ways to check levels
print(head(data$Gender))
## [1] male male male male female male
## Levels: female male
levels(data$Gender)
## [1] "female" "male"
Check for missing values in key variables.
Remove any rows with missing values.
Recode SmokeNow and PhysActive to have
more interpretable categories if necessary.
data_clean <- data %>%
drop_na(Age, BMI, Gender, SmokeNow, PhysActive)
# (Optionally)
# Optional 1: Recode variables for more information when doing summary
data_clean1 <- data_clean %>%
mutate(
SmokeNow = recode(SmokeNow, "Yes" = 1, "No" = 0),
PhysActive = recode(PhysActive, "Yes" = 1, "No" = 0),
Gender = recode(Gender, "male" = 1, "female" = 0)
)
summary(data_clean1 %>% select(Gender,Age, BMI,SmokeNow,PhysActive))
## Gender Age BMI SmokeNow PhysActive
## Min. :0.0000 Min. :20.00 Min. :15.02 Min. :0.0000 Min. :0.0000
## 1st Qu.:0.0000 1st Qu.:35.00 1st Qu.:23.90 1st Qu.:0.0000 1st Qu.:0.0000
## Median :1.0000 Median :49.00 Median :27.54 Median :0.0000 Median :0.0000
## Mean :0.5625 Mean :48.91 Mean :28.45 Mean :0.4579 Mean :0.4771
## 3rd Qu.:1.0000 3rd Qu.:62.00 3rd Qu.:31.90 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :80.00 Max. :67.83 Max. :1.0000 Max. :1.0000
# Option 2: Recode variables for regression table later
# this way so that your regrestion table is more straitforward
data_clean2 <- data_clean %>%
mutate(
SmokeNow = recode(SmokeNow, "Yes" = "Smoker", "No" = "Non-Smoker"),
PhysActive = recode(PhysActive, "Yes" = "Active", "No" = "Inactive"))
summary(data_clean2 %>% select(Gender,Age, BMI,SmokeNow,PhysActive))
## Gender Age BMI SmokeNow PhysActive
## female:1393 Min. :20.00 Min. :15.02 Non-Smoker:1726 Inactive:1665
## male :1791 1st Qu.:35.00 1st Qu.:23.90 Smoker :1458 Active :1519
## Median :49.00 Median :27.54
## Mean :48.91 Mean :28.45
## 3rd Qu.:62.00 3rd Qu.:31.90
## Max. :80.00 Max. :67.83
Create a histogram of BMI.
Create
a boxplot of BMI by Gender.
Add appropriate labels and titles.
## basic R
hist(data_clean2$BMI,
main = "Histogram of BMI",
xlab = "BMI",
col = "lightblue", # Optional: Adds color to the bars
border = "black") # Optional: Adds a border to the bars
boxplot(BMI ~ Gender, data = data_clean2, # remember to use the dataset you didn't code Gender as numbers so that X aes wound't be a continues variable
main = "Boxplot of BMI by Gender",
xlab = "Gender",
ylab = "BMI",
col = "lightblue") # Optional: Adds color to the boxplot
# ggplot2 package
ggplot(data_clean1, aes(x = BMI)) +
geom_histogram(binwidth = 1, fill = "blue", color = "black") + # we are doing it a bit differently by changing the binwidth
labs(title = "Distribution of BMI", x = "BMI", y = "Frequency")
ggplot(data_clean2, aes(x = Gender, y = BMI)) +
geom_boxplot(fill = "pink") +
labs(title = "BMI by Gender", x = "Gender", y = "BMI")
Use stargazer Present sample statistics for our key variables, mutate the data if needed ( for categorical variables)
Present the results in a neat table.
# Only do sample statistic for our key variables
key_variables <- data_clean1 %>%
select(Age, BMI, Gender, SmokeNow, PhysActive)
library(stargazer)
# Key variables: Age,Gender,BMI,SmokeNow(whether the person currently smokes), and PhysActive(physical activity status).
stargazer(
key_variables,
type = "text", #comment out this line and de-comment the following two line if you need a doc file
# type = "html",
# out = "sample statistics.doc",
title = "Descriptive Statistics",
digits = 2, # keeping two digit after "."
summary.stat = c("mean", "sd", "min", "max","n")
)
##
## Descriptive Statistics
## ===========================================
## Statistic Mean St. Dev. Min Max N
## -------------------------------------------
## Age 48.91 16.79 20 80 3,184
## BMI 28.45 6.33 15.02 67.83 3,184
## Gender 0.56 0.50 0 1 3,184
## SmokeNow 0.46 0.50 0 1 3,184
## PhysActive 0.48 0.50 0 1 3,184
## -------------------------------------------
Run a linear regression with BMI as
the dependent
variable and Age, Gender,
and PhysActive as independent
variables.
Interpret the coefficients ( to yourself or group member).
model_basic <- lm(BMI ~ Age + Gender + PhysActive, data = data_clean2)
summary(model_basic)
##
## Call:
## lm(formula = BMI ~ Age + Gender + PhysActive, data = data_clean2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.092 -4.434 -0.957 3.488 38.686
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 28.536543 0.399157 71.492 < 2e-16 ***
## Age 0.015578 0.006699 2.325 0.0201 *
## Gendermale 0.009406 0.223630 0.042 0.9665
## PhysActiveActive -1.785032 0.225212 -7.926 3.1e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.258 on 3180 degrees of freedom
## Multiple R-squared: 0.02348, Adjusted R-squared: 0.02256
## F-statistic: 25.49 on 3 and 3180 DF, p-value: 2.716e-16
Use stargazer to create a regression table
(both in the console and a Word doc).
Use coefplot to visualize the
coefficients.
stargazer(model_basic, type = "text", title = "Regression Results")
##
## Regression Results
## ===============================================
## Dependent variable:
## ---------------------------
## BMI
## -----------------------------------------------
## Age 0.016**
## (0.007)
##
## Gendermale 0.009
## (0.224)
##
## PhysActiveActive -1.785***
## (0.225)
##
## Constant 28.537***
## (0.399)
##
## -----------------------------------------------
## Observations 3,184
## R2 0.023
## Adjusted R2 0.023
## Residual Std. Error 6.258 (df = 3180)
## F Statistic 25.492*** (df = 3; 3180)
## ===============================================
## Note: *p<0.1; **p<0.05; ***p<0.01
library(coefplot)
coefplot(model_basic, title = "Coefficient Plot",intercept = F)
We will use the NHANES dataset, which contains
health and demographic information collected from a representative
sample of the U.S. population. The dataset is available in R through
the NHANES package.
Accessing the Dataset:
#if (FALSE) install.packages("NHANES") # Install if not already installed
library(NHANES)
data <- data.frame(NHANES)
Load the NHANES dataset.
Check for missing values and handle them appropriately.
Summarize key
variables: Age, Gender, BMI, SmokeNow (whether
the person currently smokes), PhysActive(physical activity
status) and  SleepHrsNight, recode them if it is
necessary.
Identify and
remove outliers in BMI (e.g., using the
IQR method).
Transform BMI using log
transformation if you think it is appropriate.
Recode SmokeNow and PhysActive to have
more interpretable categories if necessary.
library(tidyverse)
data_clean <- data %>%
drop_na(Age, BMI, Gender, SmokeNow, PhysActive)
# Remove outliers
Q1 <- quantile(data_clean$BMI, 0.25, na.rm = TRUE)
Q3 <- quantile(data_clean$BMI, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
data_no_outliers <- data_clean %>%
filter(BMI >= (Q1 - 1.5 * IQR) & BMI <= (Q3 + 1.5 * IQR))
# Log transformation (if distribution is skewed)
data_no_outliers <- data_no_outliers %>%
mutate(log_BMI = log(BMI))
| Dealing with Missing Data |
|---|
When working with datasets like NHANES, missing data can be common. We need to handle missing data because missing values can distort analysis results, lead to biased conclusions, and reduce the dataset size. There are different ways to deal with missing data:
In our project, we chose to drop rows with missing data for important variables (Age, BMI, Gender, SmokeNow, PhysActive) because the dataset is large enough, and this ensures that our results won’t be biased by incomplete records. |
| Dealing with Outliers |
|---|
Outliers are data points that differ significantly from other observations. They can distort statistical results and lead to incorrect conclusions if not properly addressed. Here’s how we deal with outliers in this project:
In this project, we chose to remove outliers from the BMI variable to ensure that our regression models and visualizations are not overly influenced by extreme BMI values that may not represent the general population. By doing this, we can obtain more reliable and meaningful results. |
Create a scatter
plot of BMI vs. Age.
Color the points by SmokeNow.
Use smooth lines to show trends (e.g.,
using geom_smooth).
Customize the plot with themes and labels.
ggplot(data_no_outliers, aes(x = Age, y = BMI, color = SmokeNow)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = FALSE) +
theme_minimal() +
labs(title = "BMI vs Age by Smoking Status",
x = "Age",
y = "BMI")
Calculate correlation
coefficients between Age, BMI,
and SleepHrsNight (average hours of sleep per night, if
available).
Present the correlation matrix.
# Ensure 'SleepHrsNight' is available and has no missing values
data_no_outliers <- data_no_outliers %>%
drop_na(SleepHrsNight)
correlation_matrix <- data_no_outliers %>%
select(Age, BMI, SleepHrsNight) %>%
cor()
print(correlation_matrix)
## Age BMI SleepHrsNight
## Age 1.00000000 0.09484044 0.10211937
## BMI 0.09484044 1.00000000 -0.04470914
## SleepHrsNight 0.10211937 -0.04470914 1.00000000
Run multiple regression models:
Model 1:Â BMI ~ Age
Model 2: Add Gender as an
independent variable.
Model 3:
Include PhysActive , SmokeNow and
SleepHrsNight as independent variables.
model1 <- lm(BMI ~ Age, data = data_no_outliers)
model2 <- lm(BMI ~ Age + Gender, data = data_no_outliers)
model3 <- lm(BMI ~ Age + Gender + PhysActive + SmokeNow +SleepHrsNight, data = data_no_outliers)
Check the assumptions of linear regression
for model3:
Linearity
Independence
Homoscedasticity
Normality of residuals
# Residual plots
par(mfrow = c(2, 2))
plot(model3)
Four Assumptions of Linear Regression |
|---|
Linear regression is a powerful tool, but for it to work well, four key assumptions must hold:
By checking and ensuring these assumptions hold, we can be more confident that the results from our regression models are reliable and valid for interpretation. |
| Checking Four Assumptions of Linear Regression |
|---|
1. Linearity (Residuals vs Fitted Plot)
2. Independence (Residuals vs Fitted Plot)
3. Homoscedasticity (Scale-Location Plot)
4. Normality of Residuals (Normal Q-Q Plot)
|
| Checking Four Assumptions of Linear Regression |
|---|
5. Residuals vs Leverage (for detecting influential points)
|
Use stargazer to compare all four
models in a single table with robust standard
errors.
Use coefplot to visualize and
compare coefficients across models.
library(stargazer)
library(sandwich) # For robust standard errors
stargazer(model1, model2, model3, type = "text",
title = "Regression Models Comparison",
se = list(
sqrt(diag(vcovHC(model1, type = "HC1"))),
sqrt(diag(vcovHC(model2, type = "HC1"))),
sqrt(diag(vcovHC(model3, type = "HC1")))
))
##
## Regression Models Comparison
## ==============================================================================================
## Dependent variable:
## --------------------------------------------------------------------------
## BMI
## (1) (2) (3)
## ----------------------------------------------------------------------------------------------
## Age 0.032*** 0.032*** 0.009
## (0.006) (0.006) (0.006)
##
## Gendermale 0.405* 0.397*
## (0.208) (0.205)
##
## PhysActiveYes -1.607***
## (0.206)
##
## SmokeNowYes -1.672***
## (0.218)
##
## SleepHrsNight -0.220***
## (0.077)
##
## Constant 26.489*** 26.246*** 30.413***
## (0.310) (0.342) (0.666)
##
## ----------------------------------------------------------------------------------------------
## Observations 3,113 3,113 3,113
## R2 0.009 0.010 0.044
## Adjusted R2 0.009 0.010 0.043
## Residual Std. Error 5.608 (df = 3111) 5.606 (df = 3110) 5.511 (df = 3107)
## F Statistic 28.237*** (df = 1; 3111) 16.121*** (df = 2; 3110) 28.721*** (df = 5; 3107)
## ==============================================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
library(coefplot)
coefplot(model3, legend = TRUE, title = "Coefficient Comparison",intercept=F)