Session 11: Prediction

Interactive mode: click a code block or Show Plot button to reveal/hide its corresponding plot.

Week 12 Prediction

library(haven)
library(tidyverse)

df<-read_dta("CMPS_2016.dta")

Why do we need predicted values?

After fitting a model, it is useful generate model-based estimates (expected values, or adjusted predictions) of the response variable for different combinations of predictor values. Such estimates can be used to make inferences about relationships between variables - adjusted predictions tell you: what is the expected ouctome for certain values or levels of my predictors?

ggeffects is a handy package that aims at easily calculating adjusted predictions at meaningful values of covariates from statistical models.

#if (FALSE) install.packages("ggeffects")
library(ggeffects)

df$C246

## <labelled<double>[10144]>: Immigrants (How much discrimination is there in the United States today against
##    [1] 5 5 1 1 1 1 1 2 1 2 1 3 1 2 1 2 1 1 1 1 3 1 1 2 2 1 3 2 2 5 2 1 3 5 5 1 1 2 1 1 3 1 3 1 3 2 3
##   [48] 1 2 2 3 1 1 2 2 5 1 5 1 4 1 1 1 3 2 1 5 1 1 1 3 1 1 5 1 3 2 1 2 1 3 1 3 1 5 2 2 1 2 3 2 1 1 1
##   [95] 3 5 2 3 2 5 3 4 3 5 2 3 2 1 2 1 1 2 3 1 5 2 2 3 2 1 1 1 2 1 2 5 2 3 2 3 1 5 1 2 1 1 1 1 2 5 2
##  [142] 2 2 1 1 5 2 2 1 5 1 3 5 1 1 1 4 3 1 1 2 1 5 1 1 3 3 1 1 4 1 1 1 1 1 1 1 2 1 1 1 1 3 5 2 5 2 1
##  [189] 1 2 5 5 3 2 2 2 3 3 1 3 2 1 2 2 1 1 2 1 1 1 1 2 4 1 2 1 1 1 1 1 3 2 2 5 2 1 1 1 1 1 4 3 4 1 2
##  [236] 2 1 1 3 1 5 1 1 3 2 2 3 2 5 1 1 1 4 3 1 4 1 1 3 4 1 3 4 4 2 3 1 2 1 1 5 5 3 1 1 1 2 2 3 3 3 5
##  [283] 1 1 2 1 2 2 1 1 1 1 2 1 1 2 5 1 2 1 2 5 1 1 1 1 1 2 2 2 1 3 1 2 1 1 2 1 1 2 2 1 1 2 1 1 1 1 1
##  [330] 3 1 1 3 2 1 1 2 2 2 5 1 2 1 4 1 2 2 4 2 3 3 3 2 1 1 4 2 3 2 1 1 2 5 1 1 1 2 3 1 2 2 1 4 1 2 3
##  [377] 1 2 1 1 1 3 3 2 2 2 3 3 2 2 2 2 2 1 5 3 1 2 2 1 1 1 1 1 3 2 2 2 3 1 1 2 2 2 3 3 3 1 2 1 4 1 2
##  [424] 1 3 3 1 2 2 1 2 2 1 1 1 5 1 3 5 2 2 5 5 1 1 1 1 2 2 1 1 2 2 3 1 1 1 5 5 2 3 1 1 1 4 2 2 1 4 1
##  [471] 5 2 1 1 1 1 5 3 1 2 1 1 1 5 1 2 1 2 1 2 3 2 1 2 2 1 1 1 5 1 2 2 1 2 1 1 1 1 1 1 2 2 3 1 2 1 1
##  [518] 1 2 1 1 1 2 1 2 5 1 1 1 2 3 3 1 5 2 1 1 1 2 2 1 5 2 1 2 2 4 4 5 2 1 1 2 1 1 1 5 2 1 3 1 2 1 1
##  [565] 2 1 1 2 1 1 2 1 3 1 3 1 2 3 1 1 1 1 3 2 2 1 2 1 1 1 1 3 1 1 1 1 1 1 2 1 1 1 3 1 1 5 3 1 1 1 3
##  [612] 2 1 2 1 2 3 1 1 1 2 1 1 1 2 2 1 3 5 1 1 1 2 1 3 1 2 3 2 3 1 1 3 1 1 1 2 1 2 1 1 1 1 2 1 1 1 3
##  [659] 2 1 3 1 2 1 1 3 1 2 5 1 1 1 1 3 2 3 2 1 1 1 1 5 2 3 1 1 1 1 2 2 2 2 1 1 3 3 1 1 2 2 3 1 2 1 5
##  [706] 3 1 3 1 4 1 1 1 3 1 2 1 5 3 1 1 1 2 1 1 4 1 1 1 5 1 1 1 5 1 1 2 1 1 4 2 1 4 5 2 1 1 1 1 2 2 5
##  [753] 3 1 5 1 3 3 1 1 1 2 1 2 1 1 1 2 5 2 2 1 1 1 2 1 3 1 1 1 1 1 1 1 2 1 1 1 2 1 1 3 1 1 5 3 2 3 1
##  [800] 1 1 3 1 2 2 1 1 3 1 3 1 2 1 2 1 1 1 1 1 1 1 1 2 5 1 1 1 2 5 1 1 2 3 1 1 2 1 5 1 1 1 2 1 5 1 5
##  [847] 2 1 1 5 1 2 1 3 1 1 1 1 1 5 1 1 1 2 1 1 1 2 1 1 2 1 1 1 1 3 1 1 5 1 1 1 2 2 1 2 5 1 1 5 2 1 1
##  [894] 1 2 2 5 1 1 2 1 3 2 1 2 1 2 1 1 1 1 1 1 1 2 5 2 1 1 1 3 2 5 3 1 1 2 1 3 1 4 5 1 2 1 1 1 3 3 1
##  [941] 1 1 2 2 1 1 1 1 2 1 1 3 5 4 2 2 2 1 1 2 1 5 1 1 1 1 1 1 1 1 5 1 1 1 1 1 5 2 1 3 1 1 4 1 2 3 2
##  [988] 3 1 1 1 1 1 1 1 2 4 2 1 1 5 1 1 1 1 1 1 1 1 1 2 2 1 1 3 1 2 3 2 4 5 1 1 1 1 4 5 1 1 5 1 5 1 3
## [1035] 1 1 2 5 1 3 1 1 1 3 3 5 2 1 5 1 2 1 1 5 3 2 3 1 1 1 1 4 5 3 1 1 2 2 1 2 1 1 1 1 1 2 1 1 5 1 1
## [1082] 1 1 1 2 1 2 1 1 1 1 2 1 1 2 1 1 4 3 1 5 1 1 4 1 5 1 1 3 1 1 2 5 4 1 3 2 1 1 1 1 3 2 5 2 5 2 1
## [1129] 1 1 2 5 1 2 5 1 1 2 5 2 3 1 1 5 1 1 3 1 4 2 2 1 1 2 1 1 1 2 3 1 3 1 2 1 1 1 1 3 2 1 2 5 5 5 4
## [1176] 1 1 1 1 2 2 1 2 1 1 2 1 2 2 3 2 2 1 1 2 1 1 2 1 2
##  [ reached 'max' / getOption("max.print") -- omitted 8944 entries ]
## 
## Labels:
##  value       label
##      1       A lot
##      2        Some
##      3    A little
##      4 None at all
##      5  Don't know

data <- df %>%
  filter(C246<5)%>%
  mutate(
    discrimination = 4 - C246,
    sex = case_when(
      S3 == 1 ~ "Female",
      S3 == 2 ~ "Male",
      TRUE ~ NA_character_
    ),
    White = ifelse(S2_1==1,"White","Minority"),
    age = AGE
  )


# Convert 'race' to a factor and set levels
data$White <- factor(data$White,
                    levels =  c("White",
                                "Minority"))

data$sex <- factor(data$sex)

Predicted Values and Categorical Predictor

After fitting a model, it’s often helpful to estimate how the response variable changes for different levels of a categorical predictor. These predictions allow us to assess the adjusted relationship between a categorical variable and the outcome, holding other predictors constant. For example, we might want to know how discrimination scores differ by race or sex while controlling for other factors such as age. Using ggeffects, we can calculate and visualize these predictions efficiently.

model1<-glm(discrimination~sex+White+age,data=data)

summary(model1)

## 
## Call:
## glm(formula = discrimination ~ sex + White + age, data = data)
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.3919348  0.0352811  67.797  < 2e-16 ***
## sexMale       -0.2149141  0.0177212 -12.128  < 2e-16 ***
## WhiteMinority  0.1526756  0.0261176   5.846 5.21e-09 ***
## age           -0.0030184  0.0005487  -5.501 3.87e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.6633938)
## 
##     Null deviance: 6418.1  on 9434  degrees of freedom
## Residual deviance: 6256.5  on 9431  degrees of freedom
##   (23 observations deleted due to missingness)
## AIC: 22909
## 
## Number of Fisher Scoring iterations: 2

model1_pred<-ggpredict(model1,"White",ci_level = 0.95)

plot(model1_pred)

plot(model1_pred)+ylim(2,2.5)+theme_classic()

Predicted Values and Continuous Predictor

In this section, we explore how a continuous predictor influences the response variable. Adjusted predictions allow us to understand the expected outcome across a range of values for the continuous variable, holding other factors constant.

model1_pred2<-ggpredict(model1,"age",ci_level = 0.95)

plot(model1_pred2)+ylim(1.8,2.8)

Predicted Values and Interactions

Interactions between predictors can reveal more complex relationships, where the effect of one predictor depends on the level of another. Visualizing predicted values in the context of interactions helps us interpret these conditional effects effectively.

Interaction between Categorical Predictor and Continuous Predictor

Interactions between categorical and continuous predictors allow us to assess whether the relationship between the continuous variable and the outcome varies across groups. For example, we can examine whether the effect of age on discrimination differs by sex.

model2<-lm(discrimination~sex*age+White,data=data)
summary(model2)

## 
## Call:
## lm(formula = discrimination ~ sex * age + White, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5050 -0.4199  0.4988  0.6208  1.0230 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.4176445  0.0387774  62.347  < 2e-16 ***
## sexMale       -0.2898118  0.0501322  -5.781 7.66e-09 ***
## age           -0.0036970  0.0006939  -5.328 1.02e-07 ***
## WhiteMinority  0.1538525  0.0261259   5.889 4.02e-09 ***
## sexMale:age    0.0017881  0.0011196   1.597     0.11    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8144 on 9430 degrees of freedom
##   (23 observations deleted due to missingness)
## Multiple R-squared:  0.02544,    Adjusted R-squared:  0.02503 
## F-statistic: 61.55 on 4 and 9430 DF,  p-value: < 2.2e-16

model2_pred<-ggpredict(model2,c("age","sex"),ci_level = 0.95)

plot(model2_pred)+ylim(1.8,2.5)

Interaction between Categorical Predictors

When two categorical predictors interact, we can explore how their combined levels influence the response variable. This analysis helps to determine whether the relationship between one categorical variable and the outcome depends on the levels of another categorical variable. For example, we can estimate how discrimination scores vary by the interaction of race and sex.

model3<-lm(discrimination~sex*White+age,data=data)
summary(model3)

## 
## Call:
## lm(formula = discrimination ~ sex * White + age, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.4841 -0.4156  0.5189  0.6171  1.1296 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.4310918  0.0390272  62.292  < 2e-16 ***
## sexMale               -0.3254683  0.0503743  -6.461 1.09e-10 ***
## WhiteMinority          0.1065849  0.0326849   3.261  0.00111 ** 
## age                   -0.0029771  0.0005488  -5.425 5.95e-08 ***
## sexMale:WhiteMinority  0.1258074  0.0536620   2.344  0.01908 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8143 on 9430 degrees of freedom
##   (23 observations deleted due to missingness)
## Multiple R-squared:  0.02575,    Adjusted R-squared:  0.02534 
## F-statistic: 62.31 on 4 and 9430 DF,  p-value: < 2.2e-16

model3_pred<-ggpredict(model3,c("White","sex"),ci_level = 0.95)

plot(model3_pred)+ylim(1.5,2.5)