Nivi Research: Mister P helps us understand vaccine hesitancy

Eric Green

Previously I wrote about our efforts to monitor vaccine hesitancy through askNivi country polls. In our initial round in Kenya, we estimated that the probability is greater than 95% that at least 70% of people would be likely to get vaccinated if a vaccine were available for free today.

Skeptical readers might be thinking, “This is based on data from 486 people who responded to ads on Facebook. How is this representative of the country?” 🤔 Good question!

Sampling 101

The gold standard approach for estimating characteristics of some population has long been probability-based representative sampling where every person in the target population has a chance of being invited to participate. In the global health world, the best examples come from the Demographic and Health Surveys (DHS) Program. DHS surveys involve between 5,000 and 30,000 households in a country and are typically conducted every 5 years. They cover dozens of topics and produce nationally-representative estimates through multistage cluster sampling. But there’s a catch: DHS surveys take 18 to 20 months (and a lot of money) to design, conduct door-to-door, and analyze. Despite their many strengths, DHS surveys are not an option when you need answers quickly.

A popular alternative is to contact a random sample of phone numbers and conduct the survey via SMS or phone. For instance, researchers might partner with a telecommunications company to get a list of all phone subscribers and randomly select numbers from this list, also known as a sampling frame. Another approach is called random digit dialing, where researchers randomly generate a list of phone numbers and try each number on the list (some will work, some won’t). While low cost and quick, poor coverage and high non-response can make these methods no better than non-representative convenience samples.¹

Falling response rates in traditional public opinion polling, along with the growing ease of collecting data online, have led researchers to develop techniques for making convenience samples more representative (Wang et al. 2015). Before exploring one such approach, it will help to build some intuition about why just relying on convenience samples without any adjustment is usually a bad idea.

An Inconvenient Truth

Imagine a country with 1 million residents. Let’s call it Dansia. Since we’re imagining things, let’s pretend that the good people of Dansia conducted a census last week of all 1 million residents and the data are ready to share. As luck would have it, the census included a question about our key dependent variable—frequency of mask wearing in public among people ages 15 to 64. This means we can calculate the true frequency in the population. And since we know the “real” answer, we can compare how different sampling approaches recover the true value.

  library(tidyverse)
  library(ggtext)
  library(gtsummary)
  library(gt)

# get the population data and custom ggplot function
  load(url("https://www.dropbox.com/s/0u3lblqfv85hch3/pop_df.RData?dl=1"))

  pop_df %>%
    dplyr::select(-ageCat) %>%
    mutate(gender = factor(gender, 
                           levels=c("male", "female", "non-binary"),
                           labels=c("Male", "Female", "Non-binary"))) %>%
    tbl_summary(by=gender,
                label = c(age ~ "Mean age (SD)",
                          dv ~ "Wears mask"),
                statistic = list(all_continuous() ~ "{mean} ({sd})",
                                 all_categorical() ~ "{p}%")
                ) %>%
    add_overall() %>%
    bold_labels() %>%
    modify_footnote(update = everything() ~ NA) %>%
    as_gt() %>%
    tab_header(title="Population of Dansia, Ages 15 to 64") %>%
      gt::tab_style(           
      style = list(gt::cell_fill(color = "#4FBFA5"),
                   gt::cell_text(color = "white",
                                 weight="bold")),
      locations = gt::cells_body(columns = vars(stat_0),
                                 rows = 3:6)
    )

Population of Dansia, Ages 15 to 64

Characteristic	Overall, N = 681,732	Male, N = 323,970	Female, N = 323,758	Non-binary, N = 34,004
Mean age (SD)	39 (14)	39 (14)	39 (14)	39 (14)
Wears mask
never	26%	33%	18%	26%
rarely	27%	33%	21%	27%
sometimes	22%	19%	26%	22%
often	25%	16%	35%	25%

Remember, we only know this distribution of mask wearing in the population because we’re pretending we have timely census results. This would not happen in reality (and the census is not going to ask about mask wearing). In reality, we’d need to infer information about the population from a sample of people we contact. So let’s do that. First we’ll draw a random sample, followed by a convenience sample.

Random Sample

The gold standard approach to sampling from this population would be to draw a random sample from a sampling frame. Let’s assume we could do this from a phone book and contacted 500 randomly selected people. Let’s also assume that we achieved 100% coverage and 0% non-response!

  library(tidyverse)
  library(ggtext)

# get the population data and custom ggplot function
  load(url("https://www.dropbox.com/s/0u3lblqfv85hch3/pop_df.RData?dl=1"))

# draw random sample
  set.seed(12345)
  sample_rand <- pop_df %>%
    sample_n(500)

# calculate response frequency in the population
  overall_pop <- pop_df %>% 
  group_by(dv) %>% 
  count() %>% 
  ungroup() %>% 
  mutate(p = n/sum(n)) %>%
  mutate(source = "true population") %>%
  dplyr::select(source, dv, p)

# calculate response frequency in the random sample
  overall_sample_rand <- sample_rand %>% 
    group_by(dv) %>% 
    count() %>% 
    ungroup() %>% 
    mutate(p = n/sum(n)) %>%
    mutate(source = "random sample") %>%
    dplyr::select(source, dv, p)
  
# combine and plot (uses custom ggplot function that comes with data)
  overall_pop %>%
  bind_rows(overall_sample_rand) %>%
  mrp_plot(., plotTitle = "Comparing a <span style = 'color:#005BE7;'>random sample</span> to the <span style = 'color:#4FBFA5;'>true population</span>", 
           annLabel = "Random sample is\nclose to the true value", 
           ann_x = 3.3, ann_y = 0.10, annColor = "#005BE7", 
           arrow_x_start = 3.6, arrow_y_start = 0.10, 
           arrow_x_end = 4, arrow_y_end = 0.23, arrowColor = "#005BE7",
           fills = c("#005BE7", "#4FBFA5"), colors = c("#005BE7", "#4FBFA5"))

Figure 1: A random sample gets pretty close!

As you can see in Figure 1, our random sample gets pretty close to the true population values. Of course this is just one possible sample of 500 people from Dansia, but it illustrates the point that random samples are usually great (well, at least when coverage and response rates are high).

Convenience Sample

Now let’s pull together a convenience sample and see how it compares to the population. “Convenience” means that we take who we get. Suppose we posted an advertisement online for people ages 15 to 64 to complete our survey. Some people in this age range would see the ad, most would not (poor coverage). And as it goes with ads, some people who see it will click, but most will not (high non-response).

# draw a convenience sample that is mostly young men
# ----------------------------------------------------
# 50 older men
  sample_obs_m1 <- pop_df %>%
    filter(gender=="male") %>%
    filter(age>35) %>%
    sample_n(50)
  
# 400 younger men
  sample_obs_m2 <- pop_df %>%
    filter(gender=="male") %>%
    filter(age<=35) %>%
    sample_n(400)
  
# 45 women
  sample_obs_f <- pop_df %>%
    filter(gender=="female") %>%
    sample_n(45)
    
# 5 individuals who identified as non-binary
  sample_obs_n <- pop_df %>%
    filter(gender=="non-binary") %>%
    sample_n(5)
  
# combine to make a sample of 500
  sample_obs <- sample_obs_f %>%
    bind_rows(sample_obs_m1) %>%
    bind_rows(sample_obs_m2) %>%
    bind_rows(sample_obs_n)

 tbl_pop <- pop_df %>%
    dplyr::select(-ageCat, -dv) %>%
    mutate(gender = factor(gender, 
                           levels=c("male", "female", "non-binary"),
                           labels=c("Male", "Female", "Non-binary"))) %>%
    tbl_summary(label = c(age ~ "Mean age (SD)",
                          gender ~ "Gender"),
                statistic = list(all_continuous() ~ "{mean} ({sd})",
                                 all_categorical() ~ "{p}%")
                ) %>%
    bold_labels() %>%
    modify_footnote(update = everything() ~ NA)
 
 tbl_rand <- sample_rand %>%
    dplyr::select(-ageCat, -dv) %>%
    mutate(gender = factor(gender, 
                           levels=c("male", "female", "non-binary"),
                           labels=c("Male", "Female", "Non-binary"))) %>%
    tbl_summary(label = c(age ~ "Mean age (SD)",
                          gender ~ "Gender"),
                statistic = list(all_continuous() ~ "{mean} ({sd})",
                                 all_categorical() ~ "{p}%")
                ) %>%
    bold_labels() %>%
    modify_footnote(update = everything() ~ NA)
 
 tbl_obs <- sample_obs %>%
    dplyr::select(-ageCat, -dv) %>%
    mutate(gender = factor(gender, 
                           levels=c("male", "female", "non-binary"),
                           labels=c("Male", "Female", "Non-binary"))) %>%
    tbl_summary(label = c(age ~ "Mean age (SD)",
                          gender ~ "Gender"),
                statistic = list(all_continuous() ~ "{mean} ({sd})",
                                 all_categorical() ~ "{p}%")
                ) %>%
    bold_labels() %>%
    modify_footnote(update = everything() ~ NA)
 
 tbl_merge(list(tbl_pop, tbl_rand, tbl_obs),
           tab_spanner = c("**Population**", "**Random Sample**", 
                           "**Convenience Sample**"))

Characteristic	Population	Random Sample	Convenience Sample
Characteristic	N = 681,732	N = 500	N = 500
Gender
Male	48%	49%	90%
Female	47%	48%	9.0%
Non-binary	5.0%	3.4%	1.0%
Mean age (SD)	39 (14)	37 (14)	29 (11)

Our imaginary ad must have appealed more to younger, male audiences. Whereas males make up about half of the population and random sample, they are 90% of the convenience sample!

Now let’s estimate mask wearing in Dansia from this convenience sample that we know is younger and more male on average than the population.

# calculate response frequency in the random sample
  overall_sample_obs <- sample_obs %>% 
    group_by(dv) %>% 
    count() %>% 
    ungroup() %>% 
    mutate(p = n/sum(n)) %>%
    mutate(source = "biased sample, unadjusted") %>%
    dplyr::select(source, dv, p)
  
# combine and plot
  overall_pop %>%
  bind_rows(overall_sample_rand) %>%
  bind_rows(overall_sample_obs) %>%
  mrp_plot(., plotTitle = "Comparing a <span style = 'color:#F17DB1;'>convenience sample</span> to the <span style = 'color:#4FBFA5;'>true population</span>", 
           annLabel = "Convenience sample is\nfar from the true value", 
           ann_x = 3.3, ann_y = 0.05, annColor = "#F17DB1", 
           arrow_x_start = 3.6, arrow_y_start = 0.05, 
           arrow_x_end = 4, arrow_y_end = 0.11, arrowColor = "#F17DB1",
           fills = c("white", "#005BE7", "#4FBFA5"), 
           colors = c("#F17DB1", "#005BE7", "#4FBFA5"))

Figure 2: Womp womp convenience sample.

Figure 2 is a bit depressing for #teamConvenience. The clue to why is in the first table (remember our fake census?): men in the population—particularly young men—endorsed mask wearing less often than women. Since our convenience sample skews male and younger, these anti-mask views are over-represented in our dataset. If we stopped here, we’d come to the wrong conclusion about the frequency of mask wearing in the population.

This, friends, is why convenience samples are hard to interpret when the goal is to make some inference about a target population.

Multilevel Regression and Post-Stratification

But all is not lost. Take heart in this example from Wang et al. (2015) (pdf).

Andrew Gelman ² and colleagues ran daily voter polls on the Xbox gaming platform in the run up to the 2012 U.S. Presidential election. This was an opt-in poll. Every day, gamers were presented 3 to 5 questions, including the question, “If the election were held today, who would you vote for?”. The first time a person participated they were also asked to answer several demographic questions.

Xbox opt-in poll. Figure from @wang2015.

Figure 3: Xbox opt-in poll. Figure from Wang et al. (2015).

Like our fake example from Dansia, this real sample skewed male and young compared to U.S. voters. Men represented 93% of the Xbox sample, but only 47% of voters. Very non-representative.

But in a very different twist, the answer Wang et al. (2015) reached with this dataset lined up nicely with representative polls. How did they do it?³ Mister P.

What Is Mister P?

Monica Alexander gives a great overview of MRP in this post. I used her code get started, so consider popping over there to learn from a card carrying demographer.

The “P” in Poststratification

When you have non-representative samples like the ones from Dansia (our fictional country) or Wang et al. (2015)’s Xbox survey, one approach to making the data more representative is to reweight—that is, post-stratify—the results by the known population weights. All you need is a census or a representative survey that includes the variables you want to use for post-stratification.

The first step in post-stratification is to tabulate the count of the population (in the census or the representative survey) that falls into each “cell”. For instance, in the Dansia example there are 30 combinations of gender (3 levels: male, female, non-binary) and 5-year age categories between ages 15 and 64 (10 levels). Everyone in the country falls into one and only one cell.

Population of Dansia, Ages 15 to 64
Cell counts by gender and 5-year age category
Characteristic	Male, N = 323,970	Female, N = 323,758	Non-binary, N = 34,004
5-Year Age Category
15	34,457	34,142	3,556
20	34,372	34,525	3,680
25	34,245	34,474	3,624
30	34,432	34,660	3,636
35	34,524	34,373	3,455
40	30,843	30,546	3,173
45	30,897	30,750	3,244
50	31,046	30,971	3,269
55	30,953	31,039	3,269
60	28,201	28,278	3,098

The overall population of Dansians ages 15 to 64 is 681,732. We can divide each cell count by this figure to get the percentages (cell weights). For instance, men ages 15 to 19 make up 5.1% of the target population, but a whopping 20% of the convenience sample. This means that we need to down-weight responses from 15 to 19 year old men in the Dansia survey.

Population of Dansia, Ages 15 to 64
Cell percentages by gender and 5-year age category
Characteristic	Male, N = 323,970	Female, N = 323,758	Non-binary, N = 34,004
5-Year Age Category
15	5.05%	5.01%	0.52%
20	5.04%	5.06%	0.54%
25	5.02%	5.06%	0.53%
30	5.05%	5.08%	0.53%
35	5.06%	5.04%	0.51%
40	4.52%	4.48%	0.47%
45	4.53%	4.51%	0.48%
50	4.55%	4.54%	0.48%
55	4.54%	4.55%	0.48%
60	4.14%	4.15%	0.45%

In the next code chunk, I run the same calculation as above but format the overall_prop in a long tibble to use in a later step.

  pop_df %>%
    mutate(weight=1) %>%
    group_by(gender, ageCat) %>%
    summarise(n = sum(weight))%>%
    ungroup() %>%
    dplyr::select(gender, ageCat, n) %>%
    mutate(overall="overall") %>%
    group_by(overall) %>% 
    mutate(prop = n/sum(n))  %>% 
    ungroup() %>%
    {. ->> overall_prop} %>%
    arrange(ageCat) %>%
    slice(1:9)

# A tibble: 9 x 5
  gender     ageCat     n overall    prop
  <fct>      <chr>  <dbl> <chr>     <dbl>
1 male       15     34457 overall 0.0505 
2 female     15     34142 overall 0.0501 
3 non-binary 15      3556 overall 0.00522
4 male       20     34372 overall 0.0504 
5 female     20     34525 overall 0.0506 
6 non-binary 20      3680 overall 0.00540
7 male       25     34245 overall 0.0502 
8 female     25     34474 overall 0.0506 
9 non-binary 25      3624 overall 0.00532

Finally, before we move onto the MR in MRP, let’s think about why stratification alone might not be sufficient. The next table shows the cell counts from our non-representative Dansia survey of 500 people. Most of the gender X ageCat cells have fewer than 10 people. For instance, there are only 3 women in the 20 to 24 age category.

This is an example of a sparse cell. There are lots of sparse cells in this survey because the respondents were 90% men. This means that with post-stratification alone we’d have to put a lot of weight on a handful of individuals. These 3 women would be the stand-in for all women in this age range.

Dansia Convenience Sample, Ages 15 to 64
Counts by gender and 5-year age category
Characteristic	Male, N = 450	Female, N = 45	Non-binary, N = 5
5-Year Age Category
15	100	9	0
20	95	3	2
25	96	3	0
30	91	5	0
35	23	4	2
40	6	6	1
45	9	3	0
50	13	2	0
55	9	5	0
60	8	5	0

The “MR” (Mister) in Multilevel Regression

The MR before the P stands for multilevel regression. Just as Mister comes before P, multilevel regression can be used prior to post-stratification to avoid relying too heavily on information from sparse cells.

For instance, we learned in the previous section that there were only 3 women in the 20 to 24 age category in the 500-person convenience sample. 2 out of these 3 people said they “never” or “rarely” wear a mask in public. Rather than applying post-stratification weights to this result (66.7%) that is based on 3 people, we can use multilevel regression to pull the estimate toward the other groups. This partial pooling is a key feature of multilevel regression.

Fit a Model

Working with the survey data from our 500-person convenience sample, we can model the association between our outcome (frequency of wearing a mask) and our independent variables: gender and 5-year age category. Just like Dr. Alexander does in her post, we’ll also use the {brms} package in R to fit a Bayesian model. The difference is that we have an ordinal outcome (mask wearing frequency: never, rarely, sometimes, often), so instead of using family=bernoulli(), we’ll use family = cumulative("probit") to fit an ordinal regression model (Bürkner and Vuorre 2019).⁴

  library(brms)
  
  fit <- brm(dv ~ (1|gender) + (1|ageCat),
             data = sample_obs,
             family = cumulative("probit"),
             control = list(adapt_delta = 0.99),
             cores = 4, # depends on your machine
             backend = "cmdstanr") # or rstan, see footnote

brms code is similar to glmer, so if you can write a glmer model you are most of the way there.⁵ In the above chunk we fit a Bayesian ordinal regression model on the convenience sample data (sample_obs) and save the results to the fit object.⁶

I’m only using two grouping variables (gender and ageCat) but you can use as many variables as exist in both your sample and stratification datasets. It takes some wrangling to make sure your datasets share the same variable names, factor levels, etc.

Predict the DV for Each Subgroup

In the last steps, we’ll use the results of the regression to predict the mask wearing outcome for all of our subgroups.

  library(tidybayes)

  overall_sample_est <- fit %>%
    add_predicted_draws(newdata = overall_prop,
                        allow_new_levels=TRUE) %>%
    mutate(
      dv = .prediction,
      p = 1
    ) %>%
    mutate(.prediction_prop = p*prop) %>%
    group_by(.draw, dv) %>%
    summarise(p = sum(.prediction_prop)) %>%
    group_by(dv) %>%
    mean_qi(p) %>%
    mutate(source = "biased sample, adjusted") %>%
    dplyr::select(source, dv, p) %>% 
    mutate(dv = factor(dv, 
                       levels=c("never", "rarely", "sometimes", "often"), 
                       labels=c("never", "rarely", "sometimes", "often"), 
                       ordered = TRUE))

The add_predicted_draws() function takes as inputs our fit object and our overall_prop object, which is a tibble with cell proportions and subgroup labels.

# A tibble: 30 x 5
   gender ageCat     n overall   prop
   <fct>  <chr>  <dbl> <chr>    <dbl>
 1 male   15     34457 overall 0.0505
 2 male   20     34372 overall 0.0504
 3 male   25     34245 overall 0.0502
 4 male   30     34432 overall 0.0505
 5 male   35     34524 overall 0.0506
 6 male   40     30843 overall 0.0452
 7 male   45     30897 overall 0.0453
 8 male   50     31046 overall 0.0455
 9 male   55     30953 overall 0.0454
10 male   60     28201 overall 0.0414
# … with 20 more rows

add_predicted_draws() plugs the subgroup information like “male” and age category “15” into the model we fit in the previous step and predicts the value of the dependent variable 4,000 times for all 30 subgroup combinations. This results in a tibble with 120,000 rows.

  step1 <- fit %>%
    add_predicted_draws(newdata = overall_prop,
                        allow_new_levels=TRUE) %>%
    dplyr::select(gender, ageCat, n, prop, .draw, .prediction)
  step1

# A tibble: 120,000 x 8
# Groups:   gender, ageCat, n, overall, prop, .row [30]
   overall  .row gender ageCat     n   prop .draw .prediction
   <chr>   <int> <fct>  <chr>  <dbl>  <dbl> <int> <fct>      
 1 overall     1 male   15     34457 0.0505     1 rarely     
 2 overall     1 male   15     34457 0.0505     2 never      
 3 overall     1 male   15     34457 0.0505     3 never      
 4 overall     1 male   15     34457 0.0505     4 rarely     
 5 overall     1 male   15     34457 0.0505     5 rarely     
 6 overall     1 male   15     34457 0.0505     6 rarely     
 7 overall     1 male   15     34457 0.0505     7 never      
 8 overall     1 male   15     34457 0.0505     8 never      
 9 overall     1 male   15     34457 0.0505     9 rarely     
10 overall     1 male   15     34457 0.0505    10 never      
# … with 119,990 more rows

Post-Stratify

Next, we set the predicted value, p, to 1 for each row,⁷ multiply by the cell proportion that we obtained from the census (post-stratification), and sum these weights by draw (4,000) and level of the dependent variable (4). This resulted in 4,000 estimates of the frequency distribution of the dependent variable.

  step2 <- step1 %>%
    mutate(
      dv = .prediction,
      p = 1
    ) %>%
    mutate(.prediction_prop = p*prop) %>% 
    group_by(.draw, dv) %>%
    summarise(p = sum(.prediction_prop))
  step2

# A tibble: 15,988 x 3
# Groups:   .draw [4,000]
   .draw dv            p
   <int> <fct>     <dbl>
 1     1 never     0.212
 2     1 rarely    0.268
 3     1 sometimes 0.138
 4     1 often     0.383
 5     2 never     0.389
 6     2 rarely    0.247
 7     2 sometimes 0.147
 8     2 often     0.217
 9     3 never     0.257
10     3 rarely    0.251
# … with 15,978 more rows

Get Point Estimates and Credible Intervals

Lastly, the mean_qi() function creates point estimates and credible intervals for each level of the dependent variable.

  step3 <- step2 %>%
    group_by(dv) %>%
    mean_qi(p)
  step3

# A tibble: 4 x 7
  dv            p .lower .upper .width .point .interval
  <fct>     <dbl>  <dbl>  <dbl>  <dbl> <chr>  <chr>    
1 never     0.227 0.0558  0.405   0.95 mean   qi       
2 rarely    0.333 0.147   0.540   0.95 mean   qi       
3 sometimes 0.182 0.0461  0.355   0.95 mean   qi       
4 often     0.259 0.0911  0.448   0.95 mean   qi

Recap

Let’s recap this workflow:

We fit the model in {brms} using data from a convenience sample of 500 individuals (sample_obs).
Model predictors included two independent variables that are also present in the population stratification data [~ (1|gender) + (1|ageCat)]. This is important. If these variables do not exist in the population stratification data, you can’t do the next step.
We used the resulting model formula contained in fit to predict the dependent variable from aggregated data about our population [add_predicted_draws()]. We’re not predicting the DV for individuals. We’re predicting it for subgroup cells. For sparse cells like gender=="female" & ageCat=="15", the multilevel regression pulls the estimates toward the other cells.
As this is a Bayesian setup, we obtained thousands of draws from the posterior predictions of the model. So the last steps aggregate over these draws to give us a point estimate and a credible interval for each level of our mask wearing dependent variable.

Plot

And now let’s look at the results. Figure 4 plots these adjusted estimates against the results of the other approaches we reviewed in this post (unadjusted, random sample).

# combine and plot
  overall_pop %>%
  bind_rows(overall_sample_rand) %>%
  bind_rows(overall_sample_obs) %>%
  bind_rows(overall_sample_est) %>%
  mutate(source = factor(source,
                         levels=c("biased sample, unadjusted", 
                                  "biased sample, adjusted", 
                                  "random sample",
                                  "true population"))) %>%
  mrp_plot(., plotTitle = "Comparing different sample estimates to the <span style = 'color:#4FBFA5;'>true population</span>", 
           annLabel = "Unadjusted estimate is\nfar from the true value", 
           ann_x = 3.3, ann_y = 0.05, annColor = "#F17DB1", 
           arrow_x_start = 3.6, arrow_y_start = 0.05, 
           arrow_x_end = 4, arrow_y_end = 0.11, arrowColor = "#F17DB1",
           fills = c("white", "#F17DB1", "#005BE7", "#4FBFA5"), 
           colors = c("#F17DB1", "#F17DB1", "#005BE7", "#4FBFA5")) +
    annotate("text", x = 4.2, y = .40, 
             label = "Adjusted estimates are\nclose to the true value",
               size = rel(3), color="#F17DB1") + 
    geom_curve(aes(x = 3.9, y = .40, xend = 3.8, yend = .26), 
               color = "#F17DB1", arrow = arrow(type = "open",
                                                length = unit(0.15, "inches")),
               curvature = -.5, angle = -100, ncp =15)

Figure 4: All together now.

Compared to the unadjusted estimates straight from the convenience sample, it appears that adjusting the estimates reduces bias across all levels of the dependent variable.⁸

Summary

I opened this post with a reference to a recent opt-in poll we conducted in Kenya about COVID-19 vaccine hesitancy. We used data from a non-representative (convenience) sample of 486 people to estimate the percentage of people who would be likely to get vaccinated if a vaccine were available for free today. I imagined the reaction of a skeptical reader:

Skeptical readers might be thinking, “This is based on data from 486 people who responded to ads on Facebook. How is this representative of the country?

The short answer is, “It’s probably not”. Without adjustment, raw summaries from convenience samples like this are likely to be biased and misleading. But if suitable data about the target population exists (census, nationally representative survey), techniques like Mister P can make the data more representative.

To learn more, see posts by Monica Alexander, Andrew Gelman, and see Rohan Alexander’s reading list under “Next Steps”.

Data and Code

See this gist.

Bürkner, Paul-Christian, and Matti Vuorre. 2019. “Ordinal Regression Models in Psychology: A Tutorial.” Advances in Methods and Practices in Psychological Science 2 (1): 77–101.

Lau, Charles Q, Ansie Lombaard, Melissa Baker, Joe Eyerman, and Lisa Thalji. 2019. “How Representative Are Sms Surveys in Africa? Experimental Evidence from Four Countries.” International Journal of Public Opinion Research 31 (2): 309–30.

Wang, Wei, David Rothschild, Sharad Goel, and Andrew Gelman. 2015. “Forecasting Elections with Non-Representative Polls.” International Journal of Forecasting 31 (3): 980–91.

Evidence from several African countries comparing SMS surveys to face-to-face surveys suggests that results may not be representative of the general population—particularly with respect to education—even after poststratification adjustments (Lau et al. 2019).↩︎
I’m calling out the “senior” author Gelman specifically because I first learned about Mister P from his blog.↩︎
See this post by Rohan Alexander for a nice review of Wang et al. (2015).↩︎
Credit for the assist goes to to Matthew Kay, author of the {tidybayes} package, for contributing this helpful answer to my question on the Stan Forum on ordinal outcomes with MRP. And of course, we owe a big debt of gratitude to Paul Bürkner for his amazing package brms and for lots of handy tutorials like his one on ordinal regression (Bürkner and Vuorre 2019).↩︎
By default, brms models are fit with rstan. However, Andrew Heiss tweeted the other day that switching to cmdstanr will speed up the model runs. Boy was he right. To use this backend, you have to first install {cmdstanr} and then run the function install_cmdstan().↩︎
We could also model the interaction of age group and gender. (h/t Staffan Betner)↩︎
See Kay’s notes here.↩︎
I like how this figure packs in a lot of information, but it’s incomplete without a way to represent uncertainty in the estimates. I’m hopeful that readers might have some ideas for additional visualizations.↩︎

Mister P helps us understand vaccine hesitancy