Thursday, October 25, 2012

Likely Liars, or Why You Should Ignore Anything You Hear After "Poll of Likely Voters" If You've Ever Been To a Bar

Confused yet about what a "likely voter" is? Yeah, me too. I mean I know what a registered voter is, and I know what an actual voter is, but "likely"? That sounds like statistical mumbo-jumbo to me. Luckily, that's exactly what we're here clear up what might be a relatively significant bias in recent (i.e. this year and 2008) polling models.  But bear with me, the statistics get a bit more tricky before they get any clearer, although the payoff is fairly substantial in my humble opinion.

Wanna skip the stats? "Likely" voter polling results are likely to be wrong, because the same people who are most likely to say they'll vote are also the same people who are most likely to lie about it. At best, this makes the forecast a lot more error-prone; at worst, it adds serious bias to the results.

First, let's just be clear about what (most) polling firms do. They select a mode (or several) of methods to contact eligible voters (think calling random numbers of either landlines or cellphones, sending mail or email, etc.) and from respondents they form a (noisy) sample of the representative voter population. Often this involves overweighting or underweighting survey responses to properly match the known demographics of registered voters (or more simply, just the adult population) in a particular polling region. From this, they extrapolate support within a given confidence interval for a particular candidate if (assuming that their weighting is right) all registered voters voted.

But only 90% of registered voters ACTUALLY vote (yeah, I know, commentary on state of participation etc. for another time, although this number is much higher than I thought it was, given the wide discrepancy between registered and "likely" results), so polling firms try to adjust their survey respondents (again through weighting or in some cases, trimming) to more accurately represent those who are going to actually put the time in on election day. Their methods for doing this are mostly proprietary, but Gallup has some notes on how they adjust here. To quickly summarize, you are only a likely voter if you said you did all of:
  • know where your polling place is
  • have voted there in the past election
  • you voted in the last presidential election
  • you respond that you "always" vote
Gallup has some adjustments for very young voters (i.e. < 21) and other minor tweaks. Gallup just drops responses if they don't correspond to these constraints, and generates a new "likely" voter result. Other polls may employ a more sophisticated weighting scheme based on core demographics of the last few election actual votes. Of course, it is well known that this biases the survey to an even older, more stable, population than its registered voters result and so it is often assumed that the likely voter model tends to magnify Republican support. Which is exactly what you want, since traditionally these demographics are associated with actual voter turnout.

But what if people are lying?

More accurately, what if lying is correlated systematically with various partisan demographics?

What you have is a situation in which the likely voter model not only adds noise, but actually biases the results such that you would be better served not doing the adjustment at all. Some analysis here and here, which seems indicative that in 2008 there was a breakdown in the accuracy of likely voter models vs. registered voter results.

If people who vote look like people who don't vote (that is, the 1/3rd that doesn't vote looks exactly like the 2/3rds that do), then you should choose "likely" voters at random. The main effect here is that the result will the same as the registered voter survey, but with a higher variance.

What do we need to have that happen? People who report as likely voters need to be lying about voting in greater proportion than those who don't. Huh? How can we ever know that?

Enter the very smart Stephen Ansolabehere and Eitan Hersh with their recent and hopefully no longer overlooked paper: "Validation: What Big Data Reveal About Survey Misreporting and the Real Electorate." Main point:
We show that studies of representation and participation based on survey reports dramatically mis-estimate the differences between voters and non-voters.
What do they do to get there? They carefully compare survey reports of election behavior with actual voting outcomes, and then investigate the demographic determinants of of mis-reporting, which I less charitably call lying. They find that people lie when it's important to them to lie; the most likely to lie about voting are politically engaged, well-educated, church-going, partisan (for both parties) males.

Now to be fair to Ansolabehere and Hersh, I'm going to play fast and loose with their study from here out, but that's why it's my blog and not theirs. I'll at least tell you what I'm doing, in as straightforward a manner as I can.

First, their study asks people after they voted (or didn't) whether they voted, not their ex ante intentions measured against their actual outcomes. To link this to current surveys, I need to assume that these are temporally stable; that is, the same people who lie about having voted are also the same people who lie about expecting to vote, or at least there isn't a massive degree of difference between them.

Formally, what Ansolabehere and Hersh report is the probability that you report you voted conditional on the fact that you didn't. What we need is the probability that you don't vote conditional on the fact that you reported you would (i.e. the inverse). But no worries, that's what we have Bayesian identities for and our temporal assumption above. Given values for the probability of not voting and the probability of reporting that you will vote, we can generate what we need easily. The bottom line: For the values reported in their paper, there's about a 20% chance that you report you'll vote when in fact you won't. If this was random, this would just add significantly to the reported errors in the polling results. But it's not.

Since political science believes in data replication, I can very easily regenerate these values for various dimensions directly from their data. For simplicity (it's a blog dammit), we'll look at men vs. women. Unsurprisingly, men lie about voting 25% of the time, while women only lie about 14%; in other words, men lie about voting 2/3rds more than women do. Yeah, I know, I've been to a bar or three in my time, so this isn't the interesting thing.

The bottom line? If "likely voter" numbers are based on self-reporting and males favor Republicans more than women, then the likely voter results are biased versus what will actually occur. Why? Because men are more likely liars than women. And that's a different type of gender gap, one that potentially biases the "likely voter" polls that virtually all polling organizations will be reporting between now and November 6th.

My advice? Just look at the registered voter polling numbers; a lot of the people who report as "likely" are unlikely to actually show up.

And we're not even talking about the people who say they won't show up, but actually do.


  1. "But only 90% of registered voters ACTUALLY vote" -- Where in the world did you get that figure?? The truth is that fewer than 2/3 of registered voters actually vote.

  2. William--

    Click on the link...and look at Table One in the Census publication, Voting and Registration in the Election of 2008. I quote:

    "Historically, the likelihood that an individual will actually vote once registered has been high, and 2008 was no exception. Of all registered individuals, 90 percent reported voting, up slightly from 89 percent in the 2004 presidential election."

    What you might be thinking of is the percent of total population that votes...that's about 63.6%.