Since the internet is overflowing with politics these days,we should take a minute to talk about some of the mathematics of elections. We encourage everyone to vote and to take an interest in the issues, but we’ll steer clear of all of that here. Comments are welcome as always, but we’ll moderate away any which stray too far.
There are three kinds of lies: lies, damned lies, and statistics. — Benjamin Disraeli
In this post we’ll talk a little about the mathematics of polling. Of course, everybody knows about polls. Hundreds or thousands of people are contacted and asked to answer a set of questions. Based on a statistical analysis of their answers, the results are published saying things like:
“People like vanilla ice cream more than chocolate ice cream, 63% to 21%, with an error of plus/minus 3%”
What does this mean?
Lets say you’ve selected 1000 people who accurately reflect the adults of the US and they were polled with the question in this example. The above result means 630 of them said they like vanilla, 210 said chocolate, and the rest said something else or declined to answer the question. Now imagine you asked everyone in the US the same question and P% of them said they prefer vanilla. The above poll tells us that
So somewhere between 60% and 66% of American adults prefer vanilla ice cream.
Actually we should be even more careful. A poll comes with some confidence level. Unless stated otherwise, a poll usually comes with a 95% confidence level. Let’s say our poll has a 95% confidence level.
What the poll really tells us is that there is better than 95% chance that between 60% and 66% of American adults prefer vanilla ice cream.
There is always some chance that a poll has selected people who excessively like vanilla ice cream purely by chance. You can imagine that if the poll was only of 10 people, then the odds aren’t bad that you could pick a few to many vanilla lovers and skew the results. This becomes less and less likely as you poll more and more people. Alternatively, you can increase the error. For example, if you change the error to plus/minus 5%, then the pollsters are now claiming that P is between 58% and 68%. And it’s more likely that the true value is in this wider window.
On his blog, Terence Tao discusses the mathematics of how one verify the confidence level and margin of error even when you only poll the very tiny 1000/300,000,000 = .0000033 (ie. .00033%) of the population.
In any case, when you read a poll you should really say to yourself that the odds are very good that the true value is somewhere in the range given by the stated value plus/minus the error. For example, very recently the University of Cincinnati polled likely voters in Ohio.
- John McCain 48%
- Barack Obama 46%
Which presidential candidate will do the best job of improving our economy?
- Obama 47%
- McCain 44%
Survey of 876 likely voters was conducted October 4-8. The margin of error is +/- 3 percentage points.
Now that we know how to read a poll, we see that the pollsters have interviewed 876 likely voters and assert that there is better than a 95% chance that if the election were held on October 4-8, then McCain would receive somewhere between 45% and 51% of the vote and Obama would receive somewhere between 43% and 49% of the vote. In particular, notice that these intervals overlap by quite a bit! So, based on this poll, it is reasonably possible that Obama would beat McCain in the election! Of course, it’s significantly more likely that McCain would win, but it’s important to realize this poll is not saying that McCain would beat Obama by 2% in Ohio!
Besides the inaccuracy the math can warn you about using the margin of error and confidence interval, there is plenty of room for human error. More on that below.
Well, somebody is doing the poll and somebody is paying for the poll. Often intentional or unintentional biases sneak in. What are some of the ways this happens?
First of all, how you select the people to poll is important. A poll is supposed to represent the opinions of some group of people. Most polls about the presidential election are designed to represent the opinion of people likely to vote in the election. So when people are selected for a presidential poll, people in France, children, etc. are excluded. For the presidential election it’s reasonable to say that we don’t want to include the opinions of French people because the poll is meant to help predict who might be elected and the French won’t be voting in the election. So first you need to know what population of people is meant to be represented by the poll.
Maybe in our ice cream example the poll is meant to be of adults living in the US. In which case a cross section of adults in the US should be represented in the people polled. So men and women, various regions of the country, older and younger people, etc. should all be represented in the group polled. This is very hard to do. For example, if you conduct your poll by phone then automatically your biasing your poll towards people who own home phones. People who don’t have a phone, or only use a cell phone, or only an internet based phone, would all be excluded. It’s easy to skew the population of people you’re polling without even knowing it!
Second, how you phrase questions has a big affect. For example, the online gambling industry did a poll in 2006 on whether or not the government should regulate online gambling. Even before you look at the poll, you can bet that they have ideas about what outcome from the poll they want. An example question:
Q: Many gambling experts believe that Internet gambling will continue no matter what the government does to try to stop it. Do you agree or disagree that the federal government should allocate government resources and spend taxpayer money trying to stop adult Americans from gambling online?
When you phrase the question that way, it’s no surprise that you get that outcome! Another example of this is described here. By the way, both examples are by Zogby’s International polling group. Makes you wonder about their integrity!
Third, another way to influence a poll is by the order of the questions. Here is an example from here in a poll from 2007 by Fox News:
39. Who do you trust more to decide when U.S. troops should leave Iraq — U.S. military commanders or Members of Congress?
3% (Don’t know)
40. Last week the U.S. House voted to remove U.S. troops from Iraq by no later than September 2008 — would you describe this as a correct and good decision or a dangerous and bad decision?
44% Correct and good
45% Dangerous and bad
11% (Don’t know)
If you only looked at question 40 you may think this is a reasonable poll question (even that is questionable!), but it’s pretty hard to imagine that question 39 didn’t have at least an unconscious affect on peoples’ answer to question 40.
Of course, these sorts of problems show up in all sorts of polls. And even when the pollsters are being super careful, there are subtle influences affecting the poll.
Moral: No poll is completely unbiased!
At the very least when you look at a poll, as a math person you should check the sample size, margin of error, and confidence level and think about what that tells you. Then look at the who did the poll and what questions they asked to see what influence that might have in the results. A summary of things to look for in a poll is given here.
One way to increase the accuracy of your information is to lump a bunch of polls together. It’s a little like doing one big poll. Websites such as realclearpolitics.com and electoral-vote.com do exactly this. Other websites like fivethirtyeight.com do thousands of computer simulations of elections based on the polling data and use that to make predictions. However, different polls ask different questions, use different methods to select people to poll, and may even be intentionally biased. So the websites have to make decisions about which polls to include or exclude. So even these websites can have biases!