Do statisticians know the future?

Image : shutterstock

Given the history of errors in predicting voting outcomes we have witnessed over the decades, you might think there is no way of really knowing what the future holds. How does the era of easily available data affect the methods and results of predicting future outcomes?

By Dominika Tkaczyk

Recently, the news of Hilary Clinton’s collapse during the 9/11 memorial due to pneumonia made waves in markets across the globe. Analysts claim Clinton’s ailment may decrease her chances of winning in the November presidential election. Both the Polish złoty and the Warsaw Stock Exchange felt the tremors. On the Monday after the memorial, the main market index WIG20 fell by 1.3 percent, albeit amongst other unfavorable market signals. With so much attention being paid to what’s happening across the ocean, who wouldn’t want to have a crystal ball that could tell the outcome of the upcoming November vote?

Predictive analytics has been used for decades in areas such as marketing, insurance, financial services and politics, in particular for guiding the decision-making process. How can we know the outcome of a future event, such as a political election, in advance?

If the election results are calculated directly from people’s votes, a simple idea might be to ask everyone who they will vote for. This approach, however, has major flaws: we cannot be entirely sure the respondent won’t change his or her mind before the election date, not to mention some people will invariably lie in a poll. The most obvious drawback of this approach is that interviewing every potential voter is far too expensive to be useful in practice.

Good old polls

Since a direct approach is impractical, scientists have long resorted to statistical forecasting techniques, which are traditionally based on the analysis of representative polls. The idea is that instead of asking everyone about their vote, it’s enough to interview a much smaller random sample of potential voters and project their answers onto the entire population of voters. It works like tasting soup while cooking: we do not have to eat all of the soup to make sure it is salty enough, instead we examine a spoonful and assume the rest tastes the same.

A poll will provide not only a single estimate of the percentage of people voting for specific candidates, but also a range of probable percentage values called a “confidence interval.” Confidence intervals capture the uncertainty, which is an inherent aspect of drawing conclusions about larger population from a sample. For example, a poll might conclude that 53 percent (+/-4) of the voters will vote for candidate X at a confidence level of 95 percent. Sounds very specific to a lay person, doesn’t it? What this means in fact is that we are 95 percent sure that the true percentage of all candidate X voters is between 49 percent and 57 percent. Not that impressive anymore, but at least a ballpark figure.

We do not have to eat all of the soup to make sure it is salty enough, instead we examine a spoonful and assume the rest tastes the same.

Polling techniques have a long and rich history dating back to the 19th century. The first known political poll was a local US presidential election poll conducted in 1824 by the Harrisburg Pennsylvanian newspaper. In Poland polling dates back to 1958, when the Public Opinion Research Center (Ośrodek Badania Opinii Publicznej) was created, aimed at conducting sociological research related to people’s views on public matters. The first election polls in Poland were organized much later though – after 1989, once polling voters started to make sense. The first partly free elections in Poland were held in 1989.

Wrong methodology = wrong results

Traditional political forecasting techniques, as based on strong theoretical foundations, proved to be accurate in many cases. However, the design of the poll, as well as reporting and interpreting the results, should be performed with absolute scientific rigor to make sure crucial statistical assumptions are not violated and the results can be trusted.

One of the key parameters of a poll is the size of the sample, as it determines how wide the confidence interval is. The larger the sample, the closer we get to the actual result. The smaller the sample, the less certain the estimate gets. For example, let’s suppose we wish to estimate the fraction of women in the entire human population based on a random sample (we should expect an estimate close to 50 percent). If we use a random sample of only two people, there is a 25 percent chance that our estimated fraction of women will be zero. If, however, our sample includes 100 random people, there is only a 1.8 percent chance we will get an estimate less than 40 percent. Similarly, polls based on smaller samples are in general less trustworthy.

By far the most serious concern in polling is whether the sample is representative, that is, whether it quantitatively reflects various characteristics of the people in the population. If the sample is not representative, or “biased,” we cannot reliably generalize our findings to the entire population.

There are a few potential sources of bias that might cause lack of sample representativeness. For example, bias might be built in the method of choosing people for the poll. If we interviewed only people we met at the gym, our conclusions could be generalized to people interested in staying active, rather than to the entire society. Another well-known source of bias is lack of response, which happens when some people from the sample refuse to answer the questions, and the characteristics of those who agree to be interviewed are significantly different from those who decline (they might have stronger, more radical views since they are more willing to share them). Poll results may also be affected by response bias when the answers given by respondents don’t reflect their true beliefs, for example because they consider their opinion unpopular.

A famous example of a badly designed poll was a presidential election poll conducted in 1936 by The Literary Digest, an American general interest weekly magazine that was highly influential at the time. Even though the sample size was huge (10 million individuals were polled and about 2.4 million responded), the results were terribly inaccurate: according to the poll, the Republican candidate Alfred Landon was set for a landslide victory. In November, however, Landon carried only Vermont and Maine and the presidency was won by Franklin D. Roosevelt. It turned out that the sampling techniques (the magazine surveyed the following groups: its own readers, registered automobile owners and telephone users; all groups wealthier than the average at the time) as well as the non-response bias resulted in an extremely nonrepresentative sample. The infamous poll eventually led to major refinements of polling techniques, and also to the demise of the magazine, which never managed to repair the damage to its reputation.

Problems with polling-based election forecasting could also be observed in the context of Polish elections. During the 2005 presidential election campaign, most polls predicted a big win for Donald Tusk over Lech Kaczyński (for example OBOP’s poll eight days before the second voting round showed Tusk leading by 14 percentage points), but the final race was won by Kaczyński with 54 percent of the votes. Similarly, in 2015 most polls forecast Bronisław Komorowski’s win in the presidential election, with the difference to Andrzej Duda oscillating between 8 and 13 percentage points. As it turned out, Duda became the president after receiving over 51 percent of the votes.

Data-driven approaches

The era of data brought new, data-driven approaches to political forecasting. They combine historical data, as well as demographic and economic variables in order to create statistical models providing more reliable estimations of future outcomes. We are also witnessing a growing interest in exploiting social media content (such as Facebook or Twitter posts) for modeling and predicting current and future opinions and actions.

is the percentage of votes Hilary Clinton is expected to get in the upcoming presidential election, according to data-based forecasting model designed by Nate Silver.

Nevertheless, it would be far too radical to discard classical polling techniques entirely. After all, they provide valuable information about people’s opinions collected in the most direct way possible. One very promising direction is ensemble techniques that use all available poll-related data.
The idea is simple: instead of trusting one particular poll, which might have been biased by methodological errors, we combine the results of many different polls coming from different sources. This approach turns out to be much more reliable than individual forecasts. Indeed, even if some predictions are wrong, we can expect others to be right (or wrong in the opposite direction), which results in smaller average errors.

Science or crystal ball?

The ensemble approach is the basic idea behind the work of Nate Silver, an American statistician famous for his extremely accurate US presidential election forecast. His method uses a stream of election poll results combined with demographic and economic information to estimate the allocation of the 538 electoral votes, and as a result the odds of every candidate winning the election. Current estimates, as well as detailed information about the approach can be found at Silver’s blog –

Silver’s method uses both state-level and national polls. Poll results are weighted based on pollster ratings, historical poll accuracy, methodology used, sample size and time of the poll. As a consequence, a poll published by a company with a poor track record, or conducted using a small sample, will contribute less to the overall prediction. Older polls also have less impact, to correct for opinion drift. Poll results are additionally adjusted to account for: the differences between “registered” and “likely” voters, large fluctuations in polls typically occurring after party conventions, omitting third-party candidates in some polls and also pollsters’ own bias (the “house effect”).

Predictions of what will happen in the Electoral College are calculated by statistical models using weighted and adjusted poll results combined with demographic and economic data, related mostly to race, religion, jobs (nonfarm payrolls), manufacturing (industrial production), income (real personal income), spending (personal consumption expenditures), inflation (the consumer price index) and the stock market (S&P 500 index). The final prediction scores come from thousands of Monte-Carlo simulations of the actual election, which accounts for the uncertainty in the forecast and also correlated errors between the states (“If Trump beats his polls significantly in Ohio, he’ll probably do so in Pennsylvania too”).

Triumph of data science

Silver’s approach has proved to be extremely accurate in forecasting election outcomes. In 2008, the method predicted Barrack Obama’s win over John McCain with 349 to 189 electoral votes (the real results were 365 to 173), while the results of the national polls were much more balanced. Moreover, Silver was able to correctly predict the outcome in all states but one (he incorrectly anticipated a slight McCain win in Indiana).

A well-known source of bias is lack of response, which happens when some people from the sample refuse to answer the questions, and the characteristics of those who agree to be interviewed are significantly different from those who decline.

In 2012 once again the overall election result was predicted correctly (Obama’s win over Mitt Romney with 313 to 225 electoral votes; the real results were 332 to 206), while once again most media forecast smaller difference between the candidates. This time round, all of his state-level predictions were correct, which gave Nate Silver worldwide recognition.

So what will happen this fall? According to Silver’s method, the odds of winning the presidency are the following (as of September 12): Hillary Clinton 69.7 percent (307.6 electoral votes), Donald Trump 31.3 percent (229.8), Gary Johnson less than 0.1 percent (0.5). The method predicts Clinton’s victories in 28 states, with the tightest races in Florida, Ohio, North Carolina, and Iowa. The national polls on average give a 42.0 percent chance to Clinton and 38.9 percent to Trump. Will we witness data science’s triumph over traditional polling methods once again? We will find out in November.


Pin It