Polling 101

The Polling Process

Usually what happens is a candidate/PAC/media outlet decides they want a poll on something so hire a pollster. Most of the time they aren’t conducting the interviews directly or doing the statistical analysis. Using polling agencies helps cut down on bias. Pollsters may also have the expertise that the buyer of the poll may not have.

The foundation of most statistical inference is randomness. So pollsters tend to take random samples. This can be done by calling random numbers (main method) or sampling individuals on internet panels. Without randomness, a poll likely isn’t representative and most statistical tools will not apply well.

Why Only You Can Prevent Bad Political Polls

Since not everyone selected actually answers a call or checks their email polls have nonresponse bias. Nonresponse bias is when the people who don’t respond have different opinions than those that did. This difference introduces error into the poll. You can help fight nonresponse by joining panels and answering survey calls.

When you look at the details of the poll that break down by group (the crosstabs) sometimes they don’t provide data on certain groups because they didn’t get enough respondents for the estimates to be good.  This does not mean that they got no responses, even if it says 0% or NA. Every reputable pollster I know of is trying to reach those groups, but they are struggling. Young people aged 18-34 are one of the hardest groups to poll, but are a highly important demographic in the 2020 Democratic nomination process.

Thankfully, nonresponse can be addressed by a technique called reweighting. Thanks to data sources like the US Census we can take the responses from our sample and adjust them to be representative of the population.  This doesn’t completely fix nonresponse bias, but it helps.  Nonresponse is preventable if more people participate in the polling process. If everyone did just one or two polls in their adult lives we would have so much better data.

Margin of Error and Why Polling Doesn’t Always Get it Right

Then the pollster computes the margin of error. Margin of error is a statistical formula that requires the data and a confidence level (usually 95%) that tells us if we took a lot of polls about that % would contain the real mean when you take the poll result and +/- the margin of error. But this 95% number only holds for a large number of surveys, and typically there aren’t enough state-level polls for that to hold well in practice. If the difference between two options is smaller than the margin of error you can call that a statistical tie. Statistical ties mean that it is reasonable that either candidate is actually in the lead.   This means in most cases it is not surprising if a few polls say a candidate will win that candidate loses on election day.  

We also know in practice polls are off about two times the calculated margin of error. This increased error comes from a combination of nonresponse bias, day to day changes in people’s opinion and the occasional mistakes people make when they fill out a survey. None of those factors are fully accounted for in a basic margin of error calculation. You can’t avoid uncertainty in polling.

However, there are statistical models to combine polls to make even better predictions outside of looking at a single poll.  These models are also uncertain, but they tend to do well enough to predict state-level elections roughly between within two points on average. Polls in the last few weeks before an election typically are roughly 3.5 points off of the result.

Polling will always sometimes be wrong at predicting the winner.  The only way to get an exact answer on a question is to literally ask everyone and have everyone answer and that is very difficult.  But polling remains one of the best ways to learn about public opinion. Polling can’t do its job without help from the public. Next time you get a call marked Survey on your phone please answer.

Election Modelling isn’t Inherently Political

One of the sad trends I have noticed is the desire to attack politics journalists or pollsters or elections modelers and dismiss their work because it fits there political views.  An example of this is someone saying Nate Silver is only predicting the Democrats would flip the house because he is a Democrat, and not because there was actual evidence of this (and as you know this actually happened).  There might be journalists/pollsters/modelers who can not separate their politics and their work.  But, I assume most of them like me try put their politics aside and follow the facts.   I think it’s important for me to draw attention to this issue, and also share that as a conservative-leaning independent,  I do trust that people on the other side of the spectrum to do a good job.

It has surprised me as a newcomer to polling analysis is that how some people view the polls and models as something used to promote an agenda or attack the president. I’ve struggled with convincing some of my own friends and family that the polls could be trusted even though they didn’t predict that Trump would win the Presidential election. In some people’s minds polls aren’t worth dealing with, so you should just let the phone ring when “Survey” comes on the caller ID. 

And this disconnect between members of the public and the pollster and election modeling community is a problem. Combine this with a mediocre public understanding of probability and you get a level of mistrust in the models because they are “flawed”. I will acknowledge that all models are imperfect and they can always be improved, but we shouldn’t attack experts because their political opinion is different than ours.

 We should try to improve public support of the polls and models. Because if public trust is low the response rates may go down and with them, the errors may go up resulting in a never-ending self-fulfilling prophecy. 
  In particular, I think that the polling community does need to reach out to conservatives to attempt to try to gain a level of trust.  If there is a polling trust gap between conservatives and liberals it could affect how the polls perform.

But I trust the polls and the models. I know they have flaws, but I also know that it is the nature of all statistical modeling. But the power of political polling is more than election prediction and helps us understand how the electorate feels about politicians and policies. The challenging nature of this field and its potential for statistical education of the public is why I do what I do. 


For me, this has never been about my politics, and I trust the models of those whose political opinions and demographics are different than mine.  In this era of tribalism and polarization, we need to acknowledge that the field of political polling analysis isn’t inherently political.

If Beto O’Rouke Wins the Senate Tonight Here’s (Probably) Why

One of the criticisms of election predictors in 2016 was that some people felt that the risks and potential for errors were not explained.  I don’t think Beto will win the Senate but his odds right now are somewhere in the range of rolling a die and getting a 6 or rolling a die and getting either a 5 or 6 which basically says that weird things can happen but they probably won’t.  So I am writing this post so that no one can say I misrepresenting Cruz winning Texas as a sure event or was not clear about the possibility of a model/poll error, not because I am trying to hedge my bets.

But there are several factors at play that could cause the polls and my model to be wrong in Texas and other Senate races.  I trust my model and the polls,  but I know from experience that there are a few cases where the polls and my model have issues predicting a winner, and I wanted to share that those scenarios.  I don’t want to give the impression that my model is perfect and always right because it isn’t.  No statistical model is ever always right, and something as incredibly complicated as an election means that nothing is ever certain.  But I do know my model on average predicts the outcome by 2.5 points in presidential elections and calls the winner over in over 90% of the races.  I have never predicted Senate races, but I have no reason to believe this will change significantly.  Some people may wonder why I ever bother to predict something that will eventually happen,  knowing that I am going to wrong sometimes.  But I do this because it’s fun and it makes election night more exciting to have some skin in the game.

Scenario 1: Systemic Polling Error (Beto wins by 2 or more points)

Under this scenario, the polls failed to capture the enthusiasm of young and minority voters and incorrectly estimated who would turnout.  There are a lot of telephone polls, and they are probably more apt to miss Beto’s base than internet polls.  On great example of this is a New York Times / Sienna College Poll .  They struggled to get young, and minority voters and the categories were reweighted,  However, when you don’t get an accurate sample, you introduce error.  It’s not the pollsters fault for this since they have to randomly sample and they can’t make you answer.  I’ve talked about the importance of poll participation before.  Sometimes polls are wrong because they aren’t conducted properly but the vast majority of the time its because not the right group of people answered. Under this scenario, Beto would win by at least 2 points because that’s the minimum error you would need to see for the polls to be considered abnormally wrong.

Scenario 2:  Republicans Stay Home

In this scenario,  Republicans don’t turn out like they did in past elections.  The rough indicator of this is the exit poll, but it may not be detailed enough to conclude this happened.  Another proxy for this is relative turnout in the strong Republican counties versus the more urban and liberal counties. If turnout is unexpectedly weak among Republicans, this would also hurt the polls as well which were probably designed with Cruz having a turnout advantage.

Scenario 3: The Polls aren’t “Wrong” and Beto still wins by less than a point

This seems like a contradiction,  but its normal for Senate polls to be wrong by about 5 points on average.  And Cruz has slightly below a five-point lead in the polls. So Beto could win by less than a point, and the polls would still perform like they usually do. Competitive races are really hard to poll and predict because a lot of the time there will be a statistical tie.

 

2018 Prediction

This Saturday, my grandmother died.  I have decided with a heavy heart to continue to predict this election.  This project has been two years and many hours in the making.,  and I believe that my Grandma would have wanted me to continue. But given that this is a very emotional time,  I will later repeat the model in case I made a mistake.

Map with Tossups


Click the map to create your own at 270toWin.com

Map with Tossups Decided


Click the map to create your own at 270toWin.com

Overall I predict that Republicans will hold the Senate.  The polls are very close, and there might be a few surprises.  A part of me is afraid that we will see the same amount of under-capturing the support of Trump voters.  I do think a lot of pollsters have put a lot of work into building better likely voter models and weighting and they should be better, but there could be the same error we saw in 2016 that is making me a little nervous about the polls in the states Trump won with Democratic incumbents.   A lot of these competitive states are hard to poll.

I also want to represent the uncertainty in my model based on my error in the presidential model because that’s the best estimate I have of my success.    I measure my success both in terms of my predicted outcome and the actual outcome and what races I call correctly.  But since there is six toss-up states, I could be wrong about the winner but still do a very good job at predicted the outcome.  This election will come down to turnout and who is more enthusiastic about the election.

Here is the scale of uncertainty:

Safe:  Unlikely (but possible)  for the model to be wrong in predicting the winner (darkest color)

Probably Safe: It is more likely than not that the predicted winner will win. (Medium color)

To close to call: within 2.5 points or within one average error of the presidential model meaning a near statistical tie at about 68% confidence. (light color)

I have no idea:  The error is within or almost within the credible interval in my model with suggests the model is incapable of distinguishing a winner but the leader gets the seat in the final count.  (beige color in the first map, light color in the second)

Competitive Race Highlights

Here are the 11 competitive states and the predicted margins for the pooled and iterative model.  The expected error based on the presidential model data is about 2.5 points.  This doesn’t mean that I will be off by 2.5 points in all of these races. I usually get some states that are spot on with the very small error and then a few outlier states.  Numbers may not add to 100% due to rounding. R, D represent the party, and I represents incumbent.

Missouri- Hawley (R) 50.6, McCaskill (D, I) 49.4,  Margin: 1.2

Verdict:  I honestly have no idea.

The polls are really close.  FiveThirtyEight says that the fundamentals and the bias of the pollsters give McCaskill an advantage and my model doesn’t include that.  Honestly, my goal is not to predict the winner here and just hope that my prediction is close.

Nevada- Rosen (D) 51.2 ,  Heller (R,I) 48.8 , Margin 2.4

Verdict:  To close to call.

In 2012, Heller won a  point, and Clinton did carry Nevada.  I think Rosen has a slight advantage here, but turnout will determine the winner.  Democrats and independents are turning out in early voting,  but you could see an election day surge among Republicans,  and we only know the party the voters were from and not the actual votes.

Florida- Nelson 50.2 (D,I),  Scott 49.9 (R) Margin 0.3

Verdict:  I have no idea who will win.

All I know about this race is that is incredibly close, and Nelson might benefit from the excitement over the Democratic Governor candidate Gillum.

Arizona- Sinema (D) 51.5,  McSally (R,I) 48.5, margin 2

Verdict:  To close to call.

This is another one of these races where it comes down to turnout.

Texas: Cruz 52.8 (R,I), O’Rouke (D)  47.2, Margin: 5.6,

Verdict: Probably safe for Cruz

In my home state of Texas, I predict a Cruz win with a margin of 5.6%.  Based on my historical presidential error this would mean Cruz has about a 95% chance of winning, but my gut suggests that the polls may not have captured the enthusiasm among first time and young voters, so maybe its closer to 66% chance for Cruz.

Tennessee: Blackburn: 51.1 (R),   Bredesen (D) 48.9  Margin: 2.2

Verdict: To close to call with more than 68% certainty

The model thought this was more of a toss-up than I did,  but it wouldn’t be surprising for either candidate to win.  Turnout is probably key here.

North Dakota: Cramer (R) 54.4, Heitkamp (D, I) 45.6, Margin 8.8

Verdict: Relatively safe for Cramer

The North Dakota polling is a little sparse and Heitkamp could surprise us, but I doubt it.

Montana: Tester (D, I) 52.3, Rosendale (R) 47.7, Margin: 4.6

Verdict: Probably Safe

I would not be surprised if polling overly favors Democrats in the heavily red states because Trump still trashes the polls and the media so I completely wouldn’t reject the possibility of a repeat the surprise of 2016 in Michigan, Pennsylvania, and Wisconsin, but I ultimately think Tester should win.

Indiana: Donnelly (D, I) 51, Braun (R) 49, Margin 2

Verdict: To Close to Call

The model thought this race was closer than I thought it would.  There has been a lot of last-minute polling in October where Braun began to edge closer.  I wasn’t expecting this race as competitive as it was until this week. If Braun wins this would be not surprising for me.

West Virginia: Manchin (D,I) 54.4,  Morrisey 45.6, Margin 8.8

Verdict:  Probably Safe for Manchin

This race was a lot less competitive than I expected, but I guess West Virginians like Manchin.  My model always struggled with West Virginia in presidential elections so if Morrisey would it wouldn’t be that surprising.

Details

This election I have five different groups.  To be considered competitive a race must have two polls where the margin is smaller than the margin of error.  The red group contains both Mississippi races, Utah, Wyoming, and Nebraska.  Wyoming has no polls so I will use Utah’s polls.  The blue West group contains Washington, California, New Mexico, Wisconsin, Michigan,  Hawaii, and both Minnesota races. The blue east group contains Maine, Vermont, New York, New Jersey, Ohio, Virginia, Delaware, Maryland, Massachusetts, Connecticut, Pennsylvania.

I split up the races with two or more polls were the leader in the poll was ahead by less than the margin of error.  I then group the states into red-leaning, blue-leaning and toss-up states based on how I viewed the race.  The competitive red-leaners are Texas, Tennessee, North Dakota. The competitive blue-leaners are Montana, Indiana,   West Virginia. The tossups are Missouri, Nevada,  Arizona, and Florida.

And lastly the special cases.  Hawaii and Wyoming have no polls, so my prediction is just the prior average. California has two Democrats, so there I just averaged the polls there.  In Maine and Vermont, I treat the independent senators as Democrats in my model since they caucus with the Democrats and there isn’t a viable Democratic candidate in those states.

What My Model Does and Doesn’t Do and Why

I want to explain what my model does and doesn’t do.  This model came from my undergraduate research I did at Texas Tech that was financially supported by the Undergraduate Research Scholars program.  I built the 2016 model in about two months during my second year, and then post-election I spent time analyzing and writing the draft of the first paper on the model and started a project on voter behavior that got abandoned in the fall of my third and final year.  I then decided to revamp the model by altering the structure of how the model worked and compared different methods.

This whole project has always been growing as I have grown as a statistician.  But since it takes me a lot of time to build a model,  it’s always lagged behind my abilities.   I’ll admit that there are some assumptions that are not ideal and that the current model right now is not the best way to do this.   It can be better.  But I have always carefully considered the effects of the unideal assumptions in my model.  I may not have communicated this well in 2016,  but I did know that my model could be wrong.

I will technically be running about 12 models for research purposes, but my main two models consist of a model that pools the polls together and one that iterative updates based on new polls.  Both calculate what is essentially a fancy weighted average between the polls in other similar states and the polls from that state.  The iterative model converges much more quickly to the latest poll, and that is the one I tend to favor,  and the other model gets the mean and variance of the polls and does the weighted average once.

My model does not adjust polls for bias or weight polls based on quality and when they were conducted.  These changes will be implemented in 2020,  but I haven’t had the time to do it for this election.  I’ve never predicted Senate elections, but my track record on Presidential elections is incredibly similar to the major models, and this model has gone under peer review.  I can’t say for sure that my model will work, but I’m hopeful that it will hold its one on Tuesday compared to other models.

Election Night Guide

I wanted to give some advice on the following election results on Tuesday.

There are two things to keep in mind:  poll closing times and how results come out.

Different states close their polls at different times.  The “standard” closing time is 7 pm local time,  but some states contain two time zones or have extended polling hours.  So the control of the House and Senate will likely not be decided until 1-2 hours after the competitive states close which means about 8 pm PST time or 11 pm EST.

For the Texas senate race, I would be watching the smallest 200 counties that makeup about 20% of the vote, mainly because we don’t know about turnout in these places and a lot of these counties should vote strongly for Cruz.  We are seeing strong turnout in the more urban and more liberal districts,  but if turnout is good in the more conservative areas (and it is in the larger conservative counties), Cruz will probably win.  Obviously, not everyone in Austin will vote for O’Rouke, and not everyone in Lubbock will vote for Cruz, but we should see a similar partisan map on Tuesday as in past elections.   I do agree there are a lot of young and first-time voters voting in this election which is a good sign for O’Rouke,  but there are also a lot conservative young people in Texas so this is not necessarily a sign that O’Rouke will win.

 

The Polls Might be Wrong on Tuesday, but Here’s Why Thats Ok.

I’m going to preface this by saying I am writing this on the Friday before the election.  I don’t know if the polls are going to be wrong on Tuesday,  but I want to be proactive.  After 2016,  I learned that there were people who didn’t understand the uncertainty about polling and election models.  I also watched the attacks on many of the leaders of my field for their alleged partisan bias that caused them to underestimate Trump.  I can’t speak for other people political motivations, but the models are polls built using sound statistically methodology.

The fact is that polls have uncertainty.  They can be wrong and sometimes will be wrong for a few reasons.  Polls have huge nonresponse rates,  for example in the New York Times live polls you can see that usually only 2-4% of the people called answer.  And since those that don’t answer can be different than those who answer the polls the polls can be biased based on nonresponse.  Nonresponse could be easily fixed if more people answered their political calls or completed online surveys that they are chosen to participate.

Secondly, the structure of polls relies on assuming that individual people favor the candidates at similar rates and that one voter is not affected by other voters.  This assumption is for convenience because if you don’t have this assumption is practically impossible to estimate a margin of error.  So polls usually make this assumption,  which that an interpretation of 95% of all polls contain the real result in the margin of error is an overestimation of the certainty.

A heuristic I like to use is doubling the margin of error because that roughly represents the true error of polls.  One thing you will notice in this election is that a lot of the polls are close.  This means that we can not be sure who runs in quite a lot of races.  In the Senate, about four races (ND, NV, MO, FL) are to close for the polls to predict the winner with a high degree of certainty.

I expect that my model will have an average error of about 3-4 points.   Some of the error is going to come from bad estimates in noncompetitive races with limited polling,  but in the competitive races, I should be off (hopefully) by 2-3 points.  Which means it would not be surprising for me to incorrectly call 2 to 3 races,  but on the other hand I could be completely right or miss four races and not be surprised.

Election prediction is an inexact science,  and while pollsters try our best, since elections have uncertainty,   we will be wrong sometimes.  But for me at least,  I predict because I love the challenge and trying to make sense of a complicated event.  I will be wrong sometimes,  but when I’m right its a great feeling to have defied the uncertainty that makes my job difficult.

Model Details 2018

First I wanted to explain again how to read a poll.  Polls are estimates of the future vote, and like all statistical estimates, polls have a margin of error. In general, we expect the actual election results to be close to the polls,  but they may be off by a few points.  It depends on the race and the time,  but doubling the margin of error might be a reasonable heuristic to better judge the accuracy.

Model Details

For the Senate model,  I am basically just going to use the revised model outlined here.  The main change is that past vote normalization will now use the past vote of the state to normalize the prior.   The Beta model,  noninformative, and Gaussian people models will not be used, and the four models are the Polls Only,  Prior Only,  Gaussian Iterative, and Gaussian Pooled Polls.  I am not going to do a generic ballot prior because I don’t think I can expect it to be a good estimator of Senate races because the Generic ballot is national but not all states are voting and I do not know if the Generic ballot represents Senate voting behavior well.  The competitive group will be separated into a group of states currently held by a Republican a group of states currently held by a Democrat.  Noncompetitive states will be either in the safe Republican or safe Democrat West (for western states), or safe Democrat East (for eastern states) group.   A race is called competitive if there is at least poll after the primary within the margin of error.  Currently, the competitive races  (and which party holds the seat) are:  Arizona (R),  Nevada, (R), Texas (R), Tennessee (R), Florida (D), Missouri (D), North Dakota (D), and Indiana (D).   I originally thought West Virginia and Ohio might be competitive, but the polls haven’t been close.

Florida and Missouri Race Profile and Polling Updates

This is my last planned post for the 2018 midterms.  I would like to blog closer to the election but I am starting a Ph.D. program and may not have the time.  Hopefully, I will be able to make a prediction in November.  I am going to cover Florida and Missouri.  I will try to profile Arizona after the primary since there is a small chance that the presumed nominee McSally will not win on 8/28.

Florida

2016 Presidential Election result:  Trump: 49.02%  Clinton 47.82%, Margin: Trump +2.80%

2012 Senate Election result: Nelson (D) 55.2%,  Mack (R) 42.2%,  margin +13 D

Democratic Candidate: Bill Nelson (incumbent)

Nelson is currently the incumbent Senator.  He is moderate and was elected in 2000. He is branding himself as independent to appeal to Florida’s moderate independent voters.

Republican Candidate: Rick Scott

Scott is the current governor of Florida and was elected to that position in 2010.  He has an interesting plan to improve the Governments efficiency and fairness by writing bills that would introduce term limits, require congressmen to work 40 hours a week,  add a line item veto for the president, a requirement to have a super majority for a tax increase, and not paying congressman while there is no budget.  Even if he does write those bills,  I doubt that any of those things will actually become law.

My Thoughts on the Race: The Real Clear Politics average is currently Scott +1.2,  and I think this race is very close.  The latest poll showed Scott with a 3 point lead but it is from a Republican-leaning polling agency using a likely voter model that might not be accurate. The polls are disagreeing with each other but I think that is a combination of sampling error, the mix of likely voter polls and registered voter polls and variations in the agency.  There are still some undecided voters and as the undecided voters decrease and we get closer to election day hopefully this will be more clear.   This is definitely a race to watch on election night.  And to make this race even more interesting there is currently a controversy over whether or not Russian has hacked  Florida’s voter registration system.  Both campaigns are blaming each other on the algae bloom crisis with attack ads.  The winner of this race is probably going to be determined by turnout.

2016 Presidential Election result:  Trump: 56.77%  Clinton 38.14%, Margin: Trump +22.63%

2012 Senate Election result: McCaskill (D) 54.8%,  Akin (R) 39.1%, Dine (Libertarian) 6.1%  margin +15.7% D

Democratic Candidate: Claire McCaskill (incumbent)

McCaskill was elected in 2006.  Her Trump Score is 45.3% but the predicted score is 83.2%.   This difference means that there is plenty of material for attack ads on her past votes.  She voted no for the Tax plan and for immigration reform the president didn’t support.  She opposed the Republican budgets.  While as an incumbent she has an advantage,  but she might struggle to get Trump voters to vote for her.

Republican Candidate: Josh Hawley

Hawley is the current Attorney general and states on his website that he is “one of the nation’s leading constitutional lawyers”, and supports religious freedom and fighting against the opioid epidemic and human trafficking.

My Thoughts on the Race: The polling data has been pretty spread out, and the polling agencies are legitimate but not as good as the big name polling agencies.  Hawley has a chance of winning,  but at this time McCaskill might have a slight lead.  Hopefully, more polls will be conducted so we can see a better picture.