Election Modelling isn’t Inherently Political

One of the sad trends I have noticed is the desire to attack politics journalists or pollsters or elections modelers and dismiss their work because it fits there political views.  An example of this is someone saying Nate Silver is only predicting the Democrats would flip the house because he is a Democrat, and not because there was actual evidence of this (and as you know this actually happened).  There might be journalists/pollsters/modelers who can not separate their politics and their work.  But, I assume most of them like me try put their politics aside and follow the facts.   I think it’s important for me to draw attention to this issue, and also share that as a conservative-leaning independent,  I do trust that people on the other side of the spectrum to do a good job.

It has surprised me as a newcomer to polling analysis is that how some people view the polls and models as something used to promote an agenda or attack the president. I’ve struggled with convincing some of my own friends and family that the polls could be trusted even though they didn’t predict that Trump would win the Presidential election. In some people’s minds polls aren’t worth dealing with, so you should just let the phone ring when “Survey” comes on the caller ID. 

And this disconnect between members of the public and the pollster and election modeling community is a problem. Combine this with a mediocre public understanding of probability and you get a level of mistrust in the models because they are “flawed”. I will acknowledge that all models are imperfect and they can always be improved, but we shouldn’t attack experts because their political opinion is different than ours.

 We should try to improve public support of the polls and models. Because if public trust is low the response rates may go down and with them, the errors may go up resulting in a never-ending self-fulfilling prophecy. 
  In particular, I think that the polling community does need to reach out to conservatives to attempt to try to gain a level of trust.  If there is a polling trust gap between conservatives and liberals it could affect how the polls perform.

But I trust the polls and the models. I know they have flaws, but I also know that it is the nature of all statistical modeling. But the power of political polling is more than election prediction and helps us understand how the electorate feels about politicians and policies. The challenging nature of this field and its potential for statistical education of the public is why I do what I do. 


For me, this has never been about my politics, and I trust the models of those whose political opinions and demographics are different than mine.  In this era of tribalism and polarization, we need to acknowledge that the field of political polling analysis isn’t inherently political.

The Polls Might be Wrong on Tuesday, but Here’s Why Thats Ok.

I’m going to preface this by saying I am writing this on the Friday before the election.  I don’t know if the polls are going to be wrong on Tuesday,  but I want to be proactive.  After 2016,  I learned that there were people who didn’t understand the uncertainty about polling and election models.  I also watched the attacks on many of the leaders of my field for their alleged partisan bias that caused them to underestimate Trump.  I can’t speak for other people political motivations, but the models are polls built using sound statistically methodology.

The fact is that polls have uncertainty.  They can be wrong and sometimes will be wrong for a few reasons.  Polls have huge nonresponse rates,  for example in the New York Times live polls you can see that usually only 2-4% of the people called answer.  And since those that don’t answer can be different than those who answer the polls the polls can be biased based on nonresponse.  Nonresponse could be easily fixed if more people answered their political calls or completed online surveys that they are chosen to participate.

Secondly, the structure of polls relies on assuming that individual people favor the candidates at similar rates and that one voter is not affected by other voters.  This assumption is for convenience because if you don’t have this assumption is practically impossible to estimate a margin of error.  So polls usually make this assumption,  which that an interpretation of 95% of all polls contain the real result in the margin of error is an overestimation of the certainty.

A heuristic I like to use is doubling the margin of error because that roughly represents the true error of polls.  One thing you will notice in this election is that a lot of the polls are close.  This means that we can not be sure who runs in quite a lot of races.  In the Senate, about four races (ND, NV, MO, FL) are to close for the polls to predict the winner with a high degree of certainty.

I expect that my model will have an average error of about 3-4 points.   Some of the error is going to come from bad estimates in noncompetitive races with limited polling,  but in the competitive races, I should be off (hopefully) by 2-3 points.  Which means it would not be surprising for me to incorrectly call 2 to 3 races,  but on the other hand I could be completely right or miss four races and not be surprised.

Election prediction is an inexact science,  and while pollsters try our best, since elections have uncertainty,   we will be wrong sometimes.  But for me at least,  I predict because I love the challenge and trying to make sense of a complicated event.  I will be wrong sometimes,  but when I’m right its a great feeling to have defied the uncertainty that makes my job difficult.

What a Pulmonary Embolism Taught Me About Statistics

On May 3rd, 2017, I was released from the hospital following an overnight stay for the treatment of a pulmonary embolism.  I am now almost fully recovered.  I think this experience is a great opportunity to teach statistics through a real-life example. I learned three things from this experience.

Vastly different fields can have the same underlying statistical processes

So far I have worked almost exclusively with political science data. My research is about how to estimate a proportion from a sample and how to compare it to other proportions.  When my doctor told me I might have a pulmonary embolism,  I wanted to see the data for myself.  So I read the journal articles, FDA case reports, and any data I could find to try to get an estimate for the chance I had a pulmonary embolism.  What I quickly realized is that the data about adverse drug reactions had similarities with the political science data I was familiar working with.   The data had issues with nonresponse bias and limitations due to unideal sample sizes.  Although political science and pharmacology are very different fields the share similar kinds of statistical problems.

Bayesian statistics is a powerful tool in many fields

Through this process, I saw how Bayesian statistics could help solve a difficult and important problem.  My doctor came by and saw me during my brief hospital stay.  She talked about how while she knew that it was unlikely any random woman in her twenties would have a pulmonary embolism,  but the details of my case suggested that the probability I had a pulmonary embolism was significant.  In short, the Bayesian mindset is about incorporating your prior beliefs and adapting them in the presence of additional information.  I don’t think my doctor used Bayes Theorem (the formal formula for estimating a probability given prior information), but she used Bayesian reasoning.  She had initial beliefs about the cause of my symptoms, and she updated her beliefs when she got new information  (like lab results).  This is probably normal reasoning for a doctor trying to diagnose a patient, but it showed me how Bayesian statistics could be applied to other fields.   A more formal use of Bayesian statistics would provide even better information to estimate probabilities.  I always knew Bayesian statistics could be useful in other cases besides politics, but this experience showed me a new area I am interested in researching.

 I am interested in other fields to apply statistics to besides politics

I wish I could have discovered my interest in biostatistics without a life-threating medical event, but I am glad.  I was exposed to a problem that is important and would use some of the same techniques I was exposed to during my work on political science.  While I still love political science statistics, I feel like I have now answered the question on what I can research in years where is no major election.  I enjoyed reading clinical trials and studies and analyzing their statistics.  Maybe someday I can even study how to improve statistical methods to prevent and diagnosis pulmonary embolisms like mine.

Six months after returning home from the hospital,  I am grateful that God has found a way to use my PE for good.

 

What my Undergraduate Research experience was like in Statistics

I am entering my third and final year of my undergraduate degree.  I have been doing research since almost day 1, and I wanted to share what my experience was like. As a statistician, I feel like I have to mention this is from a sample size of 1 and may not reflect all undergraduate research experiences.

First, I want to give a little background.  The summer before my senior year of high school, I was chosen to participate in an NSF (National Science Foundation) funded REU (Research Experience for Undergraduates)  at Texas Tech.  There I was exposed to what research was like.  We had a series of workshops each led by different researchers over a two week period. I loved the Texas Tech math department and decided to attend Texas Tech for my undergraduate degree. I meet my current research advisor Dr. Ellingson at the REU.

Right after classes started during my freshman year, I decided to email Dr. Ellingson and see if could do research with him.  I started work on image analysis (Dr. Ellingson’s specialty).  I was also following the GOP nomination because it was interesting to me.  I had an idea to predict the nomination using Bayesian statistics, similar to how Five Thirty Eight predicts elections.  I had talked with Dr. Ellingson about political science statistics before and how there was a need for a statistically sound open source academic model.  He agreed to help guide me through the process of building a model to predict the GOP nomination process.

At the time of the GOP nomination my math background was pretty limited, so I decided to just use Baye’s theorem and used the normal distribution to estimate likelihood.  I did all the calculations in excel and I downloaded csv files from Huffington Post Pollster with the poll data.  I used previous voting results from similar states as the prior in my model.  More info about my model can be found here. What I found the most challenging was making a lot decisions about how I was going to predict the election.  I also struggled with making the decisions about the delegate assignments which often involved breaking the results down by congressional districts, even when the poll data was state wide.  After the first Super Tuesday (March 1st) I began to realize that how difficult it is to find a good prior state and reassign support of candidates who dropped out of the race.  The nomination process taught me that failure is inevitable in research, especially in statistics, where everything is at least slightly uncertain.

In the summer of 2016, I started gearing up for the general election. I decided to use Scipy (a python package for science and stats) to make my predictions.  Making the programs was incredibly difficult.  I had over a dozen variations to match different combinations of poll data.  I had the programs up and running by early October, but I discovered a couple of bugs that invalidated my early test predictions.  The original plan was to run the model on the swing states two or three times before the real election. In the middle of October I discovered a bug in one of my programs.  I had to then fix the bug in every program.  I then finally did some manual calculations to confirm the programs worked.  It was difficult to have to admit that my early predictions were totally off, but I am glad I found it before the election.  Research isn’t like a homework assignment with answers in a solution manual.  You don’t know what is exactly going to happen and it is easy to make mistakes.

I ended up writing a paper on my 2016 general election model.  Writing an paper on your research is very different than writing a paper on other peoples research.  My paper was 14 pages (and over 6500 words) long, and only about one or two pages were about what other people’s research on the topic.  It took a very long time to write, and I had 17 drafts.  I hated writing the paper at first, but when I finished it felt amazing. It was definitely worth the effort.

Undergraduate research is difficult, but I loved the entire process.  I got to work with real data to solve a real problem.  I learned how to read a research paper, and eventually I got to write my own.  I got to give presentations to both general audiences and mathematicians and statisticians.  I got to use my research to  inform others about statistics. If you are thinking about doing undergraduate research, you definitely should.

 

Data Sharing

Last semester I took a research ethics class.  I wrote a paper on preregistration and data sharing in academic research. I decided to modify the paper into two blog posts. Here is the first part on data sharing.

Statistics is the study of uncertainty.   Any research study not involving the entire population of group will not be able to provide a definite conclusion with 100% certainty.   Conclusions can be made with a high degree of certainty (95-99%) but false positives and false negatives are inevitable in any large statistical analysis.  This means that studies can fail to make the right call, and after multiple replications the original conclusion may be overturned.

One way to improve the statistical integrity of research is to have a database of the data from non-published studies.  Ideally, this database would be accessible to all academic researchers.   A research would then be able to see the data from other similar studies.   The research would then be able to compare his data with the data from the other studies.  At a significance level of .05,  approximately 1 in 20 studies that were statistically significant were a false positive.    This number applies to theoretically perfect studies that meet all the statistically assumptions used.   Any modelling error increases that rate.  With each external replication of a study the probability of a false positive or a false negative greatly decreases.   Grants from the National Science Foundation1, and the National Institute of Health2 currently require that data from the funded studies be made available to the public after the study was completed.  But not all grants and funding sources require this disclosure.    Without an universal requirement for data disclosure, it can be difficult to confirm that the study and the results are legitimate.

Advocates of open data say that data sharing saves time and reduces false positives and false negatives.  A research can look at previously conducted studies and try to replicate the results.   The results of the data can then be recalculated by another research to confirm accuracy.   In a large study with lots of data it is very easy to make a few mistakes.  These mistakes could cause the results to be misinterpreted.   Open data can even help discover fraudulent studies.  There are methods to estimate the probability the data is fraudulent by looking at the relative frequency of the digits.   The distributions of the digits should be pretty uniform and in one case the data didn’t look quite right.  In 2009, Strategic Vision (a polling company) came under fire from potentially falsifying polls, after a Five Thirty Eight analysis3  discovered that something didn’t look quite right.  This isn’t an academic example, but open access data could prevent fraudulent studies from being accepted as fact as in the infamous vaccines cause autism study.  The statistical analysis of the randomness isn’t definite, but they can raise questions that prompt further investigations of the data.   Open data makes replication easier. False positives and false negatives can cause harm in some cases.  Easier replication can help confirm findings quicker.

 

Works Cited

[1] Public Access To the Results of NSF-Funded Research. (n.d.). Retrieved April 28, 2017, from https://www.nsf.gov/news/special_reports/public_access/

[2] NIH’s Commitment to Public Accountability. (n.d.). Retrieved April 28, 2017, from https://grants.nih.gov/grants/public_accountability/

 

[3] Silver, N. (2014, May 07). Strategic Vision Polls Exhibit Unusual Patterns, Possibly Indicating Fraud. Retrieved April 28, 2017, from https://fivethirtyeight.com/features/strategic-vision-polls-exhibit-unusual/

We Don’t Live in Statsland

Statsland is a magical world that exists only in (certain) Statistics textbooks. In Statsland,  statistics is easy.  We can invoke Central Limit theorem and use the normal distribution when n is larger than 30.   In Statsland we either know or can easily determine the correct distribution.  In Statsland 95% confidence intervals have a 95% chance of containing the real value.  But we don’t live in Statsland.

The point of doing statistics is that it would be too difficult (or impossible) to find the true value of a population.  You aren’t likely to find  the exact value, but you can be pretty close.   In a statistics textbook problem, you probably have enough information to do a good job of estimating the desired value. But in applied statistics you may not have as much information.  If you know the mean and standard deviation of a population you do not need to do much (if any) statistics.  Any time you have to estimate or substitute information, your model will not perform as well as a theoretically perfect model.

Statistics never was and never will be an exact science.   In most cases, your model will be wrong.  There are no perfect answers.  Your confidence intervals will rarely perform as they theoretically should.  The requisite sample size to invoke Central Limit Theorem is not clear cut.  Your approach should vary on the individual problem.   There is no universal formula to examine data.   Applied Statistics should be flexible and instead of rigid.   The world is not a statistics textbook problem, and should never be treated as such.

 

A Non-Technical Overview of My Research

Recently I have been writing up a draft of a research article on my general election model to submit for academic publication.  But that paper is technical and requires you to have some exposure to statistical research to understand.  I wanted to explain my research without going into all the technical details.

Introduction

The President of the United States is elected every four years.  The Electoral College decides the winner,  by the votes of electors chosen by their home state.  Usually the electors are chosen based on the winner of that state and they vote for the winner of that state. Nate Silver correctly predicted the winner of the 2008 election with Bayesian statistics.  Silver got 49 out of 50 states correct.   Silver certainly wasn’t the first person to predict the election, but he received a lot of attention for his model.   Silver’s runs Five Thirty Eight  which talks about statistics and current events.  Bayesian statistics is a branch of statistics that uses information you already know (called a prior) and adjusts the model as more information comes in.  My model like Nate Silver’s used  Bayesian statistics. We do not know the details of the Silver model, besides that it used Bayesian statistics.  To the best of my knowledge, my method is the first publicly available model that used poll data from other states as the prior.  A prediction was made for 2016, where I correctly predicted 6 states.  Then the model was applied to 2008 and 2012, where my prediction of state winners matched the prediction of Five Thirty Eight.

Methodology

I took poll data from Pollster, which provided me csv files for the 2016 and 2012 election.  For 2008 I had to create the csvs by hand.  I had a series of computer programs in Python (a common programming language) to analyze.  My model, used the normal distribution.  My approach divided the 50 states into 5 regional categories: swing states,  southern red states,  midwestern red states, northern blue states,  and western blue states.  The poll data source used as the prior were National,  Texas,  Nebraska,  New York, and California respectively.  This approach is currently believed to be unique, but since multiple models are proprietary it is unknown if this has been used before.  I only used polls if they were added to pollster before the Saturday before election date.   For the 2016 election analysis this meant November 5th.  I posted my predictions on November 5th.

I outline more of my method here.

Results and Discussion

My model worked pretty well compared to other models.  Below is a table of other models and their success rate at predicting the winning candidate in all 50 states plus (and Washington D.C.).

Race Real Clear Politics Princeton Election Consortium Five Thirty Eight (Polls Plus) PredictWise (Fundamental) Sabato’s Crystal Ball My Model
2008 Winner Accuracy 0.96078 0.98039 0.98039  N/A 1 0.98039
2012 Winner Accuracy 0.98039 0.98039 1 0.98039 0.96078 1
2016 Winner Accuracy 0.92157 0.90196 0.90196 0.90196 0.90196 0.88235
Average Accuracy 0.95425 0.95425 0.96078 0.94118 0.95425 0.95425

As you can see all the models do a similar job at picking the winner in each state, which predicts the electoral college.  There are other ways to compare accuracy, but I don’t want to discuss this here since it gets a little technical.   No one was right for every state in every election.  It would probably be impossible to create a model that would consistently predict the winner in all states, because of the variability of political opinions.   Election prediction is not an exact science.  But there is the potential to apply polling analysis to estimate public opinion on certain issues and politicians.  Right now the errors in polls are too large determine public opinion on close issues.   But further research could determine ways to reduce error in polling analysis.

Only You can Prevent Bad Political Polls

My research relies heavily on polls.  So I understand why it is important to do polls.   If I see a poll and determine it’s well written, I do it.  But I think this position is rare because people don’t know the importance of polls. I want to explain why I think polls are important.   Pre-election polls are commonly used to predict elections, and favorability polls are often used to judge a politician’s  popularity. Polls are an important part of American politics.

I get that polls are annoying.  I know it takes time and you are probably busy (like me).  But doing 1 political poll a year can greatly help improve the accuracy of polls.   You don’t have to always answer a poll, but increased participation in polls improves accuracy.   Now there are a lot of bad polls, and it’s difficult to tell if a phone poll is good based of the phone number.  Some people have “polls” that really are marketing calls.  I understand if you are hesitant to do phone polls.  But internet polling provides a good alternative.  I think the future of polling is quality internet polls.  When you do an good internet poll you know more about the quality of the poll then a poll phone call. But Internet polls from scientific polling agencies require a large base of people to create accurate samples.  You can randomly call 1000 phones, but you really can’t send 1000 random internet users a poll. To combat this problem polling agencies have databases of users to send polls. Polling agencies send surveys to certain users to create a good sample. Joining a survey panel with political polls is a way to get your voice heard.

My view on participating in political polls is you can’t complain if you don’t participate.  Polls need a diverse sample to be accurate.  If you feel your political stance is not heard in the polls, then you should do more polls instead of less.  We need all kinds of people to do good polls.  Not everyone may have internet access, but enough voters do to create a good sample.  What you can do is join a poll panel.  My two recommendations are https://today.yougov.com/ or https://www.i-say.com/.  They also do non-political polls and market research which are also important (I might do a post later on this). I recommend them because they are user friendly and statistically sound.  I am not receiving anything for recommending these agencies, I just think they are good.

If you want polls to be more accurate, the best (and easiest) thing to do is participate in polls.  As a statistician, I value good data.  But for data to be good it needs a representative sample.  Regardless of your politics, you should participate in political polls.

 

Coincidences: A Lesson in Expected Value

As I followed the election I noticed the frequent mentions counties (or cities) that have been known “predict” the presidential election winner. The idea is that a the winner of a certain county has matched the winner of the election for multiple elections. Let’s look at county A for an example. To simplify things lets assume the odds of predicting a winner in a presidential election are 50-50. This would mean that the probability of getting 8 elections right would be 1 in 256. This means that it is unlikely that county A would predict the election by chance. But what about the rest of the counties in America? There are over 3,000 counties in America (according to an economist article found here: http://www.economist.com/blogs/economist-explains/2016/11/economist-explains), so we can expect on average for about 12 of these counties would have “predicted” the winner of the presidential election for eight elections.

Rare events happen all the time. Rare is not impossible. Let’s say that there is a (hypothetical) free sweepstakes with a 1 in 100 chance of winning $100. It may not be likely that you specifically win, but if all your Facebook friends enter the contest someone you know is probably going to win. If you have at least 99 Facebook friends it is likely that you or someone you know will win the sweepstakes. You may think its a coincidence or luck, but it is really math. Expected value can’t tell you who is going to win, but it can tell you someone you know is likely to win. Now expected value is not a magic bullet. You may have 0 friends win or 2 friends win, but the most likely event is that someone will win. Unfortunately (legit) sweepstakes like this don’t exist, but it is a good example of how your perception of probability may not match reality. Another example is it probably going to rain 1 in 10 days where the probability of rain is 10%, but it is easy to pretend like it never rains when the probability of rain is 10%.

You may wonder why expected value matters. But it’s actually quite important when looking at everyday events. Sometimes it is easy to underestimate the chance that something odd or rare would happen. You may think it’s odd that runs when the meteorologist says the chance of that happening is 10%. Or that it only takes 23 people to have a 50% chance of there being, two people with the same birthday (details here). It is easy to forget that once in a lifetime event do happen once in a lifetime. How you think about probability is important. So before you yell at the TV meteorologist that said there was a 10% chance of rain but it rained, try to remember that unlikely does not equal impossible.

Models May Fail but Statistics Matters Anyway

The 2016 presidential election brought attention to the limitations of Statistics.  Most models predicted a Clinton win but Trump will most likely be the president (the results are currently unofficial and recounts are in progress but most experts believe that Trump will be officially elected president). However all models are not 100% certain and the goal of statistics is to find the most likely event.  I have spent the last few weeks reflecting on the results and what this means for the field of political science statistics.  Recently I read a book by David Salsburg called: The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. It’s a history of sorts of how the field was developed and then applied to science.  While an exact date of the beginning of statistics is hard to pinpoint the first journals and departments were founded in the early twentieth century.  Statistics is a young field and is constantly growing and evolving as more data and situations are studied.  In the beginning some of the problems may have been trivial, but it is important to try to understand the world around us. Collecting data from an entire population is incredibly difficult and sometimes impossible, so methods of estimation were created.  You may wonder why prediction is necessary or helpful.  After all eventually the election happens and the president is chosen, so why do we care about knowing this in advance?  Why does prediction matter?  Statistics models and research is not just about what is being studied but about creating better ways to understand the world around us.   We can begin to better understand things like the opinions of the people, development of diseases,  and the economy.  Statistics can create better government, better medicine, and better education, and a better world.  If we can understand how polls measure the voting habits of the American people, then we may be able to get a better picture of citizens views on multiple issues and candidates.  If we can help understand how diseases like cancer behave, then we can create better more individualized medicine.  If we can understand how individual students learn and what they know, then we can create a better educational system.  Statistics isn’t perfect.  Statisticians can disagree and still both have valid models and reasoning.  The data may be imperfect and incomplete.  The model may be wrong.  The experiment may seem trivial and unimportant. But there is so much potential for the field of Statistics to change our world.  Just because prominent statisticians like Nate Silver may not have seen a Trump presidency as the most likely event doesn’t mean that the field should be discounted.