My New Project

Update 09/23/17:  I am switching to two proportion Z tests.  I am setting the population proportion to .5 to prevent an underestimation of variance.

 

This is a bit of a technical post,  I will have a better explanation later.

Post election, I have been working on a paper and thinking about what to do next.  I am really interested in breaking down voter behavior in the swing states. I have collected exit poll data from the 11 swing states.  I want to test if voter behavior across the swing states was consistent with the national vote or the swing state average.

For phase 1 of this experiment, I will run Chi-Square Test of Homogeneity between a swing state compared to the average of the other swing state and the national vote.  I will look at each category four different ways: Trump vs. not Trump,  Clinton vs. not Clinton, Other vs Clinton and Turmp, and overall.  This will probably be around 1500 tests.   I will have an initial alpha level of 0.05.  I will then run a two proportion z-tests on the tests were the p value was less than 0.05.  I will do the z-tests on the direction that matches the data.

For phase 2, I will collect data from 2008 and 2012 in states that have a statistically significant portion of significant tests.  Then I will compare voting behavior with Chi-Square Test of Homogeneity on: 2008 vs 2012,  2008 vs 2016, and 2012 vs 2016.  Then significant results will be tested using a two proportion z-test.

I am going with the Chi-Square test first for two reasons.  The Chi-Square test is not subject to errors in the direction of an effect, and the Chi-Square test is less sensitive than a two proportion z-test.  I have to be very careful in my interpretation of the results since an analysis this large means that there is a big  potential for false positives and false negatives. This analysis will probably take me most of next year. I’ll give an update on my progress in December.

A Non-Technical Overview of My Research

Recently I have been writing up a draft of a research article on my general election model to submit for academic publication.  But that paper is technical and requires you to have some exposure to statistical research to understand.  I wanted to explain my research without going into all the technical details.

Introduction

The President of the United States is elected every four years.  The Electoral College decides the winner,  by the votes of electors chosen by their home state.  Usually the electors are chosen based on the winner of that state and they vote for the winner of that state. Nate Silver correctly predicted the winner of the 2008 election with Bayesian statistics.  Silver got 49 out of 50 states correct.   Silver certainly wasn’t the first person to predict the election, but he received a lot of attention for his model.   Silver’s runs Five Thirty Eight  which talks about statistics and current events.  Bayesian statistics is a branch of statistics that uses information you already know (called a prior) and adjusts the model as more information comes in.  My model like Nate Silver’s used  Bayesian statistics. We do not know the details of the Silver model, besides that it used Bayesian statistics.  To the best of my knowledge, my method is the first publicly available model that used poll data from other states as the prior.  A prediction was made for 2016, where I correctly predicted 6 states.  Then the model was applied to 2008 and 2012, where my prediction of state winners matched the prediction of Five Thirty Eight.

Methodology

I took poll data from Pollster, which provided me csv files for the 2016 and 2012 election.  For 2008 I had to create the csvs by hand.  I had a series of computer programs in Python (a common programming language) to analyze.  My model, used the normal distribution.  My approach divided the 50 states into 5 regional categories: swing states,  southern red states,  midwestern red states, northern blue states,  and western blue states.  The poll data source used as the prior were National,  Texas,  Nebraska,  New York, and California respectively.  This approach is currently believed to be unique, but since multiple models are proprietary it is unknown if this has been used before.  I only used polls if they were added to pollster before the Saturday before election date.   For the 2016 election analysis this meant November 5th.  I posted my predictions on November 5th.

I outline more of my method here.

Results and Discussion

My model worked pretty well compared to other models.  Below is a table of other models and their success rate at predicting the winning candidate in all 50 states plus (and Washington D.C.).

Race Real Clear Politics Princeton Election Consortium Five Thirty Eight (Polls Plus) PredictWise (Fundamental) Sabato’s Crystal Ball My Model
2008 Winner Accuracy 0.96078 0.98039 0.98039  N/A 1 0.98039
2012 Winner Accuracy 0.98039 0.98039 1 0.98039 0.96078 1
2016 Winner Accuracy 0.92157 0.90196 0.90196 0.90196 0.90196 0.88235
Average Accuracy 0.95425 0.95425 0.96078 0.94118 0.95425 0.95425

As you can see all the models do a similar job at picking the winner in each state, which predicts the electoral college.  There are other ways to compare accuracy, but I don’t want to discuss this here since it gets a little technical.   No one was right for every state in every election.  It would probably be impossible to create a model that would consistently predict the winner in all states, because of the variability of political opinions.   Election prediction is not an exact science.  But there is the potential to apply polling analysis to estimate public opinion on certain issues and politicians.  Right now the errors in polls are too large determine public opinion on close issues.   But further research could determine ways to reduce error in polling analysis.