Days before the 2020 US elections, I stumbled on a data mining competition to predict state-by-state results, and immediately jumped on board. Mostly because I was eager to guesstimate, for my own sake, who was going to win…

After FiveThirtyEight (FTE) spreadsheet of polling results, I thought it might be useful to scrape polling results and maybe sentiment from other websites. But after spending a whole evening learning to use Selenium and BeautifulSoup to successfully scrape data from a single website, I realized I didn’t have time to do this a bunch more times, integrate/clean all the data, AND build a prediction model before the deadline.

Well, FTE data it was. (Given that the prize pool was $0, I was fine using just one source.)

Below, I show the some exploratory data analysis, my prediction model, and the end result.

  1. Exploratory data analysis using the 2016 election
  2. Modeling with the 2020 polling data
  3. Post-mortem: how did I do?  
  4. Time spent/tools  
     

Exploratory data analysis using the 2016 election

First, I wanted to get a sense of what factors affect how well polling results go on to explain election results, using the 2016 polls and results.

One obvious question is whether there are systematic biases according to state, e.g. whether polls consistently under-estimate/over-estimate election results in a particular state for a particular party. Based on the figure below, it seems that this is definitely the case. For example, in CA/CO/CT, the Democrats consistently polled lower than their actual performance in the election.

Different pollsters have different FTE grades, based on historical accuracy and polling methodology. I was curious whether the better rated ones turned out to be more accurate. I removed +/- (“A+” becomes “A”) and converted any in-between grades to the higher grade (“B/C” becomes “B”), to avoid sparsity. It looked like there was definitely some association – but not the one I was expecting. Pollsters rated “B” seemed to be the furthest off the mark, for both Democrats and Republicans, across most of the states.

There was also the question of who is being polled, and whether the polling population affects how well a particular poll’s results line up with election results. I expected “All Adults” to be more biased than “Registered voters”, and the latter to be more biased than “Likely voters”. I wasn’t sure about “Voters”, as that seems to rely on self-report of whether people intend to vote, and I don’t know enough to speculate whether that’s more accurate than “Likely voters”.

As it turned out, there were no real differences based on the eyeball test. It doesn’t seem very popular to poll “All adults” or “Voters”, so we’re really left with comparing “Likely voters” and “Registered voters”. You can squint and maybe conclude that “Likely voters” looks like the more accurate of the two, but there’s no consistent or obvious outperformance.

Lastly, I wanted to know if polls run closer to the date of the election tended to be more accurate. Looking at the 30 days leading up to the 2016 election, it seemed like the answer was “yes, kind of”. I also found it interesting to see so much variation in terms of whether polling results were directionally consistent, whether gaps in polling averages meant gaps in election performance, etc.

(Note: This is not the version of the figure I created when actually analyzing the data. My quick-and-dirty version did not suggest any real relationship, but this version does, and had I seen it I would have done some kind of filtering/weighting based on days to election.)

Modeling with the 2020 polling data

Given what I’d seen in EDA, I thought I could do pretty well by taking an adjusted/weighted average of polling results. I would take a weighted average (on a state-by-state basis) by FTE grade, as well sample size (as a matter of good practice, since more people polled = more representative results). Then, I would adjust these numbers by some estimate of the state bias we saw earlier, and finally make sure everything summed to 100%.

Post-mortem: how did I do?

Competition-wise, I placed 3rd with a RMSE of 2.6% (first place was 1.94% – much better!). But now that a few months have gone by, I want take a closer look at how the election results line up with my prediction.

In general, it looks like my estimates (small dots) were pretty close to the election results (large dots), except for a couple states where there were huge differences, and also I seemed to under-estimate Republican % and over-estimate Democrat % in quite a few states. I haven’t looked into what potential contributors to the latter there might be on my end… perhaps a project for another day.

Time spent/tools

Approximate time breakdown:

  • Initial data exploration — 0.5 hours
  • Web scraping — 3 hours
  • Data cleaning and exploratory data analysis — 3 hours
  • Modeling and submission — 3 hours
  • Post-mortem, making pretty figures, write-up — 11 hours

Languages/tools used: R, Rmarkdown, RStudio