Days before the 2020 US elections, I stumbled on a data mining competition to predict state-by-state results, and immediately jumped on board. Mostly because I was eager to guesstimate, for my own sake, who was going to win…

After FiveThirtyEight (FTE) spreadsheet of polling results, I thought it might be useful to scrape polling results and maybe sentiment from other websites. But after spending a whole evening learning to use Selenium and BeautifulSoup to successfully scrape data from a single website, I realized I didn’t have time to do this a bunch more times, integrate/clean all the data, AND build a prediction model before the deadline.

Well, FTE data it was. (Given that the prize pool was $0, I was fine using just one source.)

Below, I show the some exploratory data analysis, my prediction model, and the end result.

  1. Exploratory data analysis using the 2016 election
  2. Modeling with the 2020 polling data
  3. Post-mortem: how did I do?  
  4. Time spent/tools  

Exploratory data analysis using the 2016 election

First, I wanted to get a sense of what factors affect how well polling results go on to explain election results, using the 2016 polls and results.

One obvious question is whether there are systematic biases according to state, e.g. whether polls consistently under-estimate/over-estimate election results in a particular state for a particular party. Based on the figure below, it seems that this is definitely the case. For example, in CA/CO/CT, the Democrats consistently polled lower than their actual performance in the election.