@tomjs

More Words.

2020 Election Prediction

Electoral Vote Distribution

Each time the model is run it simulates the election several thousand times. This histogram represents the outcomes of every simulation. The Y-axis represents the number of times Biden achieved a specific outcome when the model was run. In order to win, Biden must get at least 270 electoral votes and that dividing point is marked on the histogram.

The States

AK | AL | AZ | AZ | CA | CO | FL | GA | HI | IA | ID | IN | KS | LA | MA | MD | ME | MI | MN | MO | MS | MT | NC | ND | NH | NJ | NM | NV | OH | OK | OR | PA | SC | TN | TX | UT | VA | WA | WI

Select a state above or search for one below:

On each graph the blue line represents Biden’s estimated latent vote share up until the current day. The dotted line represents the model’s prediction of how the race will change from the current day until election day (November 3rd). The shaded light blue area represents a 95% confidence interval of Biden’s vote share. In other words, the model is 95% confident that Biden’s eventual vote share will end up within the shaded area. Finally, each green dot represents a poll that has been released.

As you may have noticed, the more polling data available the more confident the model is of the latent vote intention. However, that confidence decreases as the model attempts to predict the vote share on election day.

The Methodology

The Origins

This is a direct implementation of Drew Linzer’s dynamic Bayesian forecasting model. This makes it similar at its core to The Economist’s presidential forecast as well as the 2016 election model by Pierre-Antoine Kremp for Slate.

The model uses the Fair model, developed by Yale economist Ray Fair, as its informative Bayesian prior, \(h_i\) for states \(i=1 … 50\). The Fair model takes into account incumbency, the state of the economy and whether or not the incumbent party has been in office for two or more consecutive terms in order to predict the vote share for both parties. I write in more details about the impetus for designing this model as well as the issues with it in a blog post here.

How polls are interpreted

For each poll, the number of respondents declaring that they will vote for Joe Biden (\(y_k\)) is drawn from a binomial distribution where \(i[k]\) and \(j[k]\) indicate the state, \(i\), and day, \(j\), of poll \(k\) and \(n_k\) denotes the total number of respondents who indicate a preference for one of the two major parties:

\[\small y_k \sim Binomial(\pi_{i[k]j[k]}, n_k)\]

As this is a direct implementation of Linzer’s model it only takes into account state polls of “Likely Voters” and ignores national polls as well as polls of “Registered Voters”. It also does not take into account house effects or other sources of survey error like other, more advanced, models do.

How latent voter intention is estimated

The daily trends in all states latent voter intention \(\pi_{ij}\) are estimated simultaneously as a function of two components: a state-level component, \(\beta_{ij}\), and a national-level component, \(\delta_j\), that detect systematic departures from \(\beta_{ij}\) due to events that affect all states, such as convention bounces. Both components are placed on the logit scale:

\[\small \pi_{ij} = logit^{-1}(\beta_{ij} + \delta_j) \]

How the forecast is made

On election day (\(\delta_j\)) is fixed at 0 and (\(\beta_{ij}\)) is defined as a Normal distribution around the structural forecast \(h_i\). The covariance matrix (a slight departure from Linzer’s initial paper) is set at variance of 0.18 and covariance of 0.162. This ensures that state scores are correlated with each other, since it is highly unlikely that Biden will perform better in Texas than Virginia, something an independent prior would not take into account.

\[\small \beta_{iJ} \sim Multivariate Normal(logit(h_i), \Sigma_{end})\]

When predicting the election result weeks or months in advance, there will inevitably be a gap in the polling data. In order to bridge this gap, as well as smoothing voter intention on days no polls are conducted, both the state and national level component use a Bayesian reverse random-walk prior which begins on election day.

Each day’s estimation of \(\beta_{ij}\) is given prior distribution:

\[\small \beta_{ij}|\beta_{ij+1} \sim MultivariateNormal(\beta_{ij+1}, \Sigma_{walk})\]

Similarly, \(\delta_j\) is given a prior distribution:

\[\small \delta_j|\delta_{j+1} \sim MultivariateNormal(\delta_{j+1}, \Sigma_{walk})\]

The variance is also give as \(\Sigma_{walk}\) with variance = \(0.017^2\) and covariance = 0.0002601.

What the forecast is doing

Overall, the election day forecast in each state is a compromise between the structural model, \(h_i\), and the most recent poll result. The uncertainty depends on the number and size of the polls available as well as the proximity to election day. As older state polls are replaced by newer ones they have less influence on the forecast but they leave behind historical trends in the latent voter intention.

Finally, the electoral vote distribution for Joe Biden is then determined by the probability that Biden’s vote share will be greater than 50% in any given state.

Where to Find the Source Code

The code for this model is all available on GitHub and is separated into three main files:

Polls.stan: The stan code used to compute the MCMC draws.
Runmodel.R: The code to wrangle the poll data so that it can be inserted into the model. It then analyses the model output to develop the visualisation data.
Main.js: The d3.js code used to produce the visualisations for the model.

I also highly recommend that you check out the source-code for The Economist’s model, Pierre-Antoine Kremp's model from 2016 and Cory Mccartan’s model. Their code was highly instrumental in enabling me to develop my own forecast.