Binomial Elections

Last updated 2015-09-12 00:05:39 SGT

Consider the following quote:

The Straits Times understands that the sample counts were found to have a confidence level of 95 per cent, plus or minus 4 percentage points

This statement is somewhat misshapen. A better rendering could perhaps state this as

The sample counts are accurate to a confidence interval of plus or minus 4 percentage points, to a confidence level of 95 percent

That is to say, we are 95% confident that the actual results will lie within 4 percentage points of those returned by the sample counts.

Now, is this statement necessarily accurate? What we're doing here is assuming that the number of electors in each constituency is very large compared to our sample; in that case we can assume that each individual sample will return positive for a particular party with constant probability [p]. The appropriate means of describing the distribution of the total number of votes for our favourite party is then given by the binomial distribution.

Consider the following properties of the binomial distribution with [n] trials at a probability [p] of success:

  1. [E[X] = n p]
  2. [\text{Var}[X] = n p (1 - p)]

Suppose we were given the result (that we had [X] successes) of such a set of trials, with known [n] and unknown [p]. We could construct an unbiased estimator for [p] as [\hat{p} = {X \over n}]. Then our corresponding estimator for the variance would be [\hat{\sigma}^2 = {1 \over n-1} \hat{p} (1 - \hat{p})]. Unfortunately, this is manifestly dependent on [\hat{p}] — unlike the flat value cited in the ST.

To make matters worse, consider the actual value of [\hat{\sigma}^2] at various values of [\hat{p}]. In the extreme case, for [n = 100, \hat{p} = {1 \over 2}], we obtain [n\sqrt{\hat{\sigma}^2} \approx 5]. Even if we make the (perhaps justified) approximation of a normal distribution, this still gives us a 95% confidence interval far more than 4 percentage points wide. On the other extreme, setting [\hat{p} = 0] (perhaps by freak accident) might, with less careful assumptions, lead us to claim that there is literally no doubt as to the actual result, which is obviously not true.

Instead, suppose we were to try averaging this across all possible values of [\hat{p}]. This gives us [\delta = {n \over \sqrt{n-1}} \int_0^1 \sqrt{x(1-x)}dx = {\pi \over 8}{n \over \sqrt{n-1}}] (we could use some tricks to perform this integral, but that is not important). For [n = 100] this is indeed roughly 4 percentage points, but this corresponds to 1 standard deviation, not a 95% confidence interval, which would then be about 10 percentage points wide.

Now, obviously this is going to scale as [{1 \over \sqrt{n}}], which is going to vary depending on the size of the constituency. The 4 percentage point confidence interval that the ST reports suggests (working backwards) a value of [n] in the ballpark of about 650 or so.

However for the purposes of this election, [n] is 100 times the number of polling stations in each constituency; this means that [n] probably goes asymptotically as [n \propto N] (not sure how polling stations are actually allocated, to be honest). Prima facie, this means that the sample estimators might become more accurate for larger constituencies. We have about 2.4 million voters and 852 polling stations, so that adds up to roughly [n \sim {1 \over 28} N] for each constituency on average — for example, this yields a confidence interval of 1.4 percentage points for Aljunied GRC, which makes the result quite boring.

However, this might also mean that the binomial distribution assumption might be faulty (since the sample size is large enough to be a nontrivial fraction of the population), and more involved analysis might be required. The ST sure isn't helping, though.

Moral: try harder next time ST

comments powered by Disqus