# Percentages for sceptics: part III

I wanted to do some self-criticism of my previous two posts in this series:

1. You can calculate the minimum of responses from a single percentage by hand (no need for computer programmes or look-up tables).
2. I’ve made a very rough model to estimate how many people the program typically returns when fed six percentages (as I did several times here).
In between, I’ve collected some links to demonstrate how great continued fractions can be.

Handy calculations

There are many ways of writing real numbers (fractions and irrationals) apart from in decimal notation. You can represent them in binary, for instance $\pi = 11.001001000011\ldots$, or in other bases. These have their uses: there is a formula to calculate the $n$th binary digit of $\pi$ without calculating all the preceding digits.

For our purposes we will use continued fractions. People write $\pi$ as $[3; 7, 15, 1, 292, 1, 1, 1, 2, 1, ...]$: this notation means that 3 is a good first approximation to $\pi$, the well-known $3+\frac{1}{7}=\frac{22}{7}$ is the closest you can be with any fraction $\frac{p}{q}$ with $q \leq 7$. Then $3+\frac{1}{ 7 + \frac{1}{15}}=\frac{333}{106}$ is the best with $q\leq 106$, and the fourth term

$3+\frac{1}{7+\frac{1}{15+\frac{1}{1}}}=\frac{355}{113}$

is a very good approximation, as the next number in the square brackets, 292, is very large (I’ll motivate this observation at the end of the section). The golden ratio is sometimes called the ‘most irrational’ number because it has a continued fraction expansion with all ones, so the sequence converges slowly.

The algorithm for finding the continued fraction for $x_0$ is:

1. Let $a_n= \lfloor x_n \rfloor$.
2. Subtract the integer part of the number: $y_n = x_n - a_n$
3. Stop if the answer $y_n$ is zero.
4. Invert what remains: $x_{n+1} = \frac{1}{y_n}$.
5. Go to 1 to compute $a_{n+1}$.
This gives $[ a_0; a_1, a_2, \ldots]= a_0 + \frac{1}{a_1+\frac{1}{a_2+\frac{1}{a_3 + \cdots}}}$.

If you look back at the first post, to find the minimum number of respondents to get $N$%, we are aiming to find the least whole number $q$ such that:

$\frac{N-0.5}{100} \leq \frac{p}{q} < \frac{N + 0.5}{100}$

This is easy for 50%: clearly $q=2$ is the optimum. For 1%, we need to find $\frac{5}{1000} \leq \frac{p}{q} < \frac{15}{1000}$. If we simplify the fractions and invert, we get:

$200 \geq \frac{q}{p} > \frac{200}{3} = 66 \frac{2}{3}$.

The least integer that satisfies this inequality is 67, so take $q=67$, and $p=1$.

In general, assume we’ve already found the best approximation $\frac{p}{q}$. Then we work out what its continued fraction must be. Let’s take 34%:

$\frac{33.5}{100} \leq \frac{p}{q} < \frac{34.5}{100}$

Inverting,

$2 \frac{66}{67} = \frac{200}{67} \geq \frac{q}{p} > \frac{200}{69} = 2\frac{62}{69}$

$\frac{66}{67} \geq \frac{q}{p} - 2 > \frac{62}{69}$

Inverting again,

$1 \frac{1}{66}=\frac{67}{66} \leq \frac{1}{\frac{q}{p}-2} < \frac{69}{62} = 1 \frac{7}{62}$.

$\frac{1}{66} \leq \frac{1}{\frac{q}{p}-2} -1< \frac{7}{62}$

Inverting yet again,

$66 \geq x = (\frac{1}{\frac{q}{p}-2}-1)^{-1} > \frac{62}{7} = 8\frac{6}{7}$.

We’ve got as many continued fractions terms as we are guaranteed. Now the best option is to take $x$ to be the least integer that satisfies the inequalities.

That is $x = (\frac{1}{\frac{q}{p}-2}-1)^{-1} = 9$. Undoing this, we get

$\frac{p}{q} = \frac{1}{2+\frac{1}{1+\frac{1}{9} } }=\frac{1}{2+\frac{9}{10}}=\frac{10}{29}$.

So $q=29$ gives the minimal sample size for 34%.

This algorithm can be converted into a proof, with the key fact that the continued fraction gives the best approximation of those numbers having a smaller denominator. If you ever face a zero on one side of the inequality, with no valid integers satisfying it, you have to invert one final time, ignoring the infinite $\frac{1}{0}$ side.

If you write out the continued fractions:

$0.335=[0; 2, 1, 66]$ and

$0.345=[0; 2, 1, 8, 1, 6]$

we can read off $\frac{p}{q}=[0; 2, 1, 9]=\frac{10}{29}$.

Care needs to be taken with examples such as

$0.365 = [0;2,1,2,1,5,3]$
$0.375 = [0;2,1,2]$
as they are the same up to the cut-off point, we need to take the next one above the longer continuing fraction (this is the same case warned about above, where we took the inverse of zero). Essentially, we can consider $0.375 = [0;2,1,2, \infty]$, and so take the approximation $latex [0;2,1,2,2]$.

@christianp mentioned that his lecturer would use continued fractions to undo the percentages at the end of levels on the game Doom. This is what prompted me to search for this method. The lecturer’s method was just to compute the continued fraction of the percentage given, and take longer abbreviations of it until you get something which rounds to the right percentage. Doing that on $\frac{37}{100} = [0;2,1,2,2,1,3]$ gives $\frac{7}{19}=[0;2,1,2,2]$ as the best guess. While it’s a good method, it fails to give the lowest answer for 1% to 5% (and similarly 95% to 99%) or 35% (and 65%), and you have to keep track of the errors to know when to stop.

[Edit: I’ve since discovered that the method I used was already sufficiently well-known to make it onto Wikipedia: the best rational approximation within an interval. You must also account for the fact that a rational number has two continued fractions, just like 1=0.9999… has two decimal expansions.]

There are three secrets in the first level, so these percentages are rounded down.

To see why stopping just before a large term is a good idea, take $x = a_0 + \frac{1}{a_1 + \frac{1}{\theta}}$, where $\theta$ is a large (possibly irrational) remainder (which will give us the large next $a_2$ term). Then $x-[a_0; a_1] =\frac{(a_1 + \frac{1}{\theta})-a_1)}{a_1 (a_1 + \frac{1}{\theta}} \approx \frac{1}{\theta a_1^2}$ which is small.

Likewise, the error $[a_0;a_1, \ldots, a_n, \theta] - [a_0;a_1, \ldots, a_n]$

$\approx \frac{1}{a_1^2}([a_1;a_2, \ldots a_n, \theta] - [a_1;a_2, \ldots, a_n]) \approx \frac{1}{\theta a_1^2 a_2^2 \cdots a_n^2}$ is small if $\theta$ is large. The next term will be even closer, but the denominator will be much, much larger.

Continued fractions continued

• I think the best place to learn about continued fractions is in The Princeton Companion to Mathematics book. You can also construct continued fractions for functions, such as tan(x) which was apparently important in the proof that $\pi$ was irrational.
• There’s a tricki article for finding a rational with a low denominator near a given real, noting that the method is important in Shor’s quantum factorizing algorithm.
• Course notes on continued fractions. Example 3 gives the same approach as the lecturer above.
• You can play around with this continued fraction calculator. If you type in your percentage, say 0.35, and click on “show all best rational approximations”, you can hunt for the one with error less than 0.005, to give the percentages for sceptics answer of $6/17 \approx 35$%.
• A faster rational approximation can be achieved by changing the algorithm slightly by using subtraction when your decimal part $x_n - \lfloor x_n \rfloor > 0.5$.
• To see how continued fractions can fit into a number theory course, see these three lecture sketches.
• The pigeonhole principle is useful to show how well real numbers can be approximated by fractions, one of Dirichlet’s theorems.
• There’s an excellent geometric interpretation of how well continued fractions approximate real numbers in this blog post about Roth’s theorem. See more of Ford circles on Wikipedia.
• You can also interpret the continued fraction form of rational numbers visually, by cutting up rectangles into squares with side lengths of the numerator and denominator. There’s also an interactive version.
• The leap day, February 29th, has been declared continued fractions day, because it exists in our calendar to rationally approximate the irrational orbit period of the Earth around the Sun. This is clearly the best maths day, independent of whether you write day/month, month/day (unlike pi day). On the downside, it only happens about once in every four years.
• Plus maths has a good article about continued fractions applications inside and outside of maths, including to gear ratios.
Ford circles show the best rational approximations: see which circles vertical lines would intersect.

Probabilistic percentages

In the second post, on three occasions I entered six percentages into a simple programme and determined the minimal response rate for each happened to be 66 or 67 (but two were 0dp, one was 1dp). I wanted some sort of very rough null hypothesis to decide what sort of numbers to expect out from entering six distinct percentages.

If we pick out a percentage from 1% to 99% uniformly at random, how many possible sample sizes does it eliminate? Unless we pick 50%, we eliminate 2. There is only one way of not eliminating a sample size of 99 (picking 50%). There are $100 - q$ ways of eliminating $q$ as a possible number of respondents. From the 2 and 99 example, you can see number elimination is not independent: we can’t have them both as possibilities. But, for simplicity, let’s suppose elimination is independent.

After picking six distinct percentages, the probability that $n$ is a possibility is:

$P(n)=\frac{n-1}{99} \cdot \frac{n-2}{98} \cdots \frac{n-6}{94}$.

We see $q=2,3,4,5, 6$ are certainly no longer possibilities.

The chance that $q=7$ is therefore $\frac{6!93!}{99!}$, and this is the minimum with the same probability. Let’s say $m(n)$ is the chance that $n$ is the minimal number of respondents. Then $m(n)= P(n) \prod_{j=2}^{n-1} (1-P(j))$.

Let’s look for the median, the lowest value $k$ such that $\sum_{i=2}^k m(j)>0.5$. Doing the calculations, this turns out to be $k=66$. So I probably shouldn’t have come to any conclusions from seeing 66 and 67 there. In fact, I was completely wrong.

However, repeating the calculations with the necessary adjustments for 1 decimal place (99 becomes 999), gives a median $k = 469$. The probability that the minimum is less than 67 in this “Humble pie” case is about 6 in ten million. So that seems a lot more significant.

This graph shows that even with six distinct percentages as an input we should probably ignore anything less than 350, to give some sort of nine out of ten confidence level.

The second assumption I made was that we choose the percentages uniformly. This isn’t as bad an assumption as it might at first seem, as it is almost equivalent to choosing uniformly percentages between 2% and 50% (by symmetry). However it becomes less true the more numbers we choose. Again, it’s hard to do any simple analysis without this assumption, and without having lots of real data or knowing, say, how many options were available as answers, it’s difficult to assign any other distribution. The whole exercise is statistically dubious anyway.