Generating a random variable by coin tossing

1. Coin tossing

Suppose we are given a sequence of independent fair bits {X_1,X_2,X_3,\ldots} (meaning {X_i} take {0} and {1} with probabily {1/2}) we want to produce with them a discrete random variable {Y} that takes the values {\{1,\ldots,k\}} with probabilities {p_1,\ldots,p_k > 0}. The objective, of course, is to do this using the least possible number of bits.

Let us note that if we consider {X = 0.X_1X_2X_3\ldots} in binary, this variable is uniformly distributed in {[0,1]}. The usual way to produce {Y} given a number {X} uniform in {[0,1]} is to consider the intervals {I_1 = [0,p_1), I_2 = [p_1,p_1+p_2),\ldots,I_n = [p_1+\ldots+p_{n-1},1)} and declare {Y = j} when {X\in I_j}. Of course, knowing whether {X\in I_j} or not, takes a few bits from {X_1,X_2,\ldots}. Indeed, having seen {k} digits, the only thing we know is that {X\in [X_1\ldots X_k,X_1\ldots X_k + 2^{-k}]}. We denote the number of bits needed by T.

1.1. Uniform distribution {U_N\sim (1/N,\ldots,1/N)}

The case of the uniform distribution is particularly interesting; it is indeed one of the most common distributions one might come across, but we claim that its study shows some interesting patterns too.

Experimental redundancy for $latex {j=9}&fg=000000$. Algorithm 1 is the ``wasteful algorithm'', while $latex {R(x)}&fg=000000$ is the limit period for the redundancy.

Experimental redundancy for {j=9}. Algorithm 1 is the “wasteful algorithm”, while {R(x)} is the limit period for the redundancy.

Algorithm 1. We begin with a simplified version that is slightly wasteful.

The idea is that we stop only when {[0.x_1\ldots x_k,0.x_1\ldots x_k+2^{-k}]} is completely contained in one of the intervals {[0,1/N),[1/N,2/N),\ldots ,[1-1/N,1]}.

Why is this wasteful? Well, the probability of the resulting {x=0.x_1x_2\ldots} being a dyadic number {\frac{A}{2^B}} is 0, so we could directly stop when

\displaystyle (0.x_1\ldots x_k,0.x_1\ldots x_k+2^{-k} ) \subset I_m\,,

thus neglecting the borders. This obviously improves the performance in cases like {N=2^k}, in which we end up stopping after exactly {k} steps. The analysis of the algorithm also gets more interesting as we shall see afterwards.

See the corresponding code for our first (and wasteful) algorithm. Here {I[m] = m/N = p_1+\ldots+p_m}.

x = 0
l = 0
while (true) :
  x += reveal_bit() / 2^k
  while (I[l+1] <= x) :
   l += 1
  if (x + 1/2^k < I[l+1] or I[l+1] == 1) :
    return k
  k += 1

To start let us recall that

\displaystyle \mathbb{E}[T] = \sum_{k\geq 0} \mathbb{P}(T>k)\,,

which simplifies the computation of {\mathbb{E}[T]}, because telling whether we have to continue is easy: if there is a number of the form {m/N} in {[0.X_1\ldots X_k,0.X_1\ldots X_k +1/2^k]}, then we must conclude that {T>k}. This is so because {m/N} constitutes the border between two intervals {I_{m-1} = \Big(\frac{m-1}{N},\frac{m}{N}\Big]} and {I_m = \Big(\frac{m}{N},\frac{m+1}{N}\Big]} whenever {m\in \{1,\ldots,N-1\}}.

For {2^{-k} > 1/N}, it is obviously true that {[0.x_1\ldots x_k,0.x_1\ldots x_k +1/2^k]} contains a number of the form {m/N} with {m<N}, that is not {0.x_1\ldots x_k}. Indeed, notice that {0.x_1\ldots x_k < 1 - 1/N}. If {N=2^{k}} then we observe that we may only stop if {[0.x_1\ldots x_k,0.x_1\ldots x_k +1/2^k]} is {[1-2^{-k},1]}, hence in this case {\mathbb{P}(T>k) = \frac{N-1}{2^k}}.

Assume now that {2^{-k} < 1/N}, then there is exactly one number of the form {0.x_1\ldots x_k} in the interval {\Big[\frac{m}{N}-2^{-k},\frac{m}{N}\Big)}, which is given by

\displaystyle 0.x_1\ldots x_k = \frac{1}{2^k}\left(\left\lceil \frac{2^k m}{N}\right\rceil-1\right)\,.

The set of numbers in {[0,1]} starting with those exact first {k} digits is

\displaystyle \Bigg[\frac{1}{2^k}\left(\left\lceil \frac{2^k m}{N}\right\rceil-1\right),\frac{1}{2^k}\left\lceil \frac{2^k m}{N}\right\rceil\Bigg)\,,

therefore we conclude that

\displaystyle \{x \in [0,1] : T(x)>k\} = \bigcup_{m=1}^{N-1}\Bigg[\frac{1}{2^k}\left(\left\lceil \frac{2^k m}{N}\right\rceil-1\right),\frac{1}{2^k}\left\lceil \frac{2^k m}{N}\right\rceil\Bigg]\,.

Since {2^{-k} < 1/N} each term of the union is disjoint and so

\displaystyle \mathbb{P}\left(T>k\right) = \frac{N-1}{2^k}\,,

where {\{\cdot\}} denotes the fractional part.


\displaystyle \mathbb{E}[T] = 1+\lfloor \log_2(N-1) \rfloor + \sum_{k : N \leq 2^k} \left(\frac{N-1}{2^k}\right) = 1+\lfloor \log_2(N-1) \rfloor + \frac{N-1}{2^{\lfloor \log_2 (N-1)\rfloor}} \,.

In all we deduce that the redundancy {\mathbb{E}[T]-H(Y)} satisfies

\displaystyle R(x) = 2^{x} - x + 1 - \log_2\left(1+\tfrac{1}{N-1}\right)\,,

where {x=\{\log_2 (N-1)\}}, where {\{\cdot\}} denotes the fractional part.

Algorithm 2. In our second algorithm we stop as soon as {(0.x_1\ldots x_k , 0.x_1\ldots x_k+2^{-k})\subset I_j} for some {j=1,\ldots,N}.

Experimental redundancy for Algorithm 2 with $latex {j=9}&fg=000000$.

Experimental redundancy for Algorithm 2 with {j=9}.

In this case we observe that the difference occurs when {0.x_1\ldots x_k+2^{-k} = m/N} for some {1\leq m\leq n-1}. Of course {0.x_1\ldots x_k+2^{-k}=A/2^B} for some integers {A,B>0}, with {A} odd.

This never happens when {N} is odd, therefore the expected value remains the same when {N} is odd. Now assume {N=2^\nu d} where {d} is odd. Then {m = 2^{\nu-B} dA} and since {A} and {d} are odd we must have {B \leq \nu} and {A <2^B}. Now given {0<A/2^B<1}, what is the smallest {k} for which {0.x_1\ldots x_k+2^{-k}=A/2^B} occurs? Let us remark that, since {0.x_1\ldots x_k+2^{-k}<1}, all of the digits {x_1,\ldots,x_k} cannot equal {1}, therefore {0.x_1\ldots x_k+2^{-k}} can be reduced to a number of the form {0.y_1\ldots y_k} with {k} digits. It follows then that {k = B} and that {m/N=A/2^B} occurs as a right border for {k=B-1}. Observe that {2^{B-1}<2^\nu\leq N}, hence, in fact, this happens during the phase in which {1/N<2^{-k}} and the intervals {(0.x_1\ldots x_k,0.x_1\ldots x_k+2^{-k})} are longer than the intervals {[m/N,(m+1)/N]}, therefore once we get to the phase {1/N\geq 2^{-k}}, all of the possible dyadics are already available.

Fixed {1\leq B \leq \nu} there are { 2^B - 2^{B-1} } choices for our odd {A} (if we allow for the possibility of {m/N=1}, whihc we have to discount anyway). Hence in all we can produce { 2^\nu } dyadics as a right-border.


\displaystyle \mathbb{E}[T] = 1+\lfloor \log_2(N-1) \rfloor + \sum_{k : N \leq 2^k} \left(\frac{N-2^\nu}{2^k}\right) = 1+\lfloor \log_2(N-1) \rfloor + \frac{N-2^\nu}{2^{\lfloor \log_2 (N-1)\rfloor}} \,.

In all we deduce that the redundancy {\mathbb{E}[T]-H(Y)} satisfies

\displaystyle R(x) = 2^{x} - x + 1 - \frac{2^ {\nu(N)} - 1}{N-1} \,2^x - \log_2\left(1+\tfrac{1}{N-1}\right)\,,

where {x=\{\log_2 (N-1)\}}, where {\{\cdot\}} denotes the fractional part and {\nu(N)} is the greatest {t} such that {2^t} divides {N}.

1.2. Generic distribution

Algorithm 1 for a generic {Y}. In this case we have to work with each individual right-border {P(j):=p_1+\ldots+p_j}. As long as {p_j \leq 2^{-k}}, all of the points in {x\in [P(j-1),P(j))} will produce a truncation {0.x_1\ldots x_k} that satisfies {[0.x_1\ldots x_k,0.x_1\ldots x_k+2^{-k}] \not\subset [P(j-1),P(j))} , thus we will not have stopped. This accounts for a term {p_j \left(1 + \left\lfloor \log_2\left( 1/p_j\right)\right\rfloor\right)} in {\sum_{k\geq 0}\mathbb{P}(T>k)}, for each {j\leq N} (even for {j=N}).

In particular this means that

\displaystyle H(Y) = \sum_{j=1}^N p_j \log_2\left( 1/p_j\right) \leq \sum_{j=1}^N p_j \left(1 + \left\lfloor \log_2\left( 1/p_j\right)\right\rfloor\right) \leq \mathbb{E}[T]\,.

The other cases are counted in (not necessarily disjoint from the previous count!)

\displaystyle U_k({\bf{p}}) := \bigcup_{m=1,\atop 2^k>1/p_m}^{N-1}\Bigg[\frac{1}{2^k}\left(\left\lceil 2^k P(m)\right\rceil-1\right),\frac{1}{2^k}\left\lceil 2^k P(m)\right\rceil\Bigg]\,,

which has measure

\displaystyle \mu_k({\bf{p}}) := \frac{\#\{m\in\{1,\ldots,N-1\} : 2^k > 1/p_m\}}{2^k}\,.

Strictly speaking we should count the difference

\displaystyle \left| U_k({\bf{p}}) \setminus V_k({\bf{p}}) \right|\,,\qquad V_k({\bf{p}}) := \bigcup_{m=1,\atop 2^k\leq 1/p_m}^{N}\Big[P(m-1),P(m)\Big]\,.

We prove the bound

\displaystyle \sum_k \mu_k({\bf{p}}) \leq 2\,.

Indeed, observe that

\displaystyle \sum_k \mu_k({\bf{p}}) = \sum_k \sum_{m < N : p_m > \frac{1}{2^k}}\frac{1}{2^k}

and reverse the order of summation to get

\displaystyle \sum_k \mu_k({\bf{p}}) = \sum_{m<N} \sum_{k : p_m > \frac{1}{2^k} } \frac{1}{2^k}= \sum_{m<N} \frac{1}{2^{\lfloor \log_2 (1/p_m)\rfloor}} \leq 2 \sum_m p_m = 2\,.

In general this means that

\displaystyle \mathbb{E}[T] \leq 1 + \sum_{j=1}^N p_j \lfloor \log_2(1/p_j)\rfloor + \sum_{1\leq m<N} \frac{1}{2^{\lfloor \log_2 (1/p_m)\rfloor}} \,, \ \ \ \ \ (1)

and in particular

\displaystyle \mathbb{E}[T] \leq H(Y)+ 3\,.

Lemma 1 Fix {p\in (0,1)}, then

\displaystyle p \lfloor \log_2(1/p)\rfloor + \frac{1}{2^{\lfloor \log_2(1/p) \rfloor}} \leq p\log_2(1/p) + p

Proof: Write

\displaystyle \log_2(1/p) = \lfloor \log_2(1/p) \rfloor + \epsilon\,,

where of course {\epsilon\in [0,1)}. Then

\displaystyle p \lfloor \log_2(1/p)\rfloor + \frac{1}{2^{\lfloor \log_2(1/p) \rfloor}} = p (\log_2(1/p) - \epsilon) + p 2^\epsilon

and then the inequality {2^\epsilon \leq 1 + \epsilon}, valid for {\epsilon \in [0,1]}, proves the result. \Box

Comment. To prove that {2^\epsilon \leq 1 + \epsilon} for {\epsilon \in [0,1]}, observe that the function {f(x) = 1+x-2^x} is concave and {f(0)=f(1)=0}.

The function $latex {f(\epsilon)=1+\epsilon-2^\epsilon}&fg=000000$.

The function {f(\epsilon)=1+\epsilon-2^\epsilon}.

As a corollary we get the following Theorem

Theorem 2 Let {T} be the number of bits it takes to decide to which interval {I_j} the number {X} belongs. Then we have

\displaystyle H(Y) \leq \mathbb{E}[T] \leq H(Y) + 2\,, \ \ \ \ \ (2)

where {H(Y) = \sum_{j=1}^k p_j \,\log_2(1/p_j)} is called the entropy of {Y}.

Furthermore, the {+2} in (2) is tight by our example with the uniform distribution.

Concluding remarks. I got the inspiration for this post from reading [1], chapter 5, where the optimal algorithm for producing a random variable from fair coin tosses is described. I wondered, usually when working with probabilities one uses the procedure that I explained in the post, which more or less corresponds to the so called Inversion method, so how far is it from the optimum? In this post I have proved that it is not that far from being optimal actually. In [1] it is proved that the optimal procedure satisfies the same bounds for the expected value {H(Y) \leq \cdot \leq H(Y)+2}, but it is not shown that the inequality on the RHS is tight for the optimal procedure.


1. Thomas M. Cover, Joy A. Thomas, Elements of Information Theory. Wiley Series in Telecommunications and Signal Processing, Second Edition, 2006.

Posted in Algorithms, Number Theory, Probability | Leave a comment

Grading a class

Imagine we had a class with {N} students and each week they hand in a sheet of homework. Now, you choose a set of {M} of them at random and correct only those {M}. At the end, the grade assigned to a student is the average among all of his homework sheets that you corrected. How many weeks does it take for every student to have a grade?

Comment: I came up with this question because I have implemented such a system and I wanted to choose {M} wisely as to have a grade for each student every {4} weeks.

Coupons collector problem.

It is easy to see that this problem is related to the classical coupon collector problem. The problem reads as follows: you have {N} coupons and each time you buy one (picked uniformly at random from the {N} possible ones) and the question is how long it takes you on average to get the full collection. The answer here is

\displaystyle  {\mathbb{E}}[T] = N H_N\,,

where {T} is the random variable corresponding to the number of coupons it takes you to get the full collection, and {H_N = 1+\frac{1}{2}+\frac{1}{3}+\ldots + \frac{1}{N}}.

The proof follows easily upon noticing that, once you have {i} different coupons, you have to wait on average {\left(\frac{N-i}{N}\right)^{-1}} to get a new one.

It is also not difficult to prove that, even though what we have above is the expected value, the actual value of {T} is usually not much bigger than {N\log N} (remember that we may estimate {H_N \approx \int_1^N\frac{dx}{x}=\log N}). Indeed,

\displaystyle  \Pr\left(T > k\right) \leq N \, \left(1 - \frac{1}{N}\right)^k\leq N \,\exp(-k/N)\,,

because {\left(1 - \frac{1}{N}\right)^k} is the probability of never getting (up to time {k}) a given coupon. The event {T > k} is the union of the events “not getting a given coupon”, thus we have bounded the probability of the union with the sum {\Pr(\cup_i A_i)\leq \sum_i \Pr(A_i)}.

Thus, if {k = N \log N + c\,N} we get

\displaystyle  \Pr\left(T > N \log N + c\,N \right) \leq e^{-c}\,.

Teacher’s problem. Back to our problem, if {T} represents the number of weeks it takes us to have a grade for every single student, intuitively we should have ({M>1})

\displaystyle  \mathbb{E}[T] ``\leq" \frac{N}{M} H_N \,.

If we look at the sequence of sets of {M} elements we get, and we write them all in a line in some order, we get a possible outcome for the coupons collector problem… but it is a bit better than in the coupons collector’s case, because we are guaranteed that the first {M} coupons are distinct, the second group of {M} coupons are distinct, and so on. What is better, these cases have higher probability than before, so one expects to have such an inequality (I am not claiming this constitutes a proof!). This establishes, though, the parallel between the two problems, now we expect to have nearly the same results, but with a factor of {1/M} involved.

We note that the probability of never getting a given student {i} in {k} iterations is

\displaystyle  \left(\frac{\binom{N-1}{M}}{\binom{N}{M}} \right)^k = \Big( 1 - \frac{M}{N} \Big)^k \leq \exp\left(-k\,\frac{M}{N}\right)\,.

Thus we also get

\displaystyle  \Pr(T > k) \leq N \exp\left(-k\,\frac{M}{N}\right)\,.

Considering {k = \frac{N}{M} \log N + \frac{N}{M} c} we get

Proposition 1 We have the bound

\displaystyle  \Pr\left(T > \frac{N}{M} \log N + \frac{N}{M} c\right) \leq e^{-c}\,.

This means that we never wait much more than {\frac{N}{M} \log N} weeks. However, is {\frac{N}{M} H_N} the expected value as in the case of the coupons collector problem? The answer is negative, but I claim that asymptotically this is not far from the truth.

Proposition 2 We have

\displaystyle  \lim_{N\rightarrow\infty} \Pr\left(T > \frac{N}{M} \log N + \frac{N}{M} c\right) = 1 - e^{-e^{-c}}\,.

Proof: Indeed, we note that bounding the probability of the union by the sum of the probabilities is just the first term of the inclusion-exclusion process, if we continue we get the so-called Bonferroni’s inequalities

\displaystyle   \sum_{j=1}^{2h} (-1)^{j+1}\binom{N}{j}\,\left(\frac{\binom{N-j}{M}}{\binom{N}{M}} \right)^k \leq \Pr\left(T>k\right) \leq \sum_{j=1}^{2h+1} (-1)^{j+1}\binom{N}{j}\,\left(\frac{\binom{N-j}{M}}{\binom{N}{M}} \right)^k\,, \ \ \ \ \ (1)

for all {h\geq 0}. Here we note that

\displaystyle   \frac{\binom{N-j}{M}}{\binom{N}{M}} = \left(1-\frac{M}{N}\right)\,\left(1-\frac{M}{N-1}\right)\ldots \left(1-\frac{M}{N-j+1}\right) \ \ \ \ \ (2)

whenever {j < N} (else it is {0}). So fixing any {h\geq 0} and {M}, letting {N\rightarrow \infty}

\displaystyle  \binom{N}{j} \,\left(\frac{\binom{N-j}{M}}{\binom{N}{M}} \right)^k \sim \frac{N^j}{j!} \,\exp\left(-k\,j\,\frac{M}{N}\right)\,,

for {j\leq 2h+1}, which already suggests the claimed result. To actually get it, we must make a better estimate of the remainder term. Note that, in (2) we have

\displaystyle  \left(\frac{\binom{N-j}{M}}{\binom{N}{M}} \right)^k = \left(\left(1-\frac{M}{N}\right)\,\left(1-\frac{M}{N-1}\right)\ldots \left(1-\frac{M}{N-j}\right)\right)^k

which equals, taking logs and using {\log(1-x)=-x-x^2/2-x^3/3\ldots}

\displaystyle  \exp\left(-k \left(\sum_{i=0}^j \sum_{n\geq 1} \frac{(M/(N-i))^n}{n}\right)\right)

Here we may rewrite this as

\displaystyle = \exp\left(-k \sum_{i=0}^j M/(N-i) - k \sum_{n\geq 2} \sum_{i=0}^j \frac{(M/(N-i))^n}{n}\right)

and the second term is quadratic

\displaystyle  \exp\left(-k\,j\,\frac{M}{N} + O(k/N^2)\right)\,.

Hence if {k=(N/M)\log N + \frac{N}{M}\,c + o(N)} we get

\displaystyle  \binom{N}{j} \,\left(\frac{\binom{N-j}{M}}{\binom{N}{M}} \right)^k = \frac{N^j}{j!}\left(1+O(1/N)\right) \, \exp(-k\,j\, M/N) \left(1 + O(\log N / N)\right)

so that

\displaystyle  \binom{N}{j} \,\left(\frac{\binom{N-j}{M}}{\binom{N}{M}} \right)^k= \frac{(e^{-c})^j}{j!} + O(\log N / N)

because {\exp(-k\,j\, M/N) = e^{-c\,j}/N^j = O(N^{-j})}. Thus, from (1) we get

\displaystyle  \sum_{j=1}^{2h} (-1)^{j+1} \frac{(e^{-c})^j}{j!} \leq \liminf_{N\rightarrow\infty} \Pr\left(T>k\right) \leq \limsup_{N\rightarrow\infty} 	 \Pr\left(T>k\right) \leq \sum_{j=1}^{2h+1} (-1)^{j+1}\frac{(e^{-c})^j}{j!}\,, \ \ \ \ \ (3)

but the two extremes converge to {1-e^{e^{-c}}} as {h\rightarrow\infty}, thus the result. \Box

Applying this to my class. In my case I usually have {16} students and I would like {T} to be {4} so I will want to have

\displaystyle  \frac{16}{M} \log 16 + \frac{16}{M} c = 4\,,\qquad e^{-c}=1/10\,,

in order to apply our inequality and have at most {10}% of probability leaving out some student… thus {c = \log(10) \approx 2.3} and {M \approx 20.3} (which is, of course, nonsense… because then {M>N}). The problem here is that we are making a gross estimation for such a small {N}, thus, in fact, looking more directly at

\displaystyle  \Pr(T>4) \leq N \left(1 - \frac{M}{N}\right)^4

and setting {N \left(1 - \frac{M}{N}\right)^4 = 1/10}, we get {M\approx 11.5}, i.e. {M\geq 12}. The actual minimal value of {M} is indeed {12}… by inspection, bad news for me (I was using {M=6}). In fact, to have probability roughly 1/2-1/2, we must pick {M=9}.

This proof is essentially that of [1] for the Coupon’s Collector problem.


  1. Erdős, Paul; Rényi, Alfréd (1961), On a classical problem of probability theory (PDF), Magyar Tudományos Akadémia Matematikai Kutató Intézetének Közleményei 6: 215–220.

Posted in Asymptotics, Combinatorics, Computation, Probability | Leave a comment


The purpose of this post is to discuss the Riemann-Lebesgue Lemma, why it makes sense, and some interesting applications among which we have {\int_{-\infty}^\infty \frac{\sin(x)}{x}dx = \pi}, the proof of which, in my opinion, explains a lot more clearly why it should equal {\pi}.

Recently I have grown quite fond of the Riemann-Lebesgue Lemma, the Fourier Transform and Fourier Series, through studying Tauberian Theorems in Analytic Number Theory, and Partial Differential Equations.

The Riemann-Lebesgue lemma states that

\displaystyle \lim_{n\rightarrow\infty}\int_A^B f(x)\,\sin(n\,x)dx = 0\,. \ \ \ \ \ (1)


for quite a general set of functions {f}.

In the case of continuous functions the interpretation is that {\sin(n x)} has periods of length proportional to {1/n}, in which, being {f} continuous, the variation in {f} is really small, while the integral of {\sin(nx)} in this periods is {0}. Thus we will be adding roughly {\epsilon/n} for each cycle, being {\epsilon} the change in {f} during the cycle. Adding these up leaves us with something of the order of {\epsilon}, which can be made arbitrarily small as {n\rightarrow\infty}.

Lemma 1 Let {f\colon [A,B]\rightarrow \mathbb{R}} be continuous, then we have

\displaystyle \lim_{n\rightarrow\infty}\int_A^B f(x)\,\sin(n\,x)dx = 0\,. \ \ \ \ \ (2)

Proof: Observe that {f} is uniformly continuous on the compact set {[A,B]}. Thus, given an arbitrary {\epsilon > 0} there is {\delta > 0} such that {|f(x)-f(y)|\leq \epsilon} whenever {|x-y|\leq \delta}.

Next consider {N \geq \frac{2\pi}{\delta}} so that {\frac{2\pi}{N} \leq \delta}, we partition the integration intervals into pieces of size less than {\delta}

\displaystyle \int_A^B f(x)\,\sin(N\,x)dx = \int_A^{\frac{2\pi}{N}\,\lceil \frac{N\,A}{2\pi} \rceil}\clubsuit + \sum_{k=\lceil \frac{N\,A}{2\pi} \rceil}^{\lfloor \frac{N\,B}{2\pi} \rfloor - 1} \int_{\frac{2\pi\,k}{N}}^{\frac{2\pi\,k}{N}+2\pi/N}{\clubsuit} + \int_{\frac{2\pi}{N}\,\lfloor \frac{N\,B}{2\pi} \rfloor}^B\clubsuit\,,

where the {\clubsuit} are a short-hand for {f(x)\,\sin(N\,x)dx}.

Here it is clear that the first and the last integrals tend to {0} as {N\rightarrow\infty} because

\displaystyle \left|\int_A^{\frac{2\pi}{N}\,\lceil \frac{N\,A}{2\pi} \rceil}f(x)\,\sin(N\,x)dx\right|\leq \int_A^{\frac{2\pi}{N}\,\lceil \frac{N\,A}{2\pi} \rceil} |f(x)|dx \leq \left(\frac{2\pi}{N}\,\left\lceil \frac{N\,A}{2\pi} \right\rceil-A\right)\|f\|_\infty\,,

which \rightarrow 0, and similarly for the other one. Thus, for a certain {N_\epsilon\in\mathbb{Z}_{\geq 1}} with {N_\epsilon\geq\frac{2\pi}{\delta}} we may assume that {N\geq N_\epsilon} implies that these two integrals have absolute value less than {\epsilon}.

As for the case of { \int_{\frac{2\pi\,k}{N}}^{\frac{2\pi\,k}{N}+2\pi/N}f(x)\,\sin(N\,x)dx\,, } with {k} integer, note that {\int_{\frac{2\pi\,k}{N}}^{\frac{2\pi\,k}{N}+2\pi/N}\sin(N\,x)dx = 0} (it goes through a whole cycle!) implies

\begin{aligned} \left|\int_{\frac{2\pi\,k}{N}}^{\frac{2\pi\,k}{N}+2\pi/N}f(x)\,\sin(N\,x)dx\right| &= \left|\int_{\frac{2\pi\,k}{N}}^{\frac{2\pi\,k}{N}+2\pi/N}\Big(f(x)-f\left(\frac{2\pi\,k}{N}\right)\Big)\,\sin(N\,x)dx\right| \\ &\leq \int_{\frac{2\pi\,k}{N}}^{\frac{2\pi\,k}{N}+2\pi/N}\Big|f(x)-f\left(\frac{2\pi\,k}{N}\right)\Big|dx \\ &\leq \frac{2\pi}{N}\,\epsilon\,, \end{aligned}

where the last inequality follows from the fact that the distance to the left-most point of the interval is less than {\delta} and so {\Big|f(x)-f\left(\frac{2\pi\,k}{N}\right)\Big|\leq \epsilon}.

Hence, collecting terms, we get

\displaystyle \left|\int_A^B f(x)\,\sin(N\,x)dx \right| \leq \epsilon + \left(\frac{N\,B}{2\pi} - \frac{N\,A}{2\pi}\right)\,\frac{2\pi}{N}\,\epsilon + \epsilon = (B-A+2)\,\epsilon\,,

for {N\geq N_\epsilon}. \Box

Example 1. We will prove that (assuming its existence, which may be shown by the alternating series test):

{\displaystyle {\int_{0}^\infty \frac{\sin(x)}{x}dx = \frac{\pi}{2}\,.}} \ \ \ \ \ (3)


Let us denote this integral by {I}. Consider then

\displaystyle I_N=\int_0^{2\pi\,N} \frac{\sin(x)}{x}dx\,, \qquad J_N = \int_0^{\pi\,N} \frac{\sin(x)}{x}dx\,, \ \ \ \ \ (4)

which clearly satisfy {I_N\rightarrow I} and {J_N\rightarrow I}. Setting {x = N\,u} we get

\displaystyle I_N=\int_{0}^{2\pi} \frac{\sin(N\,u)}{u}du\,,\qquad J_N = \int_0^{\pi} \frac{\sin(N\,u)}{u}du

Note that {I_N} almost fits into the Riemann-Lebesgue, but {1/u} is not continuous on {[0,2\pi]}, it tends to infinity close to {u=0}. To remedy this problem (somewhat surprisingly) we look first at

\displaystyle \frac{1}{u} - \frac{1}{\sin(u)}\,,

here we run into trouble again, but at the points {u=\pi} and {u=2\pi} instead of {u=0}.

Why insist with {\frac{1}{\sin(u)}} ? Well, in fact, it is well-known that

\displaystyle \int_{0}^{2\pi} \frac{\sin(N\,u)}{\sin(u)}du = \begin{cases} 2\pi &\text{if }N\text{ is odd}\\ 0 &\text{otherwise}\,, \end{cases}

for a proof look at {(*)} below.

To finally eliminate the discontinuities, we note that

\displaystyle f(u)=\frac{1}{\sin(u)} - \frac{1}{u} + \frac{1}{u-\pi} - \frac{1}{u-2\pi} \ \ \ \ \ (5)

can be extended continuously (by considering the limits at {0,\pi,2\pi}) to {u\in [0,2\pi]}.

The choice should not be surprising as

\displaystyle \frac{1}{\sin(u)} = \frac{1}{\sin(u)-\sin(\pi\,k)} \sim \frac{1}{(u-\pi\,k)\,\sin^\prime(\pi\,k)} = \frac{(-1)^k}{u-\pi\,k}\,,

when {u\rightarrow\pi\,k}. In fact we are eliminating the contribution from the simple poles!

We are very lucky because

\displaystyle \int_0^{2\pi}\frac{\sin(N\,u)}{u-\pi}du = \int_{-\pi}^{\pi} \frac{\sin(N\,u+N\,\pi)}{u}du = (-1)^N\int_{-\pi}^{\pi} \frac{\sin(N\,u)}{u}du = (-1)^N\,2J_N\,,


\displaystyle \int_0^{2\pi}\frac{\sin(N\,u)}{u-2\pi}du = \int_{-2\pi}^{0} \frac{\sin(N\,u+2N\,\pi)}{u}du = \int_{-2\pi}^{0} \frac{\sin(N\,u)}{u}du = I_N\,.


\displaystyle \int_0^{2\pi} f(u)\,\sin(N\,u)du = \int_{0}^{2\pi} \frac{\sin(N\,u)}{\sin(u)}du - I_N +(-1)^N\,2J_N-I_N\,,

and considering {N} odd we have

\displaystyle \int_0^{2\pi} f(u)\,\sin(N\,u)du = 2\pi - 2I_N - 2J_N \,, \ \ \ \ \ (6)

which shows {I = \pi/2} by the Riemann-Lebesgue lemma. {\square}

The Riemann-Lebesgue lemma is more general actually.

Lemma 2 Consider {g\in L^1([A,B],\mathcal{M},m)} where {\mathcal{M}} is the {\sigma}-algebra of the Lebesgue-measurable sets and {m} is the Lebesgue measure. Then

\displaystyle \lim_{n\rightarrow\infty}\int_A^B g(x)\,\sin(n\,x)dx = 0\,\, \ \ \ \ \ (7)

where {\int_A^B h(x) dx} is to be interpreted as {\int_{[A,B]} h dm}, the Lebesgue integral.

Proof: The continuous functions {C^1[A,B]} are dense in {L^1([A,B])} under the Lebesgue measure {m}.Thus it follows that, given a measurable function {g\in L^1([A,B])} we have a sequence of continuous functions {(f_n)} such that {\displaystyle\int_A^B |g(x)-f_n(x)|dx \rightarrow 0} so that, decomposing {g(x)\,\sin(N\,x)} as {(g(x)-f_n(x))\,\sin(N\,x) + f_n(x)\,\sin(N\,x)} we have

\displaystyle \left|\int_A^B g(x)\,\sin(N\,x) dx\right| \leq \int_A^B |g(x)-f_n(x)| dx + \left|\int_A^B f_n(x)\sin\left(N\,x\right)dx\,\right|\,,

from where the result follows upon making {n} and {N} sufficiently large in order. \Box

Of course, the proofs read verbatim when we change {\sin} for any bounded periodic-function {S}, since the principle of the proof stays the same.

Lemma 3 Consider {g\in L^1([A,B],\mathcal{M},m)} where {\mathcal{M}} is the {\sigma}-algebra of the Lebesgue-measurable sets and {m} is the Lebesgue measure. Let {S(x)} be a bounded periodic measurable function on {\mathbb{R}_{\geq 0}}, with period {\tau}, such that {\int_0^\tau S(x)dx=0}. Then

\displaystyle \lim_{n\rightarrow\infty}\int_A^B g(x)\,S(n\,x)dx = 0\,\, \ \ \ \ \ (8)

where {\int_A^B h(x) dx} is to be interpreted as {\int_{[A,B]} h dm}, the Lebesgue integral.

Another possible extension is the case in which we have a continuously differentiable function {g}, in such a case we may say much more about the rate of decay.

Lemma 4 Consider {g\in C^1([A,B])}. Then

\displaystyle \int_A^B g(x)\,\sin(n\,x)dx = O(1/n)\,\, \ \ \ \ \ (9)

as {n\rightarrow\infty}

Proof: Applying integration by parts

\begin{aligned} \int_A^B g(x)\,\sin(nx)dx &= \int_A^B g(x)\,\left(-\frac{\cos(nx)}{n}\right)^\prime dx \\ &= \frac{g(A)\,\cos(nA) - g(B)\,\cos(nB)}{n} + \frac{1}{n}\,\int_A^B g^\prime(x)\,\cos(nx) dx\,, \end{aligned}

and therefore

\displaystyle \left|\int_A^B g(x)\,\sin(nx)dx\right| \leq \frac{|g(A)|+|g(B)|+\|g^\prime\|_{L^1}}{n}\,, \ \ \ \ \ (10)

thus proving the result. \Box

Example 2. We will prove that

\displaystyle { \sum_{k=1}^\infty \frac{\sin(k\,x)}{k} = \frac{\pi-x}{2}\,,} \ \ \ \ \ (11)


for {x\in (0,2\pi)}, and that this convergence is uniform on {[\epsilon,2\pi-\epsilon]} for any {\epsilon>0}.

  1. First show that {\sum_{k=1}^{N-1} \cos(k\,x) = - \frac{1}{2} - \frac{\cos\left(N x\right)}{2} + \frac{\sin\left(N x\right) \cos\left(x/2\right)}{2 \,\sin(x/2)} }.
  2. Integrate it in {x} from {0} to {x} to get

    \displaystyle \sum_{k=1}^{N-1} \frac{\sin(k\,x)}{k} = -\frac{x}{2} - \frac{\sin(N\,x)}{2\,N} + \frac{1}{2}\,\int_0^x \sin\left(N y\right)\,\frac{\cos\left(y/2\right)}{\sin\left(y/2\right)}dy\,.

  3. To make the integrand of the form {\sin(N y)\,g(y)} with {g(y)} differentiable when {y \leq x < \pi}, we must remove the pole of {{\cos\left(y/2\right)}/{\sin\left(y/2\right)}} at {y=0}. Consider

    \displaystyle g(y) = \frac{\cos\left(y/2\right)}{\sin\left(y/2\right)} - \frac{2}{y}\,,

    then {g(z)} is holomorphic on {|z|<2\pi} because we removed the pole, in particular it is continuously differentiable on {[0,2\pi-\epsilon]} for any {\epsilon > 0}.

  4. Thus

    \displaystyle \int_0^x \sin(N y) g(y) dy = \int_0^x \sin\left(N y\right)\,\frac{\cos\left(y/2\right)}{\sin\left(y/2\right)}dy - 2\,\int_0^x \frac{\sin(N y)}{y}dy

    is {O(1/N)} when {x\in [0,2\pi-\epsilon]}.

  5. Observe that, by what we know from Example 1, we have

    \displaystyle \int_0^x \frac{\sin(N y)}{y}dy = \int_0^{N\,x} \frac{\sin(y)}{y}dy = \frac{\pi}{2} - \int_{N\,x}^\infty \frac{\sin(y)}{y}dy\,,

    and here {\int_{N\,x}^\infty \frac{\sin(y)}{y}dy = O(1/N)} when {x \geq \epsilon > 0}. Indeed

    \begin{aligned} \int_{N\,x}^\infty \frac{\sin(y)}{y}dy &= \int_{x}^\infty \frac{\sin(N y)}{y}dy \\ &= \int_{x}^\infty \frac{\left(-\cos(N y)/N\right)^\prime}{y}dy \,, \end{aligned}

    thus, integrating by parts, we get {\left|\displaystyle\int_{N\,x}^\infty \frac{\sin(y)}{y}dy\right| \leq \frac{1+1/\epsilon}{N} }.

Proof of {(*)}. Somewhat surprisingly, the computation of

\displaystyle I_{n,m} = \int_{0}^{2\pi} \left(\frac{\sin(nx)}{\sin(x)}\right)^mdx\,,

boils down to a matter of evaluating a particular coefficient of a polynomial!

Let us write {\sin(t) = \frac{e^{i\,t} - e^{-i\,t} }{ 2\,i}}, so that we get

\displaystyle I_{n,m} = \int_{0}^{2\pi} \left(\frac{e^{n\,ix} - e^{-n\,ix} }{e^{ix} - e^{-ix}}\right)^mdx\,.

Here observe that { \displaystyle\frac{a^n-b^n}{a-b} = a^{n-1} + a^{n-2}\,b + \ldots + a\,b^{n-2} + b^{n-1}\,,} so that

\displaystyle I_{n,m} = \int_{0}^{2\pi} \left(\sum_{j=0}^{n-1} \left(e^{ ix}\right)^j \left(e^{-ix}\right)^{n-1-j} \right)^mdx= \int_{0}^{2\pi} \left(\sum_{j=0}^{n-1} e^{(2j-(n-1))\,ix} \right)^mdx\,.

Note that we have

\displaystyle \int_0^{2\pi} e^{n\,i t } dt = \int_{0}^{2\pi} e^{n\,i t } dt = \begin{cases} 2\pi, & \text{if } n = 0\,, \\ 0, &\text{otherwise } \,, \end{cases}

and therefore, by factoring {e^{-(n-1)\,ix}} from each of the {m} factors, we get

\displaystyle I_{n,m} = 2\pi\,[z^{m\,(n-1)}]\Bigg\{ \Big(\sum_{j=0}^{n-1} z^{2j}\Big)^m\Bigg\}\,,

where the notation {[z^k]\{p(z)\}} denotes the coefficient of {z^k} in {p(z)}.

For instance, for {m=1}, we observe

\displaystyle I_{n,1} = \begin{cases} 2\pi, & \text{if }n\text{ is odd}\,, \\ 0, &\text{otherwise } \,, \end{cases}

as the term {z^{n-1}} appears in the sum {\sum_{j=0}^{n-1} z^{2j}} if and only if {(n-1)} is even.

Remark. The integral {I_{n,1}} may also be computed by induction and the use of trigonometric identities. However, I feel that the proof above is much more illuminating.

Exercise. Use the result from Example 2. to prove that {\displaystyle\sum_{k=1}^\infty \frac{1}{k^2} = \frac{\pi^2}{6}}.

Posted in Computation | Leave a comment

Higher order Newton-Raphson

In this post we shall see one way to look at the Newton–Raphson formula, and the extensions this produces. Actually, I came up with this idea a few years back, while I was studying for a Numerical Analysis exam.

Introduction. We are given a function {f\colon {\mathbb R} \rightarrow {\mathbb R}}, and we are interested in finding a root {\theta} such that {f(\theta) = 0}. Here we shall first assume that our function if at least {C^2}, that is, it has continuous second derivatives.

To generate a ‘fixed point method’, what we do is we find a convenient function {g(x)} such that {g(\theta) = \theta} exactly for the zeroes (or the zero) of {f(x)}. Then, we start at some point {x_0}, and consider the sequence {x_{n+1} := g(x_n)}. If this sequence converges {x_n\rightarrow \omega}, then {\omega} is a fixed point of {g(x)} and so a zero of {f(x)}.

The most trivial example of such a method follows from choosing a constant {\lambda \neq 0} and considering

\displaystyle  g(x) = x - \lambda\, f(x)\,,

so that {g(\theta) = \theta} if and only if {f(\theta) = 0}.

Let us assume that a sequence defined by {x_{n+1} := g(x_n)} converges to {\theta\in {\mathbb R}}, then by looking at the Taylor–expansion of {g(x)} at {\theta}, we get

\displaystyle  g(x) = \theta + (1- \lambda f^\prime (\theta) )\,(x-\theta) + O((x-\theta)^2)\,.

To accelerate the convergence of {g(x_n)} to {\theta}, we would like to cancel the linear term {(1- \lambda f^\prime (\theta) )\,(x_n-\theta)} so that the error roughly squares in each iteration. However, this seems impossible if we do not know {\theta} and {f^\prime (\theta)}.

The way to fix this is to consider {\lambda} not a constant, but a function too! A function such that {\lambda(\theta) = \displaystyle\frac{1}{f^\prime(\theta)}} is enough, because we note that if {g(x) = x - \lambda(x)\, f(x)}, then by expanding Taylor of order {2} for {f(x)} and then Taylor of order one for {\lambda(x)} :

\displaystyle  g(x) = \theta + (1- \lambda(\theta) f^\prime (\theta) )\,(x-\theta) + O((x-\theta)^2)\,.

Thus, the most obvious choice is {\lambda(x) = \frac{1}{f^\prime(x)}}, which yields

\displaystyle  g(x) = x - \frac{f(x)}{f^\prime(x)}\,,

which gives the Newton–Raphson iteration.

What is the trick here?. The trick was to choose {\lambda(x)} so that {g^\prime(\theta) = 0}. We note that, since {f(\theta) = 0}, we need not care about the derivative of {\lambda(x)} at {\theta} :

\displaystyle  g^\prime(x) = 1 - \lambda^\prime(x) f(x) - \lambda(x) f^\prime(x)\,,

so that

\displaystyle  g^\prime(\theta) = 1 - \lambda(\theta) f^\prime(\theta)\,.

What if we now want to cancel {g^\prime(\theta)} without affecting the first derivative? The real trick is to notice that, when you have {h_n(x)\,f^n (x)} and differentiate it less than {n} times, you get {0} at {x=\theta}. Now, when you differentiate {h_n(x)\,f^n (x)} exactly {n} times and evaluate the result at {x=\theta}, then all the terms vanish, except for the one in which we choose to differentiate {f^n(x)} all the {n} times. Hence

\displaystyle  D^k \left(h_n(x)\,f^n(x)\right) \Big|_{x=\theta} = \begin{cases} 0 & \text{ if }k<n \\ n!\,\left(f^\prime(\theta)\right)^n\, h_n(\theta) & \text{ if }k=n\,, \end{cases}

Thus, by considering

\displaystyle  g_n(x) = \sum_{k=0}^{n-1} h_k(x) \left(f(x)\right)^k\,,

and fixing the coefficients {h_k(x)} recursively, we can cancel all of the first {n} derivatives at {\theta}.

Indeed, we have

\displaystyle  g_1(x) = x\,,\qquad g_{n+1}(x) = g_n(x) - \frac{g_{n}^{(n)}(x)}{n!\,(f^\prime(x))^{n}} \,\left(f(x)\right)^n\,,

and here the zeroes of {f(x)} become fixed points of order {n} in {g_n(x)}. Of course, this procedure doesn’t ensure that no new zeroes have been introduced (it does introduce new zeroes sometimes).

Example. A method of order {3} to get {\sqrt{k}} follows from computing the positive root of {f(x)=x^2-k} by considering {x_{n+1}:=g(x_n)} with

\displaystyle  g(x) = x-\frac{x}{2}+\frac{3\,k}{8\,x}+\frac{{k}^{2}}{4\,{x}^{3}}-\frac{{k}^{3}}{8\,{x}^{5}}\,.

This method, however, has a few fixed points other than {\pm\sqrt{k}} :

\displaystyle \pm \rho := \pm \frac{\sqrt{\sqrt{17}-1}\,\sqrt{k}}{{2}^{\frac{3}{2}}}\,,\qquad \pm \frac{\sqrt{\sqrt{17}+1}\,\sqrt{k}}{{2}^{\frac{3}{2}}}\,i\,,

of course, the latter two are purely imaginary.

Since, when {x} is very large we have {g(x) = \frac{x}{2} + O(1/x)}, we get that the sequence {x_n} defined by iteration must be bounded, and hence it must have a subsequence that converges to one of our fixed points.

To normalize in terms of {\sqrt{k}}, we note that all fixed points are of the form {\theta\,\sqrt{k}} which suggests considering

\displaystyle  h(\theta)=\frac{g(\theta\,\sqrt{k})}{\sqrt{k}} = \frac{\theta}{2}+\frac{3}{8\,\theta}+\frac{1}{4\,{\theta}^{3}}-\frac{1}{8\,{\theta}^{5}}\,.

instead. Here what we want to see is that {h(h(\ldots)) \approx 1} as we keep on composing {h}.

What we first note is that, when {\theta \rightarrow 0^+} we have that {h(\theta)\rightarrow -\infty}, hence it is possible to get negative starting with a positive number. Let us avoid this problem by noticing that {h^\prime(\theta) > 0} for all {\theta} where it makes sense (its real zeros are {\pm 1}, both with multiplicity {2}), hence, if we let {\vartheta := \rho / \sqrt{k} =  \frac{\sqrt{\sqrt{17}-1}}{{2}^{\frac{3}{2}}}} we have {h(x) > h(\vartheta) = \vartheta} whenever {x > \vartheta}. As a matter of fact, we have that {h^\prime(\vartheta) > 5} and so {h(x) - \vartheta > 5(x-\vartheta)} when {x \approx \vartheta^+}. When {x\in (1,\infty)}, we note that {h(x) = \frac{x}{2} + O(1/x)}, and so the convexity of {h(\theta)} along with {h(1)=1}, imply

\displaystyle  h( \alpha\,1 + \beta\,x) \leq \alpha\,h(1)+\beta\,h(x) < \alpha\,1 +\beta x\,,

when {\alpha,\beta>0} satisfy {\alpha+\beta=1}, and {x} is very large. Therefore {x < h(x) } for {x > 1}, while having {x > 1} implies {h(x) > h(1)=1} (derivative is positive), and therefore if {x_0 > \sqrt{k}}, we have {x_n} decreasing and always {x_n > \sqrt{k}}. Hence {x_n \searrow \sqrt{k}}.

The function {h(x)} is concave for {x \in (-1,1)}, hence this will mean that more generally {h(x) > x} for {x\in (\vartheta,1)}, and therefore we can only have {x_n \rightarrow \sqrt{k}} when {\sqrt{k} > x_0 > \vartheta\,\sqrt{k}} (if it at some point {h} exceeds one, we are done by the previous part). Indeed, {h(\vartheta)=\vartheta} and {h(1)=1}, hence by concavity

\displaystyle  h( \alpha\,\vartheta + \beta\,1) \geq \alpha\,h(\vartheta)+\beta\,h(1) = \alpha\,\vartheta+\beta\,1\,,

whenever {\alpha,\beta>0} satisfy {\alpha+\beta=1}, and equality occurs if and only if either {\alpha} or {\beta} is {1}.

In conclusion, since we can write {\vartheta < \frac{2}{3}}, we have that

\displaystyle  \boxed{x_0 > \frac{2}{3}\,\sqrt{k}\,,\text{ and } x_{n+1}:=g(x_n) \Longrightarrow x_n\rightarrow \sqrt{k} \text{ with order }3\,,}


\displaystyle  g(x) = x-\frac{x}{2}+\frac{3\,k}{8\,x}+\frac{{k}^{2}}{4\,{x}^{3}}-\frac{{k}^{3}}{8\,{x}^{5}}\,.

Without knowing {\sqrt{k}}, we may still note that {k + 1 \geq 2\,\sqrt{k}} and hence taking {x_0 > \frac{k+1}{3}} is certainly enough. Even better, when {k} is an integer, we may write {k = 2^{s_1} + \ldots + 2^{s_n}} with {s_1 < \ldots < s_n}, then picking {x_0 = 2^{\lceil s_n/2 \rceil}} does the trick. Indeed, because

\displaystyle \left( 2^{\lceil s_n/2 \rceil}\right)^2 \geq 2^{s_n} > \frac{2^{s_n} + 2^{s_n}-1}{2} = \frac{2^{s_n}+2^{s_n-1}+\ldots +1}{2} \geq \frac{k}{2}

and so {2^{\lceil s_n/2 \rceil} \geq \displaystyle\frac{\sqrt{2}}{2}\,\sqrt{k}}, where {\displaystyle\frac{\sqrt{2}}{2} \doteq 0.707\ldots > \frac{2}{3}}.

The limits of the exponential decay can be seen in terms of the darkness of the pixels.

Complex plot of the result of applying h above exactly seven times, showing a more complicated situation. The red area corresponds to the convergence to 1.

Gamma border

Complex plot of the result of applying Newton-Raphson with k=1 exactly eleven times. In this case the only fixed points are -1 and 1.

Final comments.

Although the idea is amusing, it is fair to say that in terms of complexity we do not gain that much with respect to the usual Newton-Raphson. As a matter of fact, if we are given a method of order 3, and doing two iterations of Neton-Raphson were much less costly than doing one of our method of order 3, we are being wasteful (the error is “raised” to the power of 4 every two iterations of Newton-Raphson).

Posted in Algorithms, Computation | Tagged , | Leave a comment

The fast decay of the gamma function

The purpose of this post is to discuss the fast decay of the gamma function {\Gamma(z)} when {|z|\rightarrow \infty} along the imaginary axis. This is an interesting topic, as it is not usually discussed, we are generally more interested in the increase of the gamma function on the positive real axis, because there we have {\Gamma(n+1) = n!}, and we thus get the asymptotic behaviour of the factorials.

As it is well-known (see [1]), we have a generalised Stirling’s formula

\displaystyle   \log\Gamma(z) = \left(z - \tfrac{1}{2}\right)\,\log(z) - z + \tfrac{1}{2}\,\log\left(2\,\pi\right) + o(1)\,, \ \ \ \ \ (1)

as {|z|\rightarrow \infty} over a slit-domain with {|\textsf{arg}(z)|<\pi-\delta} for some {\delta > 0}. Here {\log(z)} denotes the principal part of the logarithm of {z}, while {\log \Gamma(z)} need not be the principal part of the logarithm of {\Gamma(z)}. It must coincide with the principal part of the logarithm, however, up to an integer multiple of {2\pi i}, and so, in particular, we have equality for the real parts of both sides of (1).

The behaviour of the gamma function {\Gamma(z)} is illustrated in the following complex plot. For each point, the intensity of the light indicates the absolute value of the resulting value {\Gamma(z)} (thus, black means {0} while white means {\infty}), and the colour indicates the phase angle {\textsf{arg}(\Gamma(z))}. What is striking about the picture is that the gamma function seems to decay very fast to almost being {0} when the imaginary part is larger that the real one.

Complex plot of the gamma function

Complex plot of {\Gamma(z)} on {[-10.5,10.5]\times [-10.5,10.5]}.

Next we will be interested in {\log |\Gamma(z)|}, and therefore we will only be interested in the real part of the RHS of (1), because for all {w\in {\mathbb{C}}} we have {\Re(\log(w)) = \log |w|}. Thus, we will compute the real part of the RHS of (1).

Let us write {z = s + i\, t} with {s > 0}, since the case {s < 0} follows then simply from {\Gamma(z+1) = z\,\Gamma(z)} applied repeatedly. By definition we have {\log(z) = \log |z| + i\, \textsf{arg}(z)} and here, since {s > 0}, we have { \textsf{arg}(z) = \arctan\left(\frac{t}{s}\right) }, which of course should tend to {\frac{\pi}{2}\, \textsf{sgn}(t)} as {|t|\rightarrow \infty} with {s = o(t)}. We will need a little more precision here to get the asymptotic estimates for {\Gamma(z)}, and so we will need the following lemma.

Lemma 1 Let {t\in \mathbb{R}\setminus\{0\}}, we have

\displaystyle  \arctan(t) + \arctan(1/t) = \frac{\pi}{2}\,\textsf{sgn}(t)\,, \ \ \ \ \ (2)

where {\textsf{sgn}(t)} denotes the sign of {t}.

Proof: Since {\arctan} is an odd function, it is enough to prove the result for {t > 0}. Let us differentiate the LHS of (1), we get

\displaystyle  \frac{1}{1+t^2} - \frac{1}{t^2} \,\frac{1}{1 + (1/t)^2} = 0\,,

and so the function {\arctan(t) + \arctan(1/t)} is constant in {\mathbb{R}_{>0}} and the value of this constant follows by taking {t\rightarrow \infty}. \Box

The consequence of this is, since {s > 0}, that

\displaystyle  \arctan\left(\frac{t}{s}\right) = \frac{\pi}{2}\,\textsf{sgn}\left(t\right) - \arctan\left(\frac{s}{t}\right)\,,

and here when {s = o(t)} we may use the Taylor expansion of {\arctan(x) = x - \frac{x^3}{3} \pm \ldots} around {x = 0} to get

\displaystyle  \arctan\left(\frac{t}{s}\right) = \frac{\pi}{2}\,\textsf{sgn}\left(t\right) - \frac{s}{t} + O\left(\left(\frac{s}{t}\right)^3\right)\,,

as {|t|\rightarrow \infty} with {s= o(t)} .

Now, after a little bit of algebra, we get

\displaystyle  \left(z - \tfrac{1}{2}\right)\,\log(z) = \left(s - \tfrac{1}{2}\right)\,\log |z| - \frac{\pi}{2}\, t\,\textsf{sgn}\left(t\right) + s - O(s^3 / t^2) + i\, \left(\textsf{imaginary part}\right)\,,

when {s = o(t)}. For {\log|z|}, we observe that

\displaystyle  \log |z| = \log\,|t| + \log\left(\sqrt{1 + (s/t)^2}\right) = \log\,|t| + O((s/t)^2)\,,

and so, if {s = o(t^{2/3})}, we get

\displaystyle  \left(z - \tfrac{1}{2}\right)\,\log(z) = \left(s - \tfrac{1}{2}\right)\,\log |t| - \frac{\pi}{2}\, t\,\textsf{sgn}\left(t\right) + s + o(1) + i\, \left(\textsf{imaginary part}\right)\,.

And, subtracting {z} and adding {\frac{1}{2}\,\log(2\pi)} (and another {o(1)} term) we finally get

\displaystyle  \log |\Gamma(z)| = - \frac{\pi}{2}\, t\,\textsf{sgn}\left(t\right) + \left(s - \tfrac{1}{2}\right)\,\log |t| + \frac{1}{2}\,\log(2\pi) + o(1)\, .

As a little simplification, note that {t\,\textsf{sgn}\left(t\right) = |t|} and so we have

\displaystyle   \boxed{\log |\Gamma(z)| = - \frac{\pi}{2}\, |t| + \left(s - \tfrac{1}{2}\right)\,\log |t| + \frac{1}{2}\,\log(2\pi) + o(1)\,\,,} \ \ \ \ \ (3)

whenever {|z|\rightarrow \infty} with {z = s+i\,t}, {s = o(t^{2/3})} and {s > 0}.

Remark 1 If we redo the algebra more carefully, we can check that

\displaystyle  \log |\Gamma(s+i\,t)| = - \frac{\pi}{2}\, |t| + \left(s - \tfrac{1}{2}\right)\,\log |t| + s\,g(s/t) - \frac{1}{4}\,\log\left(1+(s/t)^2\right) + \frac{1}{2}\,\log(2\pi) + o(1)\,\,, \ \ \ \ \ (4)

as {|t|\rightarrow \infty} with {0\leq s\leq |t|}, here we define {g(x) := \sum_{j=1}^{\infty} \frac{(-1)^{j+1}}{2j(2j+1)} x^{2j}}. Note that the threshold for fast decay is {s = o(t/\log(t))}.

Remark 2 A bit more precisely, the threshold for fast decay is approximated by the equation {s\,\log |t| = \frac{\pi}{2}\,|t|}, if we disregard the linear terms.

Indeed, on the one hand, when {s\,\log |t| = \frac{\pi}{2}\,|t|} we get

\displaystyle  \log |\Gamma(s+i\,t)| = \frac{\pi\,|t|}{2\,\log |t|} g\left(\frac{\pi}{2\,\log |t|}\right) + O(\log|t|) \sim \frac{\pi^3}{48} \times \frac{|t|}{\left(\log |t|\right)^3}\,,

as {|t|\rightarrow\infty}, which tends to {\infty} still. And, on the other hand, if {s\,\log |t| = (\frac{\pi}{2} - \epsilon)\,|t|} with {\epsilon > 0}, we get

\displaystyle  \log |\Gamma(s+i\,t)| \sim -\epsilon \times |t|\,,

which tends to {-\infty}, and so {\Gamma(s+i\,t)\rightarrow 0}.

The limits of the exponential decay can be seen in terms of the darkness of the pixels.

The limits of the exponential decay can be seen in terms of the darkness of the pixels.

Gamma border

The orange line corresponds to the solutions of the equation {x\,\log(y)=\frac{\pi}{2}\,y}.


  1. N.G. de Bruijn, Asymptotic methods in analysis. Dover Publications, United States, 3rd edition, 1981.
  2. Sage was used to produce the pictures. See here and here.

Posted in Asymptotics, Computation | Leave a comment

The Fibonacci Word

Let us define {\sigma \colon \{ 0,1 \} \rightarrow \{ 0,1 \}^*} by {\sigma(0)=01} and {\sigma(1) = 0}, and extend it to a homomorphism of {\{ 0,1 \}^*} by {\sigma(x_1\ldots x_n) = \sigma(x_1)\ldots \sigma(x_n)}. Let us note that

\sigma^0(0) = 0
\sigma^1(0) = 01
\sigma^2(0) = 010
\sigma^3(0) = 01001
\sigma^4(0) = 01001010

Something very interesting is going on, each {\sigma^k(0)} seems to be a prefix of {\sigma^{k+1}(0)}. We will see that this is indeed the case.

Let us write {\omega_i = \sigma^i(0)}. It is clear that {\omega_2 = \omega_1\omega_0}, and so {\omega_3 = \sigma(\omega_1)\sigma(\omega_0) = \omega_2\omega_1} and so on {\ldots} {\omega_{n+1} = \omega_n\omega_{n-1}} for {n\geq 1}. Then we have

\omega_{n+1} = \omega_n\omega_{n-1}= \omega_{n-1}\omega_{n-2}\omega_{n-1}= \omega_{n-2}\omega_{n-3} \omega_{n-2}\omega_{n-1} = \ldots= \omega_1 \omega_0 \omega_1\omega_2\ldots \omega_{n-1}

This implies that the digits converge pointwise to those of

\displaystyle \omega_\infty = \omega_1\omega_0\omega_1\omega_2\omega_3\ldots\,,

and we shall call {\omega_\infty} the Fibonacci word.

It should be noted that the origin of the name is motivated by the fact that {\omega_{n+1} = \omega_n\omega_{n-1}}, which mimicks the definition of the Fibonacci numbers {f_{n+1} = f_n + f_{n-1}}.

The Fibonacci word is quite special. For starters, each subword (a finite and consecutive subsequence of characters) of {\omega_\infty} appears infinitely many times. A sequence satisfying this property is said to be recurrent.

It is clear that {\omega_\infty} is recurrent, indeed because if {w} is a finite subword, then {w} is a subword of {\omega_n} for some {n} and then, since {\omega_n} is a subword of {\omega_m} for each {m\geq n}.

The Fibonacci word is also what we call a balanced word.

Definition 1 (Balanced word) A binary word {w} is said to be balanced if for every {n\geq 0} and for pair of subwords (a consecutive sublock of {w}) {U} and {V} of length {n}, we have that the number of ones in {U} and {V} differs in at most {1}.

If we denote by {|u|_1} the number of ones in {u}, then the latter part of the definition can be rewritten as {\left| |U|_1-|V|_1\right| \leq 1}.

Lemma 2 If a binary sequence {u} is not balanced (be it finite or not), then it contains two subwords {0W0} and {1W1} for some {W\in \{ 0,1 \}^*}.

Proof: Let {U} and {V} satisfy {|U|_1-|V|_1\geq 2} and have length as short as possible. It is clear then that {U} and {V} must differ in their borders (else we can eliminate the part that is equal and still get the same difference {|U|_1-|V|_1}). Clearly then {U = 1W1} and {V=0{\tilde{W}}0}, since the cross cases {U=1W0}, {V=0{\tilde{W}}1} and {U=0W1},{V=1{\tilde{W}}0} are not possible (because {|U|_1-|V|_1=|W|_1-|\tilde{W}|_1} and the latter are shorter), while having {U=0W0}, {V=1\tilde{W}1} implies {|W|_1-|\tilde{W}|_1 = 2 + |U|_1 - |V|_1\geq 2}. Now we claim {W = \tilde{W}}. Assume otherwise, then there is a first point where they differ, that is {W = x \alpha y}, {\tilde{W} = x \tilde{\alpha} y^\prime} where {\alpha\in \{ 0,1 \}}, {x,y,y^\prime\in \{ 0,1 \}^*}. Note then that {\alpha} can’t be {1} because otherwise we have {1x1} and {0x0} as subwords, {|1x1|_1-|0x0|_1=2} and these are shorter. Hence {\alpha = 0} and {W = x 0 y}, {\tilde{W} = x 1 y^\prime}. But then we have {|U|_1-|V|_1 = |y1|_1-|y^\prime 0|_1} which again leads to an absurd. \Box

We prove {\omega_\infty} is balanced. Suppose otherwise, then there would be subwords {0W0} and {1W1} within {\omega_m} for {m} large enough. Since {\omega_\infty} is recurrent, we may further assume that the two subwords {0W0} and {1W1} we will be working with are not at the beginning of {\omega_\infty}. Clearly {W} is not empty as {11} can’t appear in {\omega_\infty} by definition of {\sigma}. The word {W} can only start with a {0} and end with a {0}, since there can’t be two consecutive {1}s (else {1W1} would not be a subword). Thus {W = 0W^\prime 0} and so {00W^\prime00} and {10W^\prime 01} are subwords of {\omega_m}. Now {\sigma} is injective from {\{ 0,1 \}^*} to {\{ 0,1 \}^*} and we observe that

  • A {1} within the sequence implies that the element right to its left is a {0}, and that we may extract the preimage of this pair to be {0}.
  • A {00} indicates that the first {0} must come from {\sigma(1)} as otherwise it would be followed by a {1}.

In conclusion, we know that the character to the left of {10W^\prime 01} is a {0} and then {\omega_m} contains {010W^\prime01} which has {0\sigma^{-1}(0W^\prime)0} as its preimage and this is a subword of {\sigma^{-1}(\omega_m)=\omega_{m-1}}. Similarly, we note that having {00 W^\prime 00} in {\omega_m} implies having {1\sigma^{-1}(0W^\prime)1} in {\omega_{m-1}}. So both {0\sigma^{-1}(0W^\prime)0} and {1\sigma^{-1}(0W^\prime)1} appear as subwords of {\omega_{m-1}} and {\sigma_{m-1}} is not balanced too. This idea follows inductively until we get {m} small enough for it to be obvious that {\omega_m} is balanced and we have reached an absurd.

Remark 1 (The Fibonacci word is not eventually periodic) Observe that the number of ones in {\omega_m} is {f_m} while its length is {l(\omega_m) = f_{m+2}}. This follows by induction and the fact that {\omega_{m+1} = \omega_m\omega_{m-1}} for {m\geq 1}.

Since f_m \sim \frac{1}{\sqrt{5}}\,\phi^m where \phi = \frac{1+\sqrt{5}}{2} is the Golden Ratio, it follows that the density of ones in {\omega_m} tends to {\phi^{-2}} as {m\rightarrow \infty}. Since {\phi^{-2} = \frac{1}{1+\phi}} is not rational, this proves that {\omega_\infty} is not eventually periodic (otherwise the limit would be {|u|_1 / l(u)} being {u} the period, and thus rational).

Remark 2 (The Fibonacci word is Sturmian) Since {\omega_\infty} is both balanced and not–eventually periodic, it follows that it is Sturmian (see [1]). In particular, this means that the number of subwords of length {n} in {\omega_\infty} is exactly {n+1} for each {n\geq 0}.

Posted in Combinatorics | Leave a comment

Partitions and divisors

The number of partitions of a non-negative integer {n} is, roughly speaking, the number of ways in which we can write {n} as a sum of positive integers disregarding the order of the terms, and we denote this number by {p(n)}.

Formally speaking then, the fact that we do not care about the order of the terms translates into the fact that we may choose them to be ordered increasingly as follows { P_n := \Big\{(a_1,\ldots,a_k) : k\in {\mathbb{Z}}_{\geq 0}\text{ and }a_i\in {\mathbb{Z}}_{>0}\text{ such that } a_1\leq \ldots \leq a_k\text{ and } \sum_i a_i = n\Big\}\,,}
this is the set of partitions of {n}, and we then define {p(n)} to be its cardinal (number of elements).

A few examples

{n} {p(n)} {n} {p(n)}
{1} {1} {6} {11}
{2} {2} {7} {15}
{3} {3} {8} {22}
{4} {5} {9} {30}
{5} {7} {10} {42}

And here we show the partitions of n=5, the heights of the columns correspond to the sizes of the integers involved in the partition ( listed in decreasing order):




The objective of this post is to have a combinatorial proof of the following identity.

Identity 1. Given {n\in {\mathbb{Z}}_{>0}} we have

\displaystyle  n\,p(n) = \sum_{j=1}^n \sigma(j)\,p(n-j)\,,

where {\sigma(n) = \sum_{d|n}d} is the sum of the positive divisors of {n}.

Note. Before going into the proof, we note that it is not immediately obvious why p(n) should be related to a number-theoretic quantity such as \sigma(j), which corresponds to a multiplicative property (divisibility) rather than an additive one.

Proof: To prove it, we observe that the left-hand-side is the cardinal of {[n]\times P_n} where {[n]:= \{1,\ldots,n\}}, and we will now find a way to count the same thing, but that produces the right-hand-side in the process.

We start the other way around, given a partition of {n}, for which we write {(a_1,\ldots,a_k)}, we are going to exploit the fact that {a_1+\ldots+a_k=n} to index the integer in {[n]}.

The idea is, if we want to index {i\in [n]}, there is exactly one {s\in [k]} such that

\displaystyle a_1+\ldots+a_{s-1} < i \leq a_1+\ldots+a_s\,,

where the empty sum is interpreted as being {0}.

Assume {a_s = d}, then to fully determine {i}, in addition to {s}, we have to indicate a number {j} from {\{1,\ldots,d\}}, so that

\displaystyle  n = \left(a_1+\ldots+a_{s-1}\right) + j\,.

Now, the way in which we indicate {s} and {j} is as follows. Assume we had

\displaystyle  a_{s-t}<a_{s-t+1}=\ldots=a_s=\ldots=a_{s+r}<a_{s+r+1}\,.

First we record first the size {d} of the summand, and then which one of the indexes {\{s-t+1,\ldots,s+r\}} we are referring to. To indicate {s}, this is done by considering {d\times t} where {t} is the index of {a_s} within {a_{s-t+1},\ldots,a_s,\ldots,a_{s+r}}.

Thus, once we indicate {d}, a multiple {d\times t} of {d} and a number {j\in[d]}, we have actually fully determined {i\in [n]}. In how many partitions {(a_1,\ldots,a_k)}, with {k\geq 0}, can we indicate {i} in this exact same manner? Clearly {p(n-d\cdot t)}, because, once we have fixed at least {t} terms equal to {d}, the rest can be filled up as we please.

It follows then that

\displaystyle  \big|[n]\times P_n\big| = \sum_{d=1}^n \sum_{t=1}^{\lfloor n/d \rfloor} \sum_{j=1}^d p(n-d\,t) = \sum_{d=1}^n d\, \sum_{t=1}^{\lfloor n/d \rfloor} p(n-d\,t)\,.

Now, this can be rewritten as

\displaystyle  \sum_{d=1}^n d\, \sum_{t=1}^{\lfloor n/d \rfloor} p(n-d\,t) = \sum_{d=1}^n \sum_{d\,t\leq n}d\,p(n-d\,t) \underbrace{=}_{k=d\,t} \sum_{k=1}^n \left(\sum_{d|k} d\right)\,p(n-k)\,,

which is simply {\sum_{k=1}^n \sigma(k)\,p(n-k)}. \Box

This idea doesn’t stop here, try the following exercise.

Exercise 1 A rooted tree is an unlabeled tree with a distinguished vertex known as `the root’. If {t(n)} denotes the number of rooted trees with {n} vertices, prove that we have

\displaystyle  n\,t(n+1) = \sum_{j=1}^{n} \psi(j)\,t(n-j+1)\,,

where {\psi(j) := \sum_{d|j} d\cdot t(d)}.

Remark. Finally, it is fair to remark that these identities can be proved by means of generating functions. Indeed, we have the OGF

\displaystyle\sum_{k=0}^{\infty}{p(k)\, x^k} = \displaystyle\prod_{k = 1}^{\infty}{\left(\displaystyle\frac{1}{1-x^k}\right)}

and then rest is similar to the proof by generating functions in this post.

Posted in Combinatorics, Number Theory | Tagged , | Leave a comment