Processing math: 1%

Tuesday, April 24, 2018

Some characterization of the Shannon entropy

Lemma. Let f:Z+R satisfy the following two conditions

  • f(mn)=f(m)+f(n) for all m,nZ+,
  • lim.
Then there exists c \in \mathbb{R} such that f(n) = c\log n for all n \in \mathbb{Z}^+.

Proof. Define \delta_i = \max\{ |f(n+1) - f(n)| : 2^i \leq n < 2^{i+1} \}. By the assumption, we know that \delta_i \to 0 as i\to\infty. Now let n \geq 2, and for each k \geq 1 we write n^k = \sum_{j=0}^{L_k} a_j 2^j, \qquad a_j \in \{0, 1\} \quad \text{and} \quad a_{L_k} = 1. Then by using the fact that f(a_{L_k}2^{L_k}) = L_k f(2), we can write \begin{align*} \left| \frac{f(n^k)}{L_k} - f(2) \right| &\leq \frac{1}{L_k} \sum_{l=0}^{L_k-1} \left| f\left( \sum_{j=l}^{L_k} a_j 2^k \right) - f\left( \sum_{j=l+1}^{L_k} a_j 2^k \right) \right| \\ &= \frac{1}{L_k} \sum_{l=0}^{L_k-1} \left| f\left( \sum_{j=l}^{L_k} a_j 2^{k-l} \right) - f\left( \sum_{j=l+1}^{L_k} a_j 2^{k-l} \right) \right| \\ &\leq \frac{1}{L_k} \sum_{l=0}^{L_k-1} \delta_{L_k - l} = \frac{1}{L_k} \sum_{i=1}^{L_k} \delta_{i}. \end{align*} Since L_k = \lfloor k \log_2 n \rfloor tends to \infty as k\to\infty, by Stolz–Cesàro theorem, this bound converges to 0. So f(2) = \lim_{k\to\infty} \frac{f(n^k)}{L_k} = \lim_{k\to\infty} \frac{k f(n)}{\lfloor k \log_2 n \rfloor} = \frac{f(n)}{\log_2 n}, and f(n) = c \log n with c = f(2)/\log 2. When n = 1, we have f(1) = f(1^2) = 2f(1) and hence f(1) = 0 = c \log 1, completing the proof. ////

This lemma is the key step for proving the theorem of Faddeev. To state this theorem, let \Delta^n = \{ (p_1, \cdots, p_n) \in [0, \infty)^n : p_1 + \cdots + p_n = 1 \} and \mathtt{Prob} = \bigcup_{n=1}^{\infty} \Delta^n the set of all probability vectors.

Theorem. (Faddeev)[1] Let H : \mathtt{Prob} \to [0, \infty) satisfy the following properties

  1. H is a symmetric function on each \Delta^n.
  2. H is continuous on \Delta^2.
  3. H(\frac{1}{2}, \frac{1}{2}) = 1.
  4. H(t p_1, (1-t)p_1, p_2, \cdots, p_n) = H(p_1, \cdots, p_n) + p_1 H(t, 1-t).
Then H is the Shannon entropy: H(p_1, \cdots, p_n) = - \sum_{i=1}^{n} p_i \log_2 p_i.

Proof. We extend H onto the cone \{ \lambda p : p \in \mathtt{Prob}, \lambda > 0 \} by setting H(\lambda p) = \lambda H(p), \qquad \forall p \in \mathtt{Prob}, \lambda > 0. Then the condition 4 can be rephrased as H(p_0, p_1, p_2, \cdots, p_n) = H(p_0+p_1, p_2, \cdots, p_n) + H(p_0, p_1). \tag{1} By applying \text{(1)} repeatedly, we find that \begin{align*} H(p_1, \cdots, p_n) &= \sum_{k=2}^{n} H(p_1 + \cdots + p_{k-1}, p_k). \end{align*} This tells that H is continuous on each \Delta^n as well. By the same argument, we also find that \begin{align*} &H(p_1, \cdots, p_n, q_1, \cdots, q_m) \\ &= H(p_1+\cdots+p_n, q_1, \cdots, q_m) + \sum_{k=2}^{n} H(p_1 + \cdots + p_{k-1}, p_k) \\ &= H(p_1+\cdots+p_n, q_1, \cdots, q_m) + H(p_1, \cdots, p_n), \tag{2} \end{align*} which generalizes the property \text{(1)}. Now define f(n) = H(\frac{1}{n}, \cdots, \frac{1}{n}). By applying \text{(2)}, we obtain f(mn) = f(m) + f(n). Next we show that f(n+1) - f(n) \to 0 as n\to\infty. In view of Stolz–Cesàro theorem together with f(2^n)/2^n = n/2^n \to 0, we know that the only candidate for the limit is zero. Also, by \text{(2)} we easily find that (n+1)f(n+1) = H(n,1) + nf(n). So nf(n) is increasing and 0 \leq nf(n) \leq 2^{\lceil \log_2 n\rceil } f(2^{\lceil \log_2 n\rceil }) = 2^{\lceil \log_2 n\rceil } \lceil \log_2 n\rceil. This tells that f(n) = \mathcal{O}(\log n) and hence f(n+1) - f(n) = H\left(\frac{n}{n+1}, \frac{1}{n}\right) - \frac{1}{n+1}f(n) converges as n\to\infty by the continuity of H on \Delta^2. So this limit must be zero, and by the lemma, f(n) = \log_2 n. Now if (\frac{a_1}{N}, \cdots, \frac{a_n}{N}) \in \Delta^n, then by \text{(2)}, \begin{align*} H\left( \frac{a_1}{N}, \cdots, \frac{a_n}{N} \right) &= f(N) - \sum_{k=1}^{N} \frac{a_k}{N} f(a_k) \\ &= - \sum_{k=1}^{N} \frac{a_k}{N} \log_2 \frac{a_k}{N} \end{align*} and hence H equals the Shannon entropy for probability vectors with rational coordinates. Then the general case follows by the continuity and the proof is complete.

References.

  • [1] Rényi, Alfréd. On measures of entropy and information. HUNGARIAN ACADEMY OF SCIENCES Budapest Hungary, 1961.

Thursday, February 15, 2018

A proof of Burnside's lemma

Theorem. Let G be a finite group acting on a set X. Write \operatorname{Fix} g = \{ x \in X : g \cdot x = x\} for g \in G and \operatorname{Orb} x = \{ g \cdot x : g \in G\} for x \in X. (The latter is called an orbit of x.) If X/G denotes the set of orbits on X, then \begin{align*} |X/G| = \frac{1}{|G|} \sum_{g \in G} \left| \operatorname{Fix} g \right|. \end{align*}

The proof is essentially counting the set G\times X in two ways. Let \gamma be a uniform distribution over G under the law \mathsf{P}. The probabilistic analogue of the Lagrange's theorem is as follows:

Lemma. \gamma.x has the uniform distribution over the orbit \operatorname{Orb} x for each x\in X.

Proof of Lemma. \gamma \cdot x takes values only in \operatorname{Orb} x, and for each y \in \operatorname{Orb} x there exists h \in G for which x = h \cdot y. Since each map g \mapsto hg is a bijection on G, we know that h\gamma is also uniform over G, thus has the same law as \gamma. Therefore \mathsf{P}[\gamma \cdot x = y] = \mathsf{P}[h\gamma \cdot x = h \cdot y] = \mathsf{P}[\gamma \cdot x = x], which proves that \gamma \cdot x is uniform over \operatorname{Orb} x. ////

Returning to the original proof, notice that X/G is a partition of X. So \begin{align*} \frac{1}{|G|} \sum_{g \in G} \left|\operatorname{Fix} g\right| &= \mathsf{E}[\left|\operatorname{Fix} \gamma\right|] = \mathsf{E}\left[ \sum_{x\in X} \mathbf{1}_{\{\gamma \cdot x = x\}} \right] = \sum_{x\in X} \mathsf{P}[\gamma \cdot x = x] \\ &= \sum_{O \in X/G} \sum_{x \in O} \mathsf{P}[\gamma \cdot x = x] = \sum_{O \in X/G} \underbrace{\sum_{x \in O} \frac{1}{|O|}}_{=1} = |X/G|. \end{align*}

Saturday, February 3, 2018

Weighted Right Convex Function (WRCF) Theorem

Theorem. Suppose that f : I \to \mathbb{R} be a function on an interval I and that there exists s \in I such that f is convex on I \cap [s,\infty). Then for any p_1,\cdots,p_n \in (0, 1] satisfying \sum_{i=1}^{n} p_i = 1 and p = \min\{p_1,\cdots,p_n\}, the followings are equivalent:

  1. For any x_1, \cdots, x_n \in I such that \sum_{i=1}^{n} x_i p_i \geq s, we have f\left(\sum_{i=1}^{n} x_i p_i \right) \leq \sum_{i=1}^{n} f(x_i) p_i.
  2. For any x, y \in I satisfying x \leq s \leq y and p x + (1-p)y = s, we have p f(x) + (1-p)f(y) \geq f(s)

The following proof is an adaptation of the argument of [CVB11] in probabilistic language.

Proof. The direction 1 \Rightarrow 2 is straightforward. So we prove the converse. We need the following easy lemma:

Lemma. Let f : [a, b] \to \mathbb{R} be convex. Suppose that \mu and \nu are finite measures on [a, b] such that \mu([a, b]) = \nu([a, b]), \int_{[a,b]} x \, \mu(dx) = \int_{[a,b]} x \, \nu(dx) and \operatorname{supp}(\nu)\subseteq \{a, b\}. Then \int_{[a,b]} f(x) \, \mu(dx) \leq \int_{[a,b]} f(x) \, \nu(dx).

Indeed, the convexity of f tells that the line \ell joining (a, f(a)) and (b, f(b)) satisfies f(x) \leq \ell(x) for all x \in [a, b]. Then \int_{[a,b]} f(x) \, \mu(dx) \leq \int_{[a,b]} \ell(x) \, \mu(dx) = \int_{[a,b]} \ell(x) \, \nu(dx) = \int_{[a,b]} f(x) \, \nu(dx) The second equality follows from the first two assumptions and the last equality follows since \nu is supported on \{a, b\} and \ell \equiv f on \{a,b\}. This completes the proof of Lemma.

Let \phi and \tilde{f} be functions on I defined by \phi(x) = \frac{s-px}{1-p}, \qquad \tilde{f}(x) = \begin{cases} f(x), & x \geq s \\ \frac{f(s) - (1-p)f(\phi(x))}{p}, & x < s \end{cases} Notice that item 2 of the theorem is equivalent to the inequality f(x) \geq \tilde{f}(x) for all x \in I. Now consider a random variable X taking values in \{x_1,\cdots,x_n\} \subseteq I with P[X = x_i] = p_i such that E[X] \geq s. We want to prove that E[f(X)] \geq f(EX). Write E[f(X)] \geq E[\tilde{f}(X)] = E\left[ \frac{f(s) - (1-p)f(\phi(X))}{p} ; X < s\right] + E[f(X); X \geq s] and notice that E[f(X) \mid X \geq s] \geq f(E[X \mid X \geq s]) by the Jensen's inequality. Combining altogether, if we let m_+ = E[X \mid X \geq s], then it suffices to prove that \frac{P[X < s]}{p}f(s) + P[X \geq s] f(m_+) \geq \frac{1-p}{p} E[ f(\phi(X)) ; X < s] + f(E X) \tag{*} The following claim is useful for our purpose:

Claim. If j is any index such that x_j < s, then \phi(x_j) \in (s, m_+].

Indeed \phi(x_j) > s is easy to check. To prove the upper bound, let \tilde{X} be a random variable such that P[\tilde{X} = x_i] = \begin{cases} p_i/(1-p), & i \neq j \\ (p_i - p)/(1-p), & i = j \end{cases} Then we easily check that E[\tilde{X} \mid \tilde{X} \geq s] = m_+ and hence \phi(x_j) \leq \frac{(EX) - x_j p}{1-p} = E[\tilde{X}] \leq E[\tilde{X} \mid \tilde{X} \geq s] = m_+. This completes the proof of Claim. Therefore, if we write \mu = \frac{1-p}{p} P[\phi(X) \in \cdot \, ; X < s] + \delta_{E X}, \qquad \nu = \frac{P[X < s]}{p} \delta_s + P[X \geq s]\delta_{m_+} then by the claim above both \mu and \nu are finite measures on [s, m_+] and \text{(*)} is equivalent to the inequality \int_{[s,m_+]} f(x) \, \mu(dx) \leq \int_{[s,m_+]} f(x) \, \nu(dx). Finally, this inequality follows from the lemma above since \mu([s,m_+]) = \frac{p[X < s]}{p} + P[X \geq s] = \nu([s,m_+]) and \int_{[s,m_+]} x \, \mu(dx) = \frac{P[X < s]}{p}s + E[X ; X \geq s] = \int_{[s,m_+]} x \, \nu(dx) and \nu is supported on \{s, m_+\}. Therefore the desired claim E [f(X)] \geq f(EX) follows. ////

References.

  • [CVB11] Cirtoaje, Vasile, and Alina Baiesu. "An extension of Jensen's discrete inequality to half convex functions." Journal of Inequalities and Applications 2011, no. 1 (2011): 101.