Skip to Content
CSE5313CSE5313 Coding and information theory for data science (Lecture 17)

CSE5313 Coding and information theory for data science (Lecture 17)

Shannon’s coding Theorem

Shannon’s coding theorem: For a discrete memoryless channel with capacity CC, every rate R<C=maxxXI(X;Y)R < C = \max_{x\in \mathcal{X}} I(X; Y) is achievable.

Computing Channel Capacity

XX: channel input (per 1 channel use), YY: channel output (per 1 channel use).

Let the rate of the code be logFCn\frac{\log_F |C|}{n} (or kn\frac{k}{n} if it is linear).

The Binary Erasure Channel (BEC): analog of BSC, but the bits are lost (not corrupted).

Let α\alpha be the fraction of erased bits.

Corollary: The capacity of the BEC is C=1αC = 1 - \alpha.

Proof

C=maxxXI(X;Y)=maxxX(H(Y)H(YX))=H(Y)H(α)\begin{aligned} C&=\max_{x\in \mathcal{X}} I(X;Y)\\ &=\max_{x\in \mathcal{X}} (H(Y)-H(Y|X))\\ &=H(Y)-H(\alpha) \end{aligned}

Suppose we denote Pr(X=1)pPr(X=1)\coloneqq p.

Pr(Y=0)=Pr(X=0)Pr(noerasure)=(1p)(1α)Pr(Y=0)=Pr(X=0)Pr(no erasure)=(1-p)(1-\alpha)

Pr(Y=1)=Pr(X=1)Pr(noerasure)=p(1α)Pr(Y=1)=Pr(X=1)Pr(no erasure)=p(1-\alpha)

Pr(Y=)=αPr(Y=*)=\alpha

So,

H(Y)=H((1p)(1α),p(1α),α)=(1p)(1α)log2((1p)(1α))+p(1α)log2(p(1α))+αlog2(α)=H(α)+(1α)H(p)\begin{aligned} H(Y)&=H((1-p)(1-\alpha),p(1-\alpha),\alpha)\\ &=(1-p)(1-\alpha)\log_2 ((1-p)(1-\alpha))+p(1-\alpha)\log_2 (p(1-\alpha))+\alpha\log_2 (\alpha)\\ &=H(\alpha)+(1-\alpha)H(p) \end{aligned}

So I(X;Y)=H(Y)H(YX)=H(α)+(1α)H(p)H(α)=(1α)H(p)I(X;Y)=H(Y)-H(Y|X)=H(\alpha)+(1-\alpha)H(p)-H(\alpha)=(1-\alpha)H(p)

So C=maxxXI(X;Y)=maxp[0,1](1α)H(p)=(1α)C=\max_{x\in \mathcal{X}} I(X;Y)=\max_{p\in [0,1]} (1-\alpha)H(p)=(1-\alpha)

So the capacity of the BEC is C=1αC = 1 - \alpha.

General interpretation of capacity

Recall I(X;Y)=H(Y)H(YX)I(X;Y)=H(Y)-H(Y|X).

Edge case:

  • If H(XY)=0H(X|Y)=0, then output YY reveals all information about input XX.
    • rate of R=I(X;Y)=H(Y)R=I(X;Y)=H(Y) is possible. (same as information compression)
  • If H(YX)=H(X)H(Y|X)=H(X), then YY reveals no information about XX.
    • rate of R=I(X;Y)=0R=I(X;Y)=0 no information is transferred.
Note

Compression is transmission without noise.

Side notes for Cryptography

Goal: Quantify the amount of information that is leaked to the eavesdropper.

  • Let:
    • MM be the message distribution.
    • Let ZZ be the cyphertext distribution.
  • How much information is leaked about mm to the eavesdropper (who sees operatornameEnc(m)operatorname{Enc}(m))?
  • Idea: One-time pad.

One-time pad

M=M={0,1}nM=\mathcal{M}=\{0,1\}^n

Suppose the Sender and Receiver agree on kU=Uniform{0,1}nk\sim U=\operatorname{Uniform}\{0,1\}^n

Let operatornameEnc(m)=mkoperatorname{Enc}(m)=m\oplus k

Measure the information leaked to the eavesdropper (who sees operatornameEnc(m)operatorname{Enc}(m)).

That is to compute I(M;Z)=H(Z)H(ZM)I(M;Z)=H(Z)-H(Z|M).

Proof

Recall that Z=MUZ=M\oplus U.

So

Pr(Z=z)=Pr(MU=z)=mMPr(MU=zM=m)by the law of total probability=mMPr(M=m)Pr(U=mz)=12nmMPr(M=m)message is uniformly distributed=12n\begin{aligned} Pr(Z=z)&=\operatorname{Pr}(M\oplus U=z)\\ &=\sum_{m\in \mathcal{M}} \operatorname{Pr}(M\oplus U=z|M=m) \text{by the law of total probability}\\ &=\sum_{m\in \mathcal{M}} \operatorname{Pr}(M=m) \operatorname{Pr}(U=m\oplus z)\\ &=\frac{1}{2^n} \sum_{m\in \mathcal{M}} \operatorname{Pr}(M=m)\text{message is uniformly distributed}\\ &=\frac{1}{2^n} \end{aligned}

ZZ is uniformly distributed over {0,1}n\{0,1\}^n.

So H(Z)=log22n=nH(Z)=\log_2 2^n=n.

H(ZM)=H(U)=nH(Z|M)=H(U)=n because UU is uniformly distributed over {0,1}n\{0,1\}^n.

  • Notice m{0,1}n={0,1}nm\oplus\{0,1\}^n=\{0,1\}^n for every mMm\in \mathcal{M}. (since ({0,1}n,)(\{0,1\}^n,\oplus) is a group)
  • For every z{0,1}nz\in \{0,1\}^n, Pr(mU=z)=Pr(U=zm)=2nPr(m\oplus U=z)=Pr(U=z\oplus m)=2^{-n}

So I(M;Z)=H(Z)H(ZM)=nn=0I(M;Z)=H(Z)-H(Z|M)=n-n=0.

Discussion of information theoretical privacy

What does I(Z;M)=0I(Z;M)=0 mean?

  • No information is leaked to the eavesdropper.
  • Regardless of the value of MM and UU.
  • Regardless of any computational power.

Information Theoretic privacy:

  • Guarantees given in terms of mutual information.
  • Remains private in the face of any computational power.
  • Remains private forever.

Very strong form of privacy

Note

The mutual information is an average metric for privacy, no guarantee for any individual message.

The one-time pad is so-far, the only known perfect privacy scheme.

The asymptotic equipartition property (AEP) and data compression

Note

This section will help us understand the limits of data compression.

The asymptotic equipartition property

Idea: consider the space of all possible sequences produced by a random source, and focus on the “typical” ones.

  • Asymptotic in the sense that many of the results focus on the regime of large source sequences.
  • Fundamental to the concept of typical sets used in data compression.
  • The analog of the law of large numbers (LLN) in information theory.
    • Direct consequence of the weak law.

The law of large numbers

The average of outcomes obtained from a large number of trials is close to the expected value of the random variable.

For independent and identically distributed (i.i.d.) random variables X1,X2,,XnX_1,X_2,\cdots,X_n, with expected value E[Xi]=μ\mathbb{E}[X_i]=\mu, the sample average

Xn=1ni=1nXi\overline{X}_n=\frac{1}{n}\sum_{i=1}^n X_i

converges to the expected value μ\mu as nn goes to infinity.

The weak law of large numbers

converges in probability to the expected value μ\mu

Xnpμ as n\overline{X}_n^p\to \mu\text{ as }n\to\infty

That is,

\lim_{n\to\infty} P\left(|\overline{X}_n<\epsilon)=1

for any positive ϵ\epsilon.

Intuitively, for any nonzero margin ϵ\epsilon, no matter how small, with a sufficiently large sample there will be a very high probability that the average of the observations will be close to the expected value (within the margin)

The AEP

Let XX be an i.i.d source that takes values in alphabet X\mathcal{X} and produces a sequence of i.i.d. random variables X1,,XnX_1, \cdots, X_n.

Let p(X1,,Xn)p(X_1, \cdots, X_n) be the probability of observing the sequence X1,,XnX_1, \cdots, X_n.

Theorem (AEP):

1nlogp(X1,,Xn)pH(X) as n-\frac{1}{n} \log p(X_1, \cdots, X_n) \xrightarrow{p} H(X)\text{ as }n\to \infty

Proof

1nlogp(X1,,Xn)=1ni=1nlogp(Xi)\frac{1}{n} \log p(X_1, \cdots, X_n) = \frac{1}{n} \sum_{i=1}^n \log p(X_i)

The logp(Xi)\log p(X_i) terms are also i.i.d. random variables.

So by the weak law of large numbers,

1ni=1nlogp(Xi)pE[logp(Xi)]=H(X)\frac{1}{n} \sum_{i=1}^n \log p(X_i) \xrightarrow{p} \mathbb{E}[\log p(X_i)]=H(X) E[logp(Xi)]=E[log1p(Xi)]=H(X)-\mathbb{E}[\log p(X_i)]=\mathbb{E}[\log \frac{1}{p(X_i)}]=H(X)

Typical sets

Definition of typical set

The typical set (denoted by Aϵ(n)A_\epsilon^{(n)}) with respect to p(x)p(x) is the set of sequence (x1,,xn)Xn(x_1, \cdots, x_n)\in \mathcal{X}^n that satisfies

2n(H(X)ϵ)p(x1,,xn)2n(H(X)+ϵ)2^{-n(H(X)-\epsilon)}\leq p(x_1, \cdots, x_n)\leq 2^{-n(H(X)+\epsilon)}

In other words, the typical set contains all nn-length sequences with probability close to 2nH(X)2^{-nH(X)}.

  • This notion of typicality only concerns the probability of a sequence and not the actual sequence itself.
  • It has great use in compression theory.

Properties of the typical set

  • If (x1,,xn)Aϵ(n)(x_1, \cdots, x_n)\in A_\epsilon^{(n)}, then H(X)ϵ1nlogp(x1,,xn)H(X)+ϵH(X)-\epsilon\leq -\frac{1}{n} \log p(x_1, \cdots, x_n)\leq H(X)+\epsilon.
  • Pr(Aϵ(n))1ϵ\operatorname{Pr}(A_\epsilon^{(n)})\geq 1-\epsilon. for sufficiently large nn.

Sketch of proof

Use the AEP to show that Pr(Aϵ(n))1ϵ\operatorname{Pr}(A_\epsilon^{(n)})\geq 1-\epsilon. for sufficiently large nn.

For any δ>0\delta>0, there exists n0n_0 such that for all nn0n\geq n_0,

Pr(Aϵ(n))1δ\operatorname{Pr}(A_\epsilon^{(n)})\geq 1-\delta
  • Aϵ(n)2n(H(X)+ϵ)|A_\epsilon^{(n)}|\leq 2^{n(H(X)+\epsilon)}.
  • Aϵ(n)(1ϵ)2n(H(X)ϵ)|A_\epsilon^{(n)}|\geq (1-\epsilon)2^{n(H(X)-\epsilon)}. for sufficiently large nn.

Smallest probable set

The typical set Aϵ(n)A_\epsilon^{(n)} is a fairly small set with most of the probability.

Q: Is it the smallest such set?

A: Not quite, but pretty close.

Notation: Let X1,,XnX_1, \cdots, X_n be i.i.d. random variables drawn from p(x)p(x). For some δ<12\delta < \frac{1}{2}, we denote Bδ(n)XnB_\delta^{(n)}\subset \mathcal{X}^n as the smallest set such that Pr(Bδ(n))1δ\operatorname{Pr}(B_\delta^{(n)})\geq 1-\delta.

Notation: We write anbna_n \doteq b_n if

limnanbn=1\lim_{n\to\infty} \frac{a_n}{b_n}=1

Theorem: Bδ(n)Aϵ(n)2nH(X)|B_\delta^{(n)}|\doteq |A_\epsilon^{(n)}|\doteq 2^{nH(X)}.

Check book for detailed proof.

What is the difference between Bδ(n)B_\delta^{(n)} and Aϵ(n)A_\epsilon^{(n)}?

Consider a Bernoulli sequence X1,,XnX_1, \cdots, X_n with p=0.9p = 0.9.

The typical sequences are those in which the proportion of 1’s is close to 0.9.

  • However, they do not include the most likely sequence, i.e., the sequence of all 1’s!
  • H(X)=0.469H(X) = 0.469.
  • 1nlogp(1,,1)=1nlog0.9n=0.152-\frac{1}{n} \log p(1, \cdots, 1) = -\frac{1}{n} \log 0.9^n = 0.152. Its average logarithmic probability cannot come close to the entropy of XX no matter how large nn can be.
  • The set Bδ(n)B_\delta^{(n)} contains all the most probable sequences… and includes the sequence of all 1’s.

Consequences of the AEP: data compression schemes

Let X1,,XnX_1, \cdots, X_n be i.i.d. random variables drawn from p(x)p(x).

We want to find the shortest description of the sequence X1,,XnX_1, \cdots, X_n.

First, divide all sequences in Xn\mathcal{X}^n into two sets, namely the typical set Aϵ(n)A_\epsilon^{(n)} and its complement Aϵ(n)cA_\epsilon^{(n)c}.

  • Aϵ(n)A_\epsilon^{(n)} contains most of the probability while Aϵ(n)cA_\epsilon^{(n)c} contains most elements.
  • The typical set has probability close to 1 and contains approximately 2nH(X)2^{nH(X)} elements.

Order the sequences in each set in lexicographic order.

  • This means we can represent each sequence of Aϵ(n)A_\epsilon^{(n)} by giving its index in the set.
  • There are at most 2n(H(X)+ϵ)2^{n(H(X)+\epsilon)} sequences in Aϵ(n)A_\epsilon^{(n)}.
  • Indexing requires no more than nH(X)+ϵ+1nH(X)+\epsilon+1 bits.
  • Prefix all of these by a “0” bit ⇒ at most nH(X)+ϵ+2nH(X)+\epsilon+2 bits to represent each.
  • Similarly, index each sequence in Aϵ(n)cA_\epsilon^{(n)c} with no more than nlogX+1n\log |\mathcal{X}|+1 bits.
  • Prefix it by “1” ⇒ at most nlogX+2n\log |\mathcal{X}|+2 bits to represent each.
  • Voilà! We have a code for all sequences in Xn\mathcal{X}^n.
  • The typical sequences have short descriptions of length nH(X)\approx nH(X) bits.

Algorithm for data compression

  1. Divide all nn-length sequences in Xn\mathcal{X}^n into the typical set Aϵ(n)A_\epsilon^{(n)} and its complement Aϵ(n)cA_\epsilon^{(n)c}.
  2. Order the sequences in each set in lexicographic order.
  3. Index each sequence in Aϵ(n)A_\epsilon^{(n)} using nH(X)+ϵ+1\leq nH(X)+\epsilon+1 bits and each sequence in Aϵ(n)cA_\epsilon^{(n)c} using nlogX+1\leq n\log |\mathcal{X}|+1 bits.
  4. Prefix the sequence by “0” if it is in Aϵ(n)A_\epsilon^{(n)} and “1” if it is in Aϵ(n)cA_\epsilon^{(n)c}.

Notes:

  • The code is one-to-one and can be decoded easily. The initial bit acts as a “flag”.
  • The number of elements in Aϵ(n)cA_\epsilon^{(n)c} is less than the number of elements in Xn\mathcal{X}^n, but it turns out this does not matter.

Expected length of codewords

Use xnx_n to denote a sequence x1,,xnx_1, \cdots, x_n and xn\ell_{x_n} to denote the length of the corresponding codeword.

Suppose nn is sufficiently large.

This means Pr(Aϵ(n))1ϵ\operatorname{Pr}(A_\epsilon^{(n)})\geq 1-\epsilon.

Lemma: The expected length of the codeword E[xn]\mathbb{E}[\ell_{x_n}] is upper bounded by nH(X)+ϵnH(X)+\epsilon', where ϵ=ϵ+ϵlogX+2n\epsilon'=\epsilon+\epsilon\log |\mathcal{X}|+\frac{2}{n}.

ϵ\epsilon' can be made arbitrarily small by choosing ϵ\epsilon and nn appropriately.

Efficient data compression guarantee

The expected length of the codeword E[xn]\mathbb{E}[\ell_{x_n}] is upper bounded by nH(X)+ϵnH(X)+\epsilon', where ϵ=ϵ+ϵlogX+2n\epsilon'=\epsilon+\epsilon\log |\mathcal{X}|+\frac{2}{n}.

Theorem: Let XnX1,,XnX_n\triangleq X_1, \cdots, X_n be i.i.d. with p(x)p(x) and ϵ>0\epsilon > 0. Then there exists a code that maps sequences xnx_n to binary strings (codewords) such that the mapping is one-to-one and E[xn]n(H(X)+ϵ)\mathbb{E}[\ell_{x_n}] \leq n(H(X)+\epsilon) for nn sufficiently large.

  • Thus, we can represent sequences XnX_n using nH(X)\approx nH(X) bits on average

Shannon’s source coding theorem

There exists an algorithm that can compress nn i.i.d. random variables X1,,XnX_1, \cdots, X_n, each with entropy H(X)H(X), into slightly more than nH(X)nH(X) bits with negligible risk of information loss. Conversely, if they are compressed into fewer than nH(X)nH(X) bits, the risk of information loss is very high.

  • We have essentially proved the first half!

Proof of converse: Show that any set of size smaller than that of Aϵ(n)A_\epsilon^{(n)} covers a set of probability bounded away from 1

Last updated on