Skip to Content
CSE5313CSE5313 Coding and information theory for data science (Lecture 15)

CSE5313 Coding and information theory for data science (Lecture 15)

Information theory

Transmission, processing, extraction, and utilization of information.

  • Information: “Resolution” of uncertainty.
  • Question in the 1940’s: How to quantify the complexity of information?
    • Questions:
      • How many bits are required to describe an information source?
      • How many bits are required to transmit an information source?
      • How much information does one source reveal about another?
    • Applications:
      • Data Compression.
      • Channel Coding.
      • Privacy.
  • Claude Shannon 1948: Information Entropy.

Entropy

The “information value” of a message depends on how surprising it is.

  • The more unlikely the event, the more informative the message.

The Shannon information of an event EE:

I(E)=log21Pr(E)I(E)=-\log_2 \frac{1}{Pr(E)}

Entropy the the expected amount of information in a random trial.

  • Rolling a die has more entropy than rolling a coin (6 states vs 2 states).

Information entropy

Let XX be a random variable with values in some finite set X\mathcal{X}.

The entropy H(X)H(X) of XX is defined as:

H(X)=xXlog21Pr(X=x)=ExX[Pr(X=x)log2Pr(X=x)]H(X)=\sum_{x\in \mathcal{X}} \log_2 \frac{1}{Pr(X=x)}=-\mathbb{E}_{x\sim X}[Pr(X=x)\log_2 Pr(X=x)]

Idea: How many bits are required, on average, to describe XX?

  • The more unlikely the event, the more informative the message.
  • Use “few” bits for common xx (i.e., Pr(x)Pr(x) large).
  • Use “many” bits for rare xx (i.e., Pr(x)Pr(x) small).

Notes:

  • H(X)=ExX[I(X=x)]H(X)=\mathbb{E}_{x\sim X}[I(X=x)]
  • H(X)0H(X)\geq 0
  • H(X)=0H(X)=0 if and only if Pr(X=x)=1Pr(X=x)=1 for some xXx\in \mathcal{X}.
  • Does not depend on X\mathcal{X} but on the probability distribution Pr(X=x)Pr(X=x).
  • Maximum entropy is achieved when the distribution is uniform. H(Uniform(n))=log2nH(Uniform(n))=\log_2 n. Where nn is the number of events. (aa bits required to describe each event of size 2a2^a).

Example

For uniform distribution XUniform{0,1}X\sim Uniform\{0,1\}, we have H(X)=log22=1H(X)=\log_2 2=1.


For Bernoulli distribution XBernoulli(p)X\sim Bernoulli(p), we have H(X)=plog2p(1p)log2(1p)=H(p)H(X)=-p\log_2 p-(1-p)\log_2 (1-p)=H(p).

Motivation

Optimal compression:

Consider the XX with distribution given belows:

ValueProbabilityEncoding
11/8000
21/4001
31/401
41/21

The average length of the encoding is 1/8×3+1/4×3+1/4×2+1/2×1=7/41/8\times 3+1/4\times 3+1/4\times 2+1/2\times 1=7/4.

And the entropy of XX is H(X)=1/8×log28+1/4×log24+1/4×log22+1/2×log21=7/4H(X)=1/8\times \log_2 8+1/4\times \log_2 4+1/4\times \log_2 2+1/2\times \log_2 1=7/4.

So the average length of the encoding is equal to the entropy of XX.

This is the optimal compression.

This is the Huffman coding.

Few extra theorems that will not be proved in CS course

  • Theorem: Avg. # of bits in any prefix-free compression of XX is ≥ H(X)H(X).
  • Theorem: Huffman coding is optimal.
    • I.e., Avg. # of bits equals H(X)H(X).
  • Disadvantage of Huffman coding: the distribution must be known.
    • Generally not the case.
  • [Lempel-Ziv 1978]: Universal lossless data compression.

Conditional and joint entropy

How does the entropy of different random variables interact?

  • As a function of their dependence.
  • Needed in order to relate:
    • A variable and its compression.
    • A variable and its transmission over a noisy channel.
    • A variable and its encryption

Definition for joint entropy

Let XX and YY be discrete random variables.

The joint entropy H(X,Y)H(X,Y) of XX and YY is defined as:

H(X,Y)=xX,yYPr(X=x,Y=y)log2Pr(X=x,Y=y)H(X,Y)=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(X=x, Y=y) \log_2 Pr(X=x, Y=y)

Notes:

  • H(X,Y)0H(X,Y)\geq 0
  • H(X,Y)=H(Y,X)H(X,Y)=H(Y,X)
  • H(X,Y)max{H(X),H(Y)}H(X,Y)\geq \max\{H(X),H(Y)\}
  • H(X,Y)H(X)+H(Y)H(X,Y)\leq H(X)+H(Y) with equality if and only if XX and YY are independent.

Conditional entropy

Note

Recall that the conditional probability P(YX)P(Y|X) is defined as: P(YX)=P(Y,X)P(X)P(Y|X)=\frac{P(Y,X)}{P(X)}.

What is the average amount of information revealed by YY given the distribution of XX?

For each given xXx\in \mathcal{X}, the conditional entropy H(YX=x)H(Y|X=x) is defined as:

H(YX=x)=yYlog21Pr(Y=yX=x)=yYPr(Y=yX=x)log2Pr(Y=yX=x)\begin{aligned} H(Y|X=x)&=-\sum_{y\in \mathcal{Y}} \log_2 \frac{1}{Pr(Y=y|X=x)} \\ &=-\sum_{y\in \mathcal{Y}} Pr(Y=y|X=x) \log_2 Pr(Y=y|X=x) \\ \end{aligned}

The conditional entropy H(YX)H(Y|X) is defined as:

H(YX)=ExX[H(YX=x)]=xXPr(X=x)H(YX=x)=xX,yYPr(X=x,Y=y)log2Pr(Y=yX=x)=xX,yYPr(x)yYPr(Y=yX=x)log2Pr(Y=yX=x)\begin{aligned} H(Y|X)&=\mathbb{E}_{x\sim X}[H(Y|X=x)] \\ &=-\sum_{x\in \mathcal{X}} Pr(X=x)H(Y|X=x) \\ &=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(X=x, Y=y) \log_2 Pr(Y=y|X=x) \\ &=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(x)\sum_{y\in \mathcal{Y}} Pr(Y=y|X=x) \log_2 Pr(Y=y|X=x) \\ \end{aligned}

Notes:

  • H(XX)=0H(X|X)=0
  • H(YX)=0H(Y|X)=0 when YY is a function of XX.
  • Conditional entropy not necessarily symmetric. H(XY)H(YX)H(X|Y)\neq H(Y|X).
  • Conditioning will not increase entropy. H(XY)H(X)H(X|Y)\leq H(X).

Chain rules of entropy

Joint entropy: H(X,Y)=ExX,yYlog1Pr(X=x,Y=y)H(X,Y)=\mathbb{E}_{x\sim X, y\sim Y}\log \frac{1}{Pr(X=x, Y=y)}.

Conditional entropy: H(YX)=ExXlog1Pr(Y=yX=x)H(Y|X)=\mathbb{E}_{x\sim X}\log \frac{1}{Pr(Y=y|X=x)}.

Chain rule: H(X,Y)=H(X)+H(YX)H(X,Y)=H(X)+H(Y|X).

Proof

For statistically independent XX and YY, we have Pr(x,y)=Pr(x)Pr(y)Pr(x,y)=Pr(x)Pr(y).

Pr(yx)=Pr(x,y)Pr(x)=Pr(x)Pr(y)Pr(x)=Pr(y)Pr(y|x)=\frac{Pr(x,y)}{Pr(x)}=\frac{Pr(x)Pr(y)}{Pr(x)}=Pr(y).

Apply the symmetry of the joint distribution.

Example of computing the conditional entropy

For the given distribution:

Y\X1234Y marginal
11/81/161/321/321/4
21/161/81/161/81/4
31/161/161/161/161/4
41/40001/4
X marginal1/21/41/81/81.0

Here H(X)=H((1/2,1/4,1/8,1/8))=(1/2log21/2+1/4log21/4+1/8log21/8+1/8log21/8)=7/4H(X)=H((1/2,1/4,1/8,1/8))=-(1/2\log_2 1/2+1/4\log_2 1/4+1/8\log_2 1/8+1/8\log_2 1/8)=7/4.

H(Y)=H((1/4,1/4,1/4,1/4))=(1/4log21/4+1/4log21/4+1/4log21/4+1/4log21/4)=2H(Y)=H((1/4,1/4,1/4,1/4))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/4\log_2 1/4+1/4\log_2 1/4)=2.

H(YX=1)=H((1/4,1/8,1/8,1/2))=(1/4log21/4+1/8log21/8+1/8log21/8+1/2log21/2)=7/4H(Y|X=1)=H((1/4,1/8,1/8,1/2))=-(1/4\log_2 1/4+1/8\log_2 1/8+1/8\log_2 1/8+1/2\log_2 1/2)=7/4.

H(YX=2)=H((1/4,1/2,1/4))=(1/4log21/4+1/2log21/2+1/4log21/4)=1.5H(Y|X=2)=H((1/4,1/2,1/4))=-(1/4\log_2 1/4+1/2\log_2 1/2+1/4\log_2 1/4)=1.5.

H(YX=3)=H((1/4,1/4,1/2,1/4))=(1/4log21/4+1/4log21/4+1/2log21/2+1/4log21/4)=1.5H(Y|X=3)=H((1/4,1/4,1/2,1/4))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/2\log_2 1/2+1/4\log_2 1/4)=1.5.

H(YX=4)=H((1/4,1/4,1/2))=(1/4log21/4+1/4log21/4+1/2log21/2)=1.5H(Y|X=4)=H((1/4,1/4,1/2))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/2\log_2 1/2)=1.5.

So H(YX)=xXPr(x)H(YX=x)=1/2×7/4+1/4×1.5+1/8×1.5+1/8×1.5=13/8H(Y|X)=\sum_{x\in \mathcal{X}} Pr(x)H(Y|X=x)=1/2\times 7/4+1/4\times 1.5+1/8\times 1.5+1/8\times 1.5=13/8

Mutual information

Definition for mutual information

The mutual information I(X;Y)I(X;Y) of XX and YY is defined as:

I(X;Y)=H(X)H(XY)=H(Y)H(YX)I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)

Properties of mutual information

  • I(X;Y)0I(X;Y)\geq 0
    • Prove via Jensen’s inequality.
    • Conditioning will not increase entropy.
  • I(X;Y)=I(Y;X)I(X;Y)=I(Y;X) symmetry.
  • I(X;X)=H(X)H(XX)=H(X)I(X;X)=H(X)-H(X|X)=H(X)
  • I(X;Y)=H(X)+H(Y)H(X,Y)I(X;Y)=H(X)+H(Y)-H(X,Y)

Example of computing the mutual information

For the given distribution:

Y\X1234Y marginal
11/81/161/321/321/4
21/161/81/161/81/4
31/161/161/161/161/4
41/40001/4
X marginal1/21/41/81/81.0

Recall from the previous example:

H(YX)=138H(Y|X)=\frac{13}{8} H(Y)=2H(Y)=2

then the mutual information is:

I(X;Y)=H(Y)H(YX)=2138=38I(X;Y)=H(Y)-H(Y|X)=2-\frac{13}{8}=\frac{3}{8}

So the entropy of YY is reduced by 38\frac{3}{8} bits on average when XX is known.

Applications

Channel capacity

Recall a channel is tuple of (F,Φ,Pr)(F,\Phi,\operatorname{Pr}).

Definition of discrete channel

A discrete channel is a system consisting of a discrete input alphabet X\mathcal{X}, a discrete output alphabet Y\mathcal{Y}, and a probability transition Pr(yx)\operatorname{Pr}(y|x).

The channel is memoryless if the output at any time depends only on the input at that time and not on the inputs and outputs at other times.

Definition of channel capacity

The channel capacity CC of a channel is defined as:

C=maxxXI(X;Y)C=\max_{x\in \mathcal{X}} I(X;Y)

where I(X;Y)I(X;Y) is the mutual information between XX and YY.

Shannon’s noisy coding theorem

Recall the rate of the code is R=logFCnR=\frac{\log_{|F|}|\mathcal{C}|}{n}.

For a discrete memoryless channel with capacity CC, every rate R<CR<C is achievable.

  • There exists a sequence of codes of length nn and size FnR|F|^{nR} with vanishing probability of error as nn\to \infty.
  • In 1 channel use, it is impossible beat the noise.
  • In nn channel uses as nn\to \infty,it is possible to beat the all the noise.

Corollary for the binary symmetric channel

The capacity of the binary symmetric channel with crossover probability pp is 1H(p)1-H(p).

Proof

H(YX=x)=Pr(y=0x)log2Pr(y=0x)Pr(y=1x)log2Pr(y=1x)=plogp(1p)log(1p)=H(p)\begin{aligned} H(Y|X=x)&=-\operatorname{Pr}(y=0|x)\log_2 \operatorname{Pr}(y=0|x)-\operatorname{Pr}(y=1|x)\log_2 \operatorname{Pr}(y=1|x) \\ &=-p\log p-(1-p)\log (1-p)\\ &=H(p) \end{aligned}

So,

\begin{aligned} H(Y|X)&=\sum_{x\in \mathcal{X}} \operatorname{Pr}(x)H(Y|X=x)\\ &=H(p)\sum_{x\in \mathcal{X}} \operatorname{Pr}(x) \\ &=H(p)

So,

\begin{aligned} I(X;Y)&=H(Y)-H(Y|X)\\ &\leq 1-H(p)

When p=0p=0 the capacity is 11.

When p=12p=\frac{1}{2} the capacity is 00. (completely noisy channel)

Last updated on