CSE5313 Coding and information theory for data science (Lecture 15)

Information theory

Transmission, processing, extraction, and utilization of information.

Information: “Resolution” of uncertainty.
Question in the 1940’s: How to quantify the complexity of information?
- Questions:
  - How many bits are required to describe an information source?
  - How many bits are required to transmit an information source?
  - How much information does one source reveal about another?
- Applications:
  - Data Compression.
  - Channel Coding.
  - Privacy.
Claude Shannon 1948: Information Entropy.

Entropy

The “information value” of a message depends on how surprising it is.

The more unlikely the event, the more informative the message.

The Shannon information of an event $E$ :

I(E)=-\log_2 \frac{1}{Pr(E)}

Entropy the the expected amount of information in a random trial.

Rolling a die has more entropy than rolling a coin (6 states vs 2 states).

Information entropy

Let $X$ be a random variable with values in some finite set $\mathcal{X}$ .

The entropy $H(X)$ of $X$ is defined as:

H(X)=\sum_{x\in \mathcal{X}} \log_2 \frac{1}{Pr(X=x)}=-\mathbb{E}_{x\sim X}[Pr(X=x)\log_2 Pr(X=x)]

Idea: How many bits are required, on average, to describe $X$ ?

The more unlikely the event, the more informative the message.
Use “few” bits for common $x$ (i.e., $Pr(x)$ large).
Use “many” bits for rare $x$ (i.e., $Pr(x)$ small).

Notes:

$H(X)=\mathbb{E}_{x\sim X}[I(X=x)]$
$H(X)\geq 0$
$H(X)=0$ if and only if $Pr(X=x)=1$ for some $x\in \mathcal{X}$ .
Does not depend on $\mathcal{X}$ but on the probability distribution $Pr(X=x)$ .
Maximum entropy is achieved when the distribution is uniform. $H(Uniform(n))=\log_2 n$ . Where $n$ is the number of events. ( $a$ bits required to describe each event of size $2^a$ ).

Example

For uniform distribution $X\sim Uniform\{0,1\}$ , we have $H(X)=\log_2 2=1$ .

For Bernoulli distribution $X\sim Bernoulli(p)$ , we have $H(X)=-p\log_2 p-(1-p)\log_2 (1-p)=H(p)$ .

Motivation

Optimal compression:

Consider the $X$ with distribution given belows:

Value	Probability	Encoding
1	1/8	000
2	1/4	001
3	1/4	01
4	1/2	1

The average length of the encoding is $1/8\times 3+1/4\times 3+1/4\times 2+1/2\times 1=7/4$ .

And the entropy of $X$ is $H(X)=1/8\times \log_2 8+1/4\times \log_2 4+1/4\times \log_2 2+1/2\times \log_2 1=7/4$ .

So the average length of the encoding is equal to the entropy of $X$ .

This is the optimal compression.

This is the Huffman coding.

Few extra theorems that will not be proved in CS course

Theorem: Avg. # of bits in any prefix-free compression of $X$ is ≥ $H(X)$ .
Theorem: Huffman coding is optimal.
- I.e., Avg. # of bits equals $H(X)$ .
Disadvantage of Huffman coding: the distribution must be known.
- Generally not the case.
[Lempel-Ziv 1978]: Universal lossless data compression.

Conditional and joint entropy

How does the entropy of different random variables interact?

As a function of their dependence.
Needed in order to relate:
- A variable and its compression.
- A variable and its transmission over a noisy channel.
- A variable and its encryption

Definition for joint entropy

Let $X$ and $Y$ be discrete random variables.

The joint entropy $H(X,Y)$ of $X$ and $Y$ is defined as:

H(X,Y)=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(X=x, Y=y) \log_2 Pr(X=x, Y=y)

Notes:

$H(X,Y)\geq 0$
$H(X,Y)=H(Y,X)$
$H(X,Y)\geq \max\{H(X),H(Y)\}$
$H(X,Y)\leq H(X)+H(Y)$ with equality if and only if $X$ and $Y$ are independent.

Conditional entropy

Note

Recall that the conditional probability $P(Y|X)$ is defined as: $P(Y|X)=\frac{P(Y,X)}{P(X)}$ .

What is the average amount of information revealed by $Y$ given the distribution of $X$ ?

For each given $x\in \mathcal{X}$ , the conditional entropy $H(Y|X=x)$ is defined as:

\begin{aligned} H(Y|X=x)&=-\sum_{y\in \mathcal{Y}} \log_2 \frac{1}{Pr(Y=y|X=x)} \\ &=-\sum_{y\in \mathcal{Y}} Pr(Y=y|X=x) \log_2 Pr(Y=y|X=x) \\ \end{aligned}

The conditional entropy $H(Y|X)$ is defined as:

\begin{aligned} H(Y|X)&=\mathbb{E}_{x\sim X}[H(Y|X=x)] \\ &=-\sum_{x\in \mathcal{X}} Pr(X=x)H(Y|X=x) \\ &=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(X=x, Y=y) \log_2 Pr(Y=y|X=x) \\ &=-\sum_{x\in \mathcal{X}, y\in \mathcal{Y}} Pr(x)\sum_{y\in \mathcal{Y}} Pr(Y=y|X=x) \log_2 Pr(Y=y|X=x) \\ \end{aligned}

Notes:

$H(X|X)=0$
$H(Y|X)=0$ when $Y$ is a function of $X$ .
Conditional entropy not necessarily symmetric. $H(X|Y)\neq H(Y|X)$ .
Conditioning will not increase entropy. $H(X|Y)\leq H(X)$ .

Chain rules of entropy

Joint entropy: $H(X,Y)=\mathbb{E}_{x\sim X, y\sim Y}\log \frac{1}{Pr(X=x, Y=y)}$ .

Conditional entropy: $H(Y|X)=\mathbb{E}_{x\sim X}\log \frac{1}{Pr(Y=y|X=x)}$ .

Chain rule: $H(X,Y)=H(X)+H(Y|X)$ .

Proof

For statistically independent $X$ and $Y$ , we have $Pr(x,y)=Pr(x)Pr(y)$ .

$Pr(y|x)=\frac{Pr(x,y)}{Pr(x)}=\frac{Pr(x)Pr(y)}{Pr(x)}=Pr(y)$ .

Apply the symmetry of the joint distribution.

Example of computing the conditional entropy

For the given distribution:

Y\X	1	2	3	4	Y marginal
1	1/8	1/16	1/32	1/32	1/4
2	1/16	1/8	1/16	1/8	1/4
3	1/16	1/16	1/16	1/16	1/4
4	1/4	0	0	0	1/4
X marginal	1/2	1/4	1/8	1/8	1.0

Here $H(X)=H((1/2,1/4,1/8,1/8))=-(1/2\log_2 1/2+1/4\log_2 1/4+1/8\log_2 1/8+1/8\log_2 1/8)=7/4$ .

$H(Y)=H((1/4,1/4,1/4,1/4))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/4\log_2 1/4+1/4\log_2 1/4)=2$ .

$H(Y|X=1)=H((1/4,1/8,1/8,1/2))=-(1/4\log_2 1/4+1/8\log_2 1/8+1/8\log_2 1/8+1/2\log_2 1/2)=7/4$ .

$H(Y|X=2)=H((1/4,1/2,1/4))=-(1/4\log_2 1/4+1/2\log_2 1/2+1/4\log_2 1/4)=1.5$ .

$H(Y|X=3)=H((1/4,1/4,1/2,1/4))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/2\log_2 1/2+1/4\log_2 1/4)=1.5$ .

$H(Y|X=4)=H((1/4,1/4,1/2))=-(1/4\log_2 1/4+1/4\log_2 1/4+1/2\log_2 1/2)=1.5$ .

So $H(Y|X)=\sum_{x\in \mathcal{X}} Pr(x)H(Y|X=x)=1/2\times 7/4+1/4\times 1.5+1/8\times 1.5+1/8\times 1.5=13/8$

Mutual information

Definition for mutual information

The mutual information $I(X;Y)$ of $X$ and $Y$ is defined as:

I(X;Y)=H(X)-H(X|Y)=H(Y)-H(Y|X)

Properties of mutual information

$I(X;Y)\geq 0$ $I (X; Y) \geq 0$
- Prove via Jensen’s inequality.
- Conditioning will not increase entropy.
$I(X;Y)=I(Y;X)$ symmetry.
$I(X;X)=H(X)-H(X|X)=H(X)$
$I(X;Y)=H(X)+H(Y)-H(X,Y)$

Example of computing the mutual information

For the given distribution:

Y\X	1	2	3	4	Y marginal
1	1/8	1/16	1/32	1/32	1/4
2	1/16	1/8	1/16	1/8	1/4
3	1/16	1/16	1/16	1/16	1/4
4	1/4	0	0	0	1/4
X marginal	1/2	1/4	1/8	1/8	1.0

Recall from the previous example:

$H(Y|X)=\frac{13}{8}$ $H(Y)=2$

then the mutual information is:

I(X;Y)=H(Y)-H(Y|X)=2-\frac{13}{8}=\frac{3}{8}

So the entropy of $Y$ is reduced by $\frac{3}{8}$ bits on average when $X$ is known.

Applications

Channel capacity

Recall a channel is tuple of $(F,\Phi,\operatorname{Pr})$ .

Definition of discrete channel

A discrete channel is a system consisting of a discrete input alphabet $\mathcal{X}$ , a discrete output alphabet $\mathcal{Y}$ , and a probability transition $\operatorname{Pr}(y|x)$ .

The channel is memoryless if the output at any time depends only on the input at that time and not on the inputs and outputs at other times.

Definition of channel capacity

The channel capacity $C$ of a channel is defined as:

C=\max_{x\in \mathcal{X}} I(X;Y)

where $I(X;Y)$ is the mutual information between $X$ and $Y$ .

Shannon’s noisy coding theorem

Recall the rate of the code is $R=\frac{\log_{|F|}|\mathcal{C}|}{n}$ .

For a discrete memoryless channel with capacity $C$ , every rate $R<C$ is achievable.

There exists a sequence of codes of length $n$ and size $|F|^{nR}$ with vanishing probability of error as $n\to \infty$ .
In 1 channel use, it is impossible beat the noise.
In $n$ channel uses as $n\to \infty$ ,it is possible to beat the all the noise.

Corollary for the binary symmetric channel

The capacity of the binary symmetric channel with crossover probability $p$ is $1-H(p)$ .

Proof

\begin{aligned} H(Y|X=x)&=-\operatorname{Pr}(y=0|x)\log_2 \operatorname{Pr}(y=0|x)-\operatorname{Pr}(y=1|x)\log_2 \operatorname{Pr}(y=1|x) \\ &=-p\log p-(1-p)\log (1-p)\\ &=H(p) \end{aligned}

So,

\begin{aligned} H(Y|X)&=\sum_{x\in \mathcal{X}} \operatorname{Pr}(x)H(Y|X=x)\\ &=H(p)\sum_{x\in \mathcal{X}} \operatorname{Pr}(x) \\ &=H(p)

So,

\begin{aligned} I(X;Y)&=H(Y)-H(Y|X)\\ &\leq 1-H(p)

When $p=0$ the capacity is $1$ .

When $p=\frac{1}{2}$ the capacity is $0$ . (completely noisy channel)