CSE5313 Coding and information theory for data science (Lecture 15)
Information theory
Transmission, processing, extraction, and utilization of information.
- Information: “Resolution” of uncertainty.
- Question in the 1940’s: How to quantify the complexity of information?
- Questions:
- How many bits are required to describe an information source?
- How many bits are required to transmit an information source?
- How much information does one source reveal about another?
- Applications:
- Data Compression.
- Channel Coding.
- Privacy.
- Questions:
- Claude Shannon 1948: Information Entropy.
Entropy
The “information value” of a message depends on how surprising it is.
- The more unlikely the event, the more informative the message.
The Shannon information of an event :
Entropy the the expected amount of information in a random trial.
- Rolling a die has more entropy than rolling a coin (6 states vs 2 states).
Information entropy
Let be a random variable with values in some finite set .
The entropy of is defined as:
Idea: How many bits are required, on average, to describe ?
- The more unlikely the event, the more informative the message.
- Use “few” bits for common (i.e., large).
- Use “many” bits for rare (i.e., small).
Notes:
- if and only if for some .
- Does not depend on but on the probability distribution .
- Maximum entropy is achieved when the distribution is uniform. . Where is the number of events. ( bits required to describe each event of size ).
Example
For uniform distribution , we have .
For Bernoulli distribution , we have .
Motivation
Optimal compression:
Consider the with distribution given belows:
| Value | Probability | Encoding |
|---|---|---|
| 1 | 1/8 | 000 |
| 2 | 1/4 | 001 |
| 3 | 1/4 | 01 |
| 4 | 1/2 | 1 |
The average length of the encoding is .
And the entropy of is .
So the average length of the encoding is equal to the entropy of .
This is the optimal compression.
This is the Huffman coding.
Few extra theorems that will not be proved in CS course
- Theorem: Avg. # of bits in any prefix-free compression of is ≥ .
- Theorem: Huffman coding is optimal.
- I.e., Avg. # of bits equals .
- Disadvantage of Huffman coding: the distribution must be known.
- Generally not the case.
- [Lempel-Ziv 1978]: Universal lossless data compression.
Conditional and joint entropy
How does the entropy of different random variables interact?
- As a function of their dependence.
- Needed in order to relate:
- A variable and its compression.
- A variable and its transmission over a noisy channel.
- A variable and its encryption
Definition for joint entropy
Let and be discrete random variables.
The joint entropy of and is defined as:
Notes:
- with equality if and only if and are independent.
Conditional entropy
Recall that the conditional probability is defined as: .
What is the average amount of information revealed by given the distribution of ?
For each given , the conditional entropy is defined as:
The conditional entropy is defined as:
Notes:
- when is a function of .
- Conditional entropy not necessarily symmetric. .
- Conditioning will not increase entropy. .
Chain rules of entropy
Joint entropy: .
Conditional entropy: .
Chain rule: .
Proof
For statistically independent and , we have .
.
Apply the symmetry of the joint distribution.
Example of computing the conditional entropy
For the given distribution:
| Y\X | 1 | 2 | 3 | 4 | Y marginal |
|---|---|---|---|---|---|
| 1 | 1/8 | 1/16 | 1/32 | 1/32 | 1/4 |
| 2 | 1/16 | 1/8 | 1/16 | 1/8 | 1/4 |
| 3 | 1/16 | 1/16 | 1/16 | 1/16 | 1/4 |
| 4 | 1/4 | 0 | 0 | 0 | 1/4 |
| X marginal | 1/2 | 1/4 | 1/8 | 1/8 | 1.0 |
Here .
.
.
.
.
.
So
Mutual information
Definition for mutual information
The mutual information of and is defined as:
Properties of mutual information
-
- Prove via Jensen’s inequality.
- Conditioning will not increase entropy.
- symmetry.
Example of computing the mutual information
For the given distribution:
| Y\X | 1 | 2 | 3 | 4 | Y marginal |
|---|---|---|---|---|---|
| 1 | 1/8 | 1/16 | 1/32 | 1/32 | 1/4 |
| 2 | 1/16 | 1/8 | 1/16 | 1/8 | 1/4 |
| 3 | 1/16 | 1/16 | 1/16 | 1/16 | 1/4 |
| 4 | 1/4 | 0 | 0 | 0 | 1/4 |
| X marginal | 1/2 | 1/4 | 1/8 | 1/8 | 1.0 |
Recall from the previous example:
then the mutual information is:
So the entropy of is reduced by bits on average when is known.
Applications
Channel capacity
Recall a channel is tuple of .
Definition of discrete channel
A discrete channel is a system consisting of a discrete input alphabet , a discrete output alphabet , and a probability transition .
The channel is memoryless if the output at any time depends only on the input at that time and not on the inputs and outputs at other times.
Definition of channel capacity
The channel capacity of a channel is defined as:
where is the mutual information between and .
Shannon’s noisy coding theorem
Recall the rate of the code is .
For a discrete memoryless channel with capacity , every rate is achievable.
- There exists a sequence of codes of length and size with vanishing probability of error as .
- In 1 channel use, it is impossible beat the noise.
- In channel uses as ,it is possible to beat the all the noise.
Corollary for the binary symmetric channel
The capacity of the binary symmetric channel with crossover probability is .
Proof
So,
\begin{aligned} H(Y|X)&=\sum_{x\in \mathcal{X}} \operatorname{Pr}(x)H(Y|X=x)\\ &=H(p)\sum_{x\in \mathcal{X}} \operatorname{Pr}(x) \\ &=H(p)So,
\begin{aligned} I(X;Y)&=H(Y)-H(Y|X)\\ &\leq 1-H(p)When the capacity is .
When the capacity is . (completely noisy channel)