Skip to Content
Math401Math 401, Notes 3

Coding and Information Theory Crash Course

Encoding

Let A,BA,B be two finite sets with size a,ba,b respectively.

Let S(A)=r=1ArS(A)=\bigcup_{r=1}^{\infty}A^r be the word semigroup generated by AA.

A one-to-one mapping f:AS(B)f:A\to S(B) is called a code with message alphabet AA and encoded alphabet BB.

Example:

  • A=A= RGB color space
  • B={0255}B=\{0\sim 255\}
  • f:ABnf:A\to B^n is a code

For example, f(white)=(255,255,255)f(white)=(255,255,255), f(green)=(0,255,0)f(green)=(0,255,0)

Uniquely decipherable codes

A code f:AS(B)f:A\to S(B) is called uniquely decipherable if the extension code

f~:S(A)S(B)=f(a1)f(a2)f(an)\tilde{f}:S(A)\to S(B)=f(a_1)f(a_2)\cdots f(a_n)

is one-to-one.

Example:

  • A={a,b,c,d}A=\{a,b,c,d\}
  • B={0,1}B=\{0,1\}
  • f(a)=00f(a)=00, f(b)=01f(b)=01, f(c)=10f(c)=10, f(d)=11f(d)=11

is uniquely decipherable.

  • f(a)=0f(a)=0, f(b)=1f(b)=1, f(c)=10f(c)=10, f(d)=11f(d)=11

is not uniquely decipherable.

Since f~(ba)=10=f~(c)\tilde{f}(ba)=10=\tilde{f}(c)

Irreducible codes

A code f:AS(B)f:A\to S(B) is called irreducible if for any x,yAx,y\in A, f(y)f(x)wf(y)\neq f(x)w for some wS(B)w\in S(B).

This condition ensures that every message used in the encoding process is uniquely decipherable.

Means that there is no ambiguity in the decoding process.

Theorem 1.1.1 Sardinas and Patterson Theorem

Let A,BA,B be alphabet sets of size a,ba,b respectively.

Let f:AS(B)f:A\to S(B) be a code that is uniquely decipherable.

Then

xAbl(f(x))1\sum_{x\in A}b^{-l(f(x))}\leq 1

where l(f(x))l(f(x)) is the length of the codeword f(x)f(x). (Number of elements used in the codeword for xx)

Proof:

Let LL denote the max length of the codeword for any xAx\in A.

L=max{l(f(x))xA}L=\max\{l(f(x))|x\in A\}

Let crc_r be the number of codewords of length rr.

Then

xAbl(f(x))=r=1Ll(f(x))=rbl(f(x))=r=1Lcrbr\begin{aligned} \sum_{x\in A}b^{-l(f(x))}&=\sum_{r=1}^{L}\sum_{l(f(x))=r}b^{-l(f(x))}\\ &=\sum_{r=1}^{L}c_rb^{-r} \end{aligned}

Note that rr is the length of the codeword.

The max number of elements that can be represented by rr elements in BB is brb^r, so crbrc_r\leq b^r.

This gives that the power series

r=1brcr\sum_{r=1}^{\infty}b^{-r}c_r

converges to R=1lim suprcrr1R=\frac{1}{\limsup_{r\to\infty}\sqrt[r]{c_r}}\leq 1.

So the sum of the probabilities of all the codewords must be less than or equal to 1.

Sardinas and Patterson Algorithm

Let A={a1,a2,,an}A=\{a_1,a_2,\cdots,a_n\} be the message alphabet and B={b1,b2,,bm}B=\{b_1,b_2,\cdots,b_m\} be the encoded alphabet.

We test whether the code f:AS(B)f:A\to S(B) is uniquely decipherable.

def is_uniquely_decipherable(f): contain_common_prefix = message_space=set() for x in A: if f(x) in message_space: return False message_space.add(f(x)) for i in range(1,len(f(x))): if f(x)[:i] in message_space: contain_common_prefix = True while contain_common_prefix: contain_common_prefix = False for x in message_space: for y in message_space: code_length = min(len(x),len(y)) if x[:code_length] == y[:code_length]: contain_common_prefix = True if len(x) < len(y): message_space.add(y[code_length:]) else: message_space.add(x[code_length:]) break return True

Shannon’s source coding theorem

Definition 1.1.4

An elementary information source is a pair (A,μ)(A,\mu) where AA is an alphabet and μ\mu is a probability distribution on AA. μ\mu is a function μ:A[0,1]\mu:A\to[0,1] such that aAμ(a)=1\sum_{a\in A}\mu(a)=1.

The mean code word length of an information source (A,μ)(A,\mu) given a code f:AS(B)f:A\to S(B) is defined as

l(μ,f)=aAμ(a)l(f(a))\overline{l}(\mu,f)=\sum_{a\in A}\mu(a)l(f(a))

The L(μ)L(\mu) of the mean code word length is defined as

L(μ)=min{l(μ,f)f:AS(B) is uniquely decipherable}L(\mu)=\min\{\overline{l}(\mu,f)|f:A\to S(B)\text{ is uniquely decipherable}\}

Lemma: Jenson’s inequality

Let ff be a convex function on the interval (a,b)(a,b). Then for any x1,x2,,xn(a,b)x_1,x_2,\cdots,x_n\in (a,b) and λ1,λ2,,λn[0,1]\lambda_1,\lambda_2,\cdots,\lambda_n\in [0,1] such that i=1nλi=1\sum_{i=1}^{n}\lambda_i=1, we have

f(i=1nλixi)i=1nλif(xi)f(\sum_{i=1}^{n}\lambda_ix_i)\leq \sum_{i=1}^{n}\lambda_if(x_i)

Proof:

If ff is a convex function, there are three properties that useful for the proof:

  1. f(x)0f''(x)\geq 0 for all x(a,b)x\in (a,b)
  2. For any x,y(a,b)x,y\in (a,b), f(x)f(y)+(xy)f(y)f(x)\geq f(y)+(x-y)f'(y) (Take tangent line at yy)
  3. For any x,y(a,b)x,y\in (a,b) and 0<λ<10<\lambda<1, we have f(λx+(1λ)y)λf(x)+(1λ)f(y)f(\lambda x+(1-\lambda)y)\leq \lambda f(x)+(1-\lambda)f(y) (Take line connecting f(x)f(x) and f(y)f(y))

We use f(x)f(y)+(xy)f(y)f(x)\geq f(y)+(x-y)f'(y), we replace y=i=1nλixiy=\sum_{i=1}^{n}\lambda_ix_i and x=xjx=x_j, we have

f(xj)f(i=1nλixi)+(xji=1nλixi)f(i=1nλixi)f(x_j)\geq f(\sum_{i=1}^{n}\lambda_ix_i)+(x_j-\sum_{i=1}^{n}\lambda_ix_i)f'(\sum_{i=1}^{n}\lambda_ix_i)

We sum all the inequalities, we have

j=1nλjf(xj)j=1nλjf(i=1nλixi)+j=1nλj(xji=1nλixi)f(i=1nλixi)j=1nλjf(i=1nλixi)+0=f(j=1nλixj)\begin{aligned} \sum_{j=1}^{n}\lambda_j f(x_j)&\geq \sum_{j=1}^{n}\lambda_jf(\sum_{i=1}^{n}\lambda_ix_i)+\sum_{j=1}^{n}\lambda_j(x_j-\sum_{i=1}^{n}\lambda_ix_i)f'(\sum_{i=1}^{n}\lambda_ix_i)\\ &\geq \sum_{j=1}^{n}\lambda_jf(\sum_{i=1}^{n}\lambda_ix_i)+0\\ &=f(\sum_{j=1}^{n}\lambda_ix_j) \end{aligned}

Theorem 1.1.5

Shannon’s source coding theorem

Let (A,μ)(A,\mu) be an elementary information source and let bb be the size of the encoded alphabet BB.

Then

xAμ(x)logμ(x)logbL(μ)<xAμ(x)logμ(x)logb+1\frac{-\sum_{x\in A}\mu(x)\log\mu(x)}{\log b}\leq L(\mu)<\frac{-\sum_{x\in A}\mu(x)\log\mu(x)}{\log b}+1

where L(μ)L(\mu) is the minimum mean code word length of all uniquely decipherable codes for (A,μ)(A,\mu).

Proof

First, we show that

aAm(a)μ(a)(logb)1aAμ(a)logμ(a)\sum_{a\in A}m(a)\mu(a)\geq -(\log b)^{-1}\sum_{a\in A}\mu(a)\log \mu(a)

Let m:AZ+m:A\to \mathbb{Z}_+ be a map satisfying

aAbm(a)1\sum_{a\in A}b^{-m(a)}\leq 1

We defined the probability distribution vv on AA as

v(x)=bm(x)aAbm(a)=bm(x)Tv(x)=\frac{b^{-m(x)}}{\sum_{a\in A}b^{-m(a)}}=\frac{b^{-m(x)}}{T}

Write T=aAbm(a)T=\sum_{a\in A}b^{-m(a)} so that T1T\leq 1.

Since bm(a)=Tv(a)b^{-m(a)}=T\cdot v(a),

m(a)logb=logT+logv(a)-m(a)\log b=\log T+\log v(a), m(a)=logTlogblogv(a)logbm(a)=-\frac{\log T}{\log b}-\frac{\log v(a)}{\log b}

We have

aAm(a)μ(a)=aA(logTlogblogv(a)logb)μ(a)=logTlogbaAμ(a)1logbaAμ(a)logv(a)=(logb)1{logTaAμ(a)logv(a)}\begin{aligned} \sum_{a\in A}m(a)\mu(a)&=\sum_{a\in A}\left(-\frac{\log T}{\log b}-\frac{\log v(a)}{\log b}\right)\mu(a)\\ &=-\frac{\log T}{\log b}\sum_{a\in A}\mu(a)-\frac{1}{\log b}\sum_{a\in A}\mu(a)\log v(a)\\ &=(\log b)^{-1}\left\{-\log T - \sum_{a\in A}\mu(a)\log v(a)\right\} \end{aligned}

Without loss of generality, we assume that μ(a)>0\mu(a)>0 for all aAa\in A.

aAμ(a)logv(a)=aAμ(a)(logμ(a)+logv(a)logμ(a))=aAμ(a)logμ(a)aAμ(a)logv(a)μ(a)=aAμ(a)logμ(a)logaA(v(a)μ(a))μ(a)\begin{aligned} -\sum_{a\in A}\mu(a)\log v(a)&=-\sum_{a\in A}\mu(a)(\log \mu(a)+\log v(a)-\log \mu(a))\\ &=-\sum_{a\in A}\mu(a)\log \mu(a)-\sum_{a\in A}\mu(a)\log \frac{v(a)}{\mu(a)}\\ &=-\sum_{a\in A}\mu(a)\log \mu(a)-\log \prod_{a\in A}\left(\frac{v(a)}{\mu(a)}\right)^{\mu(a)} \end{aligned}

logaA(v(a)μ(a))μ(a)=aAμ(a)logv(a)μ(a)\log \prod_{a\in A}\left(\frac{v(a)}{\mu(a)}\right)^{\mu(a)}=\sum_{a\in A}\mu(a)\log \frac{v(a)}{\mu(a)} is also called Kullback–Leibler divergence or relative entropy.

Since log\log is a concave function, by Jensen’s inequality f(i=1nλixi)i=1nλif(xi)f(\sum_{i=1}^{n}\lambda_ix_i)\leq \sum_{i=1}^{n}\lambda_if(x_i), we have

aAμ(a)logv(a)μ(a)log(aAμ(a)v(a)μ(a))log(aA(v(a)μ(a))μ(a))log(aAμ(a)v(a)μ(a))aA(v(a)μ(a))μ(a)aAμ(a)v(a)μ(a)=1\begin{aligned} \sum_{a\in A}\mu(a)\log \frac{v(a)}{\mu(a)}&\leq\log\left(\sum_{a\in A}\mu(a) \frac{v(a)}{\mu(a)}\right)\\ \log\left(\prod_{a\in A}\left(\frac{v(a)}{\mu(a)}\right)^{\mu(a)}\right)&\leq \log\left(\sum_{a\in A}\mu(a) \frac{v(a)}{\mu(a)}\right)\\ \prod_{a\in A}\left(\frac{v(a)}{\mu(a)}\right)^{\mu(a)}&\leq \sum_{a\in A}\mu(a)\frac{v(a)}{\mu(a)}=1 \end{aligned}

So

aAμ(a)logv(a)aAμ(a)logμ(a)-\sum_{a\in A}\mu(a)\log v(a)\geq -\sum_{a\in A}\mu(a)\log \mu(a)

(This is also known as Gibbs’ inequality: Put in words, the information entropy of a distribution PP is less than or equal to its cross entropy with any other distribution QQ.)

Since T1T\leq 1, logT0-\log T\geq 0, we have

aAm(a)μ(a)(logb)1aAμ(a)logμ(a)\sum_{a\in A}m(a)\mu(a)\geq -(\log b)^{-1}\sum_{a\in A}\mu(a)\log \mu(a)

Second, we show that

aAm(a)μ(a)(logb)1aAμ(a)logμ(a)+1\sum_{a\in A}m(a)\mu(a)\leq -(\log b)^{-1}\sum_{a\in A}\mu(a)\log \mu(a)+1

By Theorem 1.1.1, there exists an irreducible code f:AS(B)f:A\to S(B) such that l(f(a))=m(a)l(f(a))=m(a) for all aAa\in A.

So

l(μ,f)=aAμ(a)m(a)<aAμ(a)(1logμ(a)logb)=aAμ(a)logμ(a)1logb+1=(logb)1aAμ(a)logμ(a)+1\begin{aligned} \overline{l}(\mu,f)&=\sum_{a\in A}\mu(a)m(a)\\ &<\sum_{a\in A}\mu(a)\left(1-\frac{\log\mu(a)}{\log b}\right)\\ &=-\sum_{a\in A}\mu(a)\log\mu(a)\cdot\frac{1}{\log b}+1\\ &=-(\log b)^{-1}\sum_{a\in A}\mu(a)\log \mu(a)+1 \end{aligned}

Entropy

Let AA be an alphabet and let P(A)\mathcal{P}(A) be the set of all probability distributions on AA. Any element μP(A)\mu\in \mathcal{P}(A) is thus a map from AA to [0,1][0,1] such that aAμ(a)=1\sum_{a\in A}\mu(a)=1.

Thus P(A)\mathcal{P}(A) is a compact conves subset of real linear space RA\mathbb{R}^A.

We define the map H:P(A)RH:\mathcal{P}(A)\to\mathbb{R} as

H(μ)=aAμ(a)log2μ(a)H(\mu)=-\sum_{a\in A}\mu(a)\log_2\mu(a)

Basic properties of HH

  1. H(μ)0H(\mu)\geq 0
  2. H(μ)=0H(\mu)=0 if and only if μ\mu is a point mass.
  3. H(μ)=log2AH(\mu)=\log_2|A| if and only if μ\mu is uniform.

Huffman’s coding algorithm

Huffman’s coding algorithm is a greedy algorithm that constructs an optimal prefix code for a given probability distribution.

The algorithm is as follows:

  1. Sort the symbols by their probabilities in descending order.
  2. Merge the two symbols with the smallest probabilities and assign a new codeword to the merged symbol.
  3. Repeat step 2 until there is only one symbol left.
import heapq def huffman(frequencies): class Node: def __init__(self, symbol=None, freq=0, left=None, right=None): self.symbol = symbol self.freq = freq self.left = left self.right = right def __lt__(self, other): # For priority queue return self.freq < other.freq def is_leaf(self): return self.symbol is not None # Build the Huffman tree heap = [Node(sym, freq) for sym, freq in frequencies.items()] heapq.heapify(heap) while len(heap) > 1: left = heapq.heappop(heap) right = heapq.heappop(heap) merged = Node(freq=left.freq + right.freq, left=left, right=right) heapq.heappush(heap, merged) root = heap[0] # Assign codes by traversing the tree codebook = {} def assign_codes(node, code=""): if node.is_leaf(): codebook[node.symbol] = code else: assign_codes(node.left, code + "0") assign_codes(node.right, code + "1") assign_codes(root) # Helper to pretty print the binary tree def tree_repr(node, prefix=""): if node.is_leaf(): return f"{prefix}{repr(node.symbol)} ({node.freq})\n" else: s = f"{prefix}• ({node.freq})\n" s += tree_repr(node.left, prefix + " 0 ") s += tree_repr(node.right, prefix + " 1 ") return s print("Huffman Codebook:") for sym in sorted(codebook, key=lambda s: frequencies[s], reverse=True): print(f"{repr(sym)}: {codebook[sym]}") print("\nCoding Tree:") print(tree_repr(root)) return codebook

Definition 1.2.1

Let A={a1,a2,,an}A=\{a_1,a_2,\cdots,a_n\} be an alphabet and let μ=(μ(a1),μ(a2),,μ(an))\mu=(\mu(a_1),\mu(a_2),\cdots,\mu(a_n)) be a probability distribution on AA.

Proof:

We use mathematical induction on the number of symbols nn.

Base Case: n=2n=2

Only two symbols — assign one “0”, the other “1”. This is clearly optimal (only possible choice).

Inductive Step:

Assume that Huffman’s algorithm produces an optimal code for n1n-1 symbols.

Now, given nn symbols, Huffman merges the two least probable symbols a,ba,b into cc, and constructs an optimal code recursively for n1n-1 symbols. n−1 symbols.

By the inductive hypothesis, the code on AA' is optimal. is optimal.

By Step 2 above, assigning the two merged symbols aa and bb codewords w0w_0 and w1w_1 (based on 1.1.4) results in the optimal solution for AA.

Therefore, by induction, Huffman’s algorithm gives an optimal prefix code for any nn.

Last updated on