CSE5313 Coding and information theory for data science (Lecture 2)

Review on Channel coding

Let $F$ be the input alphabet, $\Phi$ be the output alphabet.

e.g. $F=\{0,1\},\mathbb{R}$ .

Introduce noise: $\operatorname{Pr}(c'\text{ received}|c\text{ transmitted})$ .

We use $u$ to denote the information to be transmitted

$c$ to be the codeword.

$c'$ is the received codeword. given to the decoder.

$u'$ is the decoded information word.

Error if $u' \neq u$ .

Example:

Binary symmetric channel (BSC)

$F=\Phi=\{0,1\}$

Every bit of $c$ is flipped with probability $p$ .

Binary erasure channel (BEC)

$F=\Phi=\{0,1,*\}$ , very common in practice when we are unsure when the bit is transmitted.

$c$ is transmitted, $c'$ is received.

$c'$ is $c$ with probability $1-p$ , $e$ with probability $p$ .

Encoding

Encoding $E$ is a function from $F^k$ to $F^n$ .

Where $E(u)=c$ is the codeword.

Assume $n\geq k$ , we don’t compress the information.

A code $\mathcal{C}$ is a subset of $F^n$ .

Encoding is a one to one mapping from $F^k$ to $\mathcal{C}$ .

In practice, we usually choose $\mathcal{C}\subseteq F^n$ to be the size of $F^k$ .

Decoding

$D$ is a function from $\Phi^n$ to $\mathcal{C}$ .

$D(c')=\hat{c}$

The decoder then outputs the unique $u'$ such that $E(u')=\hat{c}$ .

Our aim is to have $u=u'$ .

Decoding error probability: $\operatorname{P}_{err}=\max_{c\in \mathcal{C}}\operatorname{P}_{err}(c)$ .

where $\operatorname{P}_{err}(c)=\sum_{y|D(y)\neq c}\operatorname{Pr}(y\text{ received}|c\text{ transmitted})$ .

Our goal is to construct decoder $D$ such that $\operatorname{P}_{err}$ is bounded.

Example:

Repetition code in binary symmetric channel:

Let $F=\Phi=\{0,1\}$ . Every bit of $c$ is flipped with probability $p$ .

Say $k=1$ , $n=3$ and let $\mathcal{C}=\{000,111\}$ .

Let the encoder be $E(u)=u u u$ .

The decoder is $D(000)=D(100)=D(010)=D(001)=0$ , $D(110)=D(101)=D(011)=D(111)=1$ .

Exercise: Compute the error probability of the repetition code in binary symmetric channel.

Solution

Recall that $P_{err}(c)=\sum_{y|D(y)\neq c}\operatorname{Pr}(y\text{ received}|c\text{ transmitted})$ .

Use binomial random variable:

\begin{aligned} P_{err}(000)&=\sum_{y|D(y)\neq 000}\operatorname{Pr}(y\text{ received}|000\text{ transmitted})\\ &=\operatorname{Pr}(2\text{ flipes or more})\\ &=\binom{n}{2}p^2(1-p)+\binom{n}{3}p^3\\ &=3p^2(1-p)+p^3\\ \end{aligned}

The computation is identical for $111$ .

$P_{err}=\max\{P_{err}(000),P_{err}(111)\}=P_{err}(000)=3p^2(1-p)+p^3$ .

Maximum likelihood principle

For $p\leq 1/2$ , the last example is maximum likelihood decoder.

Notice that $\operatorname{Pr}(c'=000|c=000)=(1-p)^3$ and $\operatorname{Pr}(c'=000|c=111)=p^3$ .

If $p\leq 1/2$ , then $(1-p)^3\geq p^3$ . $c=000$ is more likely to be transmitted than $c=111$ .

When $\operatorname{Pr}(c'=001|c=000)=(1-p)^2p$ and $\operatorname{Pr}(c'=001|c=111)=p^2(1-p)$ .

If $p\leq 1/2$ , then $(1-p)^2p\geq p^2(1-p)$ . $c=001$ is more likely to be transmitted than $c=110$ .

For $p>1/2$ , we just negate the above.

In general, Maximum likelihood decoder is $D(c')=\arg\max_{c\in \mathcal{C}}\operatorname{Pr}(c'\text{ received}|c\text{ transmitted})$ .

Defining a “good” code

Two metrics:

How many redundant bits are needed?
- e.g. repetition code: $k=1$ , $n=3$ sends $2$ redundant bits.
What is the resulting error probability?
- Depends on the decoding function.
- Normally, maximum likelihood decoding is assumed.
- Should go zero with $n$ .

Definition for rate of code is $\frac{k}{n}$ .

More generally, $\log_{|F|}\frac{|\mathcal{C}|}{n}$ .

Definition for information entropy

Let $X$ be a random variable over a discrete set $\mathcal{X}$ .

That is every $x\in \mathcal{X}$ has a probability $\operatorname{Pr}(X=x)$ .

The entropy $H(X)$ of a discrete random variable $X$ is defined as:

H(X)=\mathbb{E}_{x\sim X}{\log \frac{1}{\operatorname{Pr}(x)}}=-\sum_{x\in \mathcal{X}}\operatorname{Pr}(x)\log \operatorname{Pr}(x)

when $X=Bernouili(p)$ , we denote $H(X)=H(p)=-p\log p-(1-p)\log (1-p)$ .

A deeper explanation will be given in the later in the course.

Which rate are possible?

Claude Shannon ‘48: Coding theorem of the BSC(binary symmetric channel)

Recall $r=\frac{k}{n}$ .

Let $H(\cdot)$ be the entropy function.

For every $0\leq r<1-H(p)$ ,

There exists $\mathcal{C}_1, \mathcal{C}_2,\ldots$ of rates $r_1,r_2,\ldots$ lengths $n_1,n_2,\ldots$ and $r_i\geq r$ .
That with Maximum likelihood decoding satisifies $P_{err}\to 0$ as $i\to \infty$ .

For any $R\geq 1-H(p)$ ,

Any sequence $\mathcal{C}_1, \mathcal{C}_2,\ldots$ of rates $r_1,r_2,\ldots$ lengths $n_1,n_2,\ldots$ and $r_i\geq R$ ,
Any andy decoding algorithm, $P_{err}\to 1$ as $i\to \infty$ .

$1-H(p)$ is the capacity of the BSC.

Informally, the capacity is the best possible rate of the code (asymptotically).
A special case of a broader theorem (Shannon’s coding theorem).
We will see later in this course.

Polar codes, for explicit construction of codes with rate arbitrarily close to capacity.

BSC capacity - Intuition

Capacity of the binary symmetric channel with crossover probability $p=1-H(p)$ .

A correct decoder $c'\to c$ essentially identifies two objects:

The codeword $c$
The error word $e=c'-c$ subtraction $\mod 2$ .
$c$ and $e$ are independent of each other.

A typical $e$ has $\approx np$ $1$ ‘s (law of large numbers), say $n(p\pm \delta)$ .

Exercise:

$\operatorname{Pr}(e)=p^{n(p\pm \delta)}(1-p)^{n(1-p\pm \delta)}=2^{-n(H(p)+\epsilon)}$ for some $\epsilon$ goes to zero as $n\to \infty$ .

Intuition

There exists $\approx 2^{n(H(p)}$ typical error words.

To index those typical error words, we need $\log_2 (2^{nH(p)})=nH(p)+O(1)$ . bits to identify the error word $e$ .

To encode the message, we need $\log_2 |\mathcal{C}|$ bits.

Since we send $n$ bits, the rate is $k+nH(p)+O(1)\leq n$ , so $\frac{k}{n}\leq 1-H(p)$ .

So the rate cannot exceed $1-H(p)$ .

Formal proof

\begin{aligned} \operatorname{Pr}(e)&=p^{n(p\pm \delta)}(1-p)^{n(1-p\pm \delta)}\\ &=p^{np}(1-p)^{n(1-p)}p^{\pm n\delta}(1-p)^{\mp n\delta}\\ \end{aligned}

And

\begin{aligned} 2^{-n(H(p)+\epsilon)}&=2^{-n(-p\log p-(1-p)\log (1-p)+\epsilon)}\\ &=2^{np\log p}2^{n(1-p)\log (1-p)}2^{-n\epsilon}\\ &=p^{np}(1-p)^{n(1-p)}2^{-n\epsilon}\\ \end{aligned}

So we need to check there exists $\epsilon>0$ such that

\lim_{n\to \infty}p^{\pm n\delta}(1-p)^{\mp n\delta}\leq 2^{-n\epsilon}

Test

\begin{aligned} 2^{-n\epsilon}&=p^{np}(1-p)^{n(1-p)}2^{-n\epsilon}\\ -n\epsilon&=\delta n\log p-\delta n\log (1-p)\\ \epsilon&=\delta (\log (1-p)-\log p)\\ \end{aligned}

Hamming distance

How to quantify the noise in the channel?

Number of flipped bits.

Definition of Hamming distance:

Denote $c=(c_1,c_2,\ldots,c_n)$ and $c'=(c'_1,c'_2,\ldots,c'_n)$ .
$d_H(c,c')=\sum_{i=1}^n [c_i\neq c'_i]$ .

Minimum hamminng distance:

Let $\mathcal{C}$ be a code.
$d_H(\mathcal{C})=\min_{c_1,c_2\in \mathcal{C},c_1\neq c_2}d_H(c_1,c_2)$ .

Hamming distance is a metric.

$d_H(x,y)\geq 0$ equal iff $x=y$ .
$d_H(x,y)=d_H(y,x)$
Triangle inequality: $d_H(x,y)\leq d_H(x,z)+d_H(z,y)$

Level of error handling

error detection
erasure correction
error correction

Erasure: replacement of an entry by $*\not\in F$ .

Error: substitution of one entry by a different one.

Example: If $d_H(\mathcal{C})=d$ .

Error detection

Theorem: If $d_H(\mathcal{C})=d$ , then there exists $f:F^n\to \mathcal{C}\cap \{\text{"error detected"}\}$ . that detects every patter of $\leq d-1$ errors correctly.

That is, we can identify if the channel introduced at most $d-1$ errors.
No decoding is needed.

Idea:

Since $d_H(\mathcal{C})=d$ , one needs $\geq d$ errors to cause “confusion”.

Proof

The function

f(y)=\begin{cases} y\text{ if }y\in \mathcal{C}\\ \text{"error detected"} & \text{otherwise} \end{cases}

will only fails if there are $\geq d$ errors.

Erasure correction

Theorem: If $d_H(\mathcal{C})=d$ , then there exists $f:\{F^n\cup \{*\}\}\to \mathcal{C}\cap \{\text{"failed"}\}$ . that recovers every patter of at most $d-1$ erasures.

Idea:

Suppose $d=4$ .

If $4$ erasures occurred, there might be two possible codewords $c,c'\in \mathcal{C}$ .

If $\leq 3$ erasures occurred, there is only one possible codeword $c\in \mathcal{C}$ .

Error correction

Define the Hamming ball of radius $r$ centered at $c$ as:

B_H(c,r)=\{y\in F^n:d_H(c,y)\leq r\}

Theorem: If $d_H(\mathcal{C})\geq d$ , then there exists $f:F^n\to \mathcal{C}$ that corrects every pattern of at most $\lfloor \frac{d-1}{2}\rfloor$ errors.

Ideas:

The ball $\{B_H(c,\lfloor \frac{d-1}{2}\rfloor)|c\in \mathcal{C}\}$ are disjoint.

Use closest neighbor decoding, use triangle inequality.

Intro to linear codes

Summary: a code of minimum hamming distance $d$ can

detect $\leq d-1$ errors.
correct $\leq d-1$ erasures.
Correct $\leq \lfloor \frac{d-1}{2}\rfloor$ errors.

Problems:

How to construct good codes, $k/n$ and $d$ large?
How good can these codes possibly be?
How to encode?
How to decode with noisy channel

Tools

Linear algebra over finite fields.

Linear codes

Consider $F^n$ as a vector space, and let $\mathcal{C}\subseteq F^n$ be a subspace.

$F,\Phi$ are finites, we use finite fields (algebraic objects that “immitate” $\mathbb{R}^n$ , $\mathbb{C}^n$ ).

Formally, satisfy the field axioms.

Next Lectures:

Field axioms
Prime fields ( $\mathbb{F}_p$ )
Field extensions (e.g. $\mathbb{F}_{p^t}$ )