CSE5313 Coding and information theory for data science (Lecture 9)

Explicit optimal codes

Explicit optimal codes?

Singleton, Sphere-packing provide restrictions.
Gilbert-Varshamov provides existence.

Are there explicit optimal codes? That is,

Easily (polynomial time) encodable, decodable.

Yes! This lecture:

– Gustave Solomon [1930-1996] (Reed-Solomon code) – Irving S. Reed [1923-2012]. – David E. Muller [1924-2008] (Reed-Muller code)

Using Polynomials over $\mathbb{F}_q$

Reed-Solomon code

Note

The fundamental theorem of algebra:

A polynomial of degree $k$ has at most $k$ roots.

We have two equivalent definitions of a Reed-Solomon code:

As polynomial evaluations.
As linear codes (from generator matrix)

Efficient encoding (as linear codes)

Efficient decoding (use Euclidean algorithm)

Definition of Reed-Solomon code from polynomial evaluations

Caution

We assume $q\geq n$ .

Every codeword corresponds to a polynomial of degree at most $k-1$ .

Let $f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x]$ ( $f_i\in \mathbb{F}_q$ for all $i$ , $\deg(f)\leq k-1$ ).

Fix distinct $a_1,a_2,\ldots,a_n\in \mathbb{F}_q$ .

Definition of Reed-Solomon code

A Reed-Solomon code is $\{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\}$ .

In words, the set of all evaluations at $a_1,a_2,\ldots,a_n$ of polynomials of degree at most $k-1$ .

Example of Reed-Solomon code

Let $n=5$ , $\mathbb{F}_q=\mathbb{Z}_5$ , $k=3$ .

	$a_0=0$	$a_1=1$	$a_2=2$	$a_3=3$	$a_4=4$
$f(x)=1$	$1$	$1$	$1$	$1$	$1$
$f(x)=x+2$	$2$	$3$	$4$	$0$	$1$
$f(x)=x^2+x$	$0$	$2$	$1$	$2$	$0$

Here $d=n-k+1=3$ .

Proposition: Reed-Solomon code is a linear code

A Reed-Solomon code is $\{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\}$ is a linear code.

Proof

First the code is closed under addition.

Let $f(x),g(x)\in \mathbb{F}_q^{k-1}[x]$ , then $f(x)+g(x)\in \mathbb{F}_q^{k-1}[x]$ .

f(x)+g(x)=\sum_{i=0}^{k-1}(f_i+g_i)x^i

Then the code is closed under scalar multiplication.

Let $f(x)\in \mathbb{F}_q^{k-1}[x]$ , $c\in \mathbb{F}_q$ , then $cf(x)\in \mathbb{F}_q^{k-1}[x]$ .

cf(x)=\sum_{i=0}^{k-1}(cf_i)x^i

The dimension of the code is $k$ .

Corollary: The Reed-Solomon code attains the Singleton bound with equality

The Reed-Solomon code has minimum distance $n-k+1$ .

Proof

Let $c_f=(f(a_1),f(a_2),\ldots,f(a_n))$ and $c_g=(g(a_1),g(a_2),\ldots,g(a_n))$ .

Since $f\neq g$ , and $d(c_f,c_g)$ is the minimum distance of the code

Let $c_{f-g}=(f(a_1)-g(a_1),f(a_2)-g(a_2),\ldots,f(a_n)-g(a_n))$ .

By the lemma for minimum distance, we have $d(c_f,c_g)=w_H(c_{f-g})=w_H((f-g)(a_1),(f-g)(a_2),\ldots,(f-g)(a_n))$ where $f-g\in \mathbb{F}_q^{k-1}[x]$ .

So $n-w_H(c_{f-g})$ is the number of zeros (root) of the polynomial $f-g$ .

So if $f-g$ has more than $k-1$ roots, then $f=g$ .

So $n-d\leq k-1$ , $d\geq n-k+1$ .

Which is the Singleton bound.

Definition of Reed-Solomon code from generator matrix

Every Reed-Solomon code is of the form $(f(a_1),f(a_2),\ldots,f(a_n))$ for some $f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x]$ .

Observer that the evaluation map is a linear map.

$f(a_1)=f_0+f_1a_1+f_2a_1^2+\cdots+f_{k-1}a_1^{k-1}$ $f(a_2)=f_0+f_1a_2+f_2a_2^2+\cdots+f_{k-1}a_2^{k-1}$ $\vdots$ $f(a_n)=f_0+f_1a_n+f_2a_n^2+\cdots+f_{k-1}a_n^{k-1}$

So, every code word can be constructed by

(f(a_1),f(a_2),\ldots,f(a_n))=(f_0,f_1,f_2,\ldots,f_{k-1})\begin{pmatrix} 1 & 1 & \cdots & 1\\ a_1 & a_2 & \cdots & a_n\\ a_1^2 & a_2^2 & \cdots & a_n^2\\ \vdots & \vdots & \cdots & \vdots\\ a_1^{k-1} & a_2^{k-1} & \cdots & a_n^{k-1} \end{pmatrix}

The generator matrix for Reed-Solomon code is a Vandermonde matrix $V(a_1,a_2,\ldots,a_n)$ .

Fact: $V(a_1,a_2,\ldots,a_n)$ is invertible if and only if $a_1,a_2,\ldots,a_n$ are distinct. (that’s how we choose $a_1,a_2,\ldots,a_n$ )

The parity check matrix for Reed-Solomon code is also a Vandermonde matrix $V(a_1,a_2,\ldots,a_n)^\top$ with scalar multiples of the columns.

Some technical lemmas:

Let $G$ and $H$ be the generator and parity-check matrices of (any) linear code $C = [n, k, d]_{\mathbb{F}_q}$ . Then:

I. Then $H G^\top = 0$ . II. Any matrix $M \in \mathbb{F}_q^{n-k \times k}$ such that $\rank(M) = n - k$ and $M G^\top = 0$ is a parity-check matrix for $C$ (i.e. $C = \ker M$ ).

Reed-Muller code

Reed-Solomon codes: Evaluations of univariate polynomials of deg ≤ $k-1$ .

Reed-Muller codes: Evaluations of multivariate polynomials of deg $\leq k-1$

Example:

f(x_1,x_2,x_3)=x_1x_2^2+x_1x_3+x_2+x_2x_3^3

This is a degree 4 polynomial.

Usually we use $q=2$ for binary codes.

So $x^2=x$

Definition of Reed-Muller code (binary case)

RM(r,m)=\left\{(f(\alpha_1),\ldots,f(\alpha_2^m))|\alpha_i\in \mathbb{F}_2^m,\deg f\leq r\right\}

Facts:

Length $n = 2^m$ .
Minimum distance $2^{m-r}$ (not shown).
Dimension = # of free coefficients in a multilinear polynomial of degree at most $r$ .
Dimension = # of subsets of $\{1, 2, \ldots, m\}$ of size at most $r$
Dimension = $\sum_{i=0}^{r}\binom{m}{i}$

Exercises: Show that

$C_1 = RM(m-1,m) =$ Parity code.
$C_2 = RM(m-2,m) =$ Extended Hamming code.
$C_3 = RM(1,m) =$ Augmented Hadamard.

Coding for storage

Requirements/Challenges in Storage Systems

Challenge 1: Reconstruction.
- The data collector must be able to reconstruct the file, even if some are nonresponsive.
  - Minimize reconstruction bandwidth.
Challenge 2: Repair.
- The system must maintain data consistency.
- Failed servers must be repaired:
  - By contacting few other servers (locality, due to geographical constraints).
  - By minimizing bandwidth.
Challenge 3: Storage overhead.
- Minimize space consumption.
  - Minimize redundancy.

Naive solution: Replication

Fragment the file $X = (X_1, \ldots, X_k)$ .

Size of $X_i$ = Whatever fits in a storage server.

Hold $r$ copies of each $X_i$ .

I.e., $n = rk$ servers in the system.

Storage overhead?

$\frac{n}{k} = r$ .

Repair?

$X_i$ fails
$\geq r$ failures is lost data.

Reconstruction?

Possible if any $r-1$ servers fail.
Impossible for some $\geq r$ failures.

Use codes to improve storage efficiency

Reconstruction?

Lecture 1: If $d_H(\mathcal{C})\geq d$ , every pattern of at most $d-1$ erasures is recoverable.
Idea: Treat unavailable servers as erasures.

Is this better/worse than replication?

Say we wish to reconstruct from any $\approx \frac{n}{10}$ servers.
What would be the redundancy in replication vs. coding?

Coding:

Can reconstruct file from any $n-d+1\approx \frac{9}{10}n$ servers.
Resulting overhead $\frac{n}{k}=\frac{n}{n-d+1}\approx \frac{10}{9}$ (constant!).

Replication:

To reconstruct from any $\frac{9}{10}n$ servers, need $r-1\approx \frac{1}{10}n$

Repair?

Need low locality (repair by contacting few other servers).
Need low bandwidth (repair by downloading as few bits as possible).

Repair in a replicated system:

$X_i$ fails $\Rightarrow$ reconstruct from a different copy.
Locality 1.
Optimal bandwidth.

Repair in a coded system:

repair one $Y_i$ $Y_{i}$ $\approx$ $\approx$ Reconstruct the entire file.
- Locality $n-d+1$ , high bandwidth.
Much worse than replication.

New coding challenges: Minimize locality and bandwidth