Skip to Content
CSE5313CSE5313 Coding and information theory for data science (Lecture 9)

CSE5313 Coding and information theory for data science (Lecture 9)

Explicit optimal codes

Explicit optimal codes?

  • Singleton, Sphere-packing provide restrictions.
  • Gilbert-Varshamov provides existence.

Are there explicit optimal codes? That is,

  • Easily (polynomial time) encodable, decodable.

Yes! This lecture:

– Gustave Solomon [1930-1996] (Reed-Solomon code) – Irving S. Reed [1923-2012]. – David E. Muller [1924-2008] (Reed-Muller code)

Using Polynomials over Fq\mathbb{F}_q

Reed-Solomon code

Note

The fundamental theorem of algebra:

A polynomial of degree kk has at most kk roots.

We have two equivalent definitions of a Reed-Solomon code:

  • As polynomial evaluations.
  • As linear codes (from generator matrix)

Efficient encoding (as linear codes)

Efficient decoding (use Euclidean algorithm)

Definition of Reed-Solomon code from polynomial evaluations

Caution

We assume qnq\geq n.

Every codeword corresponds to a polynomial of degree at most k1k-1.

Let f(x)=i=0k1fixiFqk1[x]f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x] (fiFqf_i\in \mathbb{F}_q for all ii, deg(f)k1\deg(f)\leq k-1).

Fix distinct a1,a2,,anFqa_1,a_2,\ldots,a_n\in \mathbb{F}_q.

Definition of Reed-Solomon code

A Reed-Solomon code is {f(a1),f(a2),,f(an)f(x)Fqk1[x]}\{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\}.

  • In words, the set of all evaluations at a1,a2,,ana_1,a_2,\ldots,a_n of polynomials of degree at most k1k-1.

Example of Reed-Solomon code

Let n=5n=5, Fq=Z5\mathbb{F}_q=\mathbb{Z}_5, k=3k=3.

a0=0a_0=0a1=1a_1=1a2=2a_2=2a3=3a_3=3a4=4a_4=4
f(x)=1f(x)=11111111111
f(x)=x+2f(x)=x+22233440011
f(x)=x2+xf(x)=x^2+x0022112200

Here d=nk+1=3d=n-k+1=3.

Proposition: Reed-Solomon code is a linear code

A Reed-Solomon code is {f(a1),f(a2),,f(an)f(x)Fqk1[x]}\{f(a_1),f(a_2),\ldots,f(a_n)|f(x)\in \mathbb{F}_q^{k-1}[x]\} is a linear code.

Proof

First the code is closed under addition.

Let f(x),g(x)Fqk1[x]f(x),g(x)\in \mathbb{F}_q^{k-1}[x], then f(x)+g(x)Fqk1[x]f(x)+g(x)\in \mathbb{F}_q^{k-1}[x].

f(x)+g(x)=i=0k1(fi+gi)xif(x)+g(x)=\sum_{i=0}^{k-1}(f_i+g_i)x^i

Then the code is closed under scalar multiplication.

Let f(x)Fqk1[x]f(x)\in \mathbb{F}_q^{k-1}[x], cFqc\in \mathbb{F}_q, then cf(x)Fqk1[x]cf(x)\in \mathbb{F}_q^{k-1}[x].

cf(x)=i=0k1(cfi)xicf(x)=\sum_{i=0}^{k-1}(cf_i)x^i

The dimension of the code is kk.

Corollary: The Reed-Solomon code attains the Singleton bound with equality

The Reed-Solomon code has minimum distance nk+1n-k+1.

Proof

Let cf=(f(a1),f(a2),,f(an))c_f=(f(a_1),f(a_2),\ldots,f(a_n)) and cg=(g(a1),g(a2),,g(an))c_g=(g(a_1),g(a_2),\ldots,g(a_n)).

Since fgf\neq g, and d(cf,cg)d(c_f,c_g) is the minimum distance of the code

Let cfg=(f(a1)g(a1),f(a2)g(a2),,f(an)g(an))c_{f-g}=(f(a_1)-g(a_1),f(a_2)-g(a_2),\ldots,f(a_n)-g(a_n)).

By the lemma for minimum distance, we have d(cf,cg)=wH(cfg)=wH((fg)(a1),(fg)(a2),,(fg)(an))d(c_f,c_g)=w_H(c_{f-g})=w_H((f-g)(a_1),(f-g)(a_2),\ldots,(f-g)(a_n)) where fgFqk1[x]f-g\in \mathbb{F}_q^{k-1}[x].

So nwH(cfg)n-w_H(c_{f-g}) is the number of zeros (root) of the polynomial fgf-g.

So if fgf-g has more than k1k-1 roots, then f=gf=g.

So ndk1n-d\leq k-1, dnk+1d\geq n-k+1.

Which is the Singleton bound.

Definition of Reed-Solomon code from generator matrix

Every Reed-Solomon code is of the form (f(a1),f(a2),,f(an))(f(a_1),f(a_2),\ldots,f(a_n)) for some f(x)=i=0k1fixiFqk1[x]f(x)=\sum_{i=0}^{k-1}f_ix^i\in \mathbb{F}_q^{k-1}[x].

Observer that the evaluation map is a linear map.

f(a1)=f0+f1a1+f2a12++fk1a1k1f(a_1)=f_0+f_1a_1+f_2a_1^2+\cdots+f_{k-1}a_1^{k-1} f(a2)=f0+f1a2+f2a22++fk1a2k1f(a_2)=f_0+f_1a_2+f_2a_2^2+\cdots+f_{k-1}a_2^{k-1} \vdots f(an)=f0+f1an+f2an2++fk1ank1f(a_n)=f_0+f_1a_n+f_2a_n^2+\cdots+f_{k-1}a_n^{k-1}

So, every code word can be constructed by

(f(a1),f(a2),,f(an))=(f0,f1,f2,,fk1)(111a1a2ana12a22an2a1k1a2k1ank1)(f(a_1),f(a_2),\ldots,f(a_n))=(f_0,f_1,f_2,\ldots,f_{k-1})\begin{pmatrix} 1 & 1 & \cdots & 1\\ a_1 & a_2 & \cdots & a_n\\ a_1^2 & a_2^2 & \cdots & a_n^2\\ \vdots & \vdots & \cdots & \vdots\\ a_1^{k-1} & a_2^{k-1} & \cdots & a_n^{k-1} \end{pmatrix}

The generator matrix for Reed-Solomon code is a Vandermonde matrix V(a1,a2,,an)V(a_1,a_2,\ldots,a_n).

Fact: V(a1,a2,,an)V(a_1,a_2,\ldots,a_n) is invertible if and only if a1,a2,,ana_1,a_2,\ldots,a_n are distinct. (that’s how we choose a1,a2,,ana_1,a_2,\ldots,a_n)

The parity check matrix for Reed-Solomon code is also a Vandermonde matrix V(a1,a2,,an)V(a_1,a_2,\ldots,a_n)^\top with scalar multiples of the columns.

Some technical lemmas:

Let GG and HH be the generator and parity-check matrices of (any) linear code C=[n,k,d]FqC = [n, k, d]_{\mathbb{F}_q}. Then:

I. Then HG=0H G^\top = 0. II. Any matrix MFqnk×kM \in \mathbb{F}_q^{n-k \times k} such that \rank(M)=nk\rank(M) = n - k and MG=0M G^\top = 0 is a parity-check matrix for CC (i.e. C=kerMC = \ker M).

Reed-Muller code

Reed-Solomon codes: Evaluations of univariate polynomials of deg ≤ k1k-1.

Reed-Muller codes: Evaluations of multivariate polynomials of deg k1\leq k-1

Example:

f(x1,x2,x3)=x1x22+x1x3+x2+x2x33f(x_1,x_2,x_3)=x_1x_2^2+x_1x_3+x_2+x_2x_3^3

This is a degree 4 polynomial.

Usually we use q=2q=2 for binary codes.

So x2=xx^2=x

Definition of Reed-Muller code (binary case)

RM(r,m)={(f(α1),,f(α2m))αiF2m,degfr}RM(r,m)=\left\{(f(\alpha_1),\ldots,f(\alpha_2^m))|\alpha_i\in \mathbb{F}_2^m,\deg f\leq r\right\}

Facts:

  • Length n=2mn = 2^m.
  • Minimum distance 2mr2^{m-r} (not shown).
  • Dimension = # of free coefficients in a multilinear polynomial of degree at most rr.
  • Dimension = # of subsets of {1,2,,m}\{1, 2, \ldots, m\} of size at most rr
  • Dimension = i=0r(mi)\sum_{i=0}^{r}\binom{m}{i}

Exercises: Show that

  1. C1=RM(m1,m)=C_1 = RM(m-1,m) = Parity code.
  2. C2=RM(m2,m)=C_2 = RM(m-2,m) = Extended Hamming code.
  3. C3=RM(1,m)=C_3 = RM(1,m) = Augmented Hadamard.

Coding for storage

Requirements/Challenges in Storage Systems

  1. Challenge 1: Reconstruction.
    • The data collector must be able to reconstruct the file, even if some are nonresponsive.
      • Minimize reconstruction bandwidth.
  2. Challenge 2: Repair.
    • The system must maintain data consistency.
    • Failed servers must be repaired:
      • By contacting few other servers (locality, due to geographical constraints).
      • By minimizing bandwidth.
  3. Challenge 3: Storage overhead.
    • Minimize space consumption.
      • Minimize redundancy.

Naive solution: Replication

Fragment the file X=(X1,,Xk)X = (X_1, \ldots, X_k).

  • Size of XiX_i = Whatever fits in a storage server.

Hold rr copies of each XiX_i.

  • I.e., n=rkn = rk servers in the system.

Storage overhead?

  • nk=r\frac{n}{k} = r.

Repair?

  • XiX_i fails
  • r\geq r failures is lost data.

Reconstruction?

  • Possible if any r1r-1 servers fail.
  • Impossible for some r\geq r failures.

Use codes to improve storage efficiency

Reconstruction?

  • Lecture 1: If dH(C)dd_H(\mathcal{C})\geq d, every pattern of at most d1d-1 erasures is recoverable.

  • Idea: Treat unavailable servers as erasures.

Is this better/worse than replication?

  • Say we wish to reconstruct from any n10\approx \frac{n}{10} servers.
  • What would be the redundancy in replication vs. coding?

Coding:

  • Can reconstruct file from any nd+1910nn-d+1\approx \frac{9}{10}n servers.
  • Resulting overhead nk=nnd+1109\frac{n}{k}=\frac{n}{n-d+1}\approx \frac{10}{9} (constant!).

Replication:

To reconstruct from any 910n\frac{9}{10}n servers, need r1110nr-1\approx \frac{1}{10}n

Repair?

  • Need low locality (repair by contacting few other servers).
  • Need low bandwidth (repair by downloading as few bits as possible).

Repair in a replicated system:

  • XiX_i fails \Rightarrow reconstruct from a different copy.
  • Locality 1.
  • Optimal bandwidth.

Repair in a coded system:

  • repair one YiY_i \approx Reconstruct the entire file.
    • Locality nd+1n-d+1, high bandwidth.
  • Much worse than replication.

New coding challenges: Minimize locality and bandwidth

Last updated on