Skip to Content
CSE5313CSE5313 Coding and information theory for data science (Lecture 12)

CSE5313 Coding and information theory for data science (Lecture 12)

Challenge 1: Reconstruction

  • Minimize reconstruction bandwidth.

Challenge 2: Repair

  • Maintaining data consistency.
  • Failed servers must be repaired:
    • By contacting few other servers (locality, due to geographical constraints).
    • By minimizing bandwidth.

Challenge 3: Storage overhead

  • Minimize space consumption.
  • Minimize redundancy.

Code for storage systems

Naive solution: Replication

Locality is 1 (by copying from another server).

This gives the optimal reconstruction bandwidth.

Use codes to improve storage efficiency

Locality is nd+1n-d+1, high bandwidth.

Parity codes

Let X1,X2,,XnF2tX_1,X_2,\ldots,X_n\in \mathbb{F}_2^t be the data blocks, take extra server to store the parity.

Reconstruction:

Optimal for reconstruction bandwidth. Only need kk servers to reconstruct the file.

Overhead:

Only need one additional server

Repair:

Any server failed, reconstruct from the other nd+1=n2+1=n1n-d+1=n-2+1=n-1 servers.

Reed-Solomon codes

Fragment the file X=(X1,,Xk)X = (X_1, \ldots, X_k).

Need 2tn2^t\geq n servers to store the file.

Reconstruction:

Any kk servers can reconstruct the file.

Overhead:

Need 2tn2^t\geq n servers to store the file.

Repair:

Worse, need all servers to reconstruct the file.

New codes for storage systems

EVENODD code

  • One of the first storage specific codes.

Can a xor only code be built that enables reconstruction if two disks are missing?

locality/bandwidth problem for next lecture.

For prime mm, partition X=(X0,,Xm1)X=(X_0,\ldots,X_{m-1}) each XiX_i with m1m-1 bits.

Store Yi=XiY_i=X_i in disks 0,1,,m10,1,\ldots,m-1.

Add two redundant disks Ym,Ym+1Y_m,Y_{m+1}.

  • (Ym)i(Y_m)_i is the parity of row ii.
  • (Ym+1)i(Y_{m+1})_i, first we defined S=a0,4+a1,3+a2,2+a3,1S=a_{0,4}+a_{1,3}+a_{2,2}+a_{3,1}, then (Ym+1)i=Sj=0m2a(i,j)modm,j(Y_{m+1})_i=S\oplus \sum_{j=0}^{m-2}a_{(i,j)\mod m,j}
Y0Y_0Y1Y_1Y2Y_2Y3Y_3Y4Y_4Y5Y_5Y6Y_6
11001111001100
00111100000000
11110000000011
00110011111100

Note that the SS diagonal can be extracted from YmY_m and Ym+1Y_{m+1}.

j=0m2(Ym)jj=0m2(Ym+1)j=j=1mS=S\sum_{j=0}^{m-2}(Y_m)_j\oplus \sum_{j=0}^{m-2}(Y_{m+1})_j=\sum_{j=1}^{m}S=S

Goal: Reconstruct if any two disks are missing.

  • If Ym,Ym+1Y_m, Y_{m+1} missing, nothing to do.
  • If Yi,Ym+1Y_i, Y_{m+1} are missing for i<mi < m, decode like a parity code.
  • If Yi,YmY_i, Y_{m} are missing for i<mi < m, similar, using diagonal parities.

The interesting case: Yi,YjY_i, Y_j are missing for i,j<mi,j < m.

Using the skill you solve sudoku puzzles, we can find the missing values.

First we recover the SS diagonal from YmY_m and Ym+1Y_{m+1}.

Then we solve for the row by YmY_m and the diagonal by Ym+1Y_{m+1}.

Proof for why it always works

There are m1m-1 rows, mm including a ghost row with full 00s.

Zm\mathbb{Z}_m is cyclic of prime size, any non-zero element is a generator.

When moving from diagonal to horizontal, we are moving some offset from the diagonal, which are always generator.

This is an example of array code:

The message (X0,X1,,Xm1)(X_0,X_1,\ldots,X_{m-1}) is a matrix in F2(m1)×m\mathbb{F}_2^{(m-1)\times m}.

The codeword (Y0,Y1,,Ym+1)(Y_0,Y_1,\ldots,Y_{m+1}) is a matrix in F2(m1)×(m+2)\mathbb{F}_2^{(m-1)\times (m+2)}.

Encoding is done over Fq\mathbb{F}_q.

Locally Recoverable Codes

Locality: when a node jj fails,

  • A newcomer node joins the system.
  • The newcomer contacts a “small” number of helper nodes with the message “repairing jj”.
  • Each of the helper nodes sends something to the newcomer.
  • The newcomer aggregates the responses to find YjY_j.

Notes:

  • No adversarial behavior.
  • No privacy issues.
  • No concern about bandwidth (for now).

Research question:

  • How small can the “small number of nodes” be?
  • How does that affect the rate/minimum distance of the code?
  • How to build codes with this capability?

Definition of locally recoverable code

An [n,k]q[n, k]_q code is called rr-locally recoverable if

  • every codeword symbol yjy_j has a recovering set Rj[n]jR_j \subseteq [n] \setminus j ([n]={1,2,,n}[n]=\{1,2,\ldots,n\}),
  • such that yjy_j is computable from yiy_i for all iRji \in R_j.
  • Rjr|R_j| \leq r for every jnj \in n.

Notes:

  • From nd+1n-d+1 nodes, we can reconstruct the entire file, always assume knd+1k\leq n-d+1.
  • We want rnd+1r\ll n-d+1.
  • RjR_j does not depend on yjy_j, nor on the codewords yy, only on jj. (Need to repair without knowing y,yjy,y_j.)

Bounds for Locally Recoverable Codes

Let C\mathcal{C} be an [n,k]q[n, k]_q code with rr-locally recoverable, with minimum distance dd.

Bound 1: knrr+1\frac{k}{n}\leq \frac{r}{r+1}.

Bound 2: dnkkr+2d\leq n-k-\lceil\frac{k}{r}\rceil +2.

Notes:

For r=kr=k, bound 2 becomes dnk+1d\leq n-k+1.

  • The natural extension of singleton bound.

For r=1r=1, bound 1 becomes kn12\frac{k}{n}\leq \frac{1}{2}.

  • The duplication code is trivial code for this bound

For r=1r=1, bound 2 becomes dn2k+2d\leq n-2k+2.

  • The duplication code is trivial code for this bound

Bound 1

Turan’s Lemma

Let GG be a graph with nn vertices. Then there exists an induced directed acyclic subgraph (DAG) of GG on at least n1+\avgi(diout)\frac{n}{1+\avg_i(d^{out}_i)} nodes, where dioutd^{out}_i is the out-degree of vertex ii.

Directed graphs have large acyclic subgraphs.

Proof via the probabilistic method

Useful for showing the existence of a large acyclic subgraph, but not for finding it.

Tip

Show that E[X]something\mathbb{E}[X]\geq something, and therefore there exists UπU_\pi with Uπsomething|U_\pi|\geq something, using pigeonhole principle.

For a permutation π\pi of [n][n], define Uπ={π(i):i[n]}U_\pi = \{\pi(i): i \in [n]\}.

Let iUπi\in U_\pi if each of the dioutd_i^{out} outgoing edges from ii connect to a node jj with π(j)>π(i)\pi(j)>\pi(i).

In other words, we select a subset of nodes UπU_\pi such that each node in UπU_\pi has an outgoing edge to a node in UπU_\pi with a larger index. All edges going to right.

This graph is clearly acyclic.

Choose π\pi at random and Let X=UπX=|U_\pi| be a random variable.

Let XiX_i be the indicator random variable for iUπi\in U_\pi.

So X=i=1nXiX=\sum_{i=1}^{n} X_i.

Using linearity of expectation, we have

E[X]=i=1nE[Xi]E[X]=\sum_{i=1}^{n} E[X_i]

E[Xi]E[X_i] is the probability that π\pi places ii before any of its out-neighbors.

For each node, there are (diout+1)!(d_i^{out}+1)! ways to place the node and its out-neighbors.

For each node, there are diout!d_i^{out}! ways to place the out-neighbors.

So, E[Xi]=diout!(diout+1)!=1diout+1E[X_i]=\frac{d_i^{out}!}{(d_i^{out}+1)!}=\frac{1}{d_i^{out}+1}.

Continue next time.

Last updated on