CSE5313 Coding and information theory for data science (Lecture 12)

Challenge 1: Reconstruction

Minimize reconstruction bandwidth.

Challenge 2: Repair

Maintaining data consistency.
Failed servers must be repaired:
- By contacting few other servers (locality, due to geographical constraints).
- By minimizing bandwidth.

Challenge 3: Storage overhead

Minimize space consumption.
Minimize redundancy.

Code for storage systems

Naive solution: Replication

Locality is 1 (by copying from another server).

This gives the optimal reconstruction bandwidth.

Use codes to improve storage efficiency

Locality is $n-d+1$ , high bandwidth.

Parity codes

Let $X_1,X_2,\ldots,X_n\in \mathbb{F}_2^t$ be the data blocks, take extra server to store the parity.

Reconstruction:

Optimal for reconstruction bandwidth. Only need $k$ servers to reconstruct the file.

Overhead:

Only need one additional server

Repair:

Any server failed, reconstruct from the other $n-d+1=n-2+1=n-1$ servers.

Reed-Solomon codes

Fragment the file $X = (X_1, \ldots, X_k)$ .

Need $2^t\geq n$ servers to store the file.

Reconstruction:

Any $k$ servers can reconstruct the file.

Overhead:

Need $2^t\geq n$ servers to store the file.

Repair:

Worse, need all servers to reconstruct the file.

New codes for storage systems

EVENODD code

One of the first storage specific codes.

Can a xor only code be built that enables reconstruction if two disks are missing?

locality/bandwidth problem for next lecture.

For prime $m$ , partition $X=(X_0,\ldots,X_{m-1})$ each $X_i$ with $m-1$ bits.

Store $Y_i=X_i$ in disks $0,1,\ldots,m-1$ .

Add two redundant disks $Y_m,Y_{m+1}$ .

$(Y_m)_i$ is the parity of row $i$ .
$(Y_{m+1})_i$ , first we defined $S=a_{0,4}+a_{1,3}+a_{2,2}+a_{3,1}$ , then $(Y_{m+1})_i=S\oplus \sum_{j=0}^{m-2}a_{(i,j)\mod m,j}$

$Y_0$	$Y_1$	$Y_2$	$Y_3$	$Y_4$	$Y_5$	$Y_6$
$1$	$0$	$1$	$1$	$0$	$1$	$0$
$0$	$1$	$1$	$0$	$0$	$0$	$0$
$1$	$1$	$0$	$0$	$0$	$0$	$1$
$0$	$1$	$0$	$1$	$1$	$1$	$0$

Note that the $S$ diagonal can be extracted from $Y_m$ and $Y_{m+1}$ .

\sum_{j=0}^{m-2}(Y_m)_j\oplus \sum_{j=0}^{m-2}(Y_{m+1})_j=\sum_{j=1}^{m}S=S

Goal: Reconstruct if any two disks are missing.

If $Y_m, Y_{m+1}$ missing, nothing to do.
If $Y_i, Y_{m+1}$ are missing for $i < m$ , decode like a parity code.
If $Y_i, Y_{m}$ are missing for $i < m$ , similar, using diagonal parities.

The interesting case: $Y_i, Y_j$ are missing for $i,j < m$ .

Using the skill you solve sudoku puzzles, we can find the missing values.

First we recover the $S$ diagonal from $Y_m$ and $Y_{m+1}$ .

Then we solve for the row by $Y_m$ and the diagonal by $Y_{m+1}$ .

Proof for why it always works

There are $m-1$ rows, $m$ including a ghost row with full $0$ s.

$\mathbb{Z}_m$ is cyclic of prime size, any non-zero element is a generator.

When moving from diagonal to horizontal, we are moving some offset from the diagonal, which are always generator.

This is an example of array code:

The message $(X_0,X_1,\ldots,X_{m-1})$ is a matrix in $\mathbb{F}_2^{(m-1)\times m}$ .

The codeword $(Y_0,Y_1,\ldots,Y_{m+1})$ is a matrix in $\mathbb{F}_2^{(m-1)\times (m+2)}$ .

Encoding is done over $\mathbb{F}_q$ .

Locally Recoverable Codes

Locality: when a node $j$ fails,

A newcomer node joins the system.
The newcomer contacts a “small” number of helper nodes with the message “repairing $j$ ”.
Each of the helper nodes sends something to the newcomer.
The newcomer aggregates the responses to find $Y_j$ .

Notes:

No adversarial behavior.
No privacy issues.
No concern about bandwidth (for now).

Research question:

How small can the “small number of nodes” be?
How does that affect the rate/minimum distance of the code?
How to build codes with this capability?

Definition of locally recoverable code

An $[n, k]_q$ code is called $r$ -locally recoverable if

every codeword symbol $y_j$ has a recovering set $R_j \subseteq [n] \setminus j$ ( $[n]=\{1,2,\ldots,n\}$ ),
such that $y_j$ is computable from $y_i$ for all $i \in R_j$ .
$|R_j| \leq r$ for every $j \in n$ .

Notes:

From $n-d+1$ nodes, we can reconstruct the entire file, always assume $k\leq n-d+1$ .
We want $r\ll n-d+1$ .
$R_j$ does not depend on $y_j$ , nor on the codewords $y$ , only on $j$ . (Need to repair without knowing $y,y_j$ .)

Bounds for Locally Recoverable Codes

Let $\mathcal{C}$ be an $[n, k]_q$ code with $r$ -locally recoverable, with minimum distance $d$ .

Bound 1: $\frac{k}{n}\leq \frac{r}{r+1}$ .

Bound 2: $d\leq n-k-\lceil\frac{k}{r}\rceil +2$ .

Notes:

For $r=k$ , bound 2 becomes $d\leq n-k+1$ .

The natural extension of singleton bound.

For $r=1$ , bound 1 becomes $\frac{k}{n}\leq \frac{1}{2}$ .

The duplication code is trivial code for this bound

For $r=1$ , bound 2 becomes $d\leq n-2k+2$ .

The duplication code is trivial code for this bound

Bound 1

Turan’s Lemma

Let $G$ be a graph with $n$ vertices. Then there exists an induced directed acyclic subgraph (DAG) of $G$ on at least $\frac{n}{1+\avg_i(d^{out}_i)}$ nodes, where $d^{out}_i$ is the out-degree of vertex $i$ .

Directed graphs have large acyclic subgraphs.

Proof via the probabilistic method

Useful for showing the existence of a large acyclic subgraph, but not for finding it.

Tip

Show that $\mathbb{E}[X]\geq something$ , and therefore there exists $U_\pi$ with $|U_\pi|\geq something$ , using pigeonhole principle.

For a permutation $\pi$ of $[n]$ , define $U_\pi = \{\pi(i): i \in [n]\}$ .

Let $i\in U_\pi$ if each of the $d_i^{out}$ outgoing edges from $i$ connect to a node $j$ with $\pi(j)>\pi(i)$ .

In other words, we select a subset of nodes $U_\pi$ such that each node in $U_\pi$ has an outgoing edge to a node in $U_\pi$ with a larger index. All edges going to right.

This graph is clearly acyclic.

Choose $\pi$ at random and Let $X=|U_\pi|$ be a random variable.

Let $X_i$ be the indicator random variable for $i\in U_\pi$ .

So $X=\sum_{i=1}^{n} X_i$ .

Using linearity of expectation, we have

E[X]=\sum_{i=1}^{n} E[X_i]

$E[X_i]$ is the probability that $\pi$ places $i$ before any of its out-neighbors.

For each node, there are $(d_i^{out}+1)!$ ways to place the node and its out-neighbors.

For each node, there are $d_i^{out}!$ ways to place the out-neighbors.

So, $E[X_i]=\frac{d_i^{out}!}{(d_i^{out}+1)!}=\frac{1}{d_i^{out}+1}$ .

Continue next time.

$Y_0$	$Y_1$	$Y_2$	$Y_3$	$Y_4$	$Y_5$	$Y_6$
$1$	$0$	$1$	$1$	$0$	$1$	$0$
$0$	$1$	$1$	$0$	$0$	$0$	$0$
$1$	$1$	$0$	$0$	$0$	$0$	$1$
$0$	$1$	$0$	$1$	$1$	$1$	$0$

$Y_0$	$Y_1$	$Y_2$	$Y_3$	$Y_4$	$Y_5$	$Y_6$
$1$	$0$	$1$	$1$	$0$	$1$	$0$
$0$	$1$	$1$	$0$	$0$	$0$	$0$
$1$	$1$	$0$	$0$	$0$	$0$	$1$
$0$	$1$	$0$	$1$	$1$	$1$	$0$

$Y_0$	$Y_1$	$Y_2$	$Y_3$	$Y_4$	$Y_5$	$Y_6$
$1$	$0$	$1$	$1$	$0$	$1$	$0$
$0$	$1$	$1$	$0$	$0$	$0$	$0$
$1$	$1$	$0$	$0$	$0$	$0$	$1$
$0$	$1$	$0$	$1$	$1$	$1$	$0$