CSE5313 Coding and information theory for data science (Lecture 24)

Continue on coded computing

Coded computing scheme

Matrix-vector multiplication: $y=Ax$ , where $A\in \mathbb{F}^{M\times N},x\in \mathbb{F}^N$

MDS codes.
- Recover threshold $K=M$ .
Short-dot codes.
- Recover threshold $K\geq M$ .
- Every node receives at most $s=\frac{P-K+M}{P}$ . $N$ elements of $x$ .

Matrix-matrix multiplication

Problem Formulation:

$A=[A_0 A_1\ldots A_{M-1}]\in \mathbb{F}^{L\times L}$ , $B=[B_0,B_1,\ldots,B_{M-1}]\in \mathbb{F}^{L\times L}$
$A_m,B_m$ are submatrices of $A,B$ .
We want to compute $C=A^\top B$ .

Trivial solution:

Index each worker node by $m,n\in [0,M-1]$ .
Worker node $(m,n)$ performs matrix multiplication $A_m^\top\cdot B_n$ .
Need $P=M^2$ nodes.
No erasure tolerance.

Can we do better?

1-D MDS Method

Create $[\tilde{A}_0,\tilde{A}_1,\ldots,\tilde{A}_{S-1}]$ by encoding $[A_0,A_1,\ldots,A_{M-1}]$ . with some $(S,M)$ MDS code.

Need $P=SM$ worker nodes, and index each one by $s\in [0,S-1], n\in [0,M-1]$ .

Worker node $(s,n)$ performs matrix multiplication $\tilde{A}_s^\top\cdot B_n$ .

\begin{bmatrix} A_0^\top\\ A_1^\top\\ A_0^\top+A_1^\top \end{bmatrix} \begin{bmatrix} B_0 & B_1 \end{bmatrix}

Need $S-M$ responses from each column.

The recovery threshold $K=P-S+M$ nodes.

This is trivially parity check code with 1 recovery threshold.

2-D MDS Method

Encode $[A_0,A_1,\ldots,A_{M-1}]$ with some $(S,M)$ MDS code.

Encode $[B_0,B_1,\ldots,B_{M-1}]$ with some $(S,M)$ MDS code.

Need $P=S^2$ nodes.

\begin{bmatrix} A_0^\top\\ A_1^\top\\ A_0^\top+A_1^\top \end{bmatrix} \begin{bmatrix} B_0 & B_1 & B_0+B_1 \end{bmatrix}

Decodability depends on the pattern.

Consider an $S\times S$ bipartite graph (rows on left, columns on right).
Draw an $(i,j)$ edge if $\tilde{A}_i^\top\cdot \tilde{B}_j$ is missing
Row $i$ is decodable if and only if the degree of $i$ ‘th left node $\leq S-M$ .
Column $j$ is decodable if and only if the degree of $j$ ‘th right node $\leq S-M$ .

Peeling algorithm:

Traverse the graph.
If $\exists v$ , $\deg v\leq S-M$ , remove edges.
Repeat.

Corollary:

A pattern is decodable if and only if the above graph does not contain a subgraph with all degree larger than $S-M$ .

Note

$K_{1D-MDS}=P-S+M=\Theta(P)$ (linearly)
$K_{2D-MDS}=P-(S-M+1)^2+1$ $K_{2 D - M D S} = P - (S - M + 1)^{2} + 1$ .
- Consider $S\times S$ bipartite graph with $(S-M+1)\times (S-M+1)$ complete subgraph.
- There exists subgraph with all degrees larger than $S-M\implies$ not decodable.
- On the other hand: Fewer than $(S-M+1)^2$ edges cannot form a subgraph with all degrees $>S-M$ .
- $K$ scales sub-linearly with $P$ .
$K_{product}<P-M^2=S^2-M^2=\Theta(\sqrt{P})$

Our goal is to get rid of $P$ .

Polynomial codes

Polynomial representation

Coefficient representation of a polynomial:

$f(x)=f_dx^d+f_{d-1}x^{d-1}+\cdots+f_1x+f_0$
Uniquely defined by coefficients $[f_d,f_{d-1},\ldots,f_0]$ .

Value presentation of a polynomial:

Theorem: A polynomial of degree $d$ is uniquely determined by $d+1$ points.
Proof Outline: First create a polynomial of degree $d$ from the $d+1$ points using Lagrange interpolation, and show such polynomial is unique.
Uniquely defined by evaluations $[(\alpha_1,f(\alpha_1)),\ldots,(\alpha_{d},f(\alpha_{d}))]$

Why should we want value representation?

With coefficient representation, polynomial product takes $O(d^2)$ multiplications.
With value representation, polynomial product takes $2d+1$ multiplications.

Definition of a polynomial code

link to paper

Problem formulation:

A=[A_0,A_1,\ldots,A_{M-1}]\in \mathbb{F}^{L\times L}, B=[B_0,B_1,\ldots,B_{M-1}]\in \mathbb{F}^{L\times L}

We want to compute $C=A^\top B$ .

Define matrix polynomials:

$p_A(x)=\sum_{i=0}^{M-1} A_i x^i$ , degree $M-1$

$p_B(x)=\sum_{i=0}^{M-1} B_i x^{iM}$ , degree $M(M-1)$

where each $A_i,B_i$ are matrices

We have

h(x)=p_A(x)p_B(x)=\sum_{i=0}^{M-1}\sum_{j=0}^{M-1} A_i B_j x^{i+jM}

$\deg h(x)\leq M(M-1)+M-1=M^2-1$

Observe that

x^{i_1+j_1M}=x^{i_2+j_2M}

if and only if $m_1=n_1$ and $m_2=n_2$ .

The coefficient of $x^{i+jM}$ is $A_i^\top B_j$ .

Computing $C=A^\top B$ is equivalent to find the coefficient representation of $h(x)$ .

Encoding of polynomial codes

The master choose $\omega_0,\omega_1,\ldots,\omega_{P-1}\in \mathbb{F}$ .

Note that this requires $|\mathbb{F}|\geq P$ .

For every node $i\in [0,P-1]$ , the master computes $\tilde{A}_i=p_A(\omega_i)$

Equivalent to multiplying $[A_0^\top,A_1^\top,\ldots,A_{M-1}^\top]$ by Vandermonde matrix over $\omega_0,\omega_1,\ldots,\omega_{P-1}$ .
Can be speed up using FFT.

Similarly, the master computes $\tilde{B}_i=p_B(\omega_i)$ for every node $i\in [0,P-1]$ .

Every node $i\in [0,P-1]$ computes and returns $c_i=p_A(\omega_i)p_B(\omega_i)$ to the master.

$c_i$ is the evaluation of polynomial $h(x)=p_A(x)p_B(x)$ at $\omega_i$ .

Recall that $h(x)=\sum_{i=0}^{M-1}\sum_{j=0}^{M-1} A_i^\top B_j x^{i+jM}$ .

Computing $C=A^\top B$ is equivalent to finding the coefficient representation of $h(x)$ .

Recall that a polynomial of degree $d$ can be uniquely defined by $d+1$ points.

With $MN$ evaluations of $h(x)$ , we can recover the coefficient representation for polynomial $h(x)$ .

The recovery threshold $K=M^2$ , independent of $P$ , the number of worker nodes.

Done.

MatDot Codes

link to paper

Problem formulation:

We want to compute $C=A^\top B$ .
Unlike polynomial codes, we let $A=\begin{bmatrix} A_0\\ A_1\\ \vdots\\ A_{M-1} \end{bmatrix}$ and $B=\begin{bmatrix} B_0\\ B_1\\ \vdots\\ B_{M-1} \end{bmatrix}$ . And $A,B\in \mathbb{F}^{L\times L}$ .
In polynomial codes, $A=\begin{bmatrix} A_0 A_1\ldots A_{M-1} \end{bmatrix}$ and $B=\begin{bmatrix} B_0 B_1\ldots B_{M-1} \end{bmatrix}$ .

Key observation:

$A_m^\top$ is an $L\times \frac{L}{M}$ matrix, and $B_m$ is an $\frac{L}{M}\times L$ matrix. Hence, $A_m^\top B_m$ is an $L\times L$ matrix.

Let $C=A^\top B=\sum_{m=0}^{M-1} A_m^\top B_m$ .

Let $p_A(x)=\sum_{m=0}^{M-1} A_m x^m$ , degree $M-1$ .

Let $p_B(x)=\sum_{m=0}^{M-1} B_m x^m$ , degree $M-1$ .

Both have degree $M-1$ .

And $h(x)=p_A(x)p_B(x)$ .

$\deg h(x)\leq M-1+M-1=2M-2$

Key observation:

The coefficient of the term $x^{M-1}$ in $h(x)$ is $\sum_{m=0}^{M-1} A_m^\top B_m$ .

Recall that $C=A^\top B=\sum_{m=0}^{M-1} A_m^\top B_m$ .

Finding this coefficient is equivalent to finding the result of $A^\top B$ .

Here we sacrifice the bandwidth of the network for the computational power.

General Scheme for MatDot Codes

The master choose $\omega_0,\omega_1,\ldots,\omega_{P-1}\in \mathbb{F}$ .

Note that this requires $|\mathbb{F}|\geq P$ .

For every node $i\in [0,P-1]$ , the master computes $\tilde{A}_i=p_A(\omega_i)$ and $\tilde{B}_i=p_B(\omega_i)$ .

$p_A(x)=\sum_{m=0}^{M-1} A_m x^m$ , degree $M-1$ .
$p_B(x)=\sum_{m=0}^{M-1} B_m x^m$ , degree $M-1$ .

The master sends $\tilde{A}_i,\tilde{B}_i$ to node $i$ .

Every node $i\in [0,P-1]$ computes and returns $c_i=p_A(\omega_i)p_B(\omega_i)$ to the master.

The master needs $\deg h(x)+1=2M-1$ evaluations to obtain $h(x)$ .

The recovery threshold is $K=2M-1$

Recap on Matrix-Matrix multiplication

$A,B\in \mathbb{F}^{L\times L}$ , we want to compute $C=A^\top B$ with $P$ nodes.

Every node receives $\frac{1}{m}$ of $A$ and $\frac{1}{m}$ of $B$ .

Code	Recovery threshold $K$
1D-MDS	$\Theta(P)$
2D-MDS	$\leq \Theta(\sqrt{P})$
Polynomial codes	$\Theta(M^2)$
MatDot codes	$\Theta(M)$

Polynomial Evaluation

Problem formulation:

We have $K$ datasets $X_1,X_2,\ldots,X_K$ .
Want to compute some polynomial function $f$ $f$ of degree $d$ $d$ on each dataset.
- Want $f(X_1),f(X_2),\ldots,f(X_K)$ .
Examples:
- $X_1,X_2,\ldots,X_K$ are points in $\mathbb{F}^{M\times M}$ , and $f(X)=X^8+3X^2+1$ .
- $X_k=(X_k^{(1)},X_k^{(2)})$ , both in $\mathbb{F}^{M\times M}$ , and $f(X)=X_k^{(1)}X_k^{(2)}$ .
- Gradient computation.

$P$ worker nodes:

Some are stragglers, i.e., not responsive.
Some are adversaries, i.e., return erroneous results.
Privacy: We do not want to expose datasets to worker nodes.

Replication code

Suppose $P=(r+1)\cdot K$ .

Partition the $P$ nodes to $K$ groups of size $r+1$ each.
Node in group $i$ computes and returns $f(X_i)$ to the master.
Replication tolerates $r$ stragglers, or $\lfloor \frac{r}{2} \rfloor$ adversaries.

Linear codes

Recall previous linear computations (matrix-vector):

$[\tilde{A}_1,\tilde{A}_2,\tilde{A}_3]=[A_1,A_2,A_1+A_2]$ is the corresponding codeword of $[A_1,A_2]$ .
Every worker node $i$ computes $f(\tilde{A}_i)=\tilde{A}_i x$ .
$[\tilde{A}_1x, \tilde{A}_2x, \tilde{A}_3x]=[A_1x,A_2x,A_1x+A_2x]$ $[\tilde{A}_{1} x, \tilde{A}_{2} x, \tilde{A}_{3} x] = [A_{1} x, A_{2} x, A_{1} x + A_{2} x]$ is the corresponding codeword of $[A_1x,A_2x]$ $[A_{1} x, A_{2} x]$ .
- This enables to decode $[A_1x,A_2x]$ from $[\tilde{A}_1x,\tilde{A}_2x,\tilde{A}_3 x]$ .

However, $f$ is a polynomial of degree $d$ , not a linear transformation unless $d=1$ .

$f(cX)\neq cf(X)$ , where $c$ is a constant.
$f(X_1+X_2)\neq f(X_1)+f(X_2)$ .

Caution

$[f(\tilde{X}_1),f(\tilde{X}_2),\ldots,f(\tilde{X}_K)]$ is not the codeword corresponding to $[f(X_1),f(X_2),\ldots,f(X_K)]$ in any linear code.

Our goal is to create an encoder/decode such that:

Linear encoding: is the codeword of $[X_1,X_2,\ldots,X_K]$ $[X_{1}, X_{2}, \dots, X_{K}]$ for some linear code.
- i.e., $[\tilde{X}_1,\tilde{X}_2,\ldots,\tilde{X}_K]=[X_1,X_2,\ldots,X_K]G$ for some generator matrix $G$ .
- Every $\tilde{X}_i$ is some linear combination of $X_1,\ldots,X_K$ .
The $f(X_i)$ $f (X_{i})$ are decodable from some subset of $f(\tilde{X}_i)$ $f (\tilde{X}_{i})$ ‘s.
- Some of coded results are missing, erroneous.
$X_i$ ’s are kept private.

Lagrange Coded Computing

Let $\ell(z)$ be a polynomial whose evaluations at $\omega_1,\ldots,\omega_{K}$ are $X_1,\ldots,X_K$ .

That is, $\ell(\omega_i)=X_i$ for every $\omega_i\in \mathbb{F}, i\in [K]$ .

Some example constructions:

Given $X_1,\ldots,X_K$ with corresponding $\omega_1,\ldots,\omega_K$

$\ell(z)=\sum_{i=1}^K X_iL_i(z)$ , where $L_i(z)=\prod_{j\in[K],j\neq i} \frac{z-\omega_j}{\omega_i-\omega_j}=\begin{cases} 0 & \text{if } j\neq i \\ 1 & \text{if } j=i \end{cases}$ .

Then every $f(X_i)=f(\ell(\omega_i))$ is an evaluation of polynomial $f\circ \ell(z)$ at $\omega_i$ .

If the master obtains the composition $h=f\circ \ell$ , it can obtain every $f(X_i)=h(\omega_i)$ .

Goal: The master wished to obtain the polynomial $h(z)=f(\ell(z))$ .

Intuition:

Encoding is performed by evaluating $\ell(z)$ at $\alpha_1,\ldots,\alpha_P\in \mathbb{F}$ , and $P>K$ for redundancy.
Nodes apply $f$ on an evaluation of $\ell$ and obtain an evaluation of $h$ .
The master receives some potentially noisy evaluations, and finds $h$ .
The master evaluates $h$ at $\omega_1,\ldots,\omega_K$ to obtain $f(X_1),\ldots,f(X_K)$ .

Encoding for Lagrange coded computing

Need polynomial $\ell(z)$ such that:

$X_k=\ell(\omega_k)$ for every $k\in [K]$ .

Having obtained such $\ell$ we let $\tilde{X}_i=\ell(\alpha_i)$ for every $i\in [P]$ .

$span{\tilde{X}_1,\tilde{X}_2,\ldots,\tilde{X}_P}=span{\ell_1(x),\ell_2(x),\ldots,\ell_P(x)}$ .

Want $X_k=\ell(\omega_k)$ for every $k\in [K]$ .

Tool: Lagrange interpolation.

$\ell_k(z)=\prod_{i\neq k} \frac{z-\omega_j}{\omega_k-\omega_j}$ .
$\ell_k(\omega_k)=1$ and $\ell_k(\omega_k)=0$ for every $j\neq k$ .
$\deg \ell_k(z)=K-1$ .

Let $\ell(z)=\sum_{k=1}^K X_k\ell_k(z)$ .

$\deg \ell\leq K-1$ .
$\ell(\omega_k)=X_k$ for every $k\in [K]$ .

Let $\tilde{X}_i=\ell(\alpha_i)=\sum_{k=1}^K X_k\ell_k(\alpha_i)$ .

Every $\tilde{X}_i$ is a linear combination of $X_1,\ldots,X_K$ .

(\tilde{X}_1,\tilde{X}_2,\ldots,\tilde{X}_P)=(X_1,\ldots,X_K)\cdot G=(X_1,\ldots,X_K)\begin{bmatrix} \ell_1(\alpha_1) & \ell_1(\alpha_2) & \cdots & \ell_1(\alpha_P) \\ \ell_2(\alpha_1) & \ell_2(\alpha_2) & \cdots & \ell_2(\alpha_P) \\ \vdots & \vdots & \ddots & \vdots \\ \ell_K(\alpha_1) & \ell_K(\alpha_2) & \cdots & \ell_K(\alpha_P) \end{bmatrix}

This $G$ is called a Lagrange matrix with respect to

$\omega_1,\ldots,\omega_K$ . (interpolation points)
$\alpha_1,\ldots,\alpha_P$ . (evaluation points)

Continue next lecture