CSE5313 Coding and information theory for data science (Lecture 23)

Coded Computing

Motivation

Some facts:

Moore’s law is saturating.
- Improving CPU performance is hard.
Modern datasets are growing remarkably large.
- E.g., TikTok, YouTube.
Learning tasks are computationally heavy.
- E.g., training neural networks.

Solution: Distributed Computing for Scalability

Offloading computation tasks to multiple computation nodes.
Gather and accumulate computation results.
E.g., Apache Hadoop, Apache Spark, MapReduce.

General Framework

The system involves 1 master node and $P$ worker nodes.
The master has a dataset $D$ and wants $f(D)$ , where $f$ is some function.
The master partitions $D=(D_1,\cdots,D_P)$ , and sends $D_i$ to node $i$ .
Every node $i$ computes $g(D_i)$ , where $g$ is somem function.
Finally, the master collects $g(D_1),\cdots,g(D_P)$ and computes $f(D)=h(g(D_1),\cdots,g(D_P))$ , where $h$ is some function.

Challenges

Stragglers

Nodes that are significantly slower than the others.

Adversaries

Nodes that return errounous results.
- Computation/communication errors.
- Adversarial attacks.

Privary

Nodes may be curious about the dataset.

Resemblance to communication channel

Suppose $f,g=\operatorname{id}$ , and let $D=(D_1,\cdots,D_P)\in \mathbb{F}^p$ a message.

$D_i$ is a field element
$\mathbb{F}$ could be $\mathbb{R}$ or $\mathbb{C}$ , $\mathbb{F}^q$ .

Observation: This is a distributed storage system.

An erasure - node that does not respond.
An error - node that returns errounous results.

Solution:

Add redundancy to the message
Error-correcting codes.

Coded Distributed Computing

The master partitions $D$ and encodes it before sending to $P$ workers.
Workers perform computations on coded data $\tilde{D}$ and generate coded results $g(\tilde{D})$ .
The master decode the coded results and obtain $f(D)=h(g(\tilde{D}))$ .

Outline

Matrix-Vector Multiplication

MDS codes.
Short-Dot codes.

Matrix-Matrix Multiplication

Polynomial codes.
MatDot codes.

Polynomial Evaluation

Lagrange codes.
Application to BLockchain.

Trivial solution - replication

Why no straggler tolerance?

We employ an individual worker node $i$ to compute $y_i=(a_i,\ldots,a_{iN})\cdot (x_1,\ldots,x_N)^T$ .

Replicate the computation?

Let $r+1$ nodes compute every $y_i$ .

We need $P=rM+M$ worker nodes to tolerate $r$ erasures and $\lfloor \frac{r}{2}\rfloor$ adversaries.

Use of MDS codes

Let $2|M$ and $P=3$ .

Let $A_1,A_2$ be submatrices of $A$ such that $A=[A_1^\top|A_2^\top]^\top$ .

Worker node 1 conputes $A_1\cdot x$ .
Worker node 2 conputes $A_2\cdot x$ .
Worker node 3 conputes $(A_1+A_2)\cdot x$ .

Observation: the results can be obtained from any two worker nodes.

Let $G\in \mathbb{F}^{M\times P}$ be the generator matrix of an $(P,M)$ MDS code.

The master node computes $F=G^\top A\in \mathbb{F}^{P\times N}$ .

Every worker node $i$ computers $F_i\cdot x$ .

$F_i=(G^\top A)_i$ is the i-th row of $G^\top A$ .

Notice that $Fx=G^\top A\cdot x=G^\top y$ is the codeword of $y$ .

Node $i$ computes an entry in this codeword.

$1$ response = $1$ entyr of the codeword.

The master does not need all workers to respond to obtain $y$ .

The MDS property allows decoding from any $M$ $y_i$ ‘s
This scheme tolerates $P-M$ erasures, and the recovery threshold $K=M$ .
We need $P=r+M$ $P = r + M$ worker nodes to tolerate $r$ $r$ stragglers or $\frac{r}{2}$ $\frac{r}{2}$ adversaries.
- With replication, we need $P=rM+M$ worker nodes.

Potential improvements for MDS codes

The matrix $A$ is usually a (trained) model, and $x$ is the data (feature vector).
$x$ is transmitted frequently, while the row of $A$ (or $G^\top A$ ) is communicated in advance.
Every worker needs to receive the entire $x$ and compute the dot product.
Communication-heavy
Can we design a scheme that allows every node only receive only a part of $x$ ?

Short-Dot codes

link to paper

We want to create a matrix $F\in \mathbb{F}^{P\times M}$ from $A$ such that:

Every node computes $F_i\cdot x$ .
Every $K$ rows linearly span the row space of $A$ .
Each row of $F$ contains at most $s$ non-zero entries.

In the MDS method, $F=G^\top A$ .

The recovery threshold $K=M$ .
Every worker node needs to receive $s=N$ symbols (the entire x)

No free lunch

Can we trade the recovery threshold $K$ for a smaller $s$ ?

Every worker node receives less than $N$ symbols.
The master will need more than $M$ responses to recover the computation result.

Construction of Short-Dot codes

Choose a super-regular matrix $B\in \mathbb{F}^{P\times K}$ , where $P$ is the number of worker nodes.

A matrix is supper-regular if every square submatrix is invertible.
Lagrange/Cauchy matrix is super-regular (next lecture).

Create matrix $\tilde{A}$ by stacking some $Z\in \mathbb{F}^{(K-M)\times N}$ below matrix $A$ .

Let $F=B\cdot \tilde{A}\in \mathbb{F}^{P\times N}$ .

Short-Dot: create matrix $F\in \mathbb{F}^{P\times N}$ such that:

Every $K$ rows linearly span the row space of $A$ .
Each row of $F$ contains at most $s=\frac{P-K+M}{P}$ . $N$ non-zero entries (sparse).

Recovery of Short-Dot codes

Claim: Every $K$ rows of $F$ linearly span the row space of $A$ .

Proof

Since $B$ is supper-regular, it is also MDS, i.e., every $K\times K$ submatrix of $B$ is invertible.

Hence, every row of $A$ can be represented as a linear combination of any $K$ rows of $F$ .

That is, for every $\mathcal{X}\subseteq[P],|\mathcal{X}|=K$ , we can have $\tilde{A}=(B^{\mathcal{X}})^{-1}F^{\mathcal{X}}$ .

What about the sparsity of $F$ ?

Want each row of $F$ to be sparse.

Sparsity of Short-Dot codes

Build $P\times P$ square matrix whose each row/column contains $P-K+M$ non-zero entries.

Concatenate $\frac{N}{P}$ such matrices and obtain

[Missing slides 18]

We now investigate what 𝑍 should look like to construct such a matrix 𝐹. • Recall that each column of 𝐹 must contains 𝐾 − 𝑀 zeros. – They are indexed by the set 𝒰 ∈ 𝑃 , where |𝒰| = 𝐾 − 𝑀. – Let 𝐵 𝒰 ∈ 𝔽 𝐾−𝑀 ×𝐾 be a submatrix of 𝐵 containing rows indexed by 𝒰. • Since 𝐹 = 𝐵𝐴ሚ , it follows that 𝐹𝑗 = 𝐵𝐴ሚ 𝑗 , where 𝐹𝑗 and 𝐴ሚ 𝑗 are the 𝑗-th column of 𝐹 and 𝐴ሚ. • Next, we have 𝐵 𝒰𝐴ሚ 𝑗 = 0 (𝐾−𝑀)×1. • Split 𝐵 𝒰 = [𝐵 1,𝑀 𝒰 ,𝐵[𝑀+1,𝐾] 𝒰 ], 𝐴ሚ 𝑗 = 𝐴𝑗 𝑇 , 𝑍𝑗 𝑇 𝑇 . • 𝐵 𝒰𝐴ሚ 𝑗 = 𝐵 1,𝑀 𝒰 𝐴𝑗 + 𝐵[𝑀+1,𝐾] 𝒰 𝑍𝑗 = 0 (𝐾−𝑀)×1. • 𝑍𝑗= −(𝐵 𝑀+1,𝐾 𝒰 ) −1 𝐵 1,𝑀 𝒰 𝐴𝑗 . • Note that 𝐵 𝑀+1,𝐾 𝒰 ∈ 𝔽 𝐾−𝑀 × 𝐾−𝑀 is invertible. – Since 𝐵 is super-regular.