CSE5313 Coding and information theory for data science (Lecture 21)

Gradient coding

Intro to Statistical Machine Learning

Given by the learning problem:

Unknown target function $f:X\to Y$ .

Training data $\mathcal{D}=\{(x_i,y_i)\}_{i=1}^n$ .

Learning algorithm: $\mathbb{A}$ and Hypothesis set $\mathcal{H}$ .

Goal is to find $g\approx f$ ,

Common hypothesis sets:

Linear classifiers:

f(x)=sign (wx^\top)

Linear regressors:

f(x)=wx^\top

Neural networks:
- Concatenated linear classifiers (or differentiable approximation thereof).

The dataset $\mathcal{D}$ is taken from unknown distribution $D$ .

Common approach – Empirical Risk Minimization

Q: How to quantify $f\approx g$ ?
A: Use a loss function.
- A function which measures the deviation between $g$ ‘s output and $f$ ‘s output.
Using the chosen loss function, define two measures:
- True risk: $\mathbb{E}_{D}[\mathbb{L}(f,g)]$
- Empirical risk: $\frac{1}{n}\sum_{i=1}^n \ell(g(x_i),y_i)$

Machine learning is over $\mathbb{R}$ .

Gradient Descent (Motivation)

Parameterize $g$ using some real vector $\vec{w}$ , we want to minimize $ER(\vec{w})$ .

Algorithm:

Initialize $\vec{w}_0$ .
For $t=1,2,\cdots,T$ $t = 1, 2, \dots, T$ :
- Computer $\nabla_{\vec{w}} ER(\vec{w})$ .
- $\vec{w}\gets \vec{w}-\eta\nabla_{\vec{w}} ER(\vec{w})$
- Terminate if some stop condition are met.

Bottleneck: Calculating $\nabla_{\vec{w}} ER(\vec{w})=\frac{1}{n}\sum_{i=1}^n \nabla_{\vec{w}} \ell(g(x_i),y_i)$ .

Potentially, $O(PN)$ where $N$ is number of points, dimension of feature space.

Solution: Parallelize.

Distributed Gradient Descent

Idea: use a distributed system with master and workers.

Problem: Stragglers (slow servers, 5-6 slower than average time).

Potential Solutions:

Wait for all servers:
- Accurate, but slow
Sum results without the slowest ones
- Less accurate, but faster
Introduce redundancy
- Send each $\mathcal{D}_i$ to more than one server.
- Each server receives more than one $\mathcal{D}_i$ .
- Each server sends a linear combination of the partials gradient to its $\mathcal{D}_i$ .
- The master decodes the sum of partial gradients from the linear combination.

Problem setups

System setup: 1 master $M$ , $n$ workers $W_1,\cdots,W_n$ .

A dataset $\mathcal{D}=\{(x_i,y_i)\}_{i=1}^n$ .

Each server $j$ :

Receives some $d$ data points $\mathcal{D}_j=\{(x_{j,i},y_{j,i})\}_{i=1}^d$ . where $d$ is the replication factor.
Computes a certain vector $v_i$ from each $(x_{j,i},y_{j,i})$ .
Returns a linear combination $u_i=\sum_{j=1}^n \alpha_j v_j$ . to the master.

The master:

Waits for the first $n-s$ $u_i$ ‘s to arrive ( $s$ is the straggler tolerance factor).
Linear combines the $u_i$ ‘s to get the final gradient.

Goal: Retrieve $\sum_i v_i$ regardless of which $n-s$ workers responded.

Computation of the full gradient that tolerates any $s$ straggler.

The $\alpha_{i,j}$ ‘s are fixed (do not depend on data). These form a gradient coding matrix $B\in\mathbb{C}^{n\times n}$ .

Row $i$ has $d$ non-zero entries $\alpha_{i,j}$ . at some positions.

The $\lambda_i$ ‘s:

Might depend on the identity of the $n-s$ responses.
Nevertheless must exists in any case.

Recall:

The master must be able to recover $\sum_i v_i$ form any $n-s$ responses.

Let $\mathcal{K}$ be the indices of the responses.

Let $B_\mathcal{K}$ be the submatrix of $B$ indexed by $\mathcal{K}$ .

Must have:

For every $\mathcal{K}$ of size $n-s$ there exists coefficients $\lambda_1,\cdots,\lambda_{n-s}$ such that:

(\lambda_1,\cdots,\lambda_{n-s})B_\mathcal{K}=(1,1,1,1,\cdots,1)=\mathbb{I}

Then if $\mathcal{K}=\{i_1,\cdots,i_{n-s}\}$ responded,

(\lambda_1 v_{i_1},\cdots,\lambda_{n-s} v_{i_{n-s}})\begin{pmatrix} u_{i_1}\\ u_{i_2}\\ \vdots\\ u_{i_{n-s}} \end{pmatrix}=\sum_i v_i

Definition of gradient coding matrix.

For replication factor $d$ and straggler tolerance factor $s$ : $B\in\mathbb{C}^{n\times n}$ is a gradient coding matrix if:

$\mathbb{I}$ is in th span of any $n-s$ rows.
Every row of $B$ contains at most $d$ nonzero elements.

Grading coding matrix implies gradient coding algorithm:

The master sends $S_i$ to worker $i$ . where $i=1,\cdots,n$ are the nonzero indices of row $i$ .
Worker $i$ $i$ :
- Computes $\mathcal{D}_{i,\ell}\to v_{i,\ell}$ for $\ell=1,\cdots,d$ .
- Sends $u_i=\sum_{j=1}^d \alpha_{i,j}v_{i,j}$ to the master.
Let $\mathcal{K}=\{i_1,\cdots,i_{n-s}\}$ be the indices of the first $n-s$ responses.
Since $\mathbb{I}$ is in the span of any $n-s$ rows of $B$ , there exists $\lambda_1,\cdots,\lambda_{n-s}$ such that $(\lambda_1,\cdots,\lambda_{n-s})((u_{i_1},\cdots,u_{i_{n-s}}))=\sum_i v_i$ .

Construction of Gradient Coding Matrices

Goal:

For a given straggler tolerance parameter $s$ , we wish to construct a gradient coding matrix $B$ with the smallest possible $d$ .
Tools:

I. Cyclic Reed-Solomon codes over the complex numbers. II. Definition of $\mathcal{C}^\perp$ (dual of $\mathcal {C}$ ) and $\mathcal{C}^R$ (reverse of $\mathcal{C}$ ). III. A simple lemma about MDS codes.

Recall: An $n,k$ Reed-Solomon code over a field $\mathcal{F}$ is as follows.

Fix distinct $\alpha_1,\cdots,\alpha_n-1\in\mathcal{F}$ .
$\mathcal{C}=\{f(\alpha_1),f(\alpha_2),\cdots,f(\alpha_n-1)\}$ .
Dimension $k$ and minimum distance $n-k+1$ follow from $\mathcal{F}$ being a field.
Also works for $\mathcal {F}=\mathbb{C}$ .

I. Cyclic Reed-Solomon codes over the complex numbers.

The following Reed-Solomon code over the complex numbers is called a cyclic code.

Let $i=\sqrt{-1}$ .
For $j\in \{0,\cdots,n-1\}$ , choose $a_j=e^{2\pi i j/n}$ . The $a_j$ ‘s are roots of unity of order $n$ .
Use these $a_j$ ‘s to define a Reed-Solomon code as usual.

This code is cyclic:

Let $c=(f_c(a_0),f_c(a_1),\cdots,f_c(a_{n-1}))$ for some $f_c(x)\in \mathbb{C}[x]$ .
Need to show that $c'=f_c(a_j)$ for all $j\in \{0,\cdots,n-1\}$ .

c'=(f_c(a_1),f_c(a_2),\cdots,f_c(a_{n-1}),f_c(a_0))=(f_c(a_0),f_c(a_1),\cdots,f_c(a_{n-1}))

II. Dual and Reversed Codes

Let $\mathcal{C}=[n,k,d]_{\mathbb{F}}$ be an MDS code.

Definition for dual code of $\mathcal{C}$

The dual code of $\mathcal{C}$ is

\mathcal{C}^\perp=\{c'\in \mathbb{F}^n|c'c^\top=0\text{ for all }c\in \mathcal{C}\}

Claim: $\mathcal{C}^\perp$ is an $[n,n-k,k+1]_{\mathbb{F}}$ code.

Definition for reversed code of $\mathcal{C}$

The reversed code of $\mathcal{C}$ is

\mathcal{C}^R=\{(c_{n-1},\cdots,c_0)|(c_0,\cdots,c_{n-1})\in \mathcal{C}\}

We claim that if $\mathcal{C}$ is cyclic, then $\mathcal{C}^R$ is cyclic.

III. Lemma about MDS codes

Let $\mathcal{C}=[n,k,n-k+1]_{\mathbb{F}}$ be an MDS code.

Lemma

For any subset $\mathcal{K}\subset \{0,\cdots,n-1\}$ , of size $n-k+1$ there exists $c\in \mathcal{C}$ whose support (set of nonzero indices) is $\mathcal{K}$ .

Proof

Let $G\in \mathbb{F}^{k\times n}$ be a generator matrix, and let $G_{\mathcal{K}^c}\in \mathbb{F}^{k\times (k-1)}$ be its restriction to columns not indexed by $\mathcal{K}$ .

$G_{\mathcal{K}^c}$ has more rows than columns, so there exists $v\in \mathbb{F}^{k}$ such that $vG_{\mathcal{K}^c}=0$ .

So $c=vG$ has at least $|\mathcal{K}^c|=k-1$ zeros inn entires indexed by $\mathcal{K}^c$ .

The remaining $n-(k-1)=d$ entries of $c$ , indexed by $\mathcal{K}$ , must be nonzero.

Thus the suppose of $c$ is $\mathcal{K}$ .

Construct gradient coding matrix

Consider nay $n$ workers and $s$ stragglers.

Let $d=s+1$

Let $\mathcal{C}=[n,n-s]_{\mathbb{C}}$ be the cyclic RS code build by I.

Then by III, there exists $c\in \mathcal{c}$ whose support is the $n-(n-s)+1=s+1$ first entires.

Denote $c=(\beta_1,\cdots,\beta_{s+1},0,0,\cdots,0)$ . for some nonzero $\beta_1,\cdots,\beta_{s+1}$ .

Build:

$B\in \mathbb{C}^{n\times n}$ whose columns are all cyclic shifts of $c$ .

We claim that $B$ is a gradient coding matrix.

\begin{pmatrix} \beta_1 & 0 & & 0 & \beta_{s+1} & \beta_s& \cdots & \beta_2 \\ \vdots & \beta_1 & & \vdots & 0 & \beta_{s+1} & & \vdots \\ \beta_{s+1} & \vdots & & & \vdots & 0 & & \beta_{s+1}\\ 0 & \beta_{s+1} & \ddots & 0 & &\vdots & \dots & 0\\ \vdots & 0 & \ddots & \beta_1 & & & & &\\ & \vdots & \ddots & \vdots & & & &0\\ 0 & 0 & & \beta_{s+1}& \beta_s& \beta_{s-1}& \cdots & \beta_1\\ \end{pmatrix}

Proof

Every row is a codeword in $\mathcal{C}^R$ .

Specifically, a shift of $(0,\cdots,0,\beta_{s+1},\cdots,\beta_1)$ .
Then every row contains $\leq d=s+1$ nonzeros.

$\mathcal{I}$ is in the span of any $n-s$ rows of $B$ .

Observe that $\mathcal{I}\in \mathcal{C}$ , (evaluate the polynomial $f(x)=1$ at $\alpha_1,\cdots,\alpha_n$ ).
Then $\mathcal{I}\in \mathcal{C}^R$ .
Therefore, it suffices to show that any $n-s$ span $\mathcal{C}^R$ .
Since $\dim \mathcal{C}=\dim \mathcal{C}^R=n-s$ , it suffices to show that any $n-s$ rows are independent.

Observe: The left most $n-s$ columns are linearly independent, and therefore span $\mathcal{C}$ .

Assume for contradiction there exists $n-s$ dependent rows.

Then $\exists v\in \mathbb{C}^{n}$ such that $vB=0$ .

$v$ is orthogonal to the basis of $\mathcal{C}$ .

So $v\in \mathcal{C}^\perp$ .

From II, $\mathcal{C}^\perp$ is an $[n,s]$ MDS code, and hence every $v\in \mathcal{C}^\perp$ is of Hamming weight $\geq n-s+1$ .

This is a contradiction.

Bound for gradient coding

We want $s$ to be large and $d$ to be small.

How small can $d$ with respect to $s$ ?

A: Build a bipartite graph.
- Left side: $n$ workers $W_1,\cdots,W_n$ .
- Right side: $n$ partial datasets $D_1,\cdots,D_n$ .
- Connect $W_i$ $W_{i}$ to $D_i$ $D_{i}$ if worker $i$ $i$ contains $D_i$ $D_{i}$ .
  - Equivalently if $B_{i,i}\neq 0$ .
- $\deg (W_i) = d$ by definition.
- $\deg (\mathcal{D}_j)\geq s+1$ .
- Sum degree on left $nd$ and right $\geq n(s+1)$ .
- So $d\geq s+1$ .

We can break the lower bound using approximate computation.

CSE5313 Coding and information theory for data science (Lecture 21)

Gradient coding

Intro to Statistical Machine Learning

Common approach – Empirical Risk Minimization

Gradient Descent (Motivation)

Distributed Gradient Descent

Problem setups

Definition of gradient coding matrix.

Construction of Gradient Coding Matrices

I. Cyclic Reed-Solomon codes over the complex numbers.

II. Dual and Reversed Codes

Definition for dual code of C\mathcal{C}C

Definition for reversed code of C\mathcal{C}C

III. Lemma about MDS codes

Lemma

Proof

Construct gradient coding matrix

Proof

Bound for gradient coding

Definition for dual code of $\mathcal{C}$

Definition for reversed code of $\mathcal{C}$