Skip to Content
CSE5313CSE5313 Coding and information theory for data science (Lecture 21)

CSE5313 Coding and information theory for data science (Lecture 21)

Gradient coding

Intro to Statistical Machine Learning

Given by the learning problem:

Unknown target function f:XYf:X\to Y.

Training data D={(xi,yi)}i=1n\mathcal{D}=\{(x_i,y_i)\}_{i=1}^n.

Learning algorithm: A\mathbb{A} and Hypothesis set H\mathcal{H}.

Goal is to find gfg\approx f,

Common hypothesis sets:

  • Linear classifiers:
f(x)=sign(wx)f(x)=sign (wx^\top)
  • Linear regressors:
f(x)=wxf(x)=wx^\top
  • Neural networks:
    • Concatenated linear classifiers (or differentiable approximation thereof).

The dataset D\mathcal{D} is taken from unknown distribution DD.

Common approach – Empirical Risk Minimization

  • Q: How to quantify fgf\approx g ?
  • A: Use a loss function.
    • A function which measures the deviation between gg‘s output and ff‘s output.
  • Using the chosen loss function, define two measures:
    • True risk: ED[L(f,g)]\mathbb{E}_{D}[\mathbb{L}(f,g)]
    • Empirical risk: 1ni=1n(g(xi),yi)\frac{1}{n}\sum_{i=1}^n \ell(g(x_i),y_i)

Machine learning is over R\mathbb{R}.

Gradient Descent (Motivation)

Parameterize gg using some real vector w\vec{w}, we want to minimize ER(w)ER(\vec{w}).

Algorithm:

  • Initialize w0\vec{w}_0.
  • For t=1,2,,Tt=1,2,\cdots,T:
    • Computer wER(w)\nabla_{\vec{w}} ER(\vec{w}).
    • wwηwER(w)\vec{w}\gets \vec{w}-\eta\nabla_{\vec{w}} ER(\vec{w})
    • Terminate if some stop condition are met.

Bottleneck: Calculating wER(w)=1ni=1nw(g(xi),yi)\nabla_{\vec{w}} ER(\vec{w})=\frac{1}{n}\sum_{i=1}^n \nabla_{\vec{w}} \ell(g(x_i),y_i).

Potentially, O(PN)O(PN) where NN is number of points, dimension of feature space.

Solution: Parallelize.

Distributed Gradient Descent

Idea: use a distributed system with master and workers.

Problem: Stragglers (slow servers, 5-6 slower than average time).

Potential Solutions:

  • Wait for all servers:
    • Accurate, but slow
  • Sum results without the slowest ones
    • Less accurate, but faster
  • Introduce redundancy
    • Send each Di\mathcal{D}_i to more than one server.
    • Each server receives more than one Di\mathcal{D}_i.
    • Each server sends a linear combination of the partials gradient to its Di\mathcal{D}_i.
    • The master decodes the sum of partial gradients from the linear combination.

Problem setups

System setup: 1 master MM, nn workers W1,,WnW_1,\cdots,W_n.

A dataset D={(xi,yi)}i=1n\mathcal{D}=\{(x_i,y_i)\}_{i=1}^n.

Each server jj:

  • Receives some dd data points Dj={(xj,i,yj,i)}i=1d\mathcal{D}_j=\{(x_{j,i},y_{j,i})\}_{i=1}^d. where dd is the replication factor.
  • Computes a certain vector viv_i from each (xj,i,yj,i)(x_{j,i},y_{j,i}).
  • Returns a linear combination ui=j=1nαjvju_i=\sum_{j=1}^n \alpha_j v_j. to the master.

The master:

  • Waits for the first nsn-s uiu_i‘s to arrive (ss is the straggler tolerance factor).
  • Linear combines the uiu_i‘s to get the final gradient.

Goal: Retrieve ivi\sum_i v_i regardless of which nsn-s workers responded.

  • Computation of the full gradient that tolerates any ss straggler.

The αi,j\alpha_{i,j}‘s are fixed (do not depend on data). These form a gradient coding matrix BCn×nB\in\mathbb{C}^{n\times n}.

Row ii has dd non-zero entries αi,j\alpha_{i,j}. at some positions.

The λi\lambda_i‘s:

  • Might depend on the identity of the nsn-s responses.
  • Nevertheless must exists in any case.

Recall:

  • The master must be able to recover ivi\sum_i v_i form any nsn-s responses.

Let K\mathcal{K} be the indices of the responses.

Let BKB_\mathcal{K} be the submatrix of BB indexed by K\mathcal{K}.

Must have:

  • For every K\mathcal{K} of size nsn-s there exists coefficients λ1,,λns\lambda_1,\cdots,\lambda_{n-s} such that:
(λ1,,λns)BK=(1,1,1,1,,1)=I(\lambda_1,\cdots,\lambda_{n-s})B_\mathcal{K}=(1,1,1,1,\cdots,1)=\mathbb{I}

Then if K={i1,,ins}\mathcal{K}=\{i_1,\cdots,i_{n-s}\} responded,

(λ1vi1,,λnsvins)(ui1ui2uins)=ivi(\lambda_1 v_{i_1},\cdots,\lambda_{n-s} v_{i_{n-s}})\begin{pmatrix} u_{i_1}\\ u_{i_2}\\ \vdots\\ u_{i_{n-s}} \end{pmatrix}=\sum_i v_i

Definition of gradient coding matrix.

For replication factor dd and straggler tolerance factor ss: BCn×nB\in\mathbb{C}^{n\times n} is a gradient coding matrix if:

  • I\mathbb{I} is in th span of any nsn-s rows.
  • Every row of BB contains at most dd nonzero elements.

Grading coding matrix implies gradient coding algorithm:

  • The master sends SiS_i to worker ii. where i=1,,ni=1,\cdots,n are the nonzero indices of row ii.
  • Worker ii:
    • Computes Di,vi,\mathcal{D}_{i,\ell}\to v_{i,\ell} for =1,,d\ell=1,\cdots,d.
    • Sends ui=j=1dαi,jvi,ju_i=\sum_{j=1}^d \alpha_{i,j}v_{i,j} to the master.
  • Let K={i1,,ins}\mathcal{K}=\{i_1,\cdots,i_{n-s}\} be the indices of the first nsn-s responses.
  • Since I\mathbb{I} is in the span of any nsn-s rows of BB, there exists λ1,,λns\lambda_1,\cdots,\lambda_{n-s} such that (λ1,,λns)((ui1,,uins))=ivi(\lambda_1,\cdots,\lambda_{n-s})((u_{i_1},\cdots,u_{i_{n-s}}))=\sum_i v_i.

Construction of Gradient Coding Matrices

Goal:

  • For a given straggler tolerance parameter ss, we wish to construct a gradient coding matrix BB with the smallest possible dd.

  • Tools:

I. Cyclic Reed-Solomon codes over the complex numbers. II. Definition of C\mathcal{C}^\perp (dual of C\mathcal {C}) and CR\mathcal{C}^R (reverse of C\mathcal{C}). III. A simple lemma about MDS codes.

Recall: An n,kn,k Reed-Solomon code over a field F\mathcal{F} is as follows.

  • Fix distinct α1,,αn1F\alpha_1,\cdots,\alpha_n-1\in\mathcal{F}.
  • C={f(α1),f(α2),,f(αn1)}\mathcal{C}=\{f(\alpha_1),f(\alpha_2),\cdots,f(\alpha_n-1)\}.
  • Dimension kk and minimum distance nk+1n-k+1 follow from F\mathcal{F} being a field.
  • Also works for F=C\mathcal {F}=\mathbb{C}.

I. Cyclic Reed-Solomon codes over the complex numbers.

The following Reed-Solomon code over the complex numbers is called a cyclic code.

  • Let i=1i=\sqrt{-1}.
  • For j{0,,n1}j\in \{0,\cdots,n-1\}, choose aj=e2πij/na_j=e^{2\pi i j/n}. The aja_j‘s are roots of unity of order nn.
  • Use these aja_j‘s to define a Reed-Solomon code as usual.

This code is cyclic:

  • Let c=(fc(a0),fc(a1),,fc(an1))c=(f_c(a_0),f_c(a_1),\cdots,f_c(a_{n-1})) for some fc(x)C[x]f_c(x)\in \mathbb{C}[x].
  • Need to show that c=fc(aj)c'=f_c(a_j) for all j{0,,n1}j\in \{0,\cdots,n-1\}.
c=(fc(a1),fc(a2),,fc(an1),fc(a0))=(fc(a0),fc(a1),,fc(an1))c'=(f_c(a_1),f_c(a_2),\cdots,f_c(a_{n-1}),f_c(a_0))=(f_c(a_0),f_c(a_1),\cdots,f_c(a_{n-1}))

II. Dual and Reversed Codes

  • Let C=[n,k,d]F\mathcal{C}=[n,k,d]_{\mathbb{F}} be an MDS code.

Definition for dual code of C\mathcal{C}

The dual code of C\mathcal{C} is

C={cFncc=0 for all cC}\mathcal{C}^\perp=\{c'\in \mathbb{F}^n|c'c^\top=0\text{ for all }c\in \mathcal{C}\}

Claim: C\mathcal{C}^\perp is an [n,nk,k+1]F[n,n-k,k+1]_{\mathbb{F}} code.

Definition for reversed code of C\mathcal{C}

The reversed code of C\mathcal{C} is

CR={(cn1,,c0)(c0,,cn1)C}\mathcal{C}^R=\{(c_{n-1},\cdots,c_0)|(c_0,\cdots,c_{n-1})\in \mathcal{C}\}

We claim that if C\mathcal{C} is cyclic, then CR\mathcal{C}^R is cyclic.

III. Lemma about MDS codes

Let C=[n,k,nk+1]F\mathcal{C}=[n,k,n-k+1]_{\mathbb{F}} be an MDS code.

Lemma

For any subset K{0,,n1}\mathcal{K}\subset \{0,\cdots,n-1\}, of size nk+1n-k+1 there exists cCc\in \mathcal{C} whose support (set of nonzero indices) is K\mathcal{K}.

Proof

Let GFk×nG\in \mathbb{F}^{k\times n} be a generator matrix, and let GKcFk×(k1)G_{\mathcal{K}^c}\in \mathbb{F}^{k\times (k-1)} be its restriction to columns not indexed by K\mathcal{K}.

GKcG_{\mathcal{K}^c} has more rows than columns, so there exists vFkv\in \mathbb{F}^{k} such that vGKc=0vG_{\mathcal{K}^c}=0.

So c=vGc=vG has at least Kc=k1|\mathcal{K}^c|=k-1 zeros inn entires indexed by Kc\mathcal{K}^c.

The remaining n(k1)=dn-(k-1)=d entries of cc, indexed by K\mathcal{K}, must be nonzero.

Thus the suppose of cc is K\mathcal{K}.

Construct gradient coding matrix

Consider nay nn workers and ss stragglers.

Let d=s+1d=s+1

Let C=[n,ns]C\mathcal{C}=[n,n-s]_{\mathbb{C}} be the cyclic RS code build by I.

Then by III, there exists ccc\in \mathcal{c} whose support is the n(ns)+1=s+1n-(n-s)+1=s+1 first entires.

Denote c=(β1,,βs+1,0,0,,0)c=(\beta_1,\cdots,\beta_{s+1},0,0,\cdots,0). for some nonzero β1,,βs+1\beta_1,\cdots,\beta_{s+1}.

Build:

BCn×nB\in \mathbb{C}^{n\times n} whose columns are all cyclic shifts of cc.

We claim that BB is a gradient coding matrix.

(β100βs+1βsβ2β10βs+1βs+10βs+10βs+1000β1000βs+1βsβs1β1)\begin{pmatrix} \beta_1 & 0 & & 0 & \beta_{s+1} & \beta_s& \cdots & \beta_2 \\ \vdots & \beta_1 & & \vdots & 0 & \beta_{s+1} & & \vdots \\ \beta_{s+1} & \vdots & & & \vdots & 0 & & \beta_{s+1}\\ 0 & \beta_{s+1} & \ddots & 0 & &\vdots & \dots & 0\\ \vdots & 0 & \ddots & \beta_1 & & & & &\\ & \vdots & \ddots & \vdots & & & &0\\ 0 & 0 & & \beta_{s+1}& \beta_s& \beta_{s-1}& \cdots & \beta_1\\ \end{pmatrix}

Proof

Every row is a codeword in CR\mathcal{C}^R.

  • Specifically, a shift of (0,,0,βs+1,,β1)(0,\cdots,0,\beta_{s+1},\cdots,\beta_1).
  • Then every row contains d=s+1\leq d=s+1 nonzeros.

I\mathcal{I} is in the span of any nsn-s rows of BB.

  • Observe that IC\mathcal{I}\in \mathcal{C}, (evaluate the polynomial f(x)=1f(x)=1 at α1,,αn\alpha_1,\cdots,\alpha_n).
  • Then ICR\mathcal{I}\in \mathcal{C}^R.
  • Therefore, it suffices to show that any nsn-s span CR\mathcal{C}^R.
  • Since dimC=dimCR=ns\dim \mathcal{C}=\dim \mathcal{C}^R=n-s, it suffices to show that any nsn-s rows are independent.

Observe: The left most nsn-s columns are linearly independent, and therefore span C\mathcal{C}.

Assume for contradiction there exists nsn-s dependent rows.

Then vCn\exists v\in \mathbb{C}^{n} such that vB=0vB=0.

vv is orthogonal to the basis of C\mathcal{C}.

So vCv\in \mathcal{C}^\perp.

From II, C\mathcal{C}^\perp is an [n,s][n,s] MDS code, and hence every vCv\in \mathcal{C}^\perp is of Hamming weight ns+1\geq n-s+1.

This is a contradiction.

Bound for gradient coding

We want ss to be large and dd to be small.

How small can dd with respect to ss?

  • A: Build a bipartite graph.
    • Left side: nn workers W1,,WnW_1,\cdots,W_n.
    • Right side: nn partial datasets D1,,DnD_1,\cdots,D_n.
    • Connect WiW_i to DiD_i if worker ii contains DiD_i.
      • Equivalently if Bi,i0B_{i,i}\neq 0.
    • deg(Wi)=d\deg (W_i) = d by definition.
    • deg(Dj)s+1\deg (\mathcal{D}_j)\geq s+1.
    • Sum degree on left ndnd and right n(s+1)\geq n(s+1).
    • So ds+1d\geq s+1.

We can break the lower bound using approximate computation.

Last updated on