Skip to Content
CSE5313CSE5313 Coding and information theory for data science (Lecture 22)

CSE5313 Coding and information theory for data science (Lecture 22)

Approximate Gradient Coding

Exact gradient computation and approximate gradient computation

In the previous formulation, the gradient ivi\sum_i v_i is computed exactly.

  • Accurate
  • Requires ds+1d \geq s + 1 (high replication factor).
  • Need to know ss in advance!

However:

  • Approximate gradient computation are very common!
    • E.g., stochastic gradient descent.
  • Machine learning is inherently inaccurate.
    • Relies on biased data, unverified assumptions about model, etc.

Idea: If we relax the exact computation requirement, can have d<s+1d < s + 1?

  • No fixed ss anymore.

Approximate computation:

  • Exact computation: v=ivi=(1,,1)v1,,vn\nabla \triangleq v = \sum_i v_i = (1, \cdots, 1) v_1, \cdots, v_n^\top. ⊤.
  • Approximate computation: v=iviuv1,,vn\nabla \triangleq v = \sum_i v_i \approx u v_1, \cdots, v_n^\top,
    • Where d2(u,I)d_2(u, \mathbb{I}) is “small” (d2(u,v)=i(uivi)2d_2(u, v) =\sqrt{ \sum_i (u_i - v_i)^2}).
    • Why?
  • Lemma: Let vu=u(v1,,vn)v_u = u (v_1, \cdots, v_n)^\top. If d2(u,I)ϵd_2(u, \mathbb{I}) \leq \epsilon then d2(v,vu)ϵspec(V)d_2(v, v_u) \leq \epsilon \cdot \ell_{spec}(V).
    • VV is the matrix whose rows are the viv_i‘s.
    • spec\ell_{spec} is the spectral norm (positive sqrt of maximum eigenvalue of VVV^\top V).
  • Idea: Distribute S1,,SnS_1, \cdots, S_n as before, and
    • as the master gets more and more responses,
    • it can reconstruct uv1,,vnu v_1, \cdots, v_n^\top,
    • such that d2(u,)d_2(u, \ell) gets smaller and smaller.

[!NOTE] ds+1d \geq s + 1 no longer holds. ss no longer a parameter of the system, but s=n#responsess = n - \#responses at any given time.

Trivial Scheme

Off the bat, the “do nothing” approach:

  • Send SiS_i to worker ii, i.e., d=1d = 1.
  • Worker ii replies with the ii‘th partial gradient viv_i.
  • The master averages up all the responses.

How good is that?

  • For u=nnsIu = \frac{n}{n-s} \cdot \mathbb{I}, the factor nns\frac{n}{n-s} corrects the 1n\frac{1}{n} in vi=1n on Siv_i = \frac{1}{n} \cdot \nabla \text{ on } S_i.
  • Is this ivi\approx \sum_i v_i? In other words, what is d2(nnsI,I)d_2(\frac{n}{n-s} \cdot \mathbb{I}, \mathbb{I})?

Trivial scheme: nnsI\frac{n}{n-s} \cdot \mathbb{I} approximation.

Must do better than that!

Roadmap

  • Quick reminder from linear algebra.
    • Eigenvectors and orthogonality.
  • Quick reminder from graph theory.
    • Adjacency matrix of a graph.
  • Graph theoretic concept: expander graphs.
    • “Well connected” graphs.
    • Extensively studied.
  • An approximate gradient coding scheme from expander graphs.

Linear algebra - Reminder

  • Let ARn×nA \in \mathbb{R}^{n \times n}.
  • If Av=λvA v = \lambda v then λ\lambda is an eigenvalue and vv is an eigenvector.
  • v1,,vnRnv_1, \cdots, v_n \in \mathbb{R}^n are orthonormal:
    • vi2=1\|v_i\|_2 = 1 for all ii.
    • vivj=0v_i \cdot v_j^\top = 0 for all iji \neq j.
  • Nice property: α1v1++αnvn22=iαi2\| \alpha_1 v_1 + \cdots + \alpha_n v_n \|_2^2 = \sqrt{\sum_i \alpha_i^2}.
  • AA is called symmetric if A=AA = A^\top.
  • Theorem: A real and symmetric matrix has an orthonormal basis of eigenvectors.
    • That is, there exists an orthonormal basis v1,,vnv_1, \cdots, v_n such that Avi=λiviA v_i = \lambda_i v_i for some λi\lambda_i‘s.

Graph theory - Reminder

  • Undirected graph G=V,EG = V, E.
  • VV is a vertex set, usually n=1,2,,nn = 1,2, \cdots, n.
  • E(V2)E \subseteq \binom{V}{2} is an edge set (i.e., EE is a collection of subsets of VV of size two).
  • Each edge eEe \in E is of the form e=(a,b)e = (a, b) for some distinct a,bVa, b \in V.
  • Spectral graph theory:
    • Analyze properties of graphs (combinatorial object) using matrices (algebraic object).
    • Specifically, for a graph GG let AG{0,1}n×nA_G \in \{0,1\}^{n \times n} be the adjacency matrix of GG.
    • Ai,j=1A_{i,j} = 1 if and only if {i,j}E\{i,j\} \in E (otherwise 0).
    • AA is real and symmetric.
    • Therefore, has an orthonormal basis of eigenvectors.

Some nice properties of adjacency matrices

  • Let G=(V,E)G = (V, E) be dd-regular, with adjacency matrix AGA_G whose (real) eigenvalues λ1λn\lambda_1 \geq \cdots \geq \lambda_n.
  • Some theorems:
    • λ1=d\lambda_1 = d.
    • λnd\lambda_n \geq -d, equality if and only if GG is bipartite.
    • AGI=λ1I=dIA_G \mathbb{I}^\top = \lambda_1 \mathbb{I}^\top = d \mathbb{I}^\top (easy to show!).
      • Does that ring a bell? ;)
    • If λ1=λ2\lambda_1 = \lambda_2 then GG is not connected.

Expander graphs - Intuition.

  • An important family of graphs.
  • Multiple applications in:
    • Algorithms, complexity theory, error correcting codes, etc.
  • Intuition: A graph is called an expander if there are no “lonely small sets” of nodes.
  • Every set of at most n/2n/2 nodes is “well connected” to the remaining nodes in the graph.
  • A bit more formally:
    • An infinite family of graphs GnG_n n=1n=1 \infty (where GnG_n has nn nodes) is called an expander family, if the “minimal connectedness” of small sets in GnG_n does not go to zero with nn.

Expander graphs - Definitions.

  • All graphs in this lecture are dd-regular, i.e., all nodes have the same degree dd.
  • For sets of nodes S,TVS, T \subseteq V, let E(S,T)E(S, T) be the set of edges between SS and TT. I.e., E(S,T)={(i,j)EiS and jT}E(S, T) = \{(i,j) \in E | i \in S \text{ and } j \in T\}.
  • For a set of nodes SS let:
    • Sc=nSS^c = n \setminus S be its complement.
    • Let S=E(S,Sc)\partial S = E(S, S^c) be the boundary of SS.
    • I.e., the set of edges between SS and its complement ScS^c.
  • The expansion parameter hGh_G of GG is:
    • I.e., how many edges leave SS, relative to its size.
    • How “well connected” SS is to the remaining nodes.

[!NOTE] hG=minSV,Sn/2SSh_G = \min_{S \subseteq V, |S| \leq n/2} \frac{\partial S}{|S|}.

  • An infinite family of dd-regular graphs (Gn)n=1(G_n)_{n=1}^\infty (where GnG_n has nn nodes) is called an expander family if h(Gn)ϵh(G_n) \geq \epsilon for all nn.
    • Same dd and same ϵ\epsilon for all nn.
  • Expander families with large ϵ\epsilon are hard to build explicitly.
  • Example: (Lubotsky, Philips and Sarnak ‘88)
    • V=ZpV = \mathbb{Z}_p (prime).
    • Connect xx to x+1,x1,x1x + 1, x - 1, x^{-1}.
    • d=3d = 3, very small ϵ\epsilon.
  • However, random graphs are expanders with high probability.

Expander graphs - Eigenvalues

  • There is a strong connection between the expansion parameter of a graph and the eigenvalues λ1λn\lambda_1 \geq \cdots \geq \lambda_n of its adjacency matrix.
  • Some theorems (no proof):
    • dλ22hG2d(dλ2)\frac{d-\lambda_2}{2} \leq h_G \leq \sqrt{2d (d - \lambda_2)}.
    • dλ2d - \lambda_2 is called the spectral gap of GG.
      • If the spectral gap is large, GG is a good expander.
      • How large can it be?
    • Let λ=max{λ2,λn}\lambda = \max \{|\lambda_2|, |\lambda_n|\}. Then λ2d1on(1)\lambda \geq 2 d - 1 - o_n(1). (Alon-Boppana Theorem).
    • Graphs which achieve the Alon-Boppana bound (i.e., λ2d1\lambda \leq 2 d - 1) are called Ramanujan graphs.
      • The “best” expanders.
      • Some construction are known.
      • Efficient construction of Ramanujan graphs for all parameters is very recent (2016).

Approximate GC from Expander Graphs

Back to approximate gradient coding.

  • Let dd be any replication parameter.
  • Let GG be an expander graph (i.e., taken from an infinite expander family (Gn)n=1(G_n)_{n=1}^\infty)
    • With eigenvalues λ1λn\lambda_1 \geq \cdots \geq \lambda_n, and respective eigenvectors w1,,wnw_1, \cdots, w_n
    • Assume w12=w22==wn2=1\|w_1\|_2 =\| w_2\|_2 = \cdots = \|w_n\|_2 = 1, and wiwj=0w_i w_j^\top = 0 for all iji \neq j.
  • Let the gradient coding matrix B=1dAGB=\frac{1}{d} A_G.
    • The eigenvalues of BB are μ1=1μ2μn\mu_1 = 1\geq \mu_2 \geq \cdots \geq \mu_n. where μi=λid\mu_i = \frac{\lambda_i}{d}.
    • Let μ=max{μ2,μn}\mu = \max \{|\mu_2|, |\mu_n|\}.
    • dd nonzero entries in each row \Rightarrow Replication factor dd.
  • Claim: For any number of stragglers ss, we can get close to I\mathbb{I}.
    • Much better than the trivial scheme.
    • Proximity is a function of dd and λ\lambda.
  • For every ss and any set K\mathcal{K} of nsn - s responses, we build an “decoding vector”.
    • A function of ss and of the identities of the responding workers.
    • Will be used to linearly combine the nsn - s responses to get the approximate gradient.
  • Let wKRnw_{\mathcal{K}} \in \mathbb{R}^n such that (wK)i={1if iKsnsif iK(w_{\mathcal{K}})_i = \begin{cases} -1 & \text{if } i \notin \mathcal{K} \\ \frac{s}{n-s} & \text{if } i \in \mathcal{K} \end{cases}.

Lemma 1: wKw_{\mathcal{K}} is spanned by w2,,wnw_2, \cdots, w_n, the n1n - 1 last eigenvectors of AGA_G.

Proof

w2,,wnw_2, \cdots, w_n are independent, and all orthogonal to w1=Iw_1 = \mathbb{I}.

\Rightarrow The span of w2,,wnw_2, \cdots, w_n is exactly all vectors whose sum of entries is zero.

Sum of entries of wKw_{\mathcal{K}} is zero \Rightarrow wKw_{\mathcal{K}} is in their span.

Corollary: wK=α2w2++αnwnw_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n for some αi\alpha_i‘s in R\mathbb{R}.

Lemma 2: From direct computation, the norm of wK=α2w2++αnwnw_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n is nsns\frac{ns}{n-s}.

Corollary: wK2=i=2nαi2=nsnsw_{\mathcal{K}}^2 = \sum_{i=2}^n \alpha_i^2 = \frac{ns}{n-s} (from Lemma 2 + orthonormality of w2,,wnw_2, \cdots, w_n).

The scheme:

  • If the set of responses is K\mathcal{K}, the decoding vector is wK+2w_{\mathcal{K}} + \ell_2.
  • Notice that supp(wK+2)=K\operatorname{supp}(w_{\mathcal{K}} + \ell_2) = \mathcal{K}.
  • The responses the master receives are the rows of Bv1,,vnB v_1, \cdots, v_n^\top indexed by K\mathcal{K}.
  • \Rightarrow The master can compute wK+2Bv1,,vnw_{\mathcal{K}} + \ell_2 B v_1, \cdots, v_n^\top.

Left to show: How close is wK+2Bw_{\mathcal{K}} + \ell_2 B to 2\ell_2?

Proof

Recall that:

  1. wK=α2w2++αnwnw_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n.
  2. wiw_i’s are eigenvectors of AGA_G (with eigenvalues λi\lambda_i) and of B=1dAGB = \frac{1}{d} A_G (with eigenvalues μi=λid\mu_i = \frac{\lambda_i}{d}).

d2(wK+IB,I)=d2(I+α2w2++αnwnB,I)d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 w_2 + \cdots + \alpha_n w_n B, \mathbb{I}) (from 1.)

d2(wK+IB,I)=d2(I+α2μ2w2++αnμnwn,I)d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n, \mathbb{I}) (eigenvalues of BB, and μ1=1\mu_1 = 1)

d2(wK+IB,I)=d2(I+α2μ2w2++αnμnwn,I)d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n , \mathbb{I}) (by def.).

d2(wK+IB,I)=i=2nαiμiwi2d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2

d2(wK+IB,I)=i=2nαiμiwi2d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2 (w_i’s are orthonormal).

d2(wK+IB,I)=i=2nαi2μi2d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \sqrt{\sum_{i=2}^n \alpha_i^2 \mu_i^2} (w_i’s are orthonormal).

Improvement factor

Corollary: If B=1dAGB = \frac{1}{d} A_G for a (d-regular) Ramanujan graph GG,

  • \Rightarrow improvement factor 2d\approx \frac{2}{d}.
  • Some explicit constructions of Ramanujan graphs (Lubotsky, Philips and Sarnak ‘88)
    • with 2 2d\frac{2}{d} \approx 0.5!

Recap

  • Expander graph: A dd-regular graph with no lonely small subsets of nodes.
    • Every subset with n/2\leq n/2 has a large ratio S/S\partial S / S (not 0\rightarrow 0 with nn).
    • Many constructions exist, random graph is expander w.h.p.
    • The expansion factor is determined by the spectral gap dλ2d - \lambda_2,
    • Where λ=max{λ2,λn}\lambda = \max \{\lambda_2, \lambda_n\}, and λ1=dλ2λn\lambda_1 = d \geq \lambda_2 \geq \cdots \geq \lambda_n are the eigenvalues of AGA_G.
    • “Best” expander = Ramanujan graph = has λ2d1\lambda \leq 2 d - 1.
  • “Do nothing” approach: approximation nsns\frac{ns}{n-s}.
  • Approximate gradient coding:
    • Send dd subsets Sj1,,SjdS_{j1}, \cdots, S_{jd} to each node ii, which returns a linear combination according to a coefficient matrix BB.
    • Let B=1dAGB = \frac{1}{d} A_G, for GG a Ramanujan graph: approximation λdnsns\frac{\lambda}{d} \frac{ns}{n-s}.
    • Up to 50% closer than “do nothing”, at the price of higher computation load

[!NOTE] Faster = more computation load.

Last updated on