CSE5313 Coding and information theory for data science (Lecture 22)

Approximate Gradient Coding

Exact gradient computation and approximate gradient computation

In the previous formulation, the gradient $\sum_i v_i$ is computed exactly.

Accurate
Requires $d \geq s + 1$ (high replication factor).
Need to know $s$ in advance!

However:

Approximate gradient computation are very common!
- E.g., stochastic gradient descent.
Machine learning is inherently inaccurate.
- Relies on biased data, unverified assumptions about model, etc.

Idea: If we relax the exact computation requirement, can have $d < s + 1$ ?

No fixed $s$ anymore.

Approximate computation:

Exact computation: $\nabla \triangleq v = \sum_i v_i = (1, \cdots, 1) v_1, \cdots, v_n^\top$ . ⊤.
Approximate computation: $\nabla \triangleq v = \sum_i v_i \approx u v_1, \cdots, v_n^\top$ $\nabla ≜ v = \sum_{i} v_{i} \approx u v_{1}, \dots, v_{n}^{⊤}$ ,
- Where $d_2(u, \mathbb{I})$ is “small” ( $d_2(u, v) =\sqrt{ \sum_i (u_i - v_i)^2}$ ).
- Why?
Lemma: Let $v_u = u (v_1, \cdots, v_n)^\top$ $v_{u} = u (v_{1}, \dots, v_{n})^{⊤}$ . If $d_2(u, \mathbb{I}) \leq \epsilon$ $d_{2} (u, I) \leq ϵ$ then $d_2(v, v_u) \leq \epsilon \cdot \ell_{spec}(V)$ $d_{2} (v, v_{u}) \leq ϵ \cdot ℓ_{s p ec} (V)$ .
- $V$ is the matrix whose rows are the $v_i$ ‘s.
- $\ell_{spec}$ is the spectral norm (positive sqrt of maximum eigenvalue of $V^\top V$ ).
Idea: Distribute $S_1, \cdots, S_n$ $S_{1}, \dots, S_{n}$ as before, and
- as the master gets more and more responses,
- it can reconstruct $u v_1, \cdots, v_n^\top$ ,
- such that $d_2(u, \ell)$ gets smaller and smaller.

[!NOTE] $d \geq s + 1$ no longer holds. $s$ no longer a parameter of the system, but $s = n - \#responses$ at any given time.

Trivial Scheme

Off the bat, the “do nothing” approach:

Send $S_i$ to worker $i$ , i.e., $d = 1$ .
Worker $i$ replies with the $i$ ‘th partial gradient $v_i$ .
The master averages up all the responses.

How good is that?

For $u = \frac{n}{n-s} \cdot \mathbb{I}$ , the factor $\frac{n}{n-s}$ corrects the $\frac{1}{n}$ in $v_i = \frac{1}{n} \cdot \nabla \text{ on } S_i$ .
Is this $\approx \sum_i v_i$ ? In other words, what is $d_2(\frac{n}{n-s} \cdot \mathbb{I}, \mathbb{I})$ ?

Trivial scheme: $\frac{n}{n-s} \cdot \mathbb{I}$ approximation.

Must do better than that!

Roadmap

Quick reminder from linear algebra.
- Eigenvectors and orthogonality.
Quick reminder from graph theory.
- Adjacency matrix of a graph.
Graph theoretic concept: expander graphs.
- “Well connected” graphs.
- Extensively studied.
An approximate gradient coding scheme from expander graphs.

Linear algebra - Reminder

Let $A \in \mathbb{R}^{n \times n}$ .
If $A v = \lambda v$ then $\lambda$ is an eigenvalue and $v$ is an eigenvector.
$v_1, \cdots, v_n \in \mathbb{R}^n$ $v_{1}, \dots, v_{n} \in R^{n}$ are orthonormal:
- $\|v_i\|_2 = 1$ for all $i$ .
- $v_i \cdot v_j^\top = 0$ for all $i \neq j$ .
Nice property: $\| \alpha_1 v_1 + \cdots + \alpha_n v_n \|_2^2 = \sqrt{\sum_i \alpha_i^2}$ .
$A$ is called symmetric if $A = A^\top$ .
Theorem: A real and symmetric matrix has an orthonormal basis of eigenvectors.
- That is, there exists an orthonormal basis $v_1, \cdots, v_n$ such that $A v_i = \lambda_i v_i$ for some $\lambda_i$ ‘s.

Graph theory - Reminder

Undirected graph $G = V, E$ .
$V$ is a vertex set, usually $n = 1,2, \cdots, n$ .
$E \subseteq \binom{V}{2}$ is an edge set (i.e., $E$ is a collection of subsets of $V$ of size two).
Each edge $e \in E$ is of the form $e = (a, b)$ for some distinct $a, b \in V$ .
Spectral graph theory:
- Analyze properties of graphs (combinatorial object) using matrices (algebraic object).
- Specifically, for a graph $G$ let $A_G \in \{0,1\}^{n \times n}$ be the adjacency matrix of $G$ .
- $A_{i,j} = 1$ if and only if $\{i,j\} \in E$ (otherwise 0).
- $A$ is real and symmetric.
- Therefore, has an orthonormal basis of eigenvectors.

Some nice properties of adjacency matrices

Let $G = (V, E)$ be $d$ -regular, with adjacency matrix $A_G$ whose (real) eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$ .
Some theorems:
- $\lambda_1 = d$ .
- $\lambda_n \geq -d$ , equality if and only if $G$ is bipartite.
- $A_G \mathbb{I}^\top = \lambda_1 \mathbb{I}^\top = d \mathbb{I}^\top$ $A_{G} I^{⊤} = λ_{1} I^{⊤} = d I^{⊤}$ (easy to show!).
  - Does that ring a bell? ;)
- If $\lambda_1 = \lambda_2$ then $G$ is not connected.

Expander graphs - Intuition.

An important family of graphs.
Multiple applications in:
- Algorithms, complexity theory, error correcting codes, etc.
Intuition: A graph is called an expander if there are no “lonely small sets” of nodes.
Every set of at most $n/2$ nodes is “well connected” to the remaining nodes in the graph.
A bit more formally:
- An infinite family of graphs $G_n$ $n=1$ $\infty$ (where $G_n$ has $n$ nodes) is called an expander family, if the “minimal connectedness” of small sets in $G_n$ does not go to zero with $n$ .

Expander graphs - Definitions.

All graphs in this lecture are $d$ -regular, i.e., all nodes have the same degree $d$ .
For sets of nodes $S, T \subseteq V$ , let $E(S, T)$ be the set of edges between $S$ and $T$ . I.e., $E(S, T) = \{(i,j) \in E | i \in S \text{ and } j \in T\}$ .
For a set of nodes $S$ $S$ let:
- $S^c = n \setminus S$ be its complement.
- Let $\partial S = E(S, S^c)$ be the boundary of $S$ .
- I.e., the set of edges between $S$ and its complement $S^c$ .
The expansion parameter $h_G$ $h_{G}$ of $G$ $G$ is:
- I.e., how many edges leave $S$ , relative to its size.
- How “well connected” $S$ is to the remaining nodes.

[!NOTE] $h_G = \min_{S \subseteq V, |S| \leq n/2} \frac{\partial S}{|S|}$ .

An infinite family of $d$ $d$ -regular graphs $(G_n)_{n=1}^\infty$ $(G_{n})_{n = 1}^{\infty}$ (where $G_n$ $G_{n}$ has $n$ $n$ nodes) is called an expander family if $h(G_n) \geq \epsilon$ $h (G_{n}) \geq ϵ$ for all $n$ $n$ .
- Same $d$ and same $\epsilon$ for all $n$ .
Expander families with large $\epsilon$ are hard to build explicitly.
Example: (Lubotsky, Philips and Sarnak ‘88)
- $V = \mathbb{Z}_p$ (prime).
- Connect $x$ to $x + 1, x - 1, x^{-1}$ .
- $d = 3$ , very small $\epsilon$ .
However, random graphs are expanders with high probability.

Expander graphs - Eigenvalues

There is a strong connection between the expansion parameter of a graph and the eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$ of its adjacency matrix.
Some theorems (no proof):
- $\frac{d-\lambda_2}{2} \leq h_G \leq \sqrt{2d (d - \lambda_2)}$ .
- $d - \lambda_2$ $d - λ_{2}$ is called the spectral gap of $G$ $G$ .
  - If the spectral gap is large, $G$ is a good expander.
  - How large can it be?
- Let $\lambda = \max \{|\lambda_2|, |\lambda_n|\}$ . Then $\lambda \geq 2 d - 1 - o_n(1)$ . (Alon-Boppana Theorem).
- Graphs which achieve the Alon-Boppana bound (i.e., $\lambda \leq 2 d - 1$ $λ \leq 2 d - 1$ ) are called Ramanujan graphs.
  - The “best” expanders.
  - Some construction are known.
  - Efficient construction of Ramanujan graphs for all parameters is very recent (2016).

Approximate GC from Expander Graphs

Back to approximate gradient coding.

Let $d$ be any replication parameter.
Let $G$ $G$ be an expander graph (i.e., taken from an infinite expander family $(G_n)_{n=1}^\infty$ $(G_{n})_{n = 1}^{\infty}$ )
- With eigenvalues $\lambda_1 \geq \cdots \geq \lambda_n$ , and respective eigenvectors $w_1, \cdots, w_n$
- Assume $\|w_1\|_2 =\| w_2\|_2 = \cdots = \|w_n\|_2 = 1$ , and $w_i w_j^\top = 0$ for all $i \neq j$ .
Let the gradient coding matrix $B=\frac{1}{d} A_G$ $B = \frac{1}{d} A_{G}$ .
- The eigenvalues of $B$ are $\mu_1 = 1\geq \mu_2 \geq \cdots \geq \mu_n$ . where $\mu_i = \frac{\lambda_i}{d}$ .
- Let $\mu = \max \{|\mu_2|, |\mu_n|\}$ .
- $d$ nonzero entries in each row $\Rightarrow$ Replication factor $d$ .
Claim: For any number of stragglers $s$ $s$ , we can get close to $\mathbb{I}$ $I$ .
- Much better than the trivial scheme.
- Proximity is a function of $d$ and $\lambda$ .
For every $s$ $s$ and any set $\mathcal{K}$ $K$ of $n - s$ $n - s$ responses, we build an “decoding vector”.
- A function of $s$ and of the identities of the responding workers.
- Will be used to linearly combine the $n - s$ responses to get the approximate gradient.
Let $w_{\mathcal{K}} \in \mathbb{R}^n$ such that $(w_{\mathcal{K}})_i = \begin{cases} -1 & \text{if } i \notin \mathcal{K} \\ \frac{s}{n-s} & \text{if } i \in \mathcal{K} \end{cases}$ .

Lemma 1: $w_{\mathcal{K}}$ is spanned by $w_2, \cdots, w_n$ , the $n - 1$ last eigenvectors of $A_G$ .

Proof

$w_2, \cdots, w_n$ are independent, and all orthogonal to $w_1 = \mathbb{I}$ .

$\Rightarrow$ The span of $w_2, \cdots, w_n$ is exactly all vectors whose sum of entries is zero.

Sum of entries of $w_{\mathcal{K}}$ is zero $\Rightarrow$ $w_{\mathcal{K}}$ is in their span.

Corollary: $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$ for some $\alpha_i$ ‘s in $\mathbb{R}$ .

Lemma 2: From direct computation, the norm of $w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$ is $\frac{ns}{n-s}$ .

Corollary: $w_{\mathcal{K}}^2 = \sum_{i=2}^n \alpha_i^2 = \frac{ns}{n-s}$ (from Lemma 2 + orthonormality of $w_2, \cdots, w_n$ ).

The scheme:

If the set of responses is $\mathcal{K}$ , the decoding vector is $w_{\mathcal{K}} + \ell_2$ .
Notice that $\operatorname{supp}(w_{\mathcal{K}} + \ell_2) = \mathcal{K}$ .
The responses the master receives are the rows of $B v_1, \cdots, v_n^\top$ indexed by $\mathcal{K}$ .
$\Rightarrow$ The master can compute $w_{\mathcal{K}} + \ell_2 B v_1, \cdots, v_n^\top$ .

Left to show: How close is $w_{\mathcal{K}} + \ell_2 B$ to $\ell_2$ ?

Proof

Recall that:

$w_{\mathcal{K}} = \alpha_2 w_2 + \cdots + \alpha_n w_n$ .
$w_i$ ’s are eigenvectors of $A_G$ (with eigenvalues $\lambda_i$ ) and of $B = \frac{1}{d} A_G$ (with eigenvalues $\mu_i = \frac{\lambda_i}{d}$ ).

$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 w_2 + \cdots + \alpha_n w_n B, \mathbb{I})$ (from 1.)

$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n, \mathbb{I})$ (eigenvalues of $B$ , and $\mu_1 = 1$ )

$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = d_2 (\mathbb{I} + \alpha_2 \mu_2 w_2 + \cdots + \alpha_n \mu_n w_n , \mathbb{I})$ (by def.).

$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2$

$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \|\sum_{i=2}^n \alpha_i \mu_i w_i\|_2$ (w_i’s are orthonormal).

$d_2 (w_{\mathcal{K}} + \mathbb{I} B, \mathbb{I}) = \sqrt{\sum_{i=2}^n \alpha_i^2 \mu_i^2}$ (w_i’s are orthonormal).

Improvement factor

Corollary: If $B = \frac{1}{d} A_G$ for a (d-regular) Ramanujan graph $G$ ,

$\Rightarrow$ improvement factor $\approx \frac{2}{d}$ .
Some explicit constructions of Ramanujan graphs (Lubotsky, Philips and Sarnak ‘88)
- with 2 $\frac{2}{d}$ $\approx$ 0.5!

Recap

Expander graph: A $d$ $d$ -regular graph with no lonely small subsets of nodes.
- Every subset with $\leq n/2$ has a large ratio $\partial S / S$ (not $\rightarrow 0$ with $n$ ).
- Many constructions exist, random graph is expander w.h.p.
- The expansion factor is determined by the spectral gap $d - \lambda_2$ ,
- Where $\lambda = \max \{\lambda_2, \lambda_n\}$ , and $\lambda_1 = d \geq \lambda_2 \geq \cdots \geq \lambda_n$ are the eigenvalues of $A_G$ .
- “Best” expander = Ramanujan graph = has $\lambda \leq 2 d - 1$ .
“Do nothing” approach: approximation $\frac{ns}{n-s}$ .
Approximate gradient coding:
- Send $d$ subsets $S_{j1}, \cdots, S_{jd}$ to each node $i$ , which returns a linear combination according to a coefficient matrix $B$ .
- Let $B = \frac{1}{d} A_G$ , for $G$ a Ramanujan graph: approximation $\frac{\lambda}{d} \frac{ns}{n-s}$ .
- Up to 50% closer than “do nothing”, at the price of higher computation load

[!NOTE] Faster = more computation load.