CSE5313 Coding and information theory for data science (Lecture 22)
Approximate Gradient Coding
Exact gradient computation and approximate gradient computation
In the previous formulation, the gradient is computed exactly.
- Accurate
- Requires (high replication factor).
- Need to know in advance!
However:
- Approximate gradient computation are very common!
- E.g., stochastic gradient descent.
- Machine learning is inherently inaccurate.
- Relies on biased data, unverified assumptions about model, etc.
Idea: If we relax the exact computation requirement, can have ?
- No fixed anymore.
Approximate computation:
- Exact computation: . ⊤.
- Approximate computation: ,
- Where is “small” ().
- Why?
- Lemma: Let . If then .
- is the matrix whose rows are the ‘s.
- is the spectral norm (positive sqrt of maximum eigenvalue of ).
- Idea: Distribute as before, and
- as the master gets more and more responses,
- it can reconstruct ,
- such that gets smaller and smaller.
[!NOTE] no longer holds. no longer a parameter of the system, but at any given time.
Trivial Scheme
Off the bat, the “do nothing” approach:
- Send to worker , i.e., .
- Worker replies with the ‘th partial gradient .
- The master averages up all the responses.
How good is that?
- For , the factor corrects the in .
- Is this ? In other words, what is ?
Trivial scheme: approximation.
Must do better than that!
Roadmap
- Quick reminder from linear algebra.
- Eigenvectors and orthogonality.
- Quick reminder from graph theory.
- Adjacency matrix of a graph.
- Graph theoretic concept: expander graphs.
- “Well connected” graphs.
- Extensively studied.
- An approximate gradient coding scheme from expander graphs.
Linear algebra - Reminder
- Let .
- If then is an eigenvalue and is an eigenvector.
- are orthonormal:
- for all .
- for all .
- Nice property: .
- is called symmetric if .
- Theorem: A real and symmetric matrix has an orthonormal basis of eigenvectors.
- That is, there exists an orthonormal basis such that for some ‘s.
Graph theory - Reminder
- Undirected graph .
- is a vertex set, usually .
- is an edge set (i.e., is a collection of subsets of of size two).
- Each edge is of the form for some distinct .
- Spectral graph theory:
- Analyze properties of graphs (combinatorial object) using matrices (algebraic object).
- Specifically, for a graph let be the adjacency matrix of .
- if and only if (otherwise 0).
- is real and symmetric.
- Therefore, has an orthonormal basis of eigenvectors.
Some nice properties of adjacency matrices
- Let be -regular, with adjacency matrix whose (real) eigenvalues .
- Some theorems:
- .
- , equality if and only if is bipartite.
- (easy to show!).
- Does that ring a bell? ;)
- If then is not connected.
Expander graphs - Intuition.
- An important family of graphs.
- Multiple applications in:
- Algorithms, complexity theory, error correcting codes, etc.
- Intuition: A graph is called an expander if there are no “lonely small sets” of nodes.
- Every set of at most nodes is “well connected” to the remaining nodes in the graph.
- A bit more formally:
- An infinite family of graphs (where has nodes) is called an expander family, if the “minimal connectedness” of small sets in does not go to zero with .
Expander graphs - Definitions.
- All graphs in this lecture are -regular, i.e., all nodes have the same degree .
- For sets of nodes , let be the set of edges between and . I.e., .
- For a set of nodes let:
- be its complement.
- Let be the boundary of .
- I.e., the set of edges between and its complement .
- The expansion parameter of is:
- I.e., how many edges leave , relative to its size.
- How “well connected” is to the remaining nodes.
[!NOTE] .
- An infinite family of -regular graphs (where has nodes) is called an expander family if for all .
- Same and same for all .
- Expander families with large are hard to build explicitly.
- Example: (Lubotsky, Philips and Sarnak ‘88)
- (prime).
- Connect to .
- , very small .
- However, random graphs are expanders with high probability.
Expander graphs - Eigenvalues
- There is a strong connection between the expansion parameter of a graph and the eigenvalues of its adjacency matrix.
- Some theorems (no proof):
- .
- is called the spectral gap of .
- If the spectral gap is large, is a good expander.
- How large can it be?
- Let . Then . (Alon-Boppana Theorem).
- Graphs which achieve the Alon-Boppana bound (i.e., ) are called Ramanujan graphs.
- The “best” expanders.
- Some construction are known.
- Efficient construction of Ramanujan graphs for all parameters is very recent (2016).
Approximate GC from Expander Graphs
Back to approximate gradient coding.
- Let be any replication parameter.
- Let be an expander graph (i.e., taken from an infinite expander family )
- With eigenvalues , and respective eigenvectors
- Assume , and for all .
- Let the gradient coding matrix .
- The eigenvalues of are . where .
- Let .
- nonzero entries in each row Replication factor .
- Claim: For any number of stragglers , we can get close to .
- Much better than the trivial scheme.
- Proximity is a function of and .
- For every and any set of responses, we build an “decoding vector”.
- A function of and of the identities of the responding workers.
- Will be used to linearly combine the responses to get the approximate gradient.
- Let such that .
Lemma 1: is spanned by , the last eigenvectors of .
Proof
are independent, and all orthogonal to .
The span of is exactly all vectors whose sum of entries is zero.
Sum of entries of is zero is in their span.
Corollary: for some ‘s in .
Lemma 2: From direct computation, the norm of is .
Corollary: (from Lemma 2 + orthonormality of ).
The scheme:
- If the set of responses is , the decoding vector is .
- Notice that .
- The responses the master receives are the rows of indexed by .
- The master can compute .
Left to show: How close is to ?
Proof
Recall that:
- .
- ’s are eigenvectors of (with eigenvalues ) and of (with eigenvalues ).
(from 1.)
(eigenvalues of , and )
(by def.).
(w_i’s are orthonormal).
(w_i’s are orthonormal).
Improvement factor
Corollary: If for a (d-regular) Ramanujan graph ,
- improvement factor .
- Some explicit constructions of Ramanujan graphs (Lubotsky, Philips and Sarnak ‘88)
- with 2 0.5!
Recap
- Expander graph: A -regular graph with no lonely small subsets of nodes.
- Every subset with has a large ratio (not with ).
- Many constructions exist, random graph is expander w.h.p.
- The expansion factor is determined by the spectral gap ,
- Where , and are the eigenvalues of .
- “Best” expander = Ramanujan graph = has .
- “Do nothing” approach: approximation .
- Approximate gradient coding:
- Send subsets to each node , which returns a linear combination according to a coefficient matrix .
- Let , for a Ramanujan graph: approximation .
- Up to 50% closer than “do nothing”, at the price of higher computation load
[!NOTE] Faster = more computation load.