CSE5313 Coding and information theory for data science (Lecture 21)
Gradient coding
Intro to Statistical Machine Learning
Given by the learning problem:
Unknown target function .
Training data .
Learning algorithm: and Hypothesis set .
Goal is to find ,
Common hypothesis sets:
- Linear classifiers:
- Linear regressors:
- Neural networks:
- Concatenated linear classifiers (or differentiable approximation thereof).
The dataset is taken from unknown distribution .
Common approach – Empirical Risk Minimization
- Q: How to quantify ?
- A: Use a loss function.
- A function which measures the deviation between ‘s output and ‘s output.
- Using the chosen loss function, define two measures:
- True risk:
- Empirical risk:
Machine learning is over .
Gradient Descent (Motivation)
Parameterize using some real vector , we want to minimize .
Algorithm:
- Initialize .
- For :
- Computer .
- Terminate if some stop condition are met.
Bottleneck: Calculating .
Potentially, where is number of points, dimension of feature space.
Solution: Parallelize.
Distributed Gradient Descent
Idea: use a distributed system with master and workers.
Problem: Stragglers (slow servers, 5-6 slower than average time).
Potential Solutions:
- Wait for all servers:
- Accurate, but slow
- Sum results without the slowest ones
- Less accurate, but faster
- Introduce redundancy
- Send each to more than one server.
- Each server receives more than one .
- Each server sends a linear combination of the partials gradient to its .
- The master decodes the sum of partial gradients from the linear combination.
Problem setups
System setup: 1 master , workers .
A dataset .
Each server :
- Receives some data points . where is the replication factor.
- Computes a certain vector from each .
- Returns a linear combination . to the master.
The master:
- Waits for the first ‘s to arrive ( is the straggler tolerance factor).
- Linear combines the ‘s to get the final gradient.
Goal: Retrieve regardless of which workers responded.
- Computation of the full gradient that tolerates any straggler.
The ‘s are fixed (do not depend on data). These form a gradient coding matrix .
Row has non-zero entries . at some positions.
The ‘s:
- Might depend on the identity of the responses.
- Nevertheless must exists in any case.
Recall:
- The master must be able to recover form any responses.
Let be the indices of the responses.
Let be the submatrix of indexed by .
Must have:
- For every of size there exists coefficients such that:
Then if responded,
Definition of gradient coding matrix.
For replication factor and straggler tolerance factor : is a gradient coding matrix if:
- is in th span of any rows.
- Every row of contains at most nonzero elements.
Grading coding matrix implies gradient coding algorithm:
- The master sends to worker . where are the nonzero indices of row .
- Worker :
- Computes for .
- Sends to the master.
- Let be the indices of the first responses.
- Since is in the span of any rows of , there exists such that .
Construction of Gradient Coding Matrices
Goal:
-
For a given straggler tolerance parameter , we wish to construct a gradient coding matrix with the smallest possible .
-
Tools:
I. Cyclic Reed-Solomon codes over the complex numbers. II. Definition of (dual of ) and (reverse of ). III. A simple lemma about MDS codes.
Recall: An Reed-Solomon code over a field is as follows.
- Fix distinct .
- .
- Dimension and minimum distance follow from being a field.
- Also works for .
I. Cyclic Reed-Solomon codes over the complex numbers.
The following Reed-Solomon code over the complex numbers is called a cyclic code.
- Let .
- For , choose . The ‘s are roots of unity of order .
- Use these ‘s to define a Reed-Solomon code as usual.
This code is cyclic:
- Let for some .
- Need to show that for all .
II. Dual and Reversed Codes
- Let be an MDS code.
Definition for dual code of
The dual code of is
Claim: is an code.
Definition for reversed code of
The reversed code of is
We claim that if is cyclic, then is cyclic.
III. Lemma about MDS codes
Let be an MDS code.
Lemma
For any subset , of size there exists whose support (set of nonzero indices) is .
Proof
Let be a generator matrix, and let be its restriction to columns not indexed by .
has more rows than columns, so there exists such that .
So has at least zeros inn entires indexed by .
The remaining entries of , indexed by , must be nonzero.
Thus the suppose of is .
Construct gradient coding matrix
Consider nay workers and stragglers.
Let
Let be the cyclic RS code build by I.
Then by III, there exists whose support is the first entires.
Denote . for some nonzero .
Build:
whose columns are all cyclic shifts of .
We claim that is a gradient coding matrix.
Proof
Every row is a codeword in .
- Specifically, a shift of .
- Then every row contains nonzeros.
is in the span of any rows of .
- Observe that , (evaluate the polynomial at ).
- Then .
- Therefore, it suffices to show that any span .
- Since , it suffices to show that any rows are independent.
Observe: The left most columns are linearly independent, and therefore span .
Assume for contradiction there exists dependent rows.
Then such that .
is orthogonal to the basis of .
So .
From II, is an MDS code, and hence every is of Hamming weight .
This is a contradiction.
Bound for gradient coding
We want to be large and to be small.
How small can with respect to ?
- A: Build a bipartite graph.
- Left side: workers .
- Right side: partial datasets .
- Connect to if worker contains .
- Equivalently if .
- by definition.
- .
- Sum degree on left and right .
- So .
We can break the lower bound using approximate computation.