CSE5313 Coding and information theory for data science (Lecture 1)
Example problem
Get sum of original server data when any one of the servers is down.
-> Gradient coding for distributed machine learning.
Two shares give anything and on share gives nothing.
-> Scheme: Let f where is a random number. Share 1 is and share 2 is . where is the final key. (The shamir secret sharing scheme)
Montage of topics
Storage system in the Wild
- Huge amount of data worldwide.
- Grows exponentially.
- Must be reliably
- Must be accessed easily
Challenge 1: reconstructing the data from subset of servers.
Challenge 2: repair and maintain the data.
Challenge 3: data access efficiency.
Privacy information retrieval
Retrieving data from a database without revealing which data is retrieved.
silly solution: download everything.
Coding for distributed computation
Problems in distributed computation:
- Stragglers: 5% of the server is x5-x8 times slower than the rest.
- Privacy of the data
- Hardware failure
- Adversaries: one adversary can corrupt the divergence of the computation.
- Privacy of computation
Network coding
Passing information over network of nodes
DNA storage
GTAC
- Super-dense
- Super-durable
- Non-volatile (no extra energy required to maintain the data)
- Future proof (may be interested by future generations)
Challenges:
- DNA synthesis is hard, only short strings (1000 nucleotides)
- Very noisy.
Topics covered in this course
Part 1: Foundations
Foundations:
- Finite fields, linear codes, information theory
Distributed storage:
- Regenerating codes, locally recoverable codes
Privacy:
- secret sharing, multi-party computation, private information retrieval
Midterm: (short and easy.. I guess)
Part 2: contemporary topics
Computation:
- coded computation, gradient coding, private computation
Emerging/advanced topics:
- DNA storage
Administration
Grading:
3-5 HW assignments (50%) Midterm (25%) Final assignment (25%)
- For a recent paper: read, summarize, criticize and propose for future work.
Textbooks:
- Introduction to Coding Theory, R. M. Roth.
- Elements of Information Theory, T. M. Cover and J. A. Thomas.
The slides for every lecture will be made available online.
Clarification about Gen AI
The use of generative artificial intelligence tools (GenAI) is permitted. However, students are not permitted to ask GenAI for complete solutions, and must limit their interaction with GenAI to light document editing (grammar, typos, etc.), understanding background information, and to seeking alternative explanations for course material. In any case, submission of AI generated text is prohibited, and up to light editing, students must write all submitted text themselves.
IMPORTANT:
- In all submitted assignments and projects, students must include a “Use Of GenAI” paragraph, which briefly summarizes their use of GenAI in preparing this assignment/project.
- Failure to include this paragraph, or including untruthful statements, will be considered a violation of academic integrity.
- The course staff reserves the right to summon any student for an oral exam regarding the content of any submitted assignment or project, and adjust their grade accordingly. The exam will focus on explaining the reasoning behind the work, and no memorization will be required. Students will be summoned at random, and not necessarily on the basis of any accusation or suspicion.
Channel coding
Input alphabet , output alphabet .
e.g. .
Introduce noise: .
We use to denote the information to be transmitted
to be the codeword.
is the received codeword. given to the decoder.
is the decoded information word.
Error if .
A channel is defined by the three tuple .