CSE5313 Coding and information theory for data science (Lecture 26)
Sliced and Broken Information with applications in DNA storage and 3D printing
Basic info
Deoxyribo-Nucleic Acid.
A double-helix shaped molecule.
Each helix is a string of
- Cytosine,
- Guanine,
- Adenine, and
- Thymine.
Contained inside every living cell.
- Inside the nucleus.
Used to encode proteins.
mRNA carries info to Ribosome as codons of length 3 over GUCA.
- Each codon produces an amino acids.
- , redundancy in nature!
1st Chargaff rule:
- The two strands are complements (A-T and G-C).
- #A = #T and #G = #C in both strands.
2nd Chargaff rule:
- #A \approx #T and #G \approx #C in each strands.
- Can be explained via tandem duplications.
- .
- Occur naturally during cell mitosis.
DNA storage
DNA synthesis:
- Artificial creation of DNA from G’s, T’s, A’s, and C’s.
Can be used to store information!
Advantages:
- Density.
- 5.5 PB per mm3.
- Stability.
- Half-life 521 years (compare to ≈ 20𝑦 on hard drives).
- Future proof.
- DNA reading and writing will remain relevant “forever.”
DNA storage prototypes
Some recent attempts:
- 2011, 659kb.
- Church, Gao, Kosuri, “Next-generation digital information storage in DNA”, Science.
- 2018, 200MB.
- Organick et al., “Random access in large-scale DNA data storage,” Nature biotechnology.
- CatalogDNA (startup):
- 2019, 16GB.
- 2021, 18 Mbps.
Companies:
- Microsoft, Illumina, Western Digital, many startups.
Challenges:
- Expensive, Slow.
- Traditional storage media still sufficient and affordable.
DNA Storage models
In vivo:
- Implant the synthetic DNA inside a living organism.
- Need evolution-correcting codes!
- E.g., coding against tandem-duplications.
In vitro:
- Place the synthetic DNA in test tubes.
- Challenge: Can only synthesize short sequences ().
- 1 test tube contains millions to billions of short sequences.
How to encode information?
How to achieve noise robustness?
DNA coding in vitro environment
Traditional data communication:
DNA storage:
where is the collection of all -subsets of . ()
A codeword is a set of binary strings, each of length .
“Sliced channel”:
- The message is encoded to M$ equal parts.
- Parts may be noisy (substitutions, deletions, etc.).
- Also useful in network packet transmission ( packets of length ).
Sliced channel: Figures of merit
How to quantify the merit of a given code ?
- Want resilience to any substitutions in all parts.
Redundance:
- Recall in linear codes,
- .
- In sliced channel:
- .
Research questions:
- Bounds on redundancy?
- Code construction?
- Is more redundancy needed in sliced channel?
Sliced channel: Lower bound
Idea: Sphere packing
Given and , how may codewords must we exclude?
Example
- , and let . Ball of radius 1 is sized 5.
all the codeword with distance 1 from are:
Distance 0:
Distance 1:
7 options and 2 of them are not codewords.
So the effective size of the ball is 5.
Tool: (Fourier analysis of) Boolean functions
Introducing hypercube graph:
- .
- if and only if .
- What is size of ?
Consider as a characteristic function: .
Let be its boundary.
- All hypercube edges such that .
Lemma of boundary
Size of 1-ball .
Proof
Every edge on the boundary represents a unique way of flipping one bit in one string in .
Need to bound from below
Tool: Total influence.
Definition of total influence.
The total influence of is defined as:
where equals to with it’s th bit flipped.
Theorem: Edge-isoperimetric inequality, no proof)
.
where .
Notice: Let be the -dimensional edges in .
Corollary: Let , and , and let . Then,
Size of ball is sliced channel .
this implies that
Corollary:
- Redundancy in sliced channel with with the above parameters .
- Simple generation (not shown) gives .
Robust indexing
Idea: Start with bit string with bits for indexing.
Problem 1: Indices subject to noise.
Problem 2: Indexing bits do not carry information higher redundancy.
Idea: Robust indexing
Instead of using for indexing, use such that
- are of minimum distance (solves problem 1) .
- contain information (solves problem 2).
depend on the message.
- Consider the message .
- Find an encoding function (coding over codes)
- Assume is such function (not shown).
…
Additional reading:
Jin Sima, Netanel Raviv, Moshe Schwartz, and Jehoshua Bruck. “Error Correction for DNA Storage.” arXiv:2310.01729 (2023).
- Magazine article.
- Introductory.
- Broad perspective.
Information Embedding in 3D printing
Motivations:
Threats to public safety.
- Ghost guns, forging fingerprints, forging keys, fooling facial recognition.
Solution – Information Embedding.
- Printer ID, user ID, time/location stamp
Existing Information Embedding Techniques
Many techniques exist.
Information embedding using variations in layers.
- Width, rotation, etc.
- Magnetic properties.
- Radiative materials.
Combating Adversarial Noise
Most techniques are rather accurate.
- I.e., low bit error rate.
Challenge: Adversarial damage after use.
- Scraping.
- Deformation.
- Breaking.
A t-break code
Let
Adversary breaks at most times (security parameter).
Decoder receives a multi-st of at most fragments. Assume the following:
- Oriented
- Unordered
- Any length
Lower bound for t-break code
Claim: A t-break code must have redundancy.
Lemma: Let be a t-break code of length , and for and be the subset of containing all codewords of Hamming weight . Then .
Proof of Lemma
Let for some , and let .
Write ( denotes the concatenation operation):
Where for all .
Break and times to produce the multi-sets:
, and therefore and .
Proof of Claim
Let be such that is the largest among .
By Lemma and ordinary sphere packing bound, for ,
This implies that
Corollary: In the relevant regime , we have redundancy.
t-break codes: Main ideas.
Encoding:
- Need multiple markers across the codeword.
- Construct an adjacency matrix 𝐴 of markers to record their order.
- Append to the codeword (as in the sliced channel).
Decoding (from fragments):
- Locate all surviving markers, and locate .
- Build an approximate adjacency matrix from surviving markers .
- Correct .
- Order the fragments correctly using .
Tools:
- Random encoding (to have many markers).
- Mutually uncorrelated codes (so that markers will not overlap).
Tool: Mutually uncorrelated codes.
- Want: Markers not to overlap.
- Solution: Take markers from a Mutually Uncorrelated Codes (existing notion).
- A code is called mutually uncorrelated if no suffix of any is if a prefix of another (including ).
- Many constructions exist.
Theorem: For any integer there exists a mutually uncorrelated code of length and size .
Tool: Random encoding.
- Want: Codewords with many markers from , that are not too far apart.
- Problem: Hard to achieve explicitly.
- Workaround: Show that a uniformly random string has this property.
Random encoding:
- Choose the message at random.
- Suitable for embedding, say, printer ID.
- Not suitable for dynamic information.
Let be a parameter.
Fix a mutually uncorrelated code of length .
Fix from as “special” markers.
Claim: With probability , in uniformly random string .
- Every bits contain a marker from .
- Every two non-overlapping substrings of length are distinct.
- does not contain any of the special markers .
Proof idea:
- Short substring are abundant.
- Long substring are rare.
Sketch of encoding for t-break codes.
Repeatedly sample until it is “good”.
Find all markers in it.
Build a matrix which records order and distances:
- if are not adjacent.
- Otherwise, it is the distance between them (in bits).
Append at the end, and use the special markers .
Sketch of decoding for t-break codes.
Construct a partial adjacency matrix from fragments.