CSE5313 Coding and information theory for data science (Lecture 26)

Sliced and Broken Information with applications in DNA storage and 3D printing

Basic info

Deoxyribo-Nucleic Acid.

A double-helix shaped molecule.

Each helix is a string of

Cytosine,
Guanine,
Adenine, and
Thymine.

Contained inside every living cell.

Inside the nucleus.

Used to encode proteins.

mRNA carries info to Ribosome as codons of length 3 over GUCA.

Each codon produces an amino acids.
$4^3> 20$ , redundancy in nature!

1st Chargaff rule:

The two strands are complements (A-T and G-C).
$#A = #T$ and $#G = #C$ in both strands.

2nd Chargaff rule:

$#A \approx #T$ and $#G \approx #C$ in each strands.
Can be explained via tandem duplications.
- $GCAGCATT \implies GCAGCAGCATT$ .
- Occur naturally during cell mitosis.

DNA storage

DNA synthesis:

Artificial creation of DNA from G’s, T’s, A’s, and C’s.

Can be used to store information!

Advantages:

Density.
- 5.5 PB per mm3.
Stability.
- Half-life 521 years (compare to ≈ 20𝑦 on hard drives).
Future proof.
- DNA reading and writing will remain relevant “forever.”

DNA storage prototypes

Some recent attempts:

2011, 659kb.
- Church, Gao, Kosuri, “Next-generation digital information storage in DNA”, Science.
2018, 200MB.
- Organick et al., “Random access in large-scale DNA data storage,” Nature biotechnology.
CatalogDNA (startup):
- 2019, 16GB.
- 2021, 18 Mbps.

Companies:

Microsoft, Illumina, Western Digital, many startups.

Challenges:

Expensive, Slow.
Traditional storage media still sufficient and affordable.

DNA Storage models

In vivo:

Implant the synthetic DNA inside a living organism.
Need evolution-correcting codes!
E.g., coding against tandem-duplications.

In vitro:

Place the synthetic DNA in test tubes.
Challenge: Can only synthesize short sequences ( $\approx 1000 bp$ ).
1 test tube contains millions to billions of short sequences.

How to encode information?

How to achieve noise robustness?

DNA coding in vitro environment

Traditional data communication:

m\in\{0,1\}^k\mapsto c\in\{0,1\}^n

DNA storage:

m\in\{0,1\}^k\mapsto c\in \binom{\{0,1\}^L}{M}

where $\binom{\{0,1\}^L}{M}$ is the collection of all $M$ -subsets of $\{0,1\}^L$ . ( $0\leq M\leq 2^L$ )

A codeword is a set of $M$ binary strings, each of length $L$ .

“Sliced channel”:

The message $m$ is encoded to $c\in \{0,1\}^{ML}, and then sliced to$ M$ equal parts.
Parts may be noisy (substitutions, deletions, etc.).
Also useful in network packet transmission ( $M$ packets of length $L$ ).

Sliced channel: Figures of merit

How to quantify the merit of a given code $\mathcal{C}$ ?

Want resilience to any $K$ substitutions in all $M$ parts.

Redundance:

Recall in linear codes,
- $redundancy=length-dimension=\log (size\ of\ space)-\log (size\ of\ code)$ .
In sliced channel:
- $redundancy=\log (size\ of\ space)-\log (size\ of\ code)=\log \binom{2^L}{M}-\log |\mathcal{C}|$ .

Research questions:

Bounds on redundancy?
Code construction?
Is more redundancy needed in sliced channel?

Sliced channel: Lower bound

Idea: Sphere packing

Given $c\in \binom{\{0,1\}^L}{M}$ and $K=1$ , how may codewords must we exclude?

Example

$L=3,M=2$ , and let $c=\{001,011\}$ . Ball of radius 1 is sized 5.

all the codeword with distance 1 from $c$ are:

Distance 0:

\{001,011\}

Distance 1:

\{101,011\}\quad \{011\}\quad \{000,011\}\\ \{001,111\}\quad \{001\}\quad \{001,010\}

7 options and 2 of them are not codewords.

So the effective size of the ball is 5.

Tool: (Fourier analysis of) Boolean functions

Introducing hypercube graph:

$V=\{0,1\}^L$ .
$\{x,y\}\in E$ if and only if $d_H(x,y)=1$ .
What is size of $E$ ?

Consider $c\in \binom{\{0,1\}^L}{M}$ as a characteristic function: $f_c(x)=\{0,1\}^L\to \{0,1\}$ .

Let $\partial f_c$ be its boundary.

All hypercube edges $\{x,y\}$ such that $f_c(x)\neq f_c(y)$ .

Lemma of boundary

Size of 1-ball $\geq |\partial f_c|+1$ .

Proof

Every edge on the boundary represents a unique way of flipping one bit in one string in $c$ .

Need to bound $|\partial f_c|$ from below

Tool: Total influence.

Definition of total influence.

The total influence $I(f)$ of $f:\{0,1\}^L\to \{0,1\}$ is defined as:

\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))

where $x^{\oplus i}$ equals to $x$ with it’s $i$ th bit flipped.

Theorem: Edge-isoperimetric inequality, no proof)

$I(f)\geq 2\alpha\log\frac{1}{\alpha}$ .

where $\alpha=\min\{\text{fraction of 1's},\text{fraction of 0's}\}$ .

Notice: Let $\partial_i f$ be the $i$ -dimensional edges in $\partial f$ .

\begin{aligned} I(f)&=\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))\\ &=\sum_{i=1}^L\frac{|\partial_i f|}{2^{L-1}}\\ &=\frac{||\partial f||}{2^{L-1}}\\ \end{aligned}

Corollary: Let $\epsilon>0$ , $L\geq \frac{1}{\epsilon}$ and $M\leq 2^{(1-\epsilon)L}$ , and let $c\in \binom{\{0,1\}^L}{M}$ . Then,

|\parital f_c|\geq 2\times 2^{L-1}\frac{M}{2^L}\log \frac{2^L}{M}\geq M\log \frac{2^L}{M}\geq ML\epsilon

Size of $1$ ball is sliced channel $\geq ML\epsilon$ .

this implies that $|\mathcal{C}|leq \frac{\binom{2^L}{M}}{\epsilon ML}$

Corollary:

Redundancy in sliced channel with $K=1$ with the above parameters $\log ML-O(1)$ .
Simple generation (not shown) gives $O(K\log ML)$ .

Robust indexing

Idea: Start with $L$ bit string with $\log M$ bits for indexing.

Problem 1: Indices subject to noise.

Problem 2: Indexing bits do not carry information $\implies$ higher redundancy.

link to paper

Idea: Robust indexing

Instead of using $1,\ldots, M$ for indexing, use $x_1,\ldots, x_M$ such that

$\{x_1,\ldots,x_M\}$ are of minimum distance $2K+1$ (solves problem 1) $|x_i|=O(\log M)$ .
$\{x_1,\ldots,x_M\}$ contain information (solves problem 2).

$\{x_1,\ldots,x_M\}$ depend on the message.

Consider the message $m=(m_1,m_2)$ .
Find an encoding function $m_1\mapsto\{\{x_i\}_{i=1}^M|d_H(x_i,x_j)\geq 2K+1\}$ (coding over codes)
Assume $e$ is such function (not shown).

…

Additional reading:

Jin Sima, Netanel Raviv, Moshe Schwartz, and Jehoshua Bruck. “Error Correction for DNA Storage.” arXiv:2310.01729 (2023).

Magazine article.
Introductory.
Broad perspective.

Information Embedding in 3D printing

Motivations:

Threats to public safety.

Ghost guns, forging fingerprints, forging keys, fooling facial recognition.

Solution – Information Embedding.

Printer ID, user ID, time/location stamp

Existing Information Embedding Techniques

Many techniques exist.

Information embedding using variations in layers.

Width, rotation, etc.
Magnetic properties.
Radiative materials.

Combating Adversarial Noise

Most techniques are rather accurate.

I.e., low bit error rate.

Challenge: Adversarial damage after use.

Scraping.
Deformation.
Breaking.

A t-break code

Let $m\in \{0,1\}^k\mapsto c\in \{0,1\}^n$

Adversary breaks $c$ at most $t$ times (security parameter).

Decoder receives a multi-st of at most $t+1$ fragments. Assume the following:

Oriented
Unordered
Any length

Lower bound for t-break code

Claim: A t-break code must have $\Omega(t\log (n/t))$ redundancy.

Lemma: Let $\mathcal{C}$ be a t-break code of length $n$ , and for $i\in [n]$ and $\mathcal{C}_i\subseteq\mathcal{C}$ be the subset of $\mathcal{C}$ containing all codewords of Hamming weight $i$ . Then $d_H(\mathcal{C}_i)\geq \lceil \frac{t+1}{2}\rceil$ .

Proof of Lemma

Let $x,y\in \mathcal{C}_i$ for some $i$ , and let $\ell=d_H(x,y)$ .

Write ( $\circ$ denotes the concatenation operation):

$x=c_1\circ x_{i_1}\circ c_2\circ x_{i_2}\circ\ldots\circ c_{t+1}\circ x_{i_{t+1}}$

$y=c_1\circ y_{i_1}\circ c_2\circ y_{i_2}\circ\ldots\circ c_{t+1}\circ y_{i_{t+1}}$

Where $x_{i_j}\neq y_{i_j}$ for all $j\in [\ell]$ .

Break $x$ and $y$ $2\ell$ times to produce the multi-sets:

\mathcal{X}=\{\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\}\}

\mathcal{Y}=\{\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\}\}

$w_H(x)=w_H(y)=i$ , and therefore $\mathcal{X}=\mathcal{Y}$ and $\ell\geq \lceil \frac{t+1}{2}\rceil$ .

Proof of Claim

Let $j\in \{0,1,\ldots,n\}$ be such that $\mathcal{C}_j$ is the largest among $C_0,C_1,\ldots,C_{n}$ .

$\log |\mathcal{C}|=\log(\sum_{i=0}^n|\mathcal{C}_i|)\leq \log \left((n+1)|C_j|\right)\leq \log (n+1)+\log |\mathcal{C}_j|$

By Lemma and ordinary sphere packing bound, for $t'=\lfloor\frac{\lceil \frac{t+1}{2}\rceil-1}{2}\rfloor\approx \frac{t}{4}$ ,

|C_j|\leq \frac{2^n}{\sum_{t=0}^{t'}\binom{n}{i}}

This implies that $n-\log |\mathcal{C}|\geq n-\log(n+1)-\log|\mathcal{C}_j|\geq \dots \geq \Omega(t\log (n/t))$

Corollary: In the relevant regime $t=O(n^{1-\epsilon})$ , we have $\Omega(t\log n)$ redundancy.

t-break codes: Main ideas.

Encoding:

Need multiple markers across the codeword.
Construct an adjacency matrix 𝐴 of markers to record their order.
Append $RS_{2t}(A)$ to the codeword (as in the sliced channel).

Decoding (from $t + 1$ fragments):

Locate all surviving markers, and locate $RS_{2t}(A)'$ .
Build an approximate adjacency matrix $A'$ from surviving markers $(d_H(A, A' )\leq 2t)$ .
Correct $(A',RS_{2t}(A)')\mapsto (A,RS_{2t}(A))$ .
Order the fragments correctly using $A$ .

Tools:

Random encoding (to have many markers).
Mutually uncorrelated codes (so that markers will not overlap).

Tool: Mutually uncorrelated codes.

Want: Markers not to overlap.
Solution: Take markers from a Mutually Uncorrelated Codes (existing notion).
- A code $\mathcal{M}$ is called mutually uncorrelated if no suffix of any $m_i \in \mathcal{M}$ is if a prefix of another $m_j \in \mathcal{M}$ (including $i=j$ ).
Many constructions exist.

Theorem: For any integer $\ell$ there exists a mutually uncorrelated code $\mathcal{C}_{MU}$ of length $\ell$ and size $|\mathcal{C}_{MU}|\geq \frac{2^\ell}{32\ell}$ .

Tool: Random encoding.

Want: Codewords with many markers from $\mathcal{C}_{MU}$ , that are not too far apart.
Problem: Hard to achieve explicitly.
Workaround: Show that a uniformly random string has this property.

Random encoding:

Choose the message at random.
Suitable for embedding, say, printer ID.
Not suitable for dynamic information.

Let $m>0$ be a parameter.

Fix a mutually uncorrelated code $\mathcal{C}_{MU}$ of length $\Theta(\log m)$ .

Fix $m_1,\ldots, m_t$ from $\mathcal{C}_{MU}$ as “special” markers.

Claim: With probability $1-\frac{1}{\poly(m)}$ , in uniformly random string $z\in \{0,1\}^m$ .

Every $O(\log^2(m))$ bits contain a marker from $\mathcal{C}_{MU}$ .
Every two non-overlapping substrings of length $c\log m$ are distinct.
$z$ does not contain any of the special markers $m_1,\ldots, m_t$ .

Proof idea:

Short substring are abundant.
Long substring are rare.

Sketch of encoding for t-break codes.

Repeatedly sample $z\in \{0,1\}^m$ until it is “good”.

Find all markers $m_{i_1},\ldots, m_{i_r}$ in it.

Build a $|\mathcal{C}_{MU}|\times |\mathcal{C}_{MU}|$ matrix $A$ which records order and distances:

$A_{i,j}=0$ if $m_i,m_j$ are not adjacent.
Otherwise, it is the distance between them (in bits).

Append $RS_{2t}(A)$ at the end, and use the special markers $m_1,\ldots, m_t$ .

Sketch of decoding for t-break codes.

Construct a partial adjacency matrix $A'$ from fragments.