Skip to Content
CSE5313CSE5313 Coding and information theory for data science (Lecture 26)

CSE5313 Coding and information theory for data science (Lecture 26)

Sliced and Broken Information with applications in DNA storage and 3D printing

Basic info

Deoxyribo-Nucleic Acid.

A double-helix shaped molecule.

Each helix is a string of

  • Cytosine,
  • Guanine,
  • Adenine, and
  • Thymine.

Contained inside every living cell.

  • Inside the nucleus.

Used to encode proteins.

mRNA carries info to Ribosome as codons of length 3 over GUCA.

  • Each codon produces an amino acids.
  • 43>204^3> 20, redundancy in nature!

1st Chargaff rule:

  • The two strands are complements (A-T and G-C).
  • #A = #T and #G = #C in both strands.

2nd Chargaff rule:

  • #A \approx #T and #G \approx #C in each strands.
  • Can be explained via tandem duplications.
    • GCAGCATT    GCAGCAGCATTGCAGCATT \implies GCAGCAGCATT.
    • Occur naturally during cell mitosis.

DNA storage

DNA synthesis:

  • Artificial creation of DNA from G’s, T’s, A’s, and C’s.

Can be used to store information!

Advantages:

  • Density.
    • 5.5 PB per mm3.
  • Stability.
    • Half-life 521 years (compare to ≈ 20𝑦 on hard drives).
  • Future proof.
    • DNA reading and writing will remain relevant “forever.”

DNA storage prototypes

Some recent attempts:

  • 2011, 659kb.
    • Church, Gao, Kosuri, “Next-generation digital information storage in DNA”, Science.
  • 2018, 200MB.
    • Organick et al., “Random access in large-scale DNA data storage,” Nature biotechnology.
  • CatalogDNA (startup):
    • 2019, 16GB.
    • 2021, 18 Mbps.

Companies:

  • Microsoft, Illumina, Western Digital, many startups.

Challenges:

  • Expensive, Slow.
  • Traditional storage media still sufficient and affordable.

DNA Storage models

In vivo:

  • Implant the synthetic DNA inside a living organism.
  • Need evolution-correcting codes!
  • E.g., coding against tandem-duplications.

In vitro:

  • Place the synthetic DNA in test tubes.
  • Challenge: Can only synthesize short sequences (1000bp\approx 1000 bp).
  • 1 test tube contains millions to billions of short sequences.

How to encode information?

How to achieve noise robustness?

DNA coding in vitro environment

Traditional data communication:

m{0,1}kc{0,1}nm\in\{0,1\}^k\mapsto c\in\{0,1\}^n

DNA storage:

m{0,1}kc({0,1}LM)m\in\{0,1\}^k\mapsto c\in \binom{\{0,1\}^L}{M}

where ({0,1}LM)\binom{\{0,1\}^L}{M} is the collection of all MM-subsets of {0,1}L\{0,1\}^L. (0M2L0\leq M\leq 2^L)

A codeword is a set of MM binary strings, each of length LL.

“Sliced channel”:

  • The message mm is encoded to c{0,1}ML,andthenslicedtoc\in \{0,1\}^{ML}, and then sliced to M$ equal parts.
  • Parts may be noisy (substitutions, deletions, etc.).
  • Also useful in network packet transmission (MM packets of length LL).

Sliced channel: Figures of merit

How to quantify the merit of a given code C\mathcal{C}?

  • Want resilience to any KK substitutions in all MM parts.

Redundance:

  • Recall in linear codes,
    • redundancy=lengthdimension=log(size of space)log(size of code)redundancy=length-dimension=\log (size\ of\ space)-\log (size\ of\ code).
  • In sliced channel:
    • redundancy=log(size of space)log(size of code)=log(2LM)logCredundancy=\log (size\ of\ space)-\log (size\ of\ code)=\log \binom{2^L}{M}-\log |\mathcal{C}|.

Research questions:

  • Bounds on redundancy?
  • Code construction?
  • Is more redundancy needed in sliced channel?

Sliced channel: Lower bound

Idea: Sphere packing

Given c({0,1}LM)c\in \binom{\{0,1\}^L}{M} and K=1K=1, how may codewords must we exclude?

Example

  • L=3,M=2L=3,M=2, and let c={001,011}c=\{001,011\}. Ball of radius 1 is sized 5.

all the codeword with distance 1 from cc are:

Distance 0:

{001,011}\{001,011\}

Distance 1:

{101,011}{011}{000,011}{001,111}{001}{001,010}\{101,011\}\quad \{011\}\quad \{000,011\}\\ \{001,111\}\quad \{001\}\quad \{001,010\}

7 options and 2 of them are not codewords.

So the effective size of the ball is 5.

Tool: (Fourier analysis of) Boolean functions

Introducing hypercube graph:

  • V={0,1}LV=\{0,1\}^L.
  • {x,y}E\{x,y\}\in E if and only if dH(x,y)=1d_H(x,y)=1.
  • What is size of EE?

Consider c({0,1}LM)c\in \binom{\{0,1\}^L}{M} as a characteristic function: fc(x)={0,1}L{0,1}f_c(x)=\{0,1\}^L\to \{0,1\}.

Let fc\partial f_c be its boundary.

  • All hypercube edges {x,y}\{x,y\} such that fc(x)fc(y)f_c(x)\neq f_c(y).

Lemma of boundary

Size of 1-ball fc+1\geq |\partial f_c|+1.

Proof

Every edge on the boundary represents a unique way of flipping one bit in one string in cc.

Need to bound fc|\partial f_c| from below

Tool: Total influence.

Definition of total influence.

The total influence I(f)I(f) of f:{0,1}L{0,1}f:\{0,1\}^L\to \{0,1\} is defined as:

i=1LPrx{0,1}L(f(x)f(xi))\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))

where xix^{\oplus i} equals to xx with it’s iith bit flipped.

Theorem: Edge-isoperimetric inequality, no proof)

I(f)2αlog1αI(f)\geq 2\alpha\log\frac{1}{\alpha}.

where α=min{fraction of 1’s,fraction of 0’s}\alpha=\min\{\text{fraction of 1's},\text{fraction of 0's}\}.

Notice: Let if\partial_i f be the ii-dimensional edges in f\partial f.

I(f)=i=1LPrx{0,1}L(f(x)f(xi))=i=1Lif2L1=f2L1\begin{aligned} I(f)&=\sum_{i=1}^L\operatorname{Pr}_{x\in \{0,1\}^L}(f(x)\neq f(x^{\oplus i}))\\ &=\sum_{i=1}^L\frac{|\partial_i f|}{2^{L-1}}\\ &=\frac{||\partial f||}{2^{L-1}}\\ \end{aligned}

Corollary: Let ϵ>0\epsilon>0, L1ϵL\geq \frac{1}{\epsilon} and M2(1ϵ)LM\leq 2^{(1-\epsilon)L}, and let c({0,1}LM)c\in \binom{\{0,1\}^L}{M}. Then,

\paritalfc2×2L1M2Llog2LMMlog2LMMLϵ|\parital f_c|\geq 2\times 2^{L-1}\frac{M}{2^L}\log \frac{2^L}{M}\geq M\log \frac{2^L}{M}\geq ML\epsilon

Size of 11 ball is sliced channel MLϵ\geq ML\epsilon.

this implies that Cleq(2LM)ϵML|\mathcal{C}|leq \frac{\binom{2^L}{M}}{\epsilon ML}

Corollary:

  • Redundancy in sliced channel with K=1K=1 with the above parameters logMLO(1)\log ML-O(1).
  • Simple generation (not shown) gives O(KlogML)O(K\log ML).

Robust indexing

Idea: Start with LL bit string with logM\log M bits for indexing.

Problem 1: Indices subject to noise.

Problem 2: Indexing bits do not carry information     \implies higher redundancy.

link to paper 

Idea: Robust indexing

Instead of using 1,,M1,\ldots, M for indexing, use x1,,xMx_1,\ldots, x_M such that

  • {x1,,xM}\{x_1,\ldots,x_M\} are of minimum distance 2K+12K+1 (solves problem 1) xi=O(logM)|x_i|=O(\log M).
  • {x1,,xM}\{x_1,\ldots,x_M\} contain information (solves problem 2).

{x1,,xM}\{x_1,\ldots,x_M\} depend on the message.

  • Consider the message m=(m1,m2)m=(m_1,m_2).
  • Find an encoding function m1{{xi}i=1MdH(xi,xj)2K+1}m_1\mapsto\{\{x_i\}_{i=1}^M|d_H(x_i,x_j)\geq 2K+1\} (coding over codes)
  • Assume ee is such function (not shown).

Additional reading:

Jin Sima, Netanel Raviv, Moshe Schwartz, and Jehoshua Bruck. “Error Correction for DNA Storage.” arXiv:2310.01729 (2023).

  • Magazine article.
  • Introductory.
  • Broad perspective.

Information Embedding in 3D printing

Motivations:

Threats to public safety.

  • Ghost guns, forging fingerprints, forging keys, fooling facial recognition.

Solution – Information Embedding.

  • Printer ID, user ID, time/location stamp

Existing Information Embedding Techniques

Many techniques exist.

Information embedding using variations in layers.

  • Width, rotation, etc.
  • Magnetic properties.
  • Radiative materials.

Combating Adversarial Noise

Most techniques are rather accurate.

  • I.e., low bit error rate.

Challenge: Adversarial damage after use.

  • Scraping.
  • Deformation.
  • Breaking.

A t-break code

Let m{0,1}kc{0,1}nm\in \{0,1\}^k\mapsto c\in \{0,1\}^n

Adversary breaks cc at most tt times (security parameter).

Decoder receives a multi-st of at most t+1t+1 fragments. Assume the following:

  • Oriented
  • Unordered
  • Any length

Lower bound for t-break code

Claim: A t-break code must have Ω(tlog(n/t))\Omega(t\log (n/t)) redundancy.

Lemma: Let C\mathcal{C} be a t-break code of length nn, and for i[n]i\in [n] and CiC\mathcal{C}_i\subseteq\mathcal{C} be the subset of C\mathcal{C} containing all codewords of Hamming weight ii. Then dH(Ci)t+12d_H(\mathcal{C}_i)\geq \lceil \frac{t+1}{2}\rceil.

Proof of Lemma

Let x,yCix,y\in \mathcal{C}_i for some ii, and let =dH(x,y)\ell=d_H(x,y).

Write (\circ denotes the concatenation operation):

x=c1xi1c2xi2ct+1xit+1x=c_1\circ x_{i_1}\circ c_2\circ x_{i_2}\circ\ldots\circ c_{t+1}\circ x_{i_{t+1}}

y=c1yi1c2yi2ct+1yit+1y=c_1\circ y_{i_1}\circ c_2\circ y_{i_2}\circ\ldots\circ c_{t+1}\circ y_{i_{t+1}}

Where xijyijx_{i_j}\neq y_{i_j} for all j[]j\in [\ell].

Break xx and yy 22\ell times to produce the multi-sets:

X={{c1,c2,,ct+1,xi1,xi2,,xit+1},{c1,c2,,ct+1,xi1,xi2,,xit+1},,{c1,c2,,ct+1,xi1,xi2,,xit+1}}\mathcal{X}=\{\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},x_{i_1},x_{i_2},\ldots,x_{i_{t+1}}\}\} Y={{c1,c2,,ct+1,yi1,yi2,,yit+1},{c1,c2,,ct+1,yi1,yi2,,yit+1},,{c1,c2,,ct+1,yi1,yi2,,yit+1}}\mathcal{Y}=\{\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\},\ldots,\{c_1,c_2,\ldots,c_{t+1},y_{i_1},y_{i_2},\ldots,y_{i_{t+1}}\}\}

wH(x)=wH(y)=iw_H(x)=w_H(y)=i, and therefore X=Y\mathcal{X}=\mathcal{Y} and t+12\ell\geq \lceil \frac{t+1}{2}\rceil.

Proof of Claim

Let j{0,1,,n}j\in \{0,1,\ldots,n\} be such that Cj\mathcal{C}_j is the largest among C0,C1,,CnC_0,C_1,\ldots,C_{n}.

logC=log(i=0nCi)log((n+1)Cj)log(n+1)+logCj\log |\mathcal{C}|=\log(\sum_{i=0}^n|\mathcal{C}_i|)\leq \log \left((n+1)|C_j|\right)\leq \log (n+1)+\log |\mathcal{C}_j|

By Lemma and ordinary sphere packing bound, for t=t+1212t4t'=\lfloor\frac{\lceil \frac{t+1}{2}\rceil-1}{2}\rfloor\approx \frac{t}{4},

Cj2nt=0t(ni)|C_j|\leq \frac{2^n}{\sum_{t=0}^{t'}\binom{n}{i}}

This implies that nlogCnlog(n+1)logCjΩ(tlog(n/t))n-\log |\mathcal{C}|\geq n-\log(n+1)-\log|\mathcal{C}_j|\geq \dots \geq \Omega(t\log (n/t))

Corollary: In the relevant regime t=O(n1ϵ)t=O(n^{1-\epsilon}), we have Ω(tlogn)\Omega(t\log n) redundancy.

t-break codes: Main ideas.

Encoding:

  • Need multiple markers across the codeword.
  • Construct an adjacency matrix 𝐴 of markers to record their order.
  • Append RS2t(A)RS_{2t}(A) to the codeword (as in the sliced channel).

Decoding (from t+1t + 1 fragments):

  • Locate all surviving markers, and locate RS2t(A)RS_{2t}(A)'.
  • Build an approximate adjacency matrix AA' from surviving markers (dH(A,A)2t)(d_H(A, A' )\leq 2t).
  • Correct (A,RS2t(A))(A,RS2t(A))(A',RS_{2t}(A)')\mapsto (A,RS_{2t}(A)).
  • Order the fragments correctly using AA.

Tools:

  • Random encoding (to have many markers).
  • Mutually uncorrelated codes (so that markers will not overlap).

Tool: Mutually uncorrelated codes.

  • Want: Markers not to overlap.
  • Solution: Take markers from a Mutually Uncorrelated Codes (existing notion).
    • A code M\mathcal{M} is called mutually uncorrelated if no suffix of any miMm_i \in \mathcal{M} is if a prefix of another mjMm_j \in \mathcal{M} (including i=ji=j).
  • Many constructions exist.

Theorem: For any integer \ell there exists a mutually uncorrelated code CMU\mathcal{C}_{MU} of length \ell and size CMU232|\mathcal{C}_{MU}|\geq \frac{2^\ell}{32\ell}.

Tool: Random encoding.

  • Want: Codewords with many markers from CMU\mathcal{C}_{MU}, that are not too far apart.
  • Problem: Hard to achieve explicitly.
  • Workaround: Show that a uniformly random string has this property.

Random encoding:

  • Choose the message at random.
  • Suitable for embedding, say, printer ID.
  • Not suitable for dynamic information.

Let m>0m>0 be a parameter.

Fix a mutually uncorrelated code CMU\mathcal{C}_{MU} of length Θ(logm)\Theta(\log m).

Fix m1,,mtm_1,\ldots, m_t from CMU\mathcal{C}_{MU} as “special” markers.

Claim: With probability 11\poly(m)1-\frac{1}{\poly(m)}, in uniformly random string z{0,1}mz\in \{0,1\}^m.

  • Every O(log2(m))O(\log^2(m)) bits contain a marker from CMU\mathcal{C}_{MU}.
  • Every two non-overlapping substrings of length clogmc\log m are distinct.
  • zz does not contain any of the special markers m1,,mtm_1,\ldots, m_t.

Proof idea:

  • Short substring are abundant.
  • Long substring are rare.

Sketch of encoding for t-break codes.

Repeatedly sample z{0,1}mz\in \{0,1\}^m until it is “good”.

Find all markers mi1,,mirm_{i_1},\ldots, m_{i_r} in it.

Build a CMU×CMU|\mathcal{C}_{MU}|\times |\mathcal{C}_{MU}| matrix AA which records order and distances:

  • Ai,j=0A_{i,j}=0 if mi,mjm_i,m_j are not adjacent.
  • Otherwise, it is the distance between them (in bits).

Append RS2t(A)RS_{2t}(A) at the end, and use the special markers m1,,mtm_1,\ldots, m_t.

Sketch of decoding for t-break codes.

Construct a partial adjacency matrix AA' from fragments.

Last updated on