CSE5313 Coding and information theory for data science (Lecture 20)

Review for Private Information Retrieval

PIR from replicated databases

For 2 replicated databases, we have the following protocol:

User has $i \sim U_{m}$ .
User chooses $r_1, r_2 \sim U_{\mathbb{F}_2^m}$ .
Two queries to each server:
- $q_{1, 1} = r_1 + e_i$ , $q_{1, 2} = r_2$ .
- $q_{2, 1} = r_1$ , $q_{2, 2} = r_2 + e_i$ .
Server $j$ responds with $q_{j, 1} c_j^\top$ and $q_{j, 2} c_j^\top$ .
Decoding?
- $q_{1, 1} c_1^\top + q_{2, 1} c_2^\top = r_1 c_1 + c_2 + e_i c_1^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}$ .
- $q_{1, 2} c_1^\top + q_{2, 2} c_2^\top = r_2 c_1 + c_2 + e_i c_2^\top = x_{i, 2}$ .

PIR-rate is $\frac{k}{2k} = \frac{1}{2}$ .

PIR from coded parity-check databases

For 3 coded parity-check databases, we have the following protocol:

User has $i \sim U_{m}$ .
User chooses $r_1, r_2, r_3 \sim U_{\mathbb{F}_2^m}$ .
Three queries to each server:
- $q_{1, 1} = r_1 + e_i$ , $q_{1, 2} = r_2$ , $q_{1, 3} = r_3$ .
- $q_{2, 1} = r_1$ , $q_{2, 2} = r_2 + e_i$ , $q_{2, 3} = r_3$ .
- $q_{3, 1} = r_1$ , $q_{3, 2} = r_2$ , $q_{3, 3} = r_3 + e_i$ .
Server $j$ responds with $q_{j, 1} c_j^\top, q_{j, 2} c_j^\top, q_{j, 3} c_j^\top$ .
Decoding?
- $q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 c_1 + c_2 + c_3 + e_i c_1^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}$ .
- $q_{1, 2} c_1^\top + q_{2, 2} c_2^\top + q_{3, 2} c_3^\top = r_2 c_1 + c_2 + c_3 + e_i c_2^\top = x_{i, 2}$ .
- $q_{1, 3} c_1^\top + q_{2, 3} c_2^\top + q_{3, 3} c_3^\top = r_3 c_1 + c_2 + c_3 + e_i c_3^\top = x_{i, 3}$ .

PIR-rate is $\frac{k}{3k} = \frac{1}{3}$ .

Beyond z=1

Star-product theme

Given $x=(x_1, \ldots, x_j)_{j\in [n]}, y=(y_1, \ldots, y_j)_{j\in [n]}$ , over $\mathbb{F}_q$ , the star-product is defined as:

x \star y = (x_1 y_1, \ldots, x_n y_n)

Given two linear codes, $C,D\subseteq \mathbb{F}_q^n$ , the star-product code is defined as:

C \star D = span_{\mathbb{F}_q} \{x \star y | x \in C, y \in D\}

Singleton bound for star-product:

d_{C \star D} \leq n-\dim C-\dim D+2

PIR form a database coded with any MDS code and z>1

To generalize the previous scheme to $z > 1$ need to encode multiple $r$ ‘s together.

As in the ramp scheme.

Recall from the ramp scheme, we use $r_1, \ldots, r_z \sim U_{\mathbb{F}_q^k}$ as our key vector to avoid occlusion of the servers.

In the star-product scheme:

Files are coded with an MDS code $C$ .
The multiple $r$ ‘s are coded with an MDS code $D$ .
The scheme is based on the minimum distance of $C \star D$ .

To code the data:

Let $C \subseteq \mathbb{F}_q^n$ be an MDS code of dimension $k$ .
For all $j \in m$ , encode file $x_j = x_{j, 1}, \ldots, x_{j, k}$ using $G_C$ :

\begin{pmatrix} x_{1, 1} & x_{1, 2} & \cdots & x_{1, k}\\ x_{2, 1} & x_{2, 2} & \cdots & x_{2, k}\\ \vdots & \vdots & \ddots & \vdots\\ x_{m, 1} & x_{m, 2} & \cdots & x_{m, k} \end{pmatrix} \cdot G_C = \begin{pmatrix} c_{1, 1} & c_{1, 2} & \cdots & c_{1, n}\\ c_{2, 1} & c_{2, 2} & \cdots & c_{2, n}\\ \vdots & \vdots & \ddots & \vdots\\ c_{m, 1} & c_{m, 2} & \cdots & c_{m, n} \end{pmatrix}

For all $j \in n$ , store $c_j = c_{1, j}, c_{2, j}, \ldots, c_{m, j}$ (a column of the above matrix) in server $j$ .

Let $r_1, \ldots, r_z \sim U_{\mathbb{F}_q^k}$ .

To code the queries:

Let $D \subseteq \mathbb{F}_q^k$ be an MDS code of dimension $z$ .
Encode the $r_j$ ‘s using $G_D=[g_1^\top, \ldots, g_z^\top]$ .

(r_1^\top, \ldots, r_z^\top) \cdot G_D = \begin{pmatrix} r_{1, 1} & r_{2, 1} & \cdots & r_{z, 1}\\ r_{1, 2} & r_{2, 2} & \cdots & r_{z, 2}\\ \vdots & \vdots & \ddots & \vdots\\ r_{1, m} & r_{2, m} & \cdots & r_{z, m} \end{pmatrix} \cdot G_D=\left((r_1^\top,\ldots, r_z^\top)g_1^\top,\ldots, (r_1^\top,\ldots, r_z^\top)g_n^\top \right)

To introduce the “errors in known locations” to the encoded $r_j$ ‘s:

Let $W \in \{0, 1\}^{m \times n}$ with some $d_{C \star D} - 1$ entries in its $i$ -th row equal to 1.
These are the entries we will retrieve.

For every server $j \in [n]$ send $q_j = r_1^\top, \ldots, r_z^\top g_j^\top + w_j$ , where $w_j$ is the $i$ -th column of $W$ .

This is similar to ramp scheme, where $w_j$ is the “message”.
Privacy against collusion of $z$ servers.

Response from server: $a_j = q_j c_j^\top$ .

Decoding? Let $Q \in \mathbb{F}_q^{m \times n}$ be a matrix whose columns are the $q_j$ ‘s.

Q = \begin{pmatrix} r_1^\top & \cdots & r_z^\top \end{pmatrix} \cdot G_D + W

The user has

\begin{aligned} q_1 c_1^\top, \ldots, q_n c_n^\top &= \left(\sum_{j \in m} q_{1, j} c_{j, 1}, \ldots, \sum_{j \in m} q_{n, j} c_{j, n}\right) \\ &=\sum_{j \in m} (q_{1,j}c_{j, 1}, \ldots, q_{n,j}c_{j, n}) \\ &=\sum_{j \in m} q^j \star c^j

where $q^j$ is a row of $Q$ and $c^j$ is a codeword in $C$ (an $n, k$ $q$ MDS code).

We have:

$Q=(r_1^\top, \ldots, r_z^\top) \cdot G_D + W$
$W\in \{0, 1\}^{m \times n}$ with some $d_{C \star D} - 1$ entries in its $i$ -th row equal to 1.
$(q^j \star c^j)=sum_{j \in m} q^j \star c^j$
Each $q^j$ $q^{j}$ is a row of $Q$ $Q$
- For $j \neq i$ , $q^j$ is a codeword in $D$
- $q^i = d^i + w^i$
Therefore:

\begin{aligned} \sum_{j \in [m]} q^j \star c^j &= \sum_{j \neq i} (d^j \star c^j) + ((d^i + w^i) \star c^i) \\ &= \sum_{j \neq i} (d^j \star c^j) + w^i \star c^i &= (\text{codeword in } C \star D )+( \text{noise of Hamming weight } \leq d_{C \star D} - 1) \end{aligned}

Multiply by $H_{C \star D}$ and get $d_{C \star D} - 1$ elements of $c^i$ .

Recall that $c^i = x_i \cdot G_C$
Repeat $k^{d_{C \star D} - 1}$ $k^{d_{C ⋆ D} - 1}$ times to obtain $k$ $k$ elements of $c^i$ $c^{i}$ .
- Suffices to obtain $x_i$ , since $C$ is $n, k$ $q$ MDS code.

PIR-rate:

= $\frac{k}{# \text{ downloaded elements}} = \frac{k}{\frac{k}{d_{C \star D} - 1} \cdot n} = \frac{d_{C \star D} - 1}{n}$
Singleton bound for star-product: $d_{C \star D} \leq n - \dim C - \dim D + 2$ .
Achieved with equality if $C$ and $D$ are Reed-Solomon codes.
PIR-rate = $\frac{n - \dim C - \dim D + 1}{n} = \frac{n - k - z + 1}{n}$ .
Intuition:
- “paying” $k$ for “reconstruction from any $k$ ”.
- “paying” $z$ for “protection against colluding sets of size $z$ ”.
Capacity unknown! (as of 2022).
- Known for special cases, e.g., $k = 1, z = 1$ , certain types of schemes, etc.

PIR over graphs

Graph-based replication:

Every file is replicated twice on two separate servers.
Every two servers have at most one file in common.
“file” = “granularity” of data, i.e., the smallest information unit shared by any two servers.

A server that stores $(x_{i, j})_{j=1}^d$ receives $(q_{i, j})_{j=1}^d$ , and replies with $\sum_{j=1}^d q_{i, j} \cdot x_{i, j}$ .

The idea:

Consider a 2-server replicated PIR and “split” the queries between the servers.
Sum the responses, unwanted files “cancel out”, while $x_i$ does not.

Problem: Collusion.

Solution: Add per server randomness.

Good for any graph, and any $q \geq 3$ (for simplicity assume $2 | q$ ).

The protocol:

Choose random $\gamma \in \mathbb{F}_q^n$ , $\nu \in \mathbb{F}_q^m$ , and $h \in \mathbb{F} \setminus \{0, 1\}$ .
Queries:
- If node $j$ is incident with edge $\ell$ , send $q_{j, \ell} = \gamma_j \cdot \nu_\ell$ to node $j$ .
- I.e., if server $j$ stores file $\ell$ .
Except one node $j_0$ that stores $x_i$ , which gets $q_{j_0, i} = h \cdot \gamma_{j_0} \cdot \nu_i$ .
Server $j$ $j$ responds with $a_j = \sum_{j=1}^d q_{j, \ell} \cdot x_{i, \ell}$ $a_{j} = \sum_{j = 1}^{d} q_{j, ℓ} \cdot x_{i, ℓ}$ .
- Where $x_{i, 1}, \ldots,$ x_{i, d}$ are the files adjacent with it.

Example

Consider the following graph.
$n = 5, m = 7, and i = 3$ .
$q_3 = \gamma_3 \cdot v_2, v_3, v_6$ and $a_3 = x_2 \cdot \gamma_3 v_2 + x_3 \cdot \gamma_3 v_3 + x_6 \cdot \gamma_3 v_6$ .
$q_2 = \gamma_2 \cdot v_1, h v_3, v_4$ and $a_2 = x_1 \cdot \gamma_2 v_1 + x_3 \cdot h \gamma_2 v_3 + x_4 \cdot \gamma_2 v_4$ .

Example of PIR over graphs

Correctness:

$\sum_{j=1}^5 \gamma_j^{-1} a_j =( h + 1 )v_3 x_3$
$h \neq 1, v_3 \neq 0 \implies$ find $x_3$ .

Parameters:

Storage overhead 2 (for any graph).
Download $n \cdot k$ .
PIR rate 1/n.

Collusion resistance:

1-privacy: Each node sees an entirely random vector.

2-privacy:

If no edge – as for 1-privacy.
If edge exists – E.g.,
- $\gamma_3 v_6$ and $\gamma_4 v_6$ are independent.
- $\gamma_3 v_3$ and $h \cdot \gamma_2 v_3$ are independent.

S-privacy:

Let $S \subseteq n$ (e.g., $S = 2,3,5$ ), and consider the query matrix of their mutual files:

Q_S = diag(\gamma_3, \gamma_2, \gamma_5) \begin{pmatrix} 1 &\\ h & 1 \\ & 1\end{pmatrix} diag(v_3, v_4)

It can be shown that $Pr(Q_S)=\frac{1}{(q-1)^4}$ , regardless of $i \implies$ perfect privacy.