CSE5313 Coding and information theory for data science (Lecture 19)

Private information retrieval

Problem setup

Premise:

Database $X = \{x_1, \ldots, x_m\}$ , each $x_i \in \mathbb{F}_q^k$ is a “file” (e.g., medical record).
$X$ is coded $X \mapsto \{y_1, \ldots, y_n\}$ , $y_j$ stored at server $j$ .
The user (physician) wants $x_i$ .
The user sends a query $q_j \sim Q_j$ to server $j$ .
Server $j$ responds with $a_j \sim A_j$ .

Decodability:

The user can retrieve the file: $H(X_i | A_1, \ldots, A_n) = 0$ .

Privacy:

$i$ is seen as $i \sim U = U_{m}$ , reflecting server’s lack of knowledge.
$i$ must be kept private: $I(Q_j; U) = 0$ for all $j \in n$ .

In short, we want to retrieve $x_i$ from the servers without revealing $i$ to the servers.

Private information retrieval from Replicated Databases

Simple case, one server

Say $n = 1, y_1 = X$ .

All data is stored in one server.
Simple solution:
$q_1 =$ “send everything”.
$a_1 = y_1 = X$ .

Theorem: Information Theoretic PIR with $n = 1$ can only be achieved by downloading the entire database.

Can we do better if $n > 1$ ?

Collusion parameter

Key question for $n > 1$ : Can servers collude?

I.e., does server $j$ see any $Q_\ell$ , $\ell \neq j$ ?
Key assumption:
- Privacy parameter $z$ .
- At most $z$ servers can collude.
- $z = 1\implies$ No collusion.
Requirement for $z = 1$ : $I(Q_j; U) = 0$ for all $j \in n$ .
Requirement for a general $z$ $z$ :
- $I(Q_\mathcal{T}; U) = 0$ for all $\mathcal{T} \in n$ , $|\mathcal{T}| \leq z$ , where $Q_\mathcal{T} = Q_\ell$ for all $\ell \in \mathcal{T}$ .
Motivation:
- Interception of communication links.
- Data breaches.

Other assumptions:

Computational Private information retrieval (even all the servers are hacked, still cannot get the information -> solve np-hard problem):
Non-zero MI

Private information retrieval from 2-replicated databases

First PIR protocol: Chor et al. FOCS ‘95.

The data $X = \{x_1, \ldots, x_m\}$ $X = {x_{1}, \dots, x_{m}}$ is replicated on two servers.
- $z = 1$ , i.e., no collusion.
Protocol: User has $i \sim U_{m}$ $i \sim U_{m}$ .
- User generates $r \sim U_{\mathbb{F}_q^m}$ .
- $q_1 = r, q_2 = r + e_i$ ( $e_i \in \mathbb{F}_q^m$ is the $i$ -th unit vector, $q_2$ is equivalent to one-time pad encryption of $x_i$ with key $r$ ).
- $a_j = q_j X^\top = \sum_{\ell \in m} q_j, \ell x_\ell$
- Linear combination of the files according to the query vector $q_j$ .
Decoding?
- $a_2 - a_1 = q_2 - q_1 X^\top = e_i X^\top = x_i$ .
Download?
- $a_j =$ size of file $\implies$ downloading twice the size of the file.
Privacy?
- Since $z = 1$ $z = 1$ , need to show $I(U; Q_i) = 0$ $I (U; Q_{i}) = 0$ .
  - $I(U; Q_1) = I(e_U; F) = 0$ since $U$ and $F$ are independent.
  - $I(U; Q_2) = I(e_U; F + e_U) = 0$ since this is one-time pad!

Parameters and notations in PIR

Parameters of the system:

$n =$ # servers (as in storage).
$m =$ # files.
$k =$ size of each file (as in storage).
$z =$ max. collusion (as in secret sharing).
$t =$ $t =$ # of answers required to obtain $x_i$ $x_{i}$ (as in secret sharing).
- $n - t$ servers are “stragglers”, i.e., might not respond.

Figures of merit:

PIR-rate = $\#$ desired symbols / $\#$ downloaded symbols
PIR-capacity = largest possible rate.

Notaional conventions:

-The dataset $X = \{x_j\}_{j \in m} = \{x_{j, \ell}\}_{(j, \ell) \in [m] \times [k]}$ is seen as a vector in $\mathbb{F}_q^{mk}$ .

Index $\mathbb{F}_q^{mk}$ using $[m] \times [k]$ , i.e., $x_{j, \ell}$ is the $\ell$ -th symbol of the $j$ -th file.

Private information retrieval from 4-replicated databases

Consider $n = 4$ replicated servers, file size $k = 2$ , collusion $z = 1$ .

Protocol: User has $i \sim U_{m}$ .

Fix distinct nonzero $\alpha_1, \ldots, \alpha_4 \in \mathbb{F}_q$ .
Choose $r \sim U_{\mathbb{F}_q^{2m}}$ .
User sends $q_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 r$ to each server $j$ .
Server $j$ $j$ responds with $a_j = q_j X^\top = e_{i, 1} X^\top + \alpha_j e_{i, 2} X^\top + \alpha_j^2 r X^\top$
- This is an evaluation at $\alpha_j$ of the polynomial $f_i(w) = x_{i, 1} + x_{i, 2} \cdot w + r \cdot w^2$ .
- Where $r$ is some random combination of the entries of $X$ .
Decoding?
- Any 3 responses suffice to interpolate $f_i$ and obtain $x_i = x_{i, 1}, x_{i, 2}$ .
- $\implies t = 3$ , (one straggler is allowed)
Privacy?
- Does $q_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 r$ look familiar?
- This is a share in ramp scheme with vector messages $m_1 = e_{i, 1}, m_2 = e_{i, 2}, m_i \in \mathbb{F}_q^{2m}$ .
- This is equivalent to $2m$ “parallel” ramp scheme over $\mathbb{F}_q$ .
- Each one reveals nothing to any $z = 1$ shareholders $\implies$ Private!

Private information retrieval from general replicated databases

$n$ servers, $m$ files, file size $k$ , $X \in \mathbb{F}_q^{mk}$ .

Server decodes $x_i$ from any $t$ responses.

Any $\leq z$ servers might collude to infer $i$ ( $z < t$ ).

Protocol: User has $i \sim U_{m}$ .

User chooses $r_1, \ldots, r_z \sim U_{\mathbb{F}_q^{mk}}$ .
User sends $q_j = \sum_{\ell=1}^k e_{i, \ell} \alpha_j^{\ell-1} + \sum_{\ell=1}^z r_\ell \alpha_j^{k+\ell-1}$ to each server $j$ .
Server $j$ $j$ responds with $a_j = q_j X^\top = f_i(\alpha_j)$ $a_{j} = q_{j} X^{⊤} = f_{i} (α_{j})$ .
- $f_i(w) = \sum_{\ell=1}^k e_{i, \ell} X^\top w^{\ell-1} + \sum_{\ell=1}^z r_\ell X^\top w^{k+\ell-1}$ (random combinations of $X$ ).
- Caveat: must have $t = k + z$ .
- $\implies \deg f_i = k + z - 1 = t - 1$ .
Decoding?
- Interpolation from any $t$ evaluations of $f_i$ .
Privacy?
- Against any $z = t - k$ colluding servers, immediate from the proof of the ramp scheme.

PIR-rate?

Each $a_j$ is a single field element.
Download $t = k + z$ elements in $\mathbb{F}_q$ in order to obtain $x_i \in \mathbb{F}_q^k$ .
$\implies$ PIR-rate = $\frac{k}{k+z} = \frac{k}{t}$ .

Theorem: PIR-capacity for general replicated databases

The PIR-capacity for $n$ replicated databases with $z$ colluding servers, $n - t$ unresponsive servers, and $m$ files is $C = \frac{1-\frac{z}{t}}{1-(\frac{z}{t})^m}$ .

When $m \to \infty$ , $C \to 1 - \frac{z}{t} = \frac{t-z}{t} = \frac{k}{t}$ .
The above scheme achieves PIR-capacity as $m \to \infty$

Private information retrieval from coded databases

Problem setup:

Example:

$n = 3$ servers, $m$ files $x_j$ , $x_j = x_{j, 1}, x_{j, 2}$ , $k = 2$ , and $q = 2$ .
Code each file with a parity code: $x_{j, 1}, x_{j, 2} \mapsto x_{j, 1}, x_{j, 2}, x_{j, 1} + x_{j, 2}$ .
Server $j \in 3$ stores all $j$ -th symbols of all coded files.

Queries, answers, decoding, and privacy must be tailored for the code at hand.

With respect to a code $C$ and parameters $n, k, t, z$ , such scheme is called coded-PIR.

The content for server $j$ is denoted by $c_j = c_{j, 1}, \ldots, c_{j, m}$ .
$C$ is usually an MDS code.

Private information retrieval from parity-check codes

Example:

Say $z = 1$ (no collusion).

Protocol: User has $i \sim U_{m}$ .
User chooses $r_1, r_2 \sim U_{\mathbb{F}_2^m}$ .
Two queries to each server:
- $q_{1, 1} = r_1 + e_i$ , $q_{1, 2} = r_2$ .
- $q_{2, 1} = r_1$ , $q_{2, 2} = r_2 + e_i$ .
- $q_{3, 1} = r_1$ , $q_{3, 2} = r_2$ .
Server $j$ responds with $q_{j, 1} c_j^\top$ and $q_{j, 2} c_j^\top$ .
Decoding?
- $q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 c_1 + c_2 + c_3 + e_i c_1^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}$ .
- $q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}$ .
- $q_{1, 2} c_1^\top + q_{2, 2} c_2^\top + q_{3, 2} c_3^\top = r_2 c_1 + c_2 + c_3^\top + e_i c_2^\top = x_{i, 2}$ .
Privacy?
- Every server sees two uniformly random vectors in $\mathbb{F}_2^m$ .

Proof from coding-theoretic interpretation

Let $G = g_1^\top, g_2^\top, g_3^\top$ be the generator matrix.

For every file $x_j = x_{j, 1}, x_{j, 2}$ we encode $x_j G = (x_{j, 1} g_1^\top, x_{j, 2} g_2^\top, x_{j, 1} g_3^\top) = (c_{j, 1}, c_{j, 2}, c_{j, 3})$ .
Server $j$ stores $X g_j^\top = (x_1^\top, \ldots, x_m^\top)^\top g_j^\top = (c_{j, 1}, \ldots, c_{j, m})^\top$ .
By multiplying by $r_1$ , the servers together store a codeword in $C$ :
- $r_1 X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top = r_1 X G$ .
By replacing one of the $r_1$ ’s by $r_1 + e_i$ , we introduce an error in that entry:
- $\left((r_1 + e_i) X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top\right) = r_1 X G + (e_i X g_1^\top, 0,0)$ .
Downloading this “erroneous” word from the servers and multiply by $H = h_1^\top, h_2^\top, h_3^\top$ be the parity-check matrix.

\begin{aligned} \left((r_1 + e_i) X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top\right) H^\top &= \left(r_1 X G + (e_i X g_1^\top, 0,0)\right) H^\top \\ &= r_1 X G H^\top + (e_i X g_1^\top, 0,0) H^\top \\ &= 0 + x_{i, 1} g_1^\top \\ &= x_{i, 1}. \end{aligned}

In homework we will show tha this work with any MDS code ( $z=1$ ).

Say we obtained $x_{i, 1} g_1^\top, \ldots, x_{i, k} g_k^\top$ (𝑑 − 1 at a time, how?).
$x_{i, 1} g_1^\top, \ldots, x_{i, k} g_k^\top = x_{i, B}$ , where $B$ is a $k \times k$ submatrix of $G$ .
$B$ is a $k \times k$ submatrix of $G$ $\implies$ invertible! $\implies$ Obtain $x_{i}$ .

Tip

error + known location $\implies$ erasure. $d = 2 \implies$ 1 erasure is correctable.