Skip to Content
CSE5313CSE5313 Coding and information theory for data science (Lecture 20)

CSE5313 Coding and information theory for data science (Lecture 20)

Review for Private Information Retrieval

PIR from replicated databases

For 2 replicated databases, we have the following protocol:

  • User has iUmi \sim U_{m}.
  • User chooses r1,r2UF2mr_1, r_2 \sim U_{\mathbb{F}_2^m}.
  • Two queries to each server:
    • q1,1=r1+eiq_{1, 1} = r_1 + e_i, q1,2=r2q_{1, 2} = r_2.
    • q2,1=r1q_{2, 1} = r_1, q2,2=r2+eiq_{2, 2} = r_2 + e_i.
  • Server jj responds with qj,1cjq_{j, 1} c_j^\top and qj,2cjq_{j, 2} c_j^\top.
  • Decoding?
    • q1,1c1+q2,1c2=r1c1+c2+eic1=r10+xi,1=xi,1q_{1, 1} c_1^\top + q_{2, 1} c_2^\top = r_1 c_1 + c_2 + e_i c_1^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}.
    • q1,2c1+q2,2c2=r2c1+c2+eic2=xi,2q_{1, 2} c_1^\top + q_{2, 2} c_2^\top = r_2 c_1 + c_2 + e_i c_2^\top = x_{i, 2}.

PIR-rate is k2k=12\frac{k}{2k} = \frac{1}{2}.

PIR from coded parity-check databases

For 3 coded parity-check databases, we have the following protocol:

  • User has iUmi \sim U_{m}.
  • User chooses r1,r2,r3UF2mr_1, r_2, r_3 \sim U_{\mathbb{F}_2^m}.
  • Three queries to each server:
    • q1,1=r1+eiq_{1, 1} = r_1 + e_i, q1,2=r2q_{1, 2} = r_2, q1,3=r3q_{1, 3} = r_3.
    • q2,1=r1q_{2, 1} = r_1, q2,2=r2+eiq_{2, 2} = r_2 + e_i, q2,3=r3q_{2, 3} = r_3.
    • q3,1=r1q_{3, 1} = r_1, q3,2=r2q_{3, 2} = r_2, q3,3=r3+eiq_{3, 3} = r_3 + e_i.
  • Server jj responds with qj,1cj,qj,2cj,qj,3cjq_{j, 1} c_j^\top, q_{j, 2} c_j^\top, q_{j, 3} c_j^\top.
  • Decoding?
    • q1,1c1+q2,1c2+q3,1c3=r1c1+c2+c3+eic1=r10+xi,1=xi,1q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 c_1 + c_2 + c_3 + e_i c_1^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}.
    • q1,2c1+q2,2c2+q3,2c3=r2c1+c2+c3+eic2=xi,2q_{1, 2} c_1^\top + q_{2, 2} c_2^\top + q_{3, 2} c_3^\top = r_2 c_1 + c_2 + c_3 + e_i c_2^\top = x_{i, 2}.
    • q1,3c1+q2,3c2+q3,3c3=r3c1+c2+c3+eic3=xi,3q_{1, 3} c_1^\top + q_{2, 3} c_2^\top + q_{3, 3} c_3^\top = r_3 c_1 + c_2 + c_3 + e_i c_3^\top = x_{i, 3}.

PIR-rate is k3k=13\frac{k}{3k} = \frac{1}{3}.

Beyond z=1

Star-product theme

Given x=(x1,,xj)j[n],y=(y1,,yj)j[n]x=(x_1, \ldots, x_j)_{j\in [n]}, y=(y_1, \ldots, y_j)_{j\in [n]}, over Fq\mathbb{F}_q, the star-product is defined as:

xy=(x1y1,,xnyn)x \star y = (x_1 y_1, \ldots, x_n y_n)

Given two linear codes, C,DFqnC,D\subseteq \mathbb{F}_q^n, the star-product code is defined as:

CD=spanFq{xyxC,yD}C \star D = span_{\mathbb{F}_q} \{x \star y | x \in C, y \in D\}

Singleton bound for star-product:

dCDndimCdimD+2d_{C \star D} \leq n-\dim C-\dim D+2

PIR form a database coded with any MDS code and z>1

To generalize the previous scheme to z>1z > 1 need to encode multiple rr‘s together.

  • As in the ramp scheme.

Recall from the ramp scheme, we use r1,,rzUFqkr_1, \ldots, r_z \sim U_{\mathbb{F}_q^k} as our key vector to avoid occlusion of the servers.

In the star-product scheme:

  • Files are coded with an MDS code CC.
  • The multiple rr‘s are coded with an MDS code DD.
  • The scheme is based on the minimum distance of CDC \star D.

To code the data:

  • Let CFqnC \subseteq \mathbb{F}_q^n be an MDS code of dimension kk.
  • For all jmj \in m, encode file xj=xj,1,,xj,kx_j = x_{j, 1}, \ldots, x_{j, k} using GCG_C:
(x1,1x1,2x1,kx2,1x2,2x2,kxm,1xm,2xm,k)GC=(c1,1c1,2c1,nc2,1c2,2c2,ncm,1cm,2cm,n)\begin{pmatrix} x_{1, 1} & x_{1, 2} & \cdots & x_{1, k}\\ x_{2, 1} & x_{2, 2} & \cdots & x_{2, k}\\ \vdots & \vdots & \ddots & \vdots\\ x_{m, 1} & x_{m, 2} & \cdots & x_{m, k} \end{pmatrix} \cdot G_C = \begin{pmatrix} c_{1, 1} & c_{1, 2} & \cdots & c_{1, n}\\ c_{2, 1} & c_{2, 2} & \cdots & c_{2, n}\\ \vdots & \vdots & \ddots & \vdots\\ c_{m, 1} & c_{m, 2} & \cdots & c_{m, n} \end{pmatrix}
  • For all jnj \in n, store cj=c1,j,c2,j,,cm,jc_j = c_{1, j}, c_{2, j}, \ldots, c_{m, j} (a column of the above matrix) in server jj.

Let r1,,rzUFqkr_1, \ldots, r_z \sim U_{\mathbb{F}_q^k}.

To code the queries:

  • Let DFqkD \subseteq \mathbb{F}_q^k be an MDS code of dimension zz.
  • Encode the rjr_j‘s using GD=[g1,,gz]G_D=[g_1^\top, \ldots, g_z^\top].
(r1,,rz)GD=(r1,1r2,1rz,1r1,2r2,2rz,2r1,mr2,mrz,m)GD=((r1,,rz)g1,,(r1,,rz)gn)(r_1^\top, \ldots, r_z^\top) \cdot G_D = \begin{pmatrix} r_{1, 1} & r_{2, 1} & \cdots & r_{z, 1}\\ r_{1, 2} & r_{2, 2} & \cdots & r_{z, 2}\\ \vdots & \vdots & \ddots & \vdots\\ r_{1, m} & r_{2, m} & \cdots & r_{z, m} \end{pmatrix} \cdot G_D=\left((r_1^\top,\ldots, r_z^\top)g_1^\top,\ldots, (r_1^\top,\ldots, r_z^\top)g_n^\top \right)

To introduce the “errors in known locations” to the encoded rjr_j‘s:

  • Let W{0,1}m×nW \in \{0, 1\}^{m \times n} with some dCD1d_{C \star D} - 1 entries in its ii-th row equal to 1.
  • These are the entries we will retrieve.

For every server j[n]j \in [n] send qj=r1,,rzgj+wjq_j = r_1^\top, \ldots, r_z^\top g_j^\top + w_j, where wjw_j is the ii-th column of WW.

  • This is similar to ramp scheme, where wjw_j is the “message”.
  • Privacy against collusion of zz servers.

Response from server: aj=qjcja_j = q_j c_j^\top.

Decoding? Let QFqm×nQ \in \mathbb{F}_q^{m \times n} be a matrix whose columns are the qjq_j‘s.

Q=(r1rz)GD+WQ = \begin{pmatrix} r_1^\top & \cdots & r_z^\top \end{pmatrix} \cdot G_D + W
  • The user has
\begin{aligned} q_1 c_1^\top, \ldots, q_n c_n^\top &= \left(\sum_{j \in m} q_{1, j} c_{j, 1}, \ldots, \sum_{j \in m} q_{n, j} c_{j, n}\right) \\ &=\sum_{j \in m} (q_{1,j}c_{j, 1}, \ldots, q_{n,j}c_{j, n}) \\ &=\sum_{j \in m} q^j \star c^j

where qjq^j is a row of QQ and cjc^j is a codeword in CC (an n,kn, k qq MDS code).

We have:

  • Q=(r1,,rz)GD+WQ=(r_1^\top, \ldots, r_z^\top) \cdot G_D + W
  • W{0,1}m×nW\in \{0, 1\}^{m \times n} with some dCD1d_{C \star D} - 1 entries in its ii-th row equal to 1.
  • (qjcj)=sumjmqjcj(q^j \star c^j)=sum_{j \in m} q^j \star c^j
  • Each qjq^j is a row of QQ
    • For jij \neq i, qjq^j is a codeword in DD
    • qi=di+wiq^i = d^i + w^i
  • Therefore:
j[m]qjcj=ji(djcj)+((di+wi)ci)=ji(djcj)+wici=(codeword in CD)+(noise of Hamming weight dCD1)\begin{aligned} \sum_{j \in [m]} q^j \star c^j &= \sum_{j \neq i} (d^j \star c^j) + ((d^i + w^i) \star c^i) \\ &= \sum_{j \neq i} (d^j \star c^j) + w^i \star c^i &= (\text{codeword in } C \star D )+( \text{noise of Hamming weight } \leq d_{C \star D} - 1) \end{aligned}

Multiply by HCDH_{C \star D} and get dCD1d_{C \star D} - 1 elements of cic^i.

  • Recall that ci=xiGCc^i = x_i \cdot G_C
  • Repeat kdCD1k^{d_{C \star D} - 1} times to obtain kk elements of cic^i.
    • Suffices to obtain xix_i, since CC is n,kn, k qq MDS code.

PIR-rate:

  • = \frac{k}{# \text{ downloaded elements}} = \frac{k}{\frac{k}{d_{C \star D} - 1} \cdot n} = \frac{d_{C \star D} - 1}{n}
  • Singleton bound for star-product: dCDndimCdimD+2d_{C \star D} \leq n - \dim C - \dim D + 2.
  • Achieved with equality if CC and DD are Reed-Solomon codes.
  • PIR-rate = ndimCdimD+1n=nkz+1n\frac{n - \dim C - \dim D + 1}{n} = \frac{n - k - z + 1}{n}.
  • Intuition:
    • “paying” kk for “reconstruction from any kk”.
    • “paying” zz for “protection against colluding sets of size zz”.
  • Capacity unknown! (as of 2022).
    • Known for special cases, e.g., k=1,z=1k = 1, z = 1, certain types of schemes, etc.

PIR over graphs

Graph-based replication:

  • Every file is replicated twice on two separate servers.
  • Every two servers have at most one file in common.
  • “file” = “granularity” of data, i.e., the smallest information unit shared by any two servers.

A server that stores (xi,j)j=1d(x_{i, j})_{j=1}^d receives (qi,j)j=1d(q_{i, j})_{j=1}^d, and replies with j=1dqi,jxi,j\sum_{j=1}^d q_{i, j} \cdot x_{i, j}.

The idea:

  • Consider a 2-server replicated PIR and “split” the queries between the servers.
  • Sum the responses, unwanted files “cancel out”, while xix_i does not.

Problem: Collusion.

Solution: Add per server randomness.

Good for any graph, and any q3q \geq 3 (for simplicity assume 2q2 | q).

The protocol:

  • Choose random γFqn\gamma \in \mathbb{F}_q^n, νFqm\nu \in \mathbb{F}_q^m, and hF{0,1}h \in \mathbb{F} \setminus \{0, 1\}.
  • Queries:
    • If node jj is incident with edge \ell, send qj,=γjνq_{j, \ell} = \gamma_j \cdot \nu_\ell to node jj.
    • I.e., if server jj stores file \ell.
  • Except one node j0j_0 that stores xix_i, which gets qj0,i=hγj0νiq_{j_0, i} = h \cdot \gamma_{j_0} \cdot \nu_i.
  • Server jj responds with aj=j=1dqj,xi,a_j = \sum_{j=1}^d q_{j, \ell} \cdot x_{i, \ell}.
    • Where xi,1,,x_{i, 1}, \ldots, x_{i, d}$ are the files adjacent with it.

Example

  • Consider the following graph.
  • n=5,m=7,andi=3n = 5, m = 7, and i = 3.
  • q3=γ3v2,v3,v6q_3 = \gamma_3 \cdot v_2, v_3, v_6 and a3=x2γ3v2+x3γ3v3+x6γ3v6a_3 = x_2 \cdot \gamma_3 v_2 + x_3 \cdot \gamma_3 v_3 + x_6 \cdot \gamma_3 v_6.
  • q2=γ2v1,hv3,v4q_2 = \gamma_2 \cdot v_1, h v_3, v_4 and a2=x1γ2v1+x3hγ2v3+x4γ2v4a_2 = x_1 \cdot \gamma_2 v_1 + x_3 \cdot h \gamma_2 v_3 + x_4 \cdot \gamma_2 v_4.

Example of PIR over graphs

Correctness:

  • j=15γj1aj=(h+1)v3x3\sum_{j=1}^5 \gamma_j^{-1} a_j =( h + 1 )v_3 x_3
  • h1,v30    h \neq 1, v_3 \neq 0 \implies find x3x_3.

Parameters:

  • Storage overhead 2 (for any graph).
  • Download nkn \cdot k.
  • PIR rate 1/n.

Collusion resistance:

1-privacy: Each node sees an entirely random vector.

2-privacy:

  • If no edge – as for 1-privacy.
  • If edge exists – E.g.,
    • γ3v6\gamma_3 v_6 and γ4v6\gamma_4 v_6 are independent.
    • γ3v3\gamma_3 v_3 and hγ2v3h \cdot \gamma_2 v_3 are independent.

S-privacy:

  • Let SnS \subseteq n (e.g., S=2,3,5S = 2,3,5), and consider the query matrix of their mutual files:
QS=diag(γ3,γ2,γ5)(1h11)diag(v3,v4)Q_S = diag(\gamma_3, \gamma_2, \gamma_5) \begin{pmatrix} 1 &\\ h & 1 \\ & 1\end{pmatrix} diag(v_3, v_4)
  • It can be shown that Pr(QS)=1(q1)4Pr(Q_S)=\frac{1}{(q-1)^4}, regardless of i    i \implies perfect privacy.
Last updated on