Skip to Content
CSE5313CSE5313 Coding and information theory for data science (Lecture 19)

CSE5313 Coding and information theory for data science (Lecture 19)

Private information retrieval

Problem setup

Premise:

  • Database X={x1,,xm}X = \{x_1, \ldots, x_m\}, each xiFqkx_i \in \mathbb{F}_q^k is a “file” (e.g., medical record).
  • XX is coded X{y1,,yn}X \mapsto \{y_1, \ldots, y_n\}, yjy_j stored at server jj.
  • The user (physician) wants xix_i.
  • The user sends a query qjQjq_j \sim Q_j to server jj.
  • Server jj responds with ajAja_j \sim A_j.

Decodability:

  • The user can retrieve the file: H(XiA1,,An)=0H(X_i | A_1, \ldots, A_n) = 0.

Privacy:

  • ii is seen as iU=Umi \sim U = U_{m}, reflecting server’s lack of knowledge.
  • ii must be kept private: I(Qj;U)=0I(Q_j; U) = 0 for all jnj \in n.

In short, we want to retrieve xix_i from the servers without revealing ii to the servers.

Private information retrieval from Replicated Databases

Simple case, one server

Say n=1,y1=Xn = 1, y_1 = X.

  • All data is stored in one server.
  • Simple solution:
  • q1=q_1 = “send everything”.
  • a1=y1=Xa_1 = y_1 = X.

Theorem: Information Theoretic PIR with n=1n = 1 can only be achieved by downloading the entire database.

  • Can we do better if n>1n > 1?

Collusion parameter

Key question for n>1n > 1: Can servers collude?

  • I.e., does server jj see any QQ_\ell, j\ell \neq j?
  • Key assumption:
    • Privacy parameter zz.
    • At most zz servers can collude.
    • z=1    z = 1\implies No collusion.
  • Requirement for z=1z = 1: I(Qj;U)=0I(Q_j; U) = 0 for all jnj \in n.
  • Requirement for a general zz:
    • I(QT;U)=0I(Q_\mathcal{T}; U) = 0 for all Tn\mathcal{T} \in n, Tz|\mathcal{T}| \leq z, where QT=QQ_\mathcal{T} = Q_\ell for all T\ell \in \mathcal{T}.
  • Motivation:
    • Interception of communication links.
    • Data breaches.

Other assumptions:

  • Computational Private information retrieval (even all the servers are hacked, still cannot get the information -> solve np-hard problem):
  • Non-zero MI

Private information retrieval from 2-replicated databases

First PIR protocol: Chor et al. FOCS ‘95.

  • The data X={x1,,xm}X = \{x_1, \ldots, x_m\} is replicated on two servers.
    • z=1z = 1, i.e., no collusion.
  • Protocol: User has iUmi \sim U_{m}.
    • User generates rUFqmr \sim U_{\mathbb{F}_q^m}.
    • q1=r,q2=r+eiq_1 = r, q_2 = r + e_i (eiFqme_i \in \mathbb{F}_q^m is the ii-th unit vector, q2q_2 is equivalent to one-time pad encryption of xix_i with key rr).
    • aj=qjX=mqj,xa_j = q_j X^\top = \sum_{\ell \in m} q_j, \ell x_\ell
    • Linear combination of the files according to the query vector qjq_j.
  • Decoding?
    • a2a1=q2q1X=eiX=xia_2 - a_1 = q_2 - q_1 X^\top = e_i X^\top = x_i.
  • Download?
    • aj=a_j = size of file     \implies downloading twice the size of the file.
  • Privacy?
    • Since z=1z = 1, need to show I(U;Qi)=0I(U; Q_i) = 0.
      • I(U;Q1)=I(eU;F)=0I(U; Q_1) = I(e_U; F) = 0 since UU and FF are independent.
      • I(U;Q2)=I(eU;F+eU)=0I(U; Q_2) = I(e_U; F + e_U) = 0 since this is one-time pad!
Parameters and notations in PIR

Parameters of the system:

  • n=n = # servers (as in storage).
  • m=m = # files.
  • k=k = size of each file (as in storage).
  • z=z = max. collusion (as in secret sharing).
  • t=t = # of answers required to obtain xix_i (as in secret sharing).
    • ntn - t servers are “stragglers”, i.e., might not respond.

Figures of merit:

  • PIR-rate = #\# desired symbols / #\# downloaded symbols
  • PIR-capacity = largest possible rate.

Notaional conventions:

-The dataset X={xj}jm={xj,}(j,)[m]×[k]X = \{x_j\}_{j \in m} = \{x_{j, \ell}\}_{(j, \ell) \in [m] \times [k]} is seen as a vector in Fqmk\mathbb{F}_q^{mk}.

  • Index Fqmk\mathbb{F}_q^{mk} using [m]×[k][m] \times [k], i.e., xj,x_{j, \ell} is the \ell-th symbol of the jj-th file.

Private information retrieval from 4-replicated databases

Consider n=4n = 4 replicated servers, file size k=2k = 2, collusion z=1z = 1.

Protocol: User has iUmi \sim U_{m}.

  • Fix distinct nonzero α1,,α4Fq\alpha_1, \ldots, \alpha_4 \in \mathbb{F}_q.
  • Choose rUFq2mr \sim U_{\mathbb{F}_q^{2m}}.
  • User sends qj=ei,1+αjei,2+αj2rq_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 r to each server jj.
  • Server jj responds with aj=qjX=ei,1X+αjei,2X+αj2rXa_j = q_j X^\top = e_{i, 1} X^\top + \alpha_j e_{i, 2} X^\top + \alpha_j^2 r X^\top
    • This is an evaluation at αj\alpha_j of the polynomial fi(w)=xi,1+xi,2w+rw2f_i(w) = x_{i, 1} + x_{i, 2} \cdot w + r \cdot w^2.
    • Where rr is some random combination of the entries of XX.
  • Decoding?
    • Any 3 responses suffice to interpolate fif_i and obtain xi=xi,1,xi,2x_i = x_{i, 1}, x_{i, 2}.
    •     t=3\implies t = 3, (one straggler is allowed)
  • Privacy?
    • Does qj=ei,1+αjei,2+αj2rq_j = e_{i, 1} + \alpha_j e_{i, 2} + \alpha_j^2 r look familiar?
    • This is a share in ramp scheme with vector messages m1=ei,1,m2=ei,2,miFq2mm_1 = e_{i, 1}, m_2 = e_{i, 2}, m_i \in \mathbb{F}_q^{2m}.
    • This is equivalent to 2m2m “parallel” ramp scheme over Fq\mathbb{F}_q.
    • Each one reveals nothing to any z=1z = 1 shareholders     \implies Private!

Private information retrieval from general replicated databases

nn servers, mm files, file size kk, XFqmkX \in \mathbb{F}_q^{mk}.

Server decodes xix_i from any tt responses.

Any z\leq z servers might collude to infer ii (z<tz < t).

Protocol: User has iUmi \sim U_{m}.

  • User chooses r1,,rzUFqmkr_1, \ldots, r_z \sim U_{\mathbb{F}_q^{mk}}.
  • User sends qj==1kei,αj1+=1zrαjk+1q_j = \sum_{\ell=1}^k e_{i, \ell} \alpha_j^{\ell-1} + \sum_{\ell=1}^z r_\ell \alpha_j^{k+\ell-1} to each server jj.
  • Server jj responds with aj=qjX=fi(αj)a_j = q_j X^\top = f_i(\alpha_j).
    • fi(w)==1kei,Xw1+=1zrXwk+1f_i(w) = \sum_{\ell=1}^k e_{i, \ell} X^\top w^{\ell-1} + \sum_{\ell=1}^z r_\ell X^\top w^{k+\ell-1} (random combinations of XX).
    • Caveat: must have t=k+zt = k + z.
    •     degfi=k+z1=t1\implies \deg f_i = k + z - 1 = t - 1.
  • Decoding?
    • Interpolation from any tt evaluations of fif_i.
  • Privacy?
    • Against any z=tkz = t - k colluding servers, immediate from the proof of the ramp scheme.

PIR-rate?

  • Each aja_j is a single field element.
  • Download t=k+zt = k + z elements in Fq\mathbb{F}_q in order to obtain xiFqkx_i \in \mathbb{F}_q^k.
  •     \implies PIR-rate = kk+z=kt\frac{k}{k+z} = \frac{k}{t}.

Theorem: PIR-capacity for general replicated databases

The PIR-capacity for nn replicated databases with zz colluding servers, ntn - t unresponsive servers, and mm files is C=1zt1(zt)mC = \frac{1-\frac{z}{t}}{1-(\frac{z}{t})^m}.

  • When mm \to \infty, C1zt=tzt=ktC \to 1 - \frac{z}{t} = \frac{t-z}{t} = \frac{k}{t}.
  • The above scheme achieves PIR-capacity as mm \to \infty

Private information retrieval from coded databases

Problem setup:

Example:

  • n=3n = 3 servers, mm files xjx_j, xj=xj,1,xj,2x_j = x_{j, 1}, x_{j, 2}, k=2k = 2, and q=2q = 2.
  • Code each file with a parity code: xj,1,xj,2xj,1,xj,2,xj,1+xj,2x_{j, 1}, x_{j, 2} \mapsto x_{j, 1}, x_{j, 2}, x_{j, 1} + x_{j, 2}.
  • Server j3j \in 3 stores all jj-th symbols of all coded files.

Queries, answers, decoding, and privacy must be tailored for the code at hand.

With respect to a code CC and parameters n,k,t,zn, k, t, z, such scheme is called coded-PIR.

  • The content for server jj is denoted by cj=cj,1,,cj,mc_j = c_{j, 1}, \ldots, c_{j, m}.
  • CC is usually an MDS code.

Private information retrieval from parity-check codes

Example:

Say z=1z = 1 (no collusion).

  • Protocol: User has iUmi \sim U_{m}.
  • User chooses r1,r2UF2mr_1, r_2 \sim U_{\mathbb{F}_2^m}.
  • Two queries to each server:
    • q1,1=r1+eiq_{1, 1} = r_1 + e_i, q1,2=r2q_{1, 2} = r_2.
    • q2,1=r1q_{2, 1} = r_1, q2,2=r2+eiq_{2, 2} = r_2 + e_i.
    • q3,1=r1q_{3, 1} = r_1, q3,2=r2q_{3, 2} = r_2.
  • Server jj responds with qj,1cjq_{j, 1} c_j^\top and qj,2cjq_{j, 2} c_j^\top.
  • Decoding?
    • q1,1c1+q2,1c2+q3,1c3=r1c1+c2+c3+eic1=r10+xi,1=xi,1q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 c_1 + c_2 + c_3 + e_i c_1^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}.
    • q1,1c1+q2,1c2+q3,1c3=r10+xi,1=xi,1q_{1, 1} c_1^\top + q_{2, 1} c_2^\top + q_{3, 1} c_3^\top = r_1 \cdot 0^\top + x_{i, 1} = x_{i, 1}.
    • q1,2c1+q2,2c2+q3,2c3=r2c1+c2+c3+eic2=xi,2q_{1, 2} c_1^\top + q_{2, 2} c_2^\top + q_{3, 2} c_3^\top = r_2 c_1 + c_2 + c_3^\top + e_i c_2^\top = x_{i, 2}.
  • Privacy?
    • Every server sees two uniformly random vectors in F2m\mathbb{F}_2^m.

Proof from coding-theoretic interpretation

Let G=g1,g2,g3G = g_1^\top, g_2^\top, g_3^\top be the generator matrix.

  • For every file xj=xj,1,xj,2x_j = x_{j, 1}, x_{j, 2} we encode xjG=(xj,1g1,xj,2g2,xj,1g3)=(cj,1,cj,2,cj,3)x_j G = (x_{j, 1} g_1^\top, x_{j, 2} g_2^\top, x_{j, 1} g_3^\top) = (c_{j, 1}, c_{j, 2}, c_{j, 3}).

  • Server jj stores Xgj=(x1,,xm)gj=(cj,1,,cj,m)X g_j^\top = (x_1^\top, \ldots, x_m^\top)^\top g_j^\top = (c_{j, 1}, \ldots, c_{j, m})^\top.

  • By multiplying by r1r_1, the servers together store a codeword in CC:

    • r1Xg1,r1Xg2,r1Xg3=r1XGr_1 X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top = r_1 X G.
  • By replacing one of the r1r_1’s by r1+eir_1 + e_i, we introduce an error in that entry:

    • ((r1+ei)Xg1,r1Xg2,r1Xg3)=r1XG+(eiXg1,0,0)\left((r_1 + e_i) X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top\right) = r_1 X G + (e_i X g_1^\top, 0,0).
  • Downloading this “erroneous” word from the servers and multiply by H=h1,h2,h3H = h_1^\top, h_2^\top, h_3^\top be the parity-check matrix.

((r1+ei)Xg1,r1Xg2,r1Xg3)H=(r1XG+(eiXg1,0,0))H=r1XGH+(eiXg1,0,0)H=0+xi,1g1=xi,1.\begin{aligned} \left((r_1 + e_i) X g_1^\top, r_1 X g_2^\top, r_1 X g_3^\top\right) H^\top &= \left(r_1 X G + (e_i X g_1^\top, 0,0)\right) H^\top \\ &= r_1 X G H^\top + (e_i X g_1^\top, 0,0) H^\top \\ &= 0 + x_{i, 1} g_1^\top \\ &= x_{i, 1}. \end{aligned}

In homework we will show tha this work with any MDS code (z=1z=1).

  • Say we obtained xi,1g1,,xi,kgkx_{i, 1} g_1^\top, \ldots, x_{i, k} g_k^\top (𝑑 − 1 at a time, how?).
  • xi,1g1,,xi,kgk=xi,Bx_{i, 1} g_1^\top, \ldots, x_{i, k} g_k^\top = x_{i, B}, where BB is a k×kk \times k submatrix of GG.
  • BB is a k×kk \times k submatrix of GG     \implies invertible!     \implies Obtain xix_{i}.
Tip

error + known location     \implies erasure. d=2    d = 2 \implies 1 erasure is correctable.

Last updated on