CSE5313 Coding and information theory for data science (Lecture 19)
Private information retrieval
Problem setup
Premise:
- Database , each is a “file” (e.g., medical record).
- is coded , stored at server .
- The user (physician) wants .
- The user sends a query to server .
- Server responds with .
Decodability:
- The user can retrieve the file: .
Privacy:
- is seen as , reflecting server’s lack of knowledge.
- must be kept private: for all .
In short, we want to retrieve from the servers without revealing to the servers.
Private information retrieval from Replicated Databases
Simple case, one server
Say .
- All data is stored in one server.
- Simple solution:
- “send everything”.
- .
Theorem: Information Theoretic PIR with can only be achieved by downloading the entire database.
- Can we do better if ?
Collusion parameter
Key question for : Can servers collude?
- I.e., does server see any , ?
- Key assumption:
- Privacy parameter .
- At most servers can collude.
- No collusion.
- Requirement for : for all .
- Requirement for a general :
- for all , , where for all .
- Motivation:
- Interception of communication links.
- Data breaches.
Other assumptions:
- Computational Private information retrieval (even all the servers are hacked, still cannot get the information -> solve np-hard problem):
- Non-zero MI
Private information retrieval from 2-replicated databases
First PIR protocol: Chor et al. FOCS ‘95.
- The data is replicated on two servers.
- , i.e., no collusion.
- Protocol: User has .
- User generates .
- ( is the -th unit vector, is equivalent to one-time pad encryption of with key ).
- Linear combination of the files according to the query vector .
- Decoding?
- .
- Download?
- size of file downloading twice the size of the file.
- Privacy?
- Since , need to show .
- since and are independent.
- since this is one-time pad!
- Since , need to show .
Parameters and notations in PIR
Parameters of the system:
- # servers (as in storage).
- # files.
- size of each file (as in storage).
- max. collusion (as in secret sharing).
- # of answers required to obtain (as in secret sharing).
- servers are “stragglers”, i.e., might not respond.
Figures of merit:
- PIR-rate = desired symbols / downloaded symbols
- PIR-capacity = largest possible rate.
Notaional conventions:
-The dataset is seen as a vector in .
- Index using , i.e., is the -th symbol of the -th file.
Private information retrieval from 4-replicated databases
Consider replicated servers, file size , collusion .
Protocol: User has .
- Fix distinct nonzero .
- Choose .
- User sends to each server .
- Server responds with
- This is an evaluation at of the polynomial .
- Where is some random combination of the entries of .
- Decoding?
- Any 3 responses suffice to interpolate and obtain .
- , (one straggler is allowed)
- Privacy?
- Does look familiar?
- This is a share in ramp scheme with vector messages .
- This is equivalent to “parallel” ramp scheme over .
- Each one reveals nothing to any shareholders Private!
Private information retrieval from general replicated databases
servers, files, file size , .
Server decodes from any responses.
Any servers might collude to infer ().
Protocol: User has .
- User chooses .
- User sends to each server .
- Server responds with .
- (random combinations of ).
- Caveat: must have .
- .
- Decoding?
- Interpolation from any evaluations of .
- Privacy?
- Against any colluding servers, immediate from the proof of the ramp scheme.
PIR-rate?
- Each is a single field element.
- Download elements in in order to obtain .
- PIR-rate = .
Theorem: PIR-capacity for general replicated databases
The PIR-capacity for replicated databases with colluding servers, unresponsive servers, and files is .
- When , .
- The above scheme achieves PIR-capacity as
Private information retrieval from coded databases
Problem setup:
Example:
- servers, files , , , and .
- Code each file with a parity code: .
- Server stores all -th symbols of all coded files.
Queries, answers, decoding, and privacy must be tailored for the code at hand.
With respect to a code and parameters , such scheme is called coded-PIR.
- The content for server is denoted by .
- is usually an MDS code.
Private information retrieval from parity-check codes
Example:
Say (no collusion).
- Protocol: User has .
- User chooses .
- Two queries to each server:
- , .
- , .
- , .
- Server responds with and .
- Decoding?
- .
- .
- .
- Privacy?
- Every server sees two uniformly random vectors in .
Proof from coding-theoretic interpretation
Let be the generator matrix.
-
For every file we encode .
-
Server stores .
-
By multiplying by , the servers together store a codeword in :
- .
-
By replacing one of the ’s by , we introduce an error in that entry:
- .
-
Downloading this “erroneous” word from the servers and multiply by be the parity-check matrix.
In homework we will show tha this work with any MDS code ().
- Say we obtained (𝑑 − 1 at a time, how?).
- , where is a submatrix of .
- is a submatrix of invertible! Obtain .
error + known location erasure. 1 erasure is correctable.