The General Accounting Office identified 12 terrorist watch lists in nine federal agencies. [Federal Computer Week, 2003] Here is a sample:
There is strong desire to have one large master federal watchlist, and a new center, the Terrorist Screening Center, will house the master database [Federal Computer Week, 2003]. The center results from Homeland Security Presidential Directive-6 signed by President Bush in September 2003 [Computerworld, 2003].
The number of suspicious characters to be included on the watchlist is estimated at 13 million [New York Daily News, 2003]. The list contains "potential terrorists" making its size of concern to civil libertarians who argue that innocent people may appear on the list and be subject to constant tracking without their knowledge or any legal review.
Another difficulty is identity theft. Actual terorrists that are apt to be on the watchlist are apt to steal identities and to create fraudulent travel documents in order not to be detected, so the utility of the watchlist (based on explicit identification) appears somewhat limited.
What are key characteristics to solving the Watchlist problem?
How useful are solutions apt to be?
Proposal #2. Recently, the CIA has purchased a solution developed by Jeff Jonas for approximately $35million [Wall Street Journal, 2004, Newsweek, 2004]. Below is a summary of the Newsweek story:
Jonas' system works as follows:
|
[source]
A working example of the basic principle of the systsm is available in a dynamic spreadsheet and a static spreadsheet. Load these up and see how they work. There is a sample of 500 names drawn from the Cambridge Voter list. Each has a unique hashed ID number. Five percent of the sample, selected randomly, is identified as being on the watchlist. There are 12 locations reporting patrons by hashed id number; the patrons appearing at each location is determined randomly based on a probability of being selected (reported at the top of the column). For example, a location that includes 80% of the population, provides an 80% chance that any particular person will visit in this model. By clicking the delete key in cell A1 (the upper left corner), a time period lapses in which different customers are selected for each location. The static spreadsheet has a frozen instance.
What are some benefits of this approach? What are some concerns?
In the assignment associated with this lab, you will do an assessment of Jonas' system -- i.e., you will identify ways his system may not provide the anonymity claimed.
Model 1: Ship the Explicitly Identified Data to the Government for Analysis
|
In the model shown above, explicitly identified data is shipped directly to the government for processing. The hashing is done internally and compared to hashed values of the watchlist.
In this scenario, what are the privacy concerns? risks to individuals not on the watch list? risks to the data holders? What are the legal issues? Is there any protection against using the data for other purposes?
What additional information about those not on the watchlist can be gained by the government?
Recall our older measure of risk -- the product of the number of people identified by the data and the amount of information provided. How significant is the risk in this scenario? Who is at risk?
Are there any technical remedies that could be applied to this model? Are there any legal remedies that could assist?
Model 2a: Data holders compute matches locally
|
In this model, the hashed function and the hashed values of the watchlist (w) are provided to data holders, who in turn compute the hashed values for their customers and report any matches to the government (m).
In this scenario, what are the privacy concerns? risks to individuals not on the watch list? risks to the data holders? What are the legal issues?
What are problems the government may face by relying on the data holders? What are risks associated with providing the hash function and the hashed watchlist to the data holders?
Recall our older measure of risk -- the product of the number of people identified by the data and the amount of information provided. How does risk in this model compare with that in model 1?
Are there any technical remedies that could be applied to this model? Are there any legal remedies that could assist?
Model 2b: Data holders compute matches locally using a secure "appliance"
|
In this model, the hashed function and the hashed values of the watchlist (w) are provided to data holders, who in turn compute the hashed values for their customers and report any matches to the government (m), as was done in Model 2a above. But in this case, the computation and contents of the watchlist are done by an "appliance," which is a secure, stand-alone, tamper-resistant device with network access having no visible outside controls no connections other than a power cable and a network connection jack.
Consider the questions above related to Model 2a. What changes, if anything, by having the computation performed by a secure appliance?
Model 3: Use of a Trusted Third Party
|
In this model, a the concept of a "trusted" third party is added. This third party may be Jonas' company or a government entity. The explicitly identified information is provided from the data holders to the third party. The explicitly identified watchlist is provided to the third party. The hashing and comparison is done by the third party and any matches are reported to government agencies.
In the news accounts of the system, a search warrant is needed by the government agency to get the identity of the matched person, but is that really needed? Is that actually enforced by the system?
In this scenario, what are the privacy concerns? risks to individuals not on the watch list? risks to the data holders? What are the legal issues? Is there any protection against using the data for other purposes?
What additional information about those not on the watchlist can be gained by the government?
Recall our older measure of risk -- the product of the number of people identified by the data and the amount of information provided. How significant is the risk in this scenario? Who is at risk?
Are there any technical remedies that could be applied to this model? Are there any legal remedies that could assist?
In this lab, you have examined the a system that is being purchased by the U.S. government to perform privacy-preserving surveillance. Class discussion centered on the analysis of various models of the system, as shown above. For this assignment, identify a privacy problem found in this system and write a summary of your assessment of the privacy problem. You may elect to use any of the models described above, or you may pose your own as the basis for demonstrating the system's weakness in providing sufficient privacy protection. Provide a technical basis for your conclusions about the system's inappropriateness. Show how re-identifications (the ability to re-identify sufficient information about the subjects of the information as to enable contact) are possible. Is there an easy remedy? Explain why there is or s not an easy remedy to the problem found with the system.
Ground your discussion by analyzing the system's ability to solve the watchlist problem, but to do so in such a way that that the identities of US citizens can be determined without legal review. The Central Intelligence Agency (CIA) is the group within the government that is purchasing the system. Privacy problems emerge in the CIA's use because in general the CIA cannot perform surveillance on identified US citizens without a legal process granting narrowly specific permission. The goal of the proposed system is to provide a way of tracking people, including U.S. citizens, but to do so in such a way that during typical operation, the identities of the citzens cannot be determined by the CIA. If suspicious behavior (appearance of a known person on the Watchlist) is found, the CIA should then be able to present the evidence for legal review in order to receive permission for the identities of those involved in suspcious behavior to be known and further information to possibly be obtained from the appropriate data holders. In this assignment, you will assess to what extent the proposed model(s) fails to achieve these goals.
Your write-up should be a one-page abstract in the traditional format.
Email solutions to padlab@privacy.cs.cmu.edu.
Submit either a Word document or a PDF file. Solutions (as PDF files) will be
posted on-line.
Assignment (Due Friday 4/2/2004 9am)
Spring 2004
Privacy and Anonymity in Data
Professor: Latanya Sweeney, Ph.D.
[latanya@privacy.cs.cmu.edu]