Carnegie Mellon University

Data Privacy Center

Data Privacy Course


Lab 11: The Identifiability of Anonymized Data




Objective

The objective of this lab is for you to get some experience working with anonymized data and seeing the results attempts at anonymization can have both on the quality of the resulting data as well as on the protection provided to individuals.


Overview of activities

The Cambridge Voter list has been loaded into Dataville and made available for your use.


Part I. Base Identification

A sample of 12 records for medical information has been provided. Each individual patient is a hypothetical subject who has demographics that match a person in the Cambridge voter list. The following website shows what happens to the data under various release strategies.

First, identify in the table each two records whose demographics match. Then, determine Write down how many possible people match each person in the original data by completing the following table.

RowDate of birthGenderZIPMatching record#
1    
2    
3    
4    
5    
6    
7    
8    
9    
10    
11    
12    

Send your answers as an email message to padlab@privacy.cs.cmu.edu. In the subject, write: "Lab#11 pairs". The body of the message should contain your answers.


Part II. Protection promised

For each of the results provided for the claims data, determine the naive identifiability of each person. As you may recall, to determine the naive identifiability of the information, you use the pigeon hole priniciple to determine the number of possible candidates that could match those demographics. In this case, you will want to run queries against the Dataville database to determine the number of overall gross totals.

RowDate of birthGenderZIPNaive identifiability
1    
2    
3    
4    
5    
6    
7    
8    
9    
10    
11    
12    

Send your answers as an email message to padlab@privacy.cs.cmu.edu. In the subject, write: "Lab#11 identifiability". The body of the message should contain your answers.


Part III. Actual Protection

Now, for each record in the claims data, determine the number of people who actually match the information based on the Cambridge Voter list. In this case, you will want to run queries against the Dataville database to determine the number of possible matches based on the given demographics.

RowDate of birthGenderZIPNumber of matching people
1    
2    
3    
4    
5    
6    
7    
8    
9    
10    
11    
12    

Send your answers as an email message to padlab@privacy.cs.cmu.edu. In the subject, write: "Lab#11 matches". The body of the message should contain your answers.


Part IV. Usefulness

You have been asked to statistically answer the following questions based on the data in the problem lists.

"Are the problems associated with heart disease more prevalent in one race than another, in what ZIP code more so than another, or within one gender more than the other?"

To answer these questions, you will look at the original data and contrast the results you get from the each of the anonymized tables for the problem list.

Send your answers as an email message to padlab@privacy.cs.cmu.edu. In the subject, write: "Lab#11 usefulness". The body of the message should contain your answers.


Fall 2003 Privacy and Anonymity in Data
Professor: Latanya Sweeney, Ph.D. [latanya@privacy.cs.cmu.edu]