Carnegie Mellon University

Data Privacy Center

Data Privacy Course


Lab #8a: Linking Files




Objective

The objective of this lab is for you to get some experience linking files and working with disparate pieces of data.

Considerations that should come to mind:


Overview of activities

The in-class activities for this lab relate to linking information to learn information about people whose information is captured in the information.


Part I. Brandon's medical record (15 minutes)

For the first part of this lab, you will locate from publicly (and semi-publicly) available information part of the medical record of a real child, who name is Brandon Steele.

Suppose one of our friends, Alice, from XYZ University was reading the following news article about four children who had cancer. Alice is very curious about one of the children, Brandon Steele, and wants to know more specifics about his medical condition. Here is the article Alice read.

news article

In the article, Alice learned that "Brandon Steele of Taylorville, Illinois was diagnosed with neuroblastoma in August 1991 and later died." Alice wants to know more about Brandon's medical history.

After the first labs in this course, some data sets come to mind immediately. Links are provided below. Take a few minutes and investigate Brandon using the links provided. Find "something" about his medical history that is not contained in the article. That should be easy, because not much more detail is provided about Brandon in the article.

When you find a fact or two about Brandon's medical record, send an email message to padlab@privacy.cs.cmu.edu. In the subject, write: "Lab#8a Brandon". The body of the message should contain the following the fact or two you learned. Keep a copy of the information so we can share in class.

Discussion thought: Should Alice just be able to find this kind of information out so easily?

Later in this course we will learn ways to learn this kind of information automatically.


Part II. Probabilistic Linkage -- why people died

In this activity you will learn how people died. That is, you will determine the cause of death of named individuals.

Consider the Social Security death indices that are on-line. Note. The on-lince indices do not cover exactly the same people. Nevertheless, using the on-line death indices and the hospital information below, tell me the names of some of these people and how they died.

Activity 2-1.

Start by making a list of Social Security death indices you will use. Find the URL for a death index. These ones used in earlier labs were, but you may find others to be more useful: http://www.ancestry.com/search/rectype/vital/ssdi/main.htm and http://ssdi.genealogy.rootsweb.com/.


Activity 2-2.

Below is a sample drawn from some hospital discharge data. Each of these records include people who died in the hospital. Report for some of these a description of the disease from which they died, their name and any other available information.

Send your answers as an email message to padlab@privacy.cs.cmu.edu. In the subject, write: "Lab#8a deaths described". The body of the message should contain your answers.


Part III. Assignment (Due Monday 11/22/2004 9am)

For each of the names in Part II above, you will compute the uniqueness of how many people match the criteria of each record and report your findings. Specifically, you will record how many people in the death records match the criteria appearing in the health data and generate a plot of identifiability. You may elect to perform these matches manually or by some semi- or automated means of your own design. Below are some steps to get you started.

  1. The "criteria for matching" is the list of fields in on which you make your decision as to whether there is a match. Which fields in the health data are you using for matching? Which fields in the death data are you using for matching?

  2. Using a reasonably broad criteria for matching, lookup possible hits for health records assigned to you below. Search each of your assigned health records for matches in the death data. Provide an Excel spreadsheet containing the records you found. The fields of the Excel spreadsheet must be: Name, Date of Birth, Date of Death, ZIP, SSN. The fields must appear exactly in this order.

    If there are multiple matches in death data, each match will be a row in the spreadsheet. If there are no matches in the death data, there will be no rows appearing for that search in the resulting spreadsheet.

    Submit your search results in the described spreadsheet by 9am Friday, 11/19/2004!. Send your spreadsheet as an attachment to padlab@privacy.cs.cmu.edu. We will combine your results with those from your classmates to make a master death data list that you can use for the remainder of this assignment. Below are the records in the health data for which you must provide matches. Redundancy has been purposefully built-in. More than one student will provide an answer for each group of records. The record rows correspond to rows in the Excel sheet at http://privacy.cs.cmu.edu/courses/pad1/assign/lab8a/part2/deaths.xls.

    Health record rowsStudent
    2-21Born
    2-21Chang
    22-41Forges
    22-41Gaustad
    42-61Goodman
    42-61Gwynn
    62-81Hum
    62-81Imrhan
    82-101Johnson
    82-101Kannan
    102-121Kim
    102-121Lemmon
    122-141Lim
    122-141Liu
    142-161Lynn
    142-161Mirochnik
    162-181Nussdorfer
    162-181Pawson
    182-201Pennock
    182-201Pickett

  3. The master death data is the information that results from compiling all student solutions from the previous step (removing any duplicates). This file is available in Excel, tab-delimited text, and HTML formats. You can use this for the remainder of this assignment.

  4. Compute the number of matches you get for each of the records in the health data, when matched against the master death data. For example {} may match only 1 person, so you would record a match of 1 for that record. The number of people that match the record are called the record's "bin size." Record the number of matches (bin size) for each record in the health data in an Excel spreadsheet.

  5. Plot your results. The x-axis is the "bin size," which is the number of people matching the criteria. This is the number of people to whom a specific health data record may refer.

    The y-axis is the number of records in the health data that have the same bin size. As the bin size increases, the number of records having that bin size is expected to decrease.

Consider your findings. Write a 3 page report on the experiment you just conducted. Your write-up should include the traditional sections: Abstract, Introduction (explain why is this experiment important), Background (describe sharing practices), Methods (describe your experiment, be precise), Results (include your diagram and report), and Discussion (explain what was important about you demonstrated).

Submit your report by email to padlab@privacy.cs.cmu.edu. A copy will be placed on-line for review. Also include an Excel spreadsheet showing the binsizes you found.

Student Solutions


Fall 2004 Privacy and Anonymity in Data
Professor: Latanya Sweeney, Ph.D. [latanya@privacy.cs.cmu.edu]