Trails Learning Project

Betrayed by My Shadow: Learning Data Identity via Trail Matching

by Bradley Malin

Abstract

The term “re-identification” refers to the correct relation of seemingly anonymous data to explicitly identifying information, such as the name or address, of people who are the subjects of the data. Historically, methods for re-identification have been based on data released from a single data holder. This paper extends the concept to trail re-identification in which a person is related to seemingly anonymous data left behind at multiple locations, thus the data’s trail. The main premise behind these methods is that some locations capture, in addition to seemingly anonymous data, an individual’s explicitly identifying information and, subsequently, provide separate data releases of the unidentified and identified data. A single location’s releases appear unrelated; however, when multiple locations make such releases of information, common patterns in the data trails of two types of data can be used to discover relationships between them. The algorithms presented herein differ in the amount of completeness and multiplicity assumed in the data. We report experiments and successful re-identifications of IP addresses to online users and households. This work provides a foundation for several new research directions, including the development of methods for learning identity and additional information across disparate datasets, as well as a foundation for methods that enable data holders to share information with guarantees of anonymity.

Keywords: Privacy, Anonymity, Data Mining, Data Sharing, Distributed Databases, Online Privacy

Citation:
B. Malin. Betrayed by My Shadow: Learning Data Identity via Trail Matching. Journal of Privacy Technology. 2005; 20050609001. (PDF)

Related Links


Fall 2004 Data Privacy Laboratory [LIDAP@privacy.cs.cmu.edu]