Social Networks Project

Unsupervised Name Disambiguation via Social Network Similarity

by Bradley Malin

Abstract

Though names reference actual entities it is nontrivial to resolve which entity a particular name observation represents. Even when names are devoid of typographical error, the resolution process is confounded by both ambiguity, where the same name correctly references multiple entities, and by variation, when an entity is correctly referenced by multiple names. Thus, before link analysis for surveillance or intelligence-gathering purposes can proceed, it is necessary to ensure vertices and edges of the network are correct. In this paper, we concentrate on ambiguity and investigate unsupervised methods which simultaneously learn 1) the number of entities represented by a particular name and 2) which observations correspond to the same entity. The disambiguation methods leverage the fact that an entity’s name can be listed in multiple sources, each with a number of related entity’s names, which permits the construction of name-based relational networks. The methods studied in this paper differ based on the type of network similarity exploited for disambiguation. The first method relies upon exact name similarity and employs hierarchical clustering of sources, where each source is considered a local network. In contrast, the second method employs a less strict similarity requirement by using random walks between ambiguous observations on a global social network constructed from all sources, or a community similarity. While both methods provide better than simple baseline results on a subset of the Internet Movie Database, findings suggest methods which measure similarity based on community, rather than exact, similarity provide more robust disambiguation capability.

Keywords: Disambiguation, Social Networks, Random Walks, Multiclass Clustering

Citation:
Bradley Malin. Unsupervised Name Disambiguation via Social Network Similarity. Workshop on Link Analysis, Counterterrorism, and Security, at the 2005 SIAM International Conference on Data Mining, Newport Beach, CA, 2005. (
PDF)

Related Links


Summer 2004 Data Privacy Lab