Unsupervised Name Disambiguation via Social Network Similarity
by Bradley Malin
Abstract
Though names reference actual entities it is nontrivial to resolve which entity a particular name observation represents. Even when names are devoid of typographical error, the resolution process is confounded by both ambiguity, where the same name correctly references multiple entities, and by variation, when an entity is correctly referenced by multiple names. Thus, before link analysis for surveillance or intelligence-gathering purposes can proceed, it is necessary to ensure vertices and edges of the network are correct. In this paper, we concentrate on ambiguity and investigate unsupervised methods which simultaneously learn 1) the
number of entities represented by a particular name and 2) which observations correspond to the same entity. The disambiguation methods leverage the fact that an entity’s name can be listed in multiple sources, each with a number of related entity’s names, which permits the construction of name-based relational networks. The methods studied in this paper differ based on the type of network similarity exploited for disambiguation. The first method relies upon exact name similarity and employs hierarchical clustering of
sources, where each source is considered a local network. In contrast, the second method employs a less strict similarity
requirement by using random walks between ambiguous observations on a global social network constructed from all sources, or a community similarity. While both methods provide better than simple baseline results on a subset of the Internet Movie Database, findings suggest methods which measure similarity based on community, rather than exact, similarity provide more robust disambiguation capability.
Keywords: Disambiguation, Social Networks, Random Walks, Multiclass Clustering
Citation:
Bradley Malin.
Unsupervised Name Disambiguation via Social Network Similarity.
Workshop on Link Analysis, Counterterrorism, and Security, at the 2005 SIAM International Conference on Data Mining,
Newport Beach, CA, 2005.
(PDF)
Related Links