Abstract—Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this paper, we study a variant of the similarity join, termed top-k set similarity join. It returns the top-k pairs of records ranked by their similarities, thus eliminating the guess work users have to perform when the similarity threshold is unknown before hand.
An algorithm, topk-join, is proposed to answer top-k similarity join efficiently. It is based on the prefix filtering principle and employs tight upper bounding of similarity values of unseen pairs. Experimental results demonstrate the efficiency of the proposed algorithm on large-scale real datasets.
相似度连接运算是一个有效的基础性操作在很多应用中,比如说重复网页页面消除、数据集成、模式识别等。简单的相似度连接运算的运行需要用户指定简单阈值。本文中我们研究了各种相似度连接运算,检测Top-k集相似连接运算。通过他们的相似程度得到了了Top-K对的记录等级,从而消除了访问者在简单阈值未知情况下必需完成的工作。
Topk-join算法的目的是为了研究Top-k的相关连接运算效果。它是基于前缀过滤原理并且利用了对未知对的相似值的严格上界约束。实验的结果在大规模实时数据集中证明了算法的设计效率。