Top-k　Set　Similarity　Joins　摘要翻译--梦飞翔的地方(梦翔天空)

载入中。。。 'S bLog

载入中。。。

Top-k Set Similarity Joins 摘要翻译

[ 2009/7/24 11:44:00 | By: 梦翔儿 ]

Abstract—Similarity join is a useful primitive operation underlying many applications, such as near duplicate Web page detection, data integration, and pattern recognition. Traditional similarity joins require a user to specify a similarity threshold. In this paper, we study a variant of the similarity join, termed top-k set similarity join. It returns the top-k pairs of records ranked by their similarities, thus eliminating the guess work users have to perform when the similarity threshold is unknown before hand.
An algorithm, topk-join, is proposed to answer top-k similarity join efficiently. It is based on the prefix filtering principle and employs tight upper bounding of similarity values of unseen pairs. Experimental results demonstrate the efficiency of the proposed algorithm on large-scale real datasets.

相似度连接运算是一个有效的基础性操作在很多应用中,比如说重复网页页面消除、数据集成、模式识别等。简单的相似度连接运算的运行需要用户指定简单阈值。本文中我们研究了各种相似度连接运算，检测Top-k集相似连接运算。通过他们的相似程度得到了了Top-K对的记录等级，从而消除了访问者在简单阈值未知情况下必需完成的工作。

Topk-join算法的目的是为了研究Top-k的相关连接运算效果。它是基于前缀过滤原理并且利用了对未知对的相似值的严格上界约束。实验的结果在大规模实时数据集中证明了算法的设计效率。

阅读全文 | 回复(0) | 引用通告 | 编辑

标签：Top－k Set Similarity Joins

上一篇：winxp vista/win7 有线网卡，无线网卡共享上网的方案
下一篇：数据库技术领域的发展趋势（转）

发表评论：

梦翔儿网站梦飞翔的地方 http://www.dreamflier.net
中华人民共和国信息产业部TCP/IP系统备案序号：辽ICP备09000550号