Top-k typicality queries and efficient query answering methods on large databases 摘要粗译
Abstract Finding typical instances is an effective approach to understand and analyze large data sets. In this paper, we apply the idea of typicality analysis from psychology and
cognitive science to database query answering, and study the novel problem of answering top-k typicality queries.We model typicality in large data sets systematically. Three types of top-k typicality queries are formulated. To answer questions like “Who are the top-k most typical NBA players?”,the measure of simple typicality is developed. To answer questions like “Who are the top-k most typical guards distinguishing guards from other players?”, the notion of discriminative typicality is proposed. Moreover, to answer questions like “Who are the best k typical guards in whole representing different types of guards?”, the notion of representative typicality is used. Computing the exact answer to a top-k typicality query requires quadratic time which is often too costly for online query answering on large databases.We develop a series of approximation methods for various situations: (1) the randomized tournament algorithm has linear complexity though it does not provide a theoretical guarantee on the quality of the answers; (2) the direct local typicality approximation using VP-trees provides an approximation quality guarantee; (3) a local typicality tree data structure can be exploited to index a large set of objects. Then,typicality queries can be answered efficiently with quality guarantees by a tournament method based on a Local Typicality Tree. An extensive performance study using two real data sets and a series of synthetic data sets clearly shows that top-k typicality queries are meaningful and our methods are practical.
摘要 查询代表性的实例是一个理解和分析大型数据集的有效办法。本文我们实现了针对物理和理科数据查询回问题进行代表性分析,并且研究了一种新的top-k应答典型查询问题。我们系统地建立了大型数据集的模型,形式化定义描述了三种典型的top-k查询方法。为了回答“谁是最典型的NBA球员”问题,开发了简单典型查询方法;为了回答类似于“谁是和其他球员相比表现最显著的后卫?”的问题,提了典型差别的概念‘此外,为了回答“在所有类型的后卫中,谁是最典型的代表性的后卫?”提出了典型代表概念。计算top-k典型查询的精确解需要二次的时间复杂度,这常常会导致大型数据库的在线查询代价非常的昂贵。我们开发了系列的多种形式的近似的方法:(1)随机竞赛算法,它有着线性复杂度,但是在理论上并不能提供精确解。(2)直接局部典型结构可以被用来索引大型对象集,而通过基于局部典型树的竞赛方法典型查询可以有效的高质量确定性的求解。我们在两个真实数据集和一系列的综合数据集中做了广泛的性能研究,结果表明top-k典型查询是有深远意义的,而且我们的方法是可行的。