关于　子空间聚类　subspace　clustering--梦飞翔的地方(梦翔天空)

载入中。。。 'S bLog

载入中。。。

关于子空间聚类 subspace clustering

[ 2011/12/2 18:37:00 | By: 梦翔儿 ]

聚类分析是数据挖掘领域中的关键技术之一。高维数据聚类是聚类分析技术的难点和重点，子空间聚类是实现高维数据集聚类的有效途径，它是在高维数据空间中对传统聚类算法的一种扩展，其思想是将搜索局部化在相关维中进行。

　　根据算法思想，传统的聚类算法可分以下五类：① 划分方法：将数据集随机划分为k个子集，随后通过迭代重定位技术试图将数据对象从一个簇移到另一个簇来不断改进聚类的质量；②层次方法：对给定的数据对象集合进行层次的分解，根据层次的形成方法，又可以分为凝聚和分裂方法两大类；③基于密度的方法：根据领域对象的密度或者某种密度函数来生成聚类，使得每个类在给定范围的区域内必须至少包含一定数目的点；④基于网格的方法：将对象空间量化为有限数目的单元，形成一个网格结构，使所有聚类操作都在这个网格结构上进行，使聚类速度得到较大提高；⑤基于模型的方法：为每个类假定一个模型，寻找数据对给定模型的最佳拟合。

　　目前，聚类分析的研究集中在聚类方法的可伸缩性、对复杂形状和类型的数据进行聚类的有效性、高维聚类分析技术以及混合数据的聚类方法研究，其中，高维数据聚类是聚类分析的难题，也是涉及到聚类算法是否适用于很多领域的关键。而传统聚类算法对高维数据空间进行聚类时会遇到困难，为了解决这个问题，R．Agrawal首次提出了子空间聚类的概念，以解决高维数据的聚类问题。

　　传统聚类方法在高维数据集中进行聚类时，主要遇到两个问题。①高维数据集中存在大量无关的属性使得在所有维中存在簇的可能性几乎为零；②高维空间中数据较低维空间中数据分布要稀疏，其中数据间距离几乎相等是普遍现象，而传统聚类方法是基于距离进行聚类的，因此在高维空间中无法基于距离来构建簇。

　　目前一般使用两种方法解决以上问题：（1）特征转换，（2）特征选择／子空间聚类。

　　特征选择只在那些相关的子空间上执行挖掘任务，因此它比特征转换更有效地减少维。特征选择一般使用贪心策略等搜索方法搜索不同的特征子空间，然后使用一些标准来评价这些子空间，从而找到所需的簇。

　　子空间聚类算法拓展了特征选择的任务，尝试在相同数据集的不同子空间上发现聚类。和特征选择一样，子空间聚类需要使用一种搜索策略和评测标准来筛选出需要聚类的簇，不过考虑到不同簇存在于不同的子空间，需要对评测标准做一些限制。

　　选择的搜索策略对聚类结果有很大的影响。根据搜索的方向的不同，可以将子空间聚类方法分成两大类：自顶向下的搜索策略和自底向上的搜索策略。

http://baike.baidu.com/view/2315673.htm

Clustering high-dimensional data

From Wikipedia, the free encyclopedia

(Redirected from Subspace clustering)

Jump to: navigation, search

This article needs attention from an expert on the subject. See the talk page for details. WikiProject Computer science or the Computer science Portal may be able to help recruit an expert. (May 2009)

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional data spaces are often encountered in areas such as medicine, where DNA microarray technology can produce a large number of measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the dictionary.

[hide]

Problems

According to Kriegel, Kr?ger & Zimek (2009), four problems need to be overcome for clustering in high-dimensional data:

Multiple dimensions are hard to think in, impossible to visualize, and, due to the exponential growth of the number of possible values with each dimension, impossible to enumerate. This problem is known as the curse of dimensionality.
The concept of distance becomes less precise as the number of dimensions grows, since the distance between any two points in a given dataset converges. The discrimination of the nearest and farthest point in particular becomes meaningless:

A cluster is intended to group objects that are related, based on observations of their attribute's values. However, given a large number of attributes some of the attributes will usually not be meaningful for a given cluster. For example, in newborn screening a cluster of samples might identify newborns that share similar blood values, which might lead to insights about the relevance of certain blood values for a disease. But for different diseases, different blood values might form a cluster, and other values might be uncorrelated. This is known as the local feature relevance problem: different clusters might be found in different subspaces, so a global filtering of attributes is not sufficient.
Given a large number of attributes, it is likely that some attributes are correlated. Hence, clusters might exist in arbitrarily oriented affine subspaces.

Recent research by Houle et al. (2010) indicates that the discrimination problems only occur when there is a high number of irrelevant dimensions, and that shared-nearest-neighbor approaches can improve results.

Approaches

Approaches towards clustering in axis-parallel or arbitrarily oriented affine subspaces differ in how they interpret the overall goal, which is finding clusters in data with high dimensionality. This distinction is proposed in Kriegel, Kr?ger & Zimek (2009). An overall different approach is to find clusters based on pattern in the data matrix, often referred to as biclustering, which is a technique frequently utilized in bioinformatics.

Subspace Clustering

Example 2D space with subspace clusters

Subspace clustering is the task of detecting all clusters in all subspaces. This means that a point might be a member of multiple clusters, each existing in a different subspace. Subspaces can either be axis-parallel or affine. The term is often used synonymous with general clustering in high-dimensional data.

The image on the right shows a mere two-dimensional space where a number of clusters can be identified. In the one-dimensional subspaces, the clusters $c a$ (in subspace ${x}$ ) and $c b$ , $c c$ , $c d$ (in subspace ${y}$ ) can be found. $c c$ cannot be considered a cluster in a two-dimensional (sub-)space, since it is too sparsely distributed in the $x$ axis. In two dimensions, the two clusters $c a b$ and $c a d$ can be identified.

The problem of subspace clustering is given by the fact that there are $2 d$ different subspaces of a space with $d$ dimensions. If the subspaces are not axis-parallel, an infinite number of subspaces is possible. Hence, subspace clustering algorithm utilize some kind of heuristic to remain computationally feasible, at the risk of producing inferior results. For example, the downward-closure property (cf. association rules) can be used to build higher-dimensional subspaces only by combining lower-dimensional ones, as any subspace T containing a cluster, will result in a full space S also to contain that cluster (i.e. S ? T), an approach taken by most of the traditional algorithms such as CLIQUE (Agrawal et al. 2005) and SUBCLU (Kailing, Kriegel & Kr?ger 2004).

Projected Clustering

Projected clustering seeks to assign each point to a unique cluster, but clusters may exist in different subspaces. The general approach is to use a special distance function together with a regular clustering algorithm.

For example, the PreDeCon algorithm checks which attributes seem to support a clustering for each point, and adjusts the distance function such that dimensions with low variance are amplified in the distance function (Bohm et al. 2004). In the figure above, the cluster $c c$ might be found using DBSCAN with a distance function that places less emphasis on the $x$ -axis and thus exaggerates the low difference in the $y$ -axis sufficiently enough to group the points into a cluster.

PROCLUS uses a similar approach with a k-medoid clustering (Aggarwal et al. 1999). Initial medoids are guessed, and for each medoid the subspace spanned by attributes with low variance is determined. Points are assigned to the medoid closest, considering only the subspace of that medoid in determining the distance. The algorithm then proceeds as the regular PAM algorithm.

If the distance function weights attributes differently, but never with 0 (and hence never drops irrelevant attributes), the algorithm is called a "soft"-projected clustering algorithm.

Hybrid Approaches

Not all algorithms try to either find a unique cluster assignment for each point or all clusters in all subspaces; many settle for a result in between, where a number of possibly overlapping, but not necessarily exhaustive set of clusters are found. An example is FIRES, which is from its basic approach a subspace clustering algorithm, but uses a heuristic too aggressive to credibly produce all subspace clusters (Kriegel et al. 2005).

Correlation Clustering

Another type of subspaces is considered in Correlation clustering (Data Mining).

References

Aggarwal, Charu C.; Wolf, Joel L.; Yu, Philip S.; Procopiuc, Cecilia; Park, Jong Soo (1999), "Fast algorithms for projected clustering", ACM SIGMOD Record (New York, NY: ACM) 28 (2): 61–72, doi:10.1145/304181.304188
Agrawal, Rakesh; Gehrke, Johannes; Gunopulos, Dimitrios; Raghavan, Prabhakar (2005), "Automatic Subspace Clustering of High Dimensional Data", Data Mining and Knowledge Discovery (Springer Netherlands) 11 (1): 5–33, doi:10.1007/s10618-005-1396-1
B?hm, Christian; Kailing, Karin; Kriegel, Hans-Peter; Kr?ger, Peer (2004), "Density Connected Clustering with Local Subspace Preferences", Data Mining, IEEE International Conference on (Los Alamitos, CA, USA: IEEE Computer Society): 24–34, doi:10.1109/ICDM.2004.10087, ISBN 0-7695-2142-8
Kailing, Karin; Kriegel, Hans-Peter; Kr?ger, Peer (2004), "Density-Connected Subspace Clustering for High-Dimensional Data", Proceedings of the Fourth SIAM International Conference on Data Mining (SIAM): 246–257, http://www.dbs.informatik.uni-muenchen.de/Publikationen/Papers/sdm04-subclu.pdf, retrieved 2009-05-25
Kriegel, Hans-Peter; Kr?ger, Peer; Renz, Matthias; Wurst, Sebastian (2005), "A Generic Framework for Efficient Subspace Clustering of High-Dimensional Data", Proceedings of the Fifth IEEE International Conference on Data Mining (ICDM) (Washington, DC: IEEE Computer Society): 205–257, doi:10.1109/ICDM.2005.5, ISBN 0-7695-2278-5
Kriegel, Hans-Peter; Kr?ger, Peer; Zimek, Arthur (2009), "Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering", ACM Transactions on Knowledge Discovery from Data (New York, NY: ACM) 3 (1): 1–58, doi:10.1145/1497577.1497578
Houle, Michael E.; Kriegel, Hans-Peter; Kr?ger, Peer; Schubert, Erich; Zimek, Arthur (2010), "Can Shared-Neighbor Distances Defeat the Curse of Dimensionality?", Proceedings of the 21st International Conference on Scientific and Statistical Database Management (SSDBM) (Heidelberg, Germany: Springer)