聚类算法教程（4）：高斯混合聚类--梦飞翔的地方(梦翔天空)

载入中。。。 'S bLog

载入中。。。

聚类算法教程（4）：高斯混合聚类

[ 2012/11/8 15:48:00 | By: 梦翔儿 ]

Model-Based 聚类简介
聚类问题的另一类方法是基于模型的方法，即使用特定的模型聚类，并试图优化实际数据与模型的适配度。
实际上，每个分类都可表达为数学上的参数分布，如高斯分布（连续）或泊松分布（离散）。因此整个数据集则可建模成一个这些分布的混合mixture分布，建模特定分类的单个分布经常被称为 component分布。

一个混合模型极有可能有下列特点：

component 分布有”高峰”， (单个分类的数据紧密接近）;
mixture模型很好地“覆盖”数据 (由于component分布很好地表达了数据分布模式).

基于模型的聚类主要优点如下：

可以使用成熟的统计推断技术;
灵活选择component distribution;
可以得到每个分类的估计密度;
可以用使用一个 “软” 分类

高斯混合

基本思想：假设整个数据集服从高斯混合分布，待聚类的数据点看成是分布的采样点，通过采样点利用类似极大似然估计的方法估计高斯分布的参数。求出参数即得出了数据点对分类的隶属函数。

这类方法中，使用最广泛的是高斯混合模型：我们实际上可以把分类看成以重心为中心的高斯分布，如下图，灰色圆圈表示分布的主要部分：

算法工作如下：

以概率随机选择一个component (the Gaussian) 分布 ;
从分布中采样一个点.

Let’s suppose to have:

x₁, x₂,..., x_N

We can obtain the likelihood of the sample: .
What we really want to maximise is (probability of a datum given the centres of the Gaussians).

is the base to write the likelihood function:

Now we should maximise the likelihood function by calculating , but it would be too difficult. That’s why we use a simplified algorithm called EM (Expectation-Maximization).

The EM Algorithm
The algorithm which is used in practice to find the mixture of Gaussians that can model the data set is called EM (Expectation-Maximization) (Dempster, Laird and Rubin, 1977). Let’s see how it works with an example.

Suppose x_k are the marks got by the students of a class, with these probabilities:

x₁ = 30

x₂ = 18

x₃ = 0

x₄ = 23

First case: we observe that the marks are so distributed among students:

x₁ : a students
x₂ : b students
x₃ : c students
x₄ : d students

We should maximise this function by calculating . Let’s instead calculate the logarithm of the function and maximise it:

Supposing a = 14, b = 6, c = 9 and d = 10 we can calculate that .

Second case: we observe that marks are so distributed among students:

x₁ + x₂ : h students
x₃ : c students
x₄ : d students

We have so obtained a circularity which is divided into two steps:

expectation:
maximization:

This circularity can be solved in an iterative way.

Let’s now see how the EM algorithm works for a mixture of Gaussians (parameters estimated at the pth iteration are marked by a superscript (p):

Initialize parameters:
E-step:
M-step:

where R is the number of records.

上文后面没太看懂，这里从csdn的一个博客摘录了一些高斯混合聚类算法，讲的还比较清楚，

假设n个高斯分布，从中产生m个样本,每一个 y_j 所从属的分布用 z_j 表示，z_j 的取值范围为1到n。对于任意的y，其来自第i个高斯分布的概率为

我们的任务�**兰莆粗牟问�，即每一个高斯分布的均值、方差和先验概率。

首先我们必须设定初始值。

对于无监督的聚类，我们首先必须为每一个样本指定一个类别，这一步可以通过其他的聚类方法实现，如K-means方法，求出各个类别的中心和每一个样本的类别。然后求出各个类别中样本的协方差阵，可以用每个类别中样本的个数来表示该类别的权重（先验概率）。

E-step

使用上一次M步中得到的参数θ，计算未知的z（即每一个分类）对于观测数据的条件分布，即，在当前参数θ_n下，对每一个样本，计算各个类别的后验概率。下式中第三个等号后面的分子的第一项，为每一个样本在类别i中的分布概率，分子中的第二项为每一个类别的先验概率。
分母为归一化，即对每一个样本，其在各个类别中的分布概率的和为1.

for ti = 1:Ncluster
    PyatCi = mvnpdf(featurespace,Mu(ti,:),Sigma(:,:,ti));
    E(:,ti) = PyatCi*Weight(ti);
end
Esum = sum(E,2);
for ti = 1:Ncluster
    E(:,ti) = E(:,ti)./Esum;
end

M-step

为了最大化联合分布的对数似然函数的期望

上式中最后一项的联合分布可以写成条件分布和边缘分布的乘积的形式，得到

上式需要满足先验概率归一化的要求：

为了求得Q(θ)的极大值的位置，引入拉格朗日算子，然后带入概率密度函数，

为了求得新的 θ_{t + 1},我们需要找出上式取得极值时对应的θ，即其一阶导数。

下面推导满足上式的均值，

for ti = 1:Ncluster
    muti = E(:,ti)'*featurespace;
    muti = muti/sum(E(:,ti));
    Mu(ti,:) = muti;
end

产生新的协方差矩阵

新的各个分类的权重

进行归一化处理

Inserting λ into our estimate:

将新的参数，均值、方差、分类的先验概率，带入E步，重新计算，直到样本集合对各个分类的似然函数不再有明显的变化为止。

Bibliography

A.P. Dempster, N.M. Laird, and D.B. Rubin (1977): "Maximum Likelihood from Incomplete Data via theEM algorithm", Journal of the Royal Statistical Society, Series B, vol. 39, 1:1-38
Osmar R. Za?ane: “Principles of Knowledge Discovery in Databases - Chapter 8: Data Clustering”
http://www.cs.ualberta.ca/~zaiane/courses/cmput690/slides/Chapter8/index.html
Jia Li: “Data Mining - Clustering by Mixture Models”
http://www.stat.psu.edu/~jiali/course/stat597e/notes/mix.pd

from:

http://blog.sina.com.cn/s/blog_6f0c85fb0100xhz9.html

阅读全文 | 回复(0) | 引用通告 | 编辑

标签：聚类

上一篇：自工作以来,所指导和合作指导的本科毕业论文题目列表
下一篇：聚类算法教程（6）-links

发表评论：

梦翔儿网站梦飞翔的地方 http://www.dreamflier.net
中华人民共和国信息产业部TCP/IP系统备案序号：辽ICP备09000550号