作者注:K-means方法在Matlab中已经有代码的,可以直接调用。
K-mean clustering algorithm was developed by J. MacQueen (1967) and then by J. A. Hartigan
and M. A. Wong around 1975.
Simply speaking, k-means clustering is an algorithm to classify or to group your objects based
on attributes/features into K number of group. K is positive integer number. The grouping is done
by minimizing the sum of squares of distances between data and the corresponding cluster centroid.
Thus the purpose of K-mean clustering is to classify the data.
The basic step of k-means clustering is simple. In the beginning we determine number of cluster
K and we assume the centroid or center of these clusters. We can take any random objects as the
initial centroids or the first K objects in sequence can also serve as the initial centroids.
Then the K means algorithm will do the three steps below until convergence.
Iterate until stable (= no object move group):
1. Determine the centroid coordinate;
2. Determine the distance of each object to the centroids;
3. Group the object based on minimum distance.
For you who like to use Matlab, Matlab Statistical Toolbox contain a function name kmeans.
If you do not have the statistical toolbox, you may use my generic code below. The kMeanCluster
and distMatrix can be downloaded as text files. Alternatively, you may simply copy and paste the
code below.
Matlab codes below:
function y=kMeansCluster(m,k,isRand)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% kMeansCluster - Simple k means clustering algorithm
% Author: Kardi Teknomo, Ph.D.
%
% Purpose: classify the objects in data matrix based on the attributes
% Criteria: minimize Euclidean distance between centroids and object points
% For more explanation of the algorithm, see http://people.revoledu.com/kardi/tutorial/kMean/index.html
% Output: matrix data plus an additional column represent the group of each object
%
% Example: m = [ 1 1; 2 1; 4 3; 5 4] or in a nice form
% m = [ 1 1;
% 2 1;
% 4 3;
% 5 4]
% k = 2
% kMeansCluster(m,k) produces m = [ 1 1 1;
% 2 1 1;
% 4 3 2;
% 5 4 2]
% Input:
% m - required, matrix data: objects in rows and attributes in columns
% k - optional, number of groups (default = 1)
% isRand - optional, if using random initialization isRand=1, otherwise input any number (default)
% it will assign the first k data as initial centroids
%
% Local Variables
% f - row number of data that belong to group i
% c - centroid coordinate size (1:k, 1:maxCol)
% g - current iteration group matrix size (1:maxRow)
% i - scalar iterator
% maxCol - scalar number of rows in the data matrix m = number of attributes
% maxRow - scalar number of columns in the data matrix m = number of objects
% temp - previous iteration group matrix size (1:maxRow)
% z - minimum value (not needed)
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if nargin<3, isRand=0; end
if nargin<2, k=1; end
[maxRow, maxCol]=size(m)
if maxRow<=k,
y=[m, 1:maxRow]
else
% initial value of centroid
if isRand,
p = randperm(size(m,1)); % random initialization
for i=1:k
c(i,:)=m(p(i),:)
end
else
for i=1:k
c(i,:)=m(i,:) % sequential initialization
end
end
temp=zeros(maxRow,1); % initialize as zero vector
while 1,
d=DistMatrix(m,c); % calculate objcets-centroid distances
[z,g]=min(d,[],2); % find group matrix g
if g==temp,
break; % stop the iteration
else
temp=g; % copy group matrix to temporary variable
end
for i=1:k
f=find(g==i);
if f % only compute centroid if f is not empty
c(i,:)=mean(m(find(g==i),:),1);
end
end
end
y=[m,g];
end
The Matlab function kMeansCluster above call function DistMatrix as shown in the code below.
It works for multi-dimensional Euclidean distance.
function d=DistMatrix(A,B)
%% DISTMATRIX return distance matrix between points in A=[x1 y1 ... w1] and in B=[x2 y2 ... w2]
%% Copyright (c) 2005 by Kardi Teknomo, http://people.revoledu.com/kardi/
%
% Numbers of rows (represent points) in A and B are not necessarily the same.
% It can be use for distance-in-a-slice (Spacing) or distance-between-slice (Headway),
%
% A and B must contain the same number of columns (represent variables of n dimensions),
% first column is the X coordinates, second column is the Y coordinates, and so on.
% The distance matrix is distance between points in A as rows
% and points in B as columns.
% example: Spacing= dist(A,A)
% Headway = dist(A,B), with hA ~= hB or hA=hB
% A=[1 2 3; 4 5 6; 2 4 6; 1 2 3]; B=[4 5 1; 6 2 0]
% dist(A,B)= [ 4.69 5.83;
% 5.00 7.00;
% 5.48 7.48;
% 4.69 5.83]
%
% dist(B,A)= [ 4.69 5.00 5.48 4.69;
% 5.83 7.00 7.48 5.83]
%%%%%%%%%%%%%%%%%%%%%%%%%%%
[hA,wA]=size(A);
[hB,wB]=size(B);
if wA ~= wB, error(' second dimension of A and B must be the same'); end
for k=1:wA
C{k}= repmat(A(:,k),1,hB);
D{k}= repmat(B(:,k),1,hA);
end
S=zeros(hA,hB);
for k=1:wA
S=S+(C{k}-D{k}').^2;
end
d=sqrt(S);
Similar to other algorithm, K-mean clustering has many weaknesses:
- When the numbers of data are not so many, initial grouping will determine the cluster significantly.
- The number of cluster, K, must be determined before hand.
- We never know the real cluster, using the same data, if it is inputted in a different order may produce different cluster if the number of data is a few.
- Sensitive to initial condition. Different initial condition may produce different result of cluster. The algorithm may be trapped in the local optimum.
- We never know which attribute contributes more to the grouping process since we assume that each attribute has the same weight.
- weakness of arithmetic mean is not robust to outliers. Very far data from the centroid may pull the centroid away from the real one.
- The result is circular cluster shape because based on distance.
Softwares/Codes
- SPAETH2 is a collection of FORTRAN90 routines for analyzing data by grouping into clusters which include KMEANS
- WEKA is another Data mining software under GNU public license in Java which include k means clustering. You may download the source code from Waikato university.
- Biopython project provide tools for computational molecular biology in Python language. The tutorial is in here, which include kMeans clustering module. Other k means code in Python n be found here.
- Tapas Kanungo et al provide C++ code (for Unix, under GPL) and documentation for k-means clustering based on a combination of local search and Lloyd's algorithm (also known as the k-means algorithm).
- David J.C. MacKay give demonstration of using K mean clustering in Octave, a free software similar to Matlab
- Matlab Statistical Toolbox contain a function name kmeans. My own Matlab code is in this page.
- Clustan is special commercial software for clustering.
- XLMiner? and XLStat are both commercial software that support k mean clustering in MS Excel
- Commercial software DTREG for modeling business also provide k means.
- Michael Eisen develop Cluster, an open source clustering software available here implement the most commonly used clustering methods for gene expression data analysis. Available for Mac, Windows and Unix.
- kmeans is also a function in R, a free software for Statistical Computing. More information about k means in R software is also available in here
- If you are using SPSS, it is also called quick clustering, more information is available in here
- Fuzzy c mean clustering in Matlab can be downloaded from Matlab Central File exchange
- Mathematica has kmeans function in the image processing package. You need to load teh package before using it. Clicks here for example and explanation.
- Institute for Signal and Information processing provides Java applet (with source code) of k means, PCA, LDA, SVM etc. Similarly, Jens Spehr and Simon Winkelbach also developed Java applet for k means that useful for teaching interactively.
- Kenichi Kurihara developed Bayesian k means similar to EM algorithm (code in Matlab)
- Cell-kmeans is an open source implementation of K-means algorithm on Cell Broadband Engine written in C
- Analytics1305 is useful public cloud service machine learning technology for extremely large and complex datasets. It contains k means and other clustering techniques
- http://hi.baidu.com/nicolasyude/item/eb7e99e3d8981d2f6dabb8dc