| | |
| |
| ¹ØÓÚ¾ÛÀàÓëMapreduce |
|
[ 2013/4/25 15:48:00 | By: ÃÎÏè¶ù ] |
(½è¼øÓÚÍøÂç×ÊÁÏ£¬ÓÐÐÞ¸Ä)
Ò»¡¢¸ÅÄî½éÉÜ
K-meansËã·¨ÊÇÓ²¾ÛÀàËã·¨£¬ÊǵäÐ͵ľÖÓòÔÐ͵ÄÄ¿±êº¯Êý¾ÛÀà·½·¨µÄ´ú±í£¬ËüÊÇÊý¾Ýµãµ½ÔÐ͵ÄijÖÖ¾àÀë×÷ΪÓÅ»¯µÄÄ¿±êº¯Êý£¬ÀûÓú¯ÊýÇó¼«ÖµµÄ·½·¨µÃµ½µü´úÔËËãµÄµ÷Õû¹æÔò¡£K-meansËã·¨ÒÔŷʽ¾àÀë×÷ΪÏàËÆ¶È²â¶È£¬ËüÊÇÇó¶ÔӦijһ³õʼ¾ÛÀàÖÐÐÄÏòÁ¿V×îÓзÖÀ࣬ʹµÃÆÀ¼ÛÖ¸±êJ×îС¡£Ëã·¨²ÉÓÃÎó²îƽ·½ºÍ×¼Ôòº¯Êý×÷Ϊ¾ÛÀà×¼Ôòº¯Êý¡£
K-meansËã·¨ÊǺܵäÐ͵ĻùÓÚ¾àÀëµÄ¾ÛÀàËã·¨£¬²ÉÓþàÀë×÷ΪÏàËÆÐÔµÄÆÀ¼ÛÖ¸±ê£¬¼´ÈÏΪÁ½¸ö¶ÔÏóµÄ¾àÀëÔ½½ü£¬ÆäÏàËÆ¶È¾ÍÔ½´ó¡£¸ÃËã·¨ÈÏΪ´ØÊÇÓɾàÀë¿¿½üµÄ¶ÔÏó×é³ÉµÄ£¬Òò´Ë°ÑµÃµ½½ô´ÕÇÒ¶ÀÁ¢µÄ´Ø×÷Ϊ×îÖÕÄ¿±ê¡£
¡¡¡¡k¸ö³õʼÀà¾ÛÀàÖÐÐĵãµÄѡȡ¶Ô¾ÛÀà½á¹û¾ßÓнϴóµÄÓ°Ï죬ÒòΪÔÚ¸ÃËã·¨µÚÒ»²½ÖÐÊÇËæ»úµÄѡȡÈÎÒâk¸ö¶ÔÏó×÷Ϊ³õʼ¾ÛÀàµÄÖÐÐÄ£¬³õʼµØ´ú±íÒ»¸ö´Ø¡£¸ÃËã·¨ÔÚÿ´Îµü´úÖжÔÊý¾Ý¼¯ÖÐÊ£ÓàµÄÿ¸ö¶ÔÏ󣬸ù¾ÝÆäÓë¸÷¸ö´ØÖÐÐĵľàÀ뽫ÿ¸ö¶ÔÏóÖØÐ¸³¸ø×î½üµÄ´Ø¡£µ±¿¼²ìÍêËùÓÐÊý¾Ý¶ÔÏóºó£¬Ò»´Îµü´úÔËËãÍê³É£¬ÐµľÛÀàÖÐÐı»¼ÆËã³öÀ´¡£Èç¹ûÔÚÒ»´Îµü´úǰºó£¬ÆÀ¼ÛÖ¸±êJµÄֵûÓз¢Éú±ä»¯£¬ËµÃ÷Ëã·¨ÒѾÊÕÁ²¡£
¶þ¡¢»ù±¾Ë¼Ïë
1.ÊýѧÃèÊö
¸ø¶¨dάʵÊýÏòÁ¿( )£¬ºóÃæ¾Í½«Õâ¸öʵÊýÏòÁ¿³Æ×÷µã°É£¬¶Ì£¡K-MeansËã·¨»á¸ù¾ÝÊÂÏÈÖÆ¶¨µÄ²ÎÊýk£¬½«ÕâЩµã»®·Ö³ök¸öCluster(k ¡Ü n)£¬¶ø»®·ÖµÄ±ê×¼ÊÇ×îС»¯µãÓëClusterÖØÐÄ(¾ùÖµ)µÄ¾àÀëÆ½·½ºÍ£¬¼ÙÉèÕâЩClusterΪ£º £¬ÔòÊýѧÃèÊöÈçÏ£º
£¬ÆäÖРΪµÚi¸öClusterµÄ¡°ÖØÐÄ¡±(ClusterÖÐËùÓеãµÄƽ¾ùÖµ)¡£
¾ÛÀàµÄЧ¹ûÀàËÆÏÂͼ£º

¾ßÌå¿É¼û£ºhttp://en.wikipedia.org/wiki/K-means_clustering
2.K-meansËã·¨
ËüÊÇÒ»ÖÖµü´úµÄËã·¨£º
(1)¡¢¸ù¾ÝÊÂÏȸø¶¨µÄkÖµ½¨Á¢³õʼ»®·Ö£¬µÃµ½k¸öCluster£¬±ÈÈ磬¿ÉÒÔËæ»úÑ¡Ôñk¸öµã×÷Ϊk¸öClusterµÄÖØÐÄ£¬ÓÖ»òÕßÓÃCanopy ClusteringµÃµ½µÄCluster×÷Ϊ³õÊ¼ÖØÐÄ(µ±È»Õâ¸öʱºòkµÄÖµÓÉCanopy ClusteringµÃ½á¹û¾ö¶¨)£»
(2)¡¢¼ÆËãÿ¸öµãµ½¸÷¸öClusterÖØÐĵľàÀ룬½«Ëü¼ÓÈëµ½×î½üµÄÄǸöCluster£»
(3)¡¢ÖØÐ¼ÆËãÿ¸öClusterµÄÖØÐÄ£»
(4)¡¢Öظ´¹ý³Ì2~3£¬Ö±µ½¸÷¸öClusterÖØÐÄÔÚij¸ö¾«¶È·¶Î§ÄÚ²»±ä»¯»òÕß´ïµ½×î´óµü´ú´ÎÊý¡£
±ð¿´Ëã·¨¼òµ¥£¬ºÜ¶à¸´ÔÓËã·¨µÄʵ¼ÊЧ¹û»òÐí¶¼²»ÈçËü£¬¶øÇÒËüµÄ¾Ö²¿ÐԽϺã¬ÈÝÒײ¢Ðл¯£¬¶Ô´ó¹æÄ£Êý¾Ý¼¯ºÜÓÐÒâÒ壻Ë㷨ʱ¼ä¸´ÔÓ¶ÈÊÇ£ºO(nkt)£¬ÆäÖУºn ÊǾÛÀàµã¸öÊý£¬k ÊÇCluster¸öÊý£¬t Êǵü´ú´ÎÊý¡£
Èý¡¢²¢Ðл¯K-means
K-Means½ÏºÃµØ¾Ö²¿ÐÔʹËüÄܺܺõı»²¢Ðл¯¡£µÚÒ»½×¶Î£¬Éú³ÉClusterµÄ¹ý³Ì¿ÉÒÔ²¢Ðл¯£¬¸÷¸öSlaves¶ÁÈ¡´æÔÚ±¾µØµÄÊý¾Ý¼¯£¬ÓÃÉÏÊöËã·¨Éú³ÉCluster¼¯ºÏ£¬×îºóÓÃÈô¸ÉCluster¼¯ºÏÉú³ÉµÚÒ»´Îµü´úµÄÈ«¾ÖCluster¼¯ºÏ£¬È»ºóÖØ¸´Õâ¸ö¹ý³ÌÖ±µ½Âú×ã½áÊøÌõ¼þ£¬µÚ¶þ½×¶Î£¬ÓÃ֮ǰµÃµ½µÄCluster½øÐоÛÀà²Ù×÷¡£
ÓÃmap-reduceÃèÊöÊÇ£ºdatanodeÔÚmap½×¶Î¶Á³öλÓÚ±¾µØµÄÊý¾Ý¼¯£¬Êä³öÿ¸öµã¼°Æä¶ÔÓ¦µÄCluster£»combiner²Ù×÷¶ÔλÓÚ±¾µØ°üº¬ÔÚÏàͬClusterÖÐµÄµã½øÐÐreduce²Ù×÷²¢Êä³ö£¬reduce²Ù×÷µÃµ½È«¾ÖCluster¼¯ºÏ²¢Ð´ÈëHDFS¡£
ËÄ¡¢MahoutµÄK-means
mahoutʵÏÖÁ˱ê×¼K-Means Clustering£¬Ë¼ÏëÓëÇ°ÃæÏàͬ£¬Ò»¹²Ê¹ÓÃÁË2¸ömap²Ù×÷¡¢1¸öcombine²Ù×÷ºÍ1¸öreduce²Ù×÷£¬Ã¿´Îµü´ú¶¼ÓÃ1¸ömap¡¢1¸öcombineºÍÒ»¸öreduce²Ù×÷µÃµ½²¢±£´æÈ«¾ÖCluster¼¯ºÏ£¬µü´ú½áÊøºó£¬ÓÃÒ»¸ömap½øÐоÛÀà²Ù×÷¡£
1.Êý¾Ý½á¹¹Ä£ÐÍ
Mahout¾ÛÀàËã·¨½«¶ÔÏóÒÔVectorµÄ·½Ê½±íʾ£¬Ëüͬʱ֧³Ödense vectorºÍsparse vector£¬Ò»¹²ÓÐÈýÖÖ±íʾ·½Ê½£¨ËüÃÇÓµÓй²Í¬µÄ»ùÀàAbstractVector£¬ÀïÃæÊµÏÖÁËÓйØVectorµÄºÜ¶à²Ù×÷£©£º
(1)¡¢DenseVector
ËüʵÏÖµÄʱºòÓÃÒ»¸ödoubleÊý×é±íʾVector£¨private double[] values£©£¬ ¶ÔÓÚdense data¿ÉÒÔʹÓÃËü£»
(2)¡¢RandomAccessSparseVector
ËüÓÃÀ´±íʾһ¸ö¿ÉÒÔËæ»ú·ÃÎʵÄsparse vector£¬Ö»´æ´¢·ÇÁãÔªËØ£¬Êý¾ÝµÄ´æ´¢²ÉÓÃhashÓ³É䣺OpenIntDoubleHashMap;
¹ØÓÚOpenIntDoubleHashMap£¬ÆäkeyΪintÀàÐÍ£¬valueΪdoubleÀàÐÍ£¬½â¾ö³åÍ»µÄ·½·¨ÊÇdouble hashing£¬
(3)¡¢SequentialAccessSparseVector
ËüÓÃÀ´±íʾһ¸ö˳Ðò·ÃÎʵÄsparse vector£¬Í¬ÑùÖ»´æ´¢·ÇÁãÔªËØ£¬Êý¾ÝµÄ´æ´¢²ÉÓÃ˳ÐòÓ³É䣺OrderedIntDoubleMapping;
¹ØÓÚOrderedIntDoubleMapping£¬ÆäkeyΪintÀàÐÍ£¬valueΪdoubleÀàÐÍ£¬´æ´¢µÄ·½Ê½ÈÃÎÒÏëÆðÁËLibsvmÊý¾Ý±íʾµÄÐÎʽ£º·ÇÁãÔªËØË÷Òý:·ÇÁãÔªËØµÄÖµ£¬ÕâÀïÓÃÒ»¸öintÊý×é´æ´¢indices£¬ÓÃdoubleÊý×é´æ´¢·ÇÁãÔªËØ£¬ÒªÏë¶Áдij¸öÔªËØ£¬ÐèÒªÔÚindicesÖвéÕÒoffset£¬ÓÉÓÚindicesÓ¦¸ÃÊÇÓÐÐòµÄ£¬ËùÒÔ²éÕÒ²Ù×÷ÓõÄÊǶþ·Ö·¨¡£
2.K-means±äÁ¿º¬Òå
¿ÉÒÔ´ÓCluster.java¼°Æä¸¸À࣬¶ÔÓÚCluster£¬mahoutʵÏÖÁËÒ»¸ö³éÏóÀàAbstractCluster·â×°Cluster£¬¾ßÌå˵Ã÷¿ÉÒԲο¼ÉÏһƪÎÄÕ£¬ÕâÀï×ö¸ö¼òµ¥ËµÃ÷£º
(1)¡¢private int id; #ÿ¸öK-MeansËã·¨²úÉúµÄClusterµÄid
(2)¡¢private long numPoints; #ClusterÖаüº¬µãµÄ¸öÊý£¬ÕâÀïµÄµã¶¼ÊÇVector
(3)¡¢private Vector center; #ClusterµÄÖØÐÄ£¬ÕâÀï¾ÍÊÇÆ½¾ùÖµ£¬ÓÉs0ºÍs1¼ÆËã¶øÀ´¡£
(4)¡¢private Vector Radius; #ClusterµÄ°ë¾¶£¬Õâ¸ö°ë¾¶ÊǸ÷¸öµãµÄ±ê×¼²î£¬·´Ó³×éÄÚ¸öÌå¼äµÄÀëÉ¢³Ì¶È£¬ÓÉs0¡¢s1ºÍs2¼ÆËã¶øÀ´¡£
(5)¡¢private double s0; #±íʾCluster°üº¬µãµÄÈ¨ÖØÖ®ºÍ£¬
(6)¡¢private Vector s1; #±íʾCluster°üº¬µãµÄ¼ÓȨºÍ£¬
(7)¡¢private Vector s2; #±íʾCluster°üº¬µãƽ·½µÄ¼ÓȨºÍ£¬
(8)¡¢public void computeParameters(); #¸ù¾Ýs0¡¢s1¡¢s2¼ÆËãnumPoints¡¢centerºÍRadius£º
%7ds0)



Õ⼸¸ö²Ù×÷ºÜÖØÒª£¬×îºóÈý²½ºÜ±ØÒª£¬ÔÚºóÃæ»á×ö˵Ã÷¡£
(9)¡¢public void observe(VectorWritable x, double weight); #ÿµ±ÓÐÒ»¸öеĵã¼ÓÈ뵱ǰClusterʱ¶¼ÐèÒª¸üÐÂs0¡¢s1¡¢s2µÄÖµ
(10)¡¢public ClusterObservation getObservations(); #Õâ¸ö²Ù×÷ÔÚcombine²Ù×÷ʱ»á¾³£±»Óõ½£¬Ëü»á·µ»ØÓÉs0¡¢s1¡¢s2³õʼ»¯µÄClusterObservation¶ÔÏ󣬱íʾµ±Ç°ClusterÖаüº¬µÄËùÓб»¹Û²ì¹ýµÄµã
3.K-meansµÄMap-ReduceʵÏÖ
K-Means ClusteringµÄʵÏÖͬÑù°üº¬µ¥»ú°æºÍMRÁ½¸ö°æ±¾£¬µ¥»ú°æ¾Í²»ËµÁË£¬MR°æÓÃÁËÁ½¸ömap²Ù×÷¡¢Ò»¸öcombine²Ù×÷ºÍÒ»¸öreduce²Ù×÷£¬ÊÇͨ¹ýÁ½¸ö²»Í¬µÄjob´¥·¢£¬ÓÃDirverÀ´×éÖ¯µÄ£¬mapºÍreduce½×¶ÎÖ´ÐÐ˳ÐòÊÇ£º

(1)¶ÔÓÚK³õʼ»¯µÄÐγÉ
K-MeansËã·¨ÐèÒªÒ»¸ö¶ÔÊý¾ÝµãµÄ³õʼ»®·Ö£¬mahoutÀïÓÃÁËÁ½ÖÖ·½·¨£¨ÒÔIris datasetǰ3¸öfeatureΪÀý£©£º
A¡¢Ê¹ÓÃRandomSeedGeneratorÀà
ÔÚÖ¸¶¨clustersĿ¼Éú³Ék¸ö³õʼ»®·Ö²¢ÒÔSequence FileÐÎʽ´æ´¢£¬ÆäÑ¡Ôñ·½·¨Ï£ÍûÄܾ¡¿ÉÄܲ»ÈùÂÁ¢µã×÷ΪClusterÖØÐÄ£¬´ó¸ÅÁ÷³ÌÈçÏ£º

ͼ2
B¡¢Ê¹ÓÃCanopy Clustering
Canopy Clustering³£³£ÓÃÀ´¶Ô³õʼÊý¾Ý×öÒ»¸ö´ÖÂԵĻ®·Ö£¬ËüµÄ½á¹û¿ÉÒÔΪ֮ºó´ú¼Û½Ï¸ß¾ÛÀàÌṩ°ïÖú£¬Canopy Clustering¿ÉÄÜÓÃÔÚÊý¾ÝÔ¤´¦ÀíÉÏÒª±Èµ¥´¿ÄÃÀ´¾ÛÀà¸üÓÐÓ㬱ÈÈç¶ÔK-MeansÀ´ËµÌṩkÖµ£¬ÁíÍ⻹ÄܺܺõĴ¦Àí¹ÂÁ¢µã£¬µ±È»£¬ÐèÒªÈ˹¤Ö¸¶¨µÄ²ÎÊýÓÉk±ä³ÉÁËT1¡¢T2£¬T1ºÍT2ËùÆðµÄ×÷ÓÃÊÇȱһ²»¿ÉµÄ£¬T1¾ö¶¨ÁËÿ¸öCluster°üº¬µãµÄÊýÄ¿£¬ÕâÖ±½ÓÓ°ÏìÁËClusterµÄ¡°ÖØÐÄ¡±ºÍ¡°°ë¾¶¡±£¬¶øT2Ôò¾ö¶¨ÁËClusterµÄÊýÄ¿£¬T2Ì«´ó»áµ¼ÖÂÖ»ÓÐÒ»¸öCluster£¬¶øÌ«Ð¡Ôò»á³öÏÖ¹ý¶àµÄCluster¡£Í¨¹ýʵÑ飬T1ºÍT2ȡֵ»áÑÏÖØÓ°Ïìµ½Ëã·¨µÄЧ¹û£¬ÈçºÎÈ·¶¨T1ºÍT2£¬Ëƺõ¿ÉÒÔÓÃAIC¡¢BIC»òÕß½»²æÑé֤ȥ×ö¡£¡£¡£
(2).ÅäÖÃClusterÐÅÏ¢
K-MeansËã·¨µÄMRʵÏÖ£¬µÚÒ»´Îµü´úÐèÒª½«Ëæ»ú·½·¨»òÕßCanopy Clustering·½·¨½á¹ûĿ¼×÷ΪkmeansµÚÒ»´Îµü´úµÄÊäÈëĿ¼£¬½ÓÏÂÀ´µÄÿ´Îµü´ú¶¼ÐèÒª½«Éϴεü´úµÄÊä³öĿ¼×÷Ϊ±¾´Îµü´úµÄÊäÈëĿ¼£¬Õâ¾ÍÐèÒªÄÜÔÚÿ´Îkmeans mapºÍkmeans reduce²Ù×÷ǰ´Ó¸ÃĿ¼µÃµ½ClusterµÄÐÅÏ¢£¬Õâ¸ö¹¦ÄÜÓÉKMeansUtil.configureWithClusterInfoʵÏÖ£¬ËüÒÀ¾ÝÖ¸¶¨µÄHDFSĿ¼½«Canopy Cluster»òÕßÉϴεü´úClusterµÄÐÅÏ¢´æ´¢µ½Ò»¸öCollectionÖУ¬Õâ¸ö·½·¨ÔÚÖ®ºóµÄÿ¸ömapºÍreduce²Ù×÷Öж¼ÐèÒª¡£
(3).KMeansMapper
 public class KMeansMapper extends Mapper<WritableComparable<?>, VectorWritable, Text, ClusterObservations> {
private KMeansClusterer clusterer;
private final Collection<Cluster> clusters = new ArrayList<Cluster>();
@Override
protected void map(WritableComparable<?> key, VectorWritable point, Context context)
throws IOException, InterruptedException {
this.clusterer.emitPointToNearestCluster(point.get(), this.clusters, context);
}
@Override
protected void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
Configuration conf = context.getConfiguration();
try {
ClassLoader ccl = Thread.currentThread().getContextClassLoader();
DistanceMeasure measure = ccl.loadClass(conf.get(KMeansConfigKeys.DISTANCE_MEASURE_KEY))
.asSubclass(DistanceMeasure.class).newInstance();
measure.configure(conf);
this.clusterer = new KMeansClusterer(measure);
String clusterPath = conf.get(KMeansConfigKeys.CLUSTER_PATH_KEY);
if (clusterPath != null && clusterPath.length() > 0) {
KMeansUtil.configureWithClusterInfo(conf, new Path(clusterPath), clusters);
if (clusters.isEmpty()) {
throw new IllegalStateException("No clusters found. Check your -c path.");
}
}
} catch (ClassNotFoundException e) {
throw new IllegalStateException(e);
} catch (IllegalAccessException e) {
throw new IllegalStateException(e);
} catch (InstantiationException e) {
throw new IllegalStateException(e);
}
}
void setup(Collection<Cluster> clusters, DistanceMeasure measure) {
this.clusters.clear();
this.clusters.addAll(clusters);
this.clusterer = new KMeansClusterer(measure);
}
}

A¡¢KMeansMapper½ÓÊÕµÄÊÇ(WritableComparable<?>, VectorWritable) Pair£¬setup·½·¨ÀûÓÃKMeansUtil.configureWithClusterInfoµÃµ½ÉÏÒ»´Îµü´úµÄClustering½á¹û£¬map²Ù×÷ÐèÒªÒÀ¾ÝÕâ¸ö½á¹û¾ÛÀà¡£
B¡¢Ã¿¸öslave»úÆ÷»á·Ö²¼Ê½µÄ´¦Àí´æÔÚÓ²ÅÌÉϵÄÊý¾Ý£¬ÒÀ¾Ý֮ǰµÃµ½µÃClusterÐÅÏ¢£¬ÓÃemitPointToNearestCluster·½·¨½«Ã¿¸öµã¼ÓÈëµ½ÓëÆä¾àÀë×î½üµÄCluster£¬Êä³ö½á¹ûΪ(Ó뵱ǰµã¾àÀë×î½üClusterµÄID, Óɵ±Ç°µã°ü×°¶ø³ÉµÄClusterObservations) Pair,ÖµµÃ×¢ÒâµÄÊÇMapperÖ»Êǽ«µã¼ÓÈë×î½üµÄCluster£¬²¢ÒÔ(key,value)ÐÎʽעÃ÷´ËµãËùÀë×î½üµÄcluster£¬µÈ´ýcombiner£¬reducerËѼ¯£¬Ã»ÓиüÐÂClusterÖØÐĵȲÎÊý¡£
(4).KMeansCombiner
public class KMeansCombiner extends Reducer<Text, ClusterObservations, Text, ClusterObservations> {
@Override
protected void reduce(Text key, Iterable<ClusterObservations> values, Context context)
throws IOException, InterruptedException {
Cluster cluster = new Cluster();
for (ClusterObservations value : values) {
cluster.observe(value);
}
context.write(key, cluster.getObservations());
}
}
combiner²Ù×÷ÊÇÒ»¸ö±¾µØµÄreduce²Ù×÷£¬·¢ÉúÔÚmapÖ®ºó£¬reduce֮ǰ£º
(5).KMeansReducer
 public class KMeansReducer extends Reducer<Text, ClusterObservations, Text, Cluster> {
private Map<String, Cluster> clusterMap;
private double convergenceDelta;
private KMeansClusterer clusterer;
@Override
protected void reduce(Text key, Iterable<ClusterObservations> values, Context context)
throws IOException, InterruptedException {
Cluster cluster = clusterMap.get(key.toString());
for (ClusterObservations delta : values) {
cluster.observe(delta);
}
// force convergence calculation
boolean converged = clusterer.computeConvergence(cluster, convergenceDelta);
if (converged) {
context.getCounter("Clustering", "Converged Clusters").increment(1);
}
cluster.computeParameters();
context.write(new Text(cluster.getIdentifier()), cluster);
}
@Override
protected void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
Configuration conf = context.getConfiguration();
try {
ClassLoader ccl = Thread.currentThread().getContextClassLoader();
DistanceMeasure measure = ccl.loadClass(conf.get(KMeansConfigKeys.DISTANCE_MEASURE_KEY))
.asSubclass(DistanceMeasure.class).newInstance();
measure.configure(conf);
this.convergenceDelta = Double.parseDouble(conf.get(KMeansConfigKeys.CLUSTER_CONVERGENCE_KEY));
this.clusterer = new KMeansClusterer(measure);
this.clusterMap = new HashMap<String, Cluster>();
String path = conf.get(KMeansConfigKeys.CLUSTER_PATH_KEY);
if (path.length() > 0) {
Collection<Cluster> clusters = new ArrayList<Cluster>();
KMeansUtil.configureWithClusterInfo(conf, new Path(path), clusters);
setClusterMap(clusters);
if (clusterMap.isEmpty()) {
throw new IllegalStateException("Cluster is empty!");
}
}
} catch (ClassNotFoundException e) {
throw new IllegalStateException(e);
} catch (IllegalAccessException e) {
throw new IllegalStateException(e);
} catch (InstantiationException e) {
throw new IllegalStateException(e);
}
}
private void setClusterMap(Collection<Cluster> clusters) {
clusterMap = new HashMap<String, Cluster>();
for (Cluster cluster : clusters) {
clusterMap.put(cluster.getIdentifier(), cluster);
}
clusters.clear();
}
public void setup(Collection<Cluster> clusters, DistanceMeasure measure) {
setClusterMap(clusters);
this.clusterer = new KMeansClusterer(measure);
}
}

ºÜÖ±°×µÄµÄ²Ù×÷£¬Ö»ÊÇÔÚsetupµÄʱºòÉÔ¸´ÔÓ¡£
A¡¢setup²Ù×÷µÄÄ¿µÄÊǶÁÈ¡³õʼ»®·Ö»òÕßÉϴεü´úµÄ½á¹û£¬¹¹½¨ClusterÐÅÏ¢£¬Í¬Ê±×öÁËMap<ClusterµÄID,Cluster>Ó³É䣬·½±ã´ÓIDÕÒCluster¡£
B¡¢reduce²Ù×÷·Ç³£Ö±°×£¬½«´Ócombiner´«À´µÄ<Cluster ID£¬ClusterObservations>½øÐлã×Ü£»
computeConvergenceÓÃÀ´Åжϵ±Ç°ClusterÊÇ·ñÊÕÁ²£¬¼´Ðµġ°ÖØÐÄ¡±ÓëÀϵġ°ÖØÐÄ¡±¾àÀëÊÇ·ñÂú×ã֮ǰ´«ÈëµÄ¾«¶ÈÒªÇó£»
×¢Òâµ½Óиöcluster.computeParameters()²Ù×÷£¬Õâ¸ö²Ù×÷·Ç³£ÖØÒª£¬Ëü±£Ö¤Á˱¾´Îµü´úµÄ½á¹û²»»áÓ°Ï쵽ϴεü´ú£¬Ò²¾ÍÊDZ£Ö¤ÁËÄܹ»¡°ÖØÐ¼ÆËãÿ¸öClusterµÄÖØÐÄ¡±ÕâÒ»²½Öè¡£
%7ds0)


ǰÈý¸ö²Ù×÷µÃµ½ÐµÄClusterÐÅÏ¢£»

ºóÈý¸ö²½ÖèÇå¿ÕS0¡¢S1¡¢S2ÐÅÏ¢£¬±£Ö¤Ï´εü´úËùÐèµÄClusterÐÅÏ¢ÊÇ¡°¸É¾»¡±µÄ¡£
Ö®ºó£¬reduce½«(Cluster ID, Cluster) PairдÈëµ½HDFSÖÐÒÔ¡±clusters-µü´ú´ÎÊý¡°ÃüÃûµÄÎļþ¼ÐÖУ¬¹©ºóÃæµü´úʱºòʹÓá£
Reduce²Ù×÷ËѼ¯Ç°ÃæCombinerÊä³öµÄÐÅÏ¢£¬²¢ÔÙÒ»´Î¶ÔCanopyÖØÐĵÈÐÅÏ¢½øÐÐÁ˸üÐÂ
(6).KMeansClusterMapper
֮ǰµÄMR²Ù×÷ÓÃÓÚ¹¹½¨ClusterÐÅÏ¢£¬KMeansClusterMapperÔòÓù¹ÔìºÃµÄClusterÐÅÏ¢À´¾ÛÀà¡£
 public class KMeansClusterMapper
extends Mapper<WritableComparable<?>,VectorWritable,IntWritable,WeightedVectorWritable> {
private final Collection<Cluster> clusters = new ArrayList<Cluster>();
private KMeansClusterer clusterer;
@Override
protected void map(WritableComparable<?> key, VectorWritable point, Context context)
throws IOException, InterruptedException {
clusterer.outputPointWithClusterInfo(point.get(), clusters, context);
}
@Override
protected void setup(Context context) throws IOException, InterruptedException {
super.setup(context);
Configuration conf = context.getConfiguration();
try {
ClassLoader ccl = Thread.currentThread().getContextClassLoader();
DistanceMeasure measure = ccl.loadClass(conf.get(KMeansConfigKeys.DISTANCE_MEASURE_KEY))
.asSubclass(DistanceMeasure.class).newInstance();
measure.configure(conf);
String clusterPath = conf.get(KMeansConfigKeys.CLUSTER_PATH_KEY);
if (clusterPath != null && clusterPath.length() > 0) {
KMeansUtil.configureWithClusterInfo(conf, new Path(clusterPath), clusters);
if (clusters.isEmpty()) {
throw new IllegalStateException("No clusters found. Check your -c path.");
}
}
this.clusterer = new KMeansClusterer(measure);
} catch (ClassNotFoundException e) {
throw new IllegalStateException(e);
} catch (IllegalAccessException e) {
throw new IllegalStateException(e);
} catch (InstantiationException e) {
throw new IllegalStateException(e);
}
}
}

A¡¢setupÒÀÈ»ÊÇ´ÓÖ¸¶¨Ä¿Â¼¶ÁÈ¡²¢¹¹½¨ClusterÐÅÏ¢£»
B¡¢map²Ù×÷ͨ¹ý¼ÆËãÿ¸öµãµ½¸÷Cluster¡°ÖØÐÄ¡±µÄ¾àÀëÍê³É¾ÛÀà²Ù×÷£¬¿ÉÒÔ¿´µ½map²Ù×÷½áÊø£¬ËùÓеã¾Í¶¼±»·ÅÔÚΨһһ¸öÓëÖ®¾àÀë×î½üµÄClusterÖÐÁË£¬Òò´ËÖ®ºó²¢²»ÐèÒªreduce²Ù×÷¡£
(7).KMeansDriver
ÕâÀïÖµµÃ×¢ÒâµÄÊÇbuildClusterÖеĵü´ú¹ý³Ì£¬runIterationÖÐÉèÖÃÇ°ÃæKMeanMapper,KMeansCombiner,KMeanReducerËùÔÚjobµÄ²ÎÊý¡£
ÆäÖÐbuildCluster´úÂ룺
private static Path buildClustersMR(Configuration conf,
Path input,
Path clustersIn,
Path output,
DistanceMeasure measure,
int maxIterations,
String delta) throws IOException, InterruptedException, ClassNotFoundException {
boolean converged = false;
int iteration = 1;
while (!converged && iteration <= maxIterations) {
log.info("K-Means Iteration {}", iteration);
// point the output to a new directory per iteration
Path clustersOut = new Path(output, AbstractCluster.CLUSTERS_DIR + iteration);
converged = runIteration(conf, input, clustersIn, clustersOut, measure.getClass().getName(), delta);
// now point the input to the old output directory
clustersIn = clustersOut;
iteration++;
}
return clustersIn;
}
Èç¹û°ÑÇ°ÃæµÄKMeansMapper¡¢KMeansCombiner¡¢KMeansReducer¡¢KMeansClusterMapper¿´×öÊÇשµÄ»°£¬KMeansDriver¾ÍÊǸǷ¿×ÓµÄÈË£¬ËüÓÃÀ´×éÖ¯Õû¸ökmeansËã·¨Á÷³Ì(°üÀ¨µ¥»ú°æºÍMR°æ)¡£Ê¾ÒâͼÈçÏ£º
ͼ4
http://www.cnblogs.com/biyeymyhjob/archive/2012/07/20/2599544.html
======== Mahout¾ÛÀà·ÖÎö
ʲôÊǾÛÀà·ÖÎö£¿
¾ÛÀà (Clustering) ¾ÍÊǽ«Êý¾Ý¶ÔÏó·Ö×é³ÉΪ¶à¸öÀà»òÕß´Ø (Cluster)£¬ËüµÄÄ¿±êÊÇ£ºÔÚͬһ¸ö´ØÖеĶÔÏóÖ®¼ä¾ßÓнϸߵÄÏàËÆ¶È£¬¶ø²»Í¬´ØÖеĶÔÏó²î±ð½Ï´ó¡£ËùÒÔ£¬ÔںܶàÓ¦ÓÃÖУ¬Ò»¸ö´ØÖеÄÊý¾Ý¶ÔÏó¿ÉÒÔ±»×÷Ϊһ¸öÕûÌåÀ´¶Ô´ý£¬´Ó¶ø¼õÉÙ¼ÆËãÁ¿»òÕßÌá¸ß¼ÆËãÖÊÁ¿¡£
Æäʵ¾ÛÀàÊÇÒ»¸öÈËÃÇÈÕ³£Éú»îµÄ³£¼ûÐÐΪ£¬¼´Ëùν¡°ÎïÒÔÀà¾Û£¬ÈËÒÔȺ·Ö¡±£¬ºËÐĵÄ˼ÏëÒ²¾ÍÊǾÛÀà¡£ÈËÃÇ×ÜÊDz»¶ÏµØ¸Ä½øÏÂÒâʶÖеľÛÀàģʽÀ´Ñ§Ï°ÈçºÎÇø·Ö¸÷¸öÊÂÎïºÍÈË¡£Í¬Ê±£¬¾ÛÀà·ÖÎöÒѾ¹ã·ºµÄÓ¦ÓÃÔÚÐí¶àÓ¦ÓÃÖУ¬°üÀ¨Ä£Ê½Ê¶±ð£¬Êý¾Ý·ÖÎö£¬Í¼Ïñ´¦ÀíÒÔ¼°Êг¡Ñо¿¡£Í¨¹ý¾ÛÀ࣬ÈËÃÇÄÜÒâʶµ½Ãܼ¯ºÍÏ¡ÊèµÄÇøÓò£¬·¢ÏÖÈ«¾ÖµÄ·Ö²¼Ä£Ê½£¬ÒÔ¼°Êý¾ÝÊôÐÔÖ®¼äµÄÓÐȤµÄÏ໥¹ØÏµ¡£
¾ÛÀàͬʱҲÔÚ Web Ó¦ÓÃÖÐÆðµ½Ô½À´Ô½ÖØÒªµÄ×÷Óá£×î±»¹ã·ºÊ¹ÓõļÈÊÇ¶Ô Web ÉϵÄÎĵµ½øÐзÖÀ࣬×éÖ¯ÐÅÏ¢µÄ·¢²¼£¬¸øÓû§Ò»¸öÓÐЧ·ÖÀàµÄÄÚÈÝä¯ÀÀϵͳ£¨ÃÅ»§ÍøÕ¾£©£¬Í¬Ê±¿ÉÒÔ¼ÓÈëʱ¼äÒòËØ£¬½ø¶ø·¢ÏÖ¸÷¸öÀàÄÚÈݵÄÐÅÏ¢·¢Õ¹£¬×î½ü±»´ó¼Ò¹Ø×¢µÄÖ÷ÌâºÍ»°Ì⣬»òÕß·ÖÎöÒ»¶Îʱ¼äÄÚÈËÃǶÔʲôÑùµÄÄÚÈݱȽϸÐÐËȤ£¬ÕâЩÓÐȤµÄÓ¦Óö¼µÃ½¨Á¢ÔÚ¾ÛÀàµÄ»ù´¡Ö®ÉÏ¡£×÷Ϊһ¸öÊý¾ÝÍÚ¾òµÄ¹¦ÄÜ£¬¾ÛÀà·ÖÎöÄÜ×÷Ϊ¶ÀÁ¢µÄ¹¤¾ßÀ´»ñµÃÊý¾Ý·Ö²¼µÄÇé¿ö£¬¹Û²ìÿ¸ö´ØµÄÌØµã£¬¼¯ÖжÔÌØ¶¨µÄijЩ´Ø×ö½øÒ»²½µÄ·ÖÎö£¬´ËÍ⣬¾ÛÀà·ÖÎö»¹¿ÉÒÔ×÷ΪÆäËûËã·¨µÄÔ¤´¦Àí²½Ö裬¼ò»¯¼ÆËãÁ¿£¬Ìá¸ß·ÖÎöЧÂÊ£¬ÕâÒ²ÊÇÎÒÃÇÔÚÕâÀï½éÉܾÛÀà·ÖÎöµÄÄ¿µÄ¡£
²»Í¬µÄ¾ÛÀàÎÊÌâ
¶ÔÓÚÒ»¸ö¾ÛÀàÎÊÌ⣬ҪÌôÑ¡×îÊʺÏ×î¸ßЧµÄËã·¨±ØÐë¶ÔÒª½â¾öµÄ¾ÛÀàÎÊÌâ±¾Éí½øÐÐÆÊÎö£¬ÏÂÃæÎÒÃǾʹӼ¸¸ö²àÃæ·ÖÎöһϾÛÀàÎÊÌâµÄÐèÇó¡£
¾ÛÀà½á¹ûÊÇÅÅËûµÄ»¹ÊÇ¿ÉÖØµþµÄ
ΪÁ˺ܺÃÀí½âÕâ¸öÎÊÌ⣬ÎÒÃÇÒÔÒ»¸öÀý×Ó½øÐзÖÎö£¬¼ÙÉèÄãµÄ¾ÛÀàÎÊÌâÐèÒªµÃµ½¶þ¸ö´Ø£º¡°Ï²»¶Õ²Ä·Ë¹¿¨Ã·Â¡µçÓ°µÄÓû§¡±ºÍ¡°²»Ï²»¶Õ²Ä·Ë¹¿¨Ã·Â¡µÄÓû§¡±£¬ÕâÆäʵÊÇÒ»¸öÅÅËûµÄ¾ÛÀàÎÊÌ⣬¶ÔÓÚÒ»¸öÓû§£¬ËûҪôÊôÓÚ¡°Ï²»¶¡±µÄ´Ø£¬ÒªÃ´ÊôÓÚ²»Ï²»¶µÄ´Ø¡£µ«Èç¹ûÄãµÄ¾ÛÀàÎÊÌâÊÇ¡°Ï²»¶Õ²Ä·Ë¹¿¨Ã·Â¡µçÓ°µÄÓû§¡±ºÍ¡°Ï²»¶Àï°ÂÄɶàµçÓ°µÄÓû§¡±£¬ÄÇôÕâ¸ö¾ÛÀàÎÊÌâ¾ÍÊÇÒ»¸ö¿ÉÖØµþµÄÎÊÌ⣬һ¸öÓû§Ëû¿ÉÒÔ¼Èϲ»¶Õ²Ä·Ë¹¿¨Ã·Â¡ÓÖϲ»¶Àï°ÂÄɶࡣ
ËùÒÔÕâ¸öÎÊÌâµÄºËÐÄÊÇ£¬¶ÔÓÚÒ»¸öÔªËØ£¬ËûÊÇ·ñ¿ÉÒÔÊôÓÚ¾ÛÀà½á¹ûÖеĶà¸ö´ØÖУ¬Èç¹ûÊÇ£¬ÔòÊÇÒ»¸ö¿ÉÖØµþµÄ¾ÛÀàÎÊÌ⣬Èç¹û·ñ£¬ÄÇôÊÇÒ»¸öÅÅËûµÄ¾ÛÀàÎÊÌâ¡£
»ùÓÚ²ã´Î»¹ÊÇ»ùÓÚ»®·Ö
Æäʵ´ó²¿·ÖÈËÏëµ½µÄ¾ÛÀàÎÊÌâ¶¼ÊÇ¡°»®·Ö¡±ÎÊÌ⣬¾ÍÊÇÄõ½Ò»×é¶ÔÏ󣬰´ÕÕÒ»¶¨µÄÔÔò½«ËüÃǷֳɲ»Í¬µÄ×飬ÕâÊǵäÐ͵Ļ®·Ö¾ÛÀàÎÊÌâ¡£µ«³ýÁË»ùÓÚ»®·ÖµÄ¾ÛÀ࣬»¹ÓÐÒ»ÖÖÔÚÈÕ³£Éú»îÖÐÒ²ºÜ³£¼ûµÄÀàÐÍ£¬¾ÍÊÇ»ùÓÚ²ã´ÎµÄ¾ÛÀàÎÊÌ⣬ËüµÄ¾ÛÀà½á¹ûÊǽ«ÕâЩ¶ÔÏó·ÖµÈ¼¶£¬ÔÚ¶¥²ã½«¶ÔÏó½øÐдóÖµķÖ×é£¬Ëæºóÿһ×éÔÙ±»½øÒ»²½µÄϸ·Ö£¬Ò²ÐíËùÓз¾¶×îÖÕ¶¼Òªµ½´ïÒ»¸öµ¥¶ÀʵÀý£¬ÕâÊÇÒ»ÖÖ¡°×Ô¶¥ÏòÏ¡±µÄ²ã´Î¾ÛÀà½â¾ö·½·¨£¬¶ÔÓ¦µÄ£¬Ò²ÓС°×Ôµ×ÏòÉÏ¡±µÄ¡£Æäʵ¿ÉÒÔ¼òµ¥µÄÀí½â£¬¡°×Ô¶¥ÏòÏ¡±¾ÍÊÇÒ»²½²½µÄϸ»¯·Ö×飬¶ø¡°×Ôµ×ÏòÉÏ¡±¾ÍÊÇÒ»²½²½µÄ¹é²¢·Ö×é¡£
´ØÊýÄ¿¹Ì¶¨µÄ»¹ÊÇÎÞÏÞÖÆµÄ¾ÛÀà
Õâ¸öÊôÐԺܺÃÀí½â£¬¾ÍÊÇÄãµÄ¾ÛÀàÎÊÌâÊÇÔÚÖ´ÐоÛÀàË㷨ǰÒѾȷ¶¨¾ÛÀàµÄ½á¹ûÓ¦¸ÃµÃµ½¶àÉٴأ¬»¹ÊǸù¾ÝÊý¾Ý±¾ÉíµÄÌØÕ÷£¬ÓɾÛÀàË㷨ѡÔñºÏÊʵĴصÄÊýÄ¿¡£
»ùÓÚ¾àÀ뻹ÊÇ»ùÓÚ¸ÅÂÊ·Ö²¼Ä£ÐÍ
ÔÚ±¾ÏµÁеĵڶþƪ½éÉÜÐͬ¹ýÂ˵ÄÎÄÕÂÖУ¬ÎÒÃÇÒѾÏêϸ½éÉÜÁËÏàËÆÐԺ;àÀëµÄ¸ÅÄî¡£»ùÓÚ¾àÀëµÄ¾ÛÀàÎÊÌâÓ¦¸ÃºÜºÃÀí½â£¬¾ÍÊǽ«¾àÀë½üµÄÏàËÆµÄ¶ÔÏó¾ÛÔÚÒ»Æð¡£Ïà±ÈÆðÀ´£¬»ùÓÚ¸ÅÂÊ·Ö²¼Ä£Ð͵ģ¬¿ÉÄܲ»Ì«ºÃÀí½â£¬ÄÇôÏÂÃæ¸ø¸ö¼òµ¥µÄÀý×Ó¡£
Ò»¸ö¸ÅÂÊ·Ö²¼Ä£ÐÍ¿ÉÒÔÀí½âÊÇÔÚ N ά¿Õ¼äµÄÒ»×éµãµÄ·Ö²¼£¬¶øËüÃǵķֲ¼ÍùÍù·ûºÏÒ»¶¨µÄÌØÕ÷£¬±ÈÈç×é³ÉÒ»¸öÌØ¶¨µÄÐÎ×´¡£»ùÓÚ¸ÅÂÊ·Ö²¼Ä£Ð͵ľÛÀàÎÊÌ⣬¾ÍÊÇÔÚÒ»×é¶ÔÏóÖУ¬ÕÒµ½ÄÜ·ûºÏÌØ¶¨·Ö²¼Ä£Ð͵ĵãµÄ¼¯ºÏ£¬ËûÃDz»Ò»¶¨ÊǾàÀë×î½üµÄ»òÕß×îÏàËÆµÄ£¬¶øÊÇÄÜÍêÃÀµÄ³ÊÏÖ³ö¸ÅÂÊ·Ö²¼Ä£ÐÍËùÃèÊöµÄÄ£ÐÍ¡£
ÏÂÃæÍ¼ 1 ¸ø³öÁËÒ»¸öÀý×Ó£¬¶ÔͬÑùÒ»×éµã¼¯£¬Ó¦Óò»Í¬µÄ¾ÛÀà²ßÂÔ£¬µÃµ½ÍêÈ«²»Í¬µÄ¾ÛÀà½á¹û¡£×ó²à¸ø³öµÄ½á¹ûÊÇ»ùÓÚ¾àÀëµÄ£¬ºËÐĵÄÔÔò¾ÍÊǽ«¾àÀë½üµÄµã¾ÛÔÚÒ»Æð£¬ÓÒ²à¸ø³öµÄ»ùÓÚ¸ÅÂÊ·Ö²¼Ä£Ð͵ľÛÀà½á¹û£¬ÕâÀï²ÉÓõĸÅÂÊ·Ö²¼Ä£ÐÍÊÇÒ»¶¨»¡¶ÈµÄÍÖÔ²¡£Í¼ÖÐרÃűê³öÁËÁ½¸öºìÉ«µÄµã£¬ÕâÁ½µãµÄ¾àÀëºÜ½ü£¬ÔÚ»ùÓÚ¾àÀëµÄ¾ÛÀàÖУ¬½«ËûÃǾÛÔÚÒ»¸öÀàÖУ¬µ«»ùÓÚ¸ÅÂÊ·Ö²¼Ä£Ð͵ľÛÀàÔò½«ËüÃÇ·ÖÔÚ²»Í¬µÄÀàÖУ¬Ö»ÊÇΪÁËÂú×ãÌØ¶¨µÄ¸ÅÂÊ·Ö²¼Ä£ÐÍ£¨µ±È»ÕâÀïÎÒÌØÒâ¾ÙÁËÒ»¸ö±È½Ï¼«¶ËµÄÀý×Ó£©¡£ËùÒÔÎÒÃÇ¿ÉÒÔ¿´³ö£¬ÔÚ»ùÓÚ¸ÅÂÊ·Ö²¼Ä£Ð͵ľÛÀà·½·¨ÀºËÐÄÊÇÄ£Ð͵͍Ò壬²»Í¬µÄÄ£ÐÍ¿ÉÄܵ¼ÖÂÍêÈ«²»Í¬µÄ¾ÛÀà½á¹û¡£
ͼ 1 »ùÓÚ¾àÀëºÍ»ùÓÚ¸ÅÂÊ·Ö²¼Ä£Ð͵ľÛÀàÎÊÌâ
»ØÒ³Ê×
Apache Mahout ÖеľÛÀà·ÖÎö¿ò¼Ü
Apache Mahout ÊÇ Apache Software Foundation (ASF) ÆìϵÄÒ»¸ö¿ªÔ´ÏîÄ¿£¬ÌṩһЩ¿ÉÀ©Õ¹µÄ»úÆ÷ѧϰÁìÓò¾µäËã·¨µÄʵÏÖ£¬Ö¼ÔÚ°ïÖú¿ª·¢ÈËÔ±¸ü¼Ó·½±ã¿ì½ÝµØ´´½¨ÖÇÄÜÓ¦ÓóÌÐò£¬²¢ÇÒ£¬ÔÚ Mahout µÄ×î½ü°æ±¾Öл¹¼ÓÈëÁË¶Ô Apache Hadoop µÄÖ§³Ö£¬Ê¹ÕâЩËã·¨¿ÉÒÔ¸ü¸ßЧµÄÔËÐÐÔÚÔÆ¼ÆËã»·¾³ÖС£
¹ØÓÚ Apache Mahout µÄ°²×°ºÍÅäÖÃÇë²Î¿¼¡¶»ùÓÚ Apache Mahout ¹¹½¨Éç»á»¯ÍƼöÒýÇæ¡·£¬ËüÊDZÊÕß 09 Äê·¢±íµÄһƪ¹ØÓÚ»ùÓÚ Mahout ʵÏÖÍÆ¼öÒýÇæµÄ developerWorks ÎÄÕ£¬ÆäÖÐÏêϸ½éÉÜÁË Mahout µÄ°²×°²½Öè¡£
Mahout ÖÐÌṩÁ˳£ÓõĶàÖÖ¾ÛÀàËã·¨£¬Éæ¼°ÎÒÃǸոÕÌÖÂÛ¹ýµÄ¸÷ÖÖÀàÐÍËã·¨µÄ¾ßÌåʵÏÖ£¬ÏÂÃæÎÒÃǾͽøÒ»²½ÉîÈ뼸¸öµäÐ͵ľÛÀàËã·¨µÄÔÀí£¬ÓÅȱµãºÍʵÓó¡¾°£¬ÒÔ¼°ÈçºÎʹÓà Mahout ¸ßЧµÄʵÏÖËüÃÇ¡£
ÉîÈë¾ÛÀàËã·¨
ÉîÈë½éÉܾÛÀàË㷨֮ǰ£¬ÕâÀïÏÈ¶Ô Mahout ÖжԸ÷ÖÖ¾ÛÀàÎÊÌâµÄÊý¾ÝÄ£ÐͽøÐмòÒªµÄ½éÉÜ¡£
Êý¾ÝÄ£ÐÍ
Mahout µÄ¾ÛÀàËã·¨½«¶ÔÏó±íʾ³ÉÒ»ÖÖ¼òµ¥µÄÊý¾ÝÄ£ÐÍ£ºÏòÁ¿ (Vector)¡£ÔÚÏòÁ¿Êý¾ÝÃèÊöµÄ»ù´¡ÉÏ£¬ÎÒÃÇ¿ÉÒÔÇáËɵļÆËãÁ½¸ö¶ÔÏóµÄÏàËÆÐÔ£¬¹ØÓÚÏòÁ¿ºÍÏòÁ¿µÄÏàËÆ¶È¼ÆË㣬±¾ÏµÁеÄÉÏһƪ½éÉÜÐͬ¹ýÂËËã·¨µÄÎÄÕÂÖÐÒѾ½øÐÐÁËÏêϸµÄ½éÉÜ£¬Çë²Î¿¼¡¶¡°Ì½Ë÷ÍÆ¼öÒýÇæÄÚ²¿µÄÃØÃÜ¡±ÏµÁÐ - Part 2: ÉîÈëÍÆ¼öÒýÇæÏà¹ØËã·¨ -- Ðͬ¹ýÂË¡·¡£
Mahout ÖеÄÏòÁ¿ Vector ÊÇÒ»¸öÿ¸öÓòÊǸ¡µãÊý (double) µÄ¸´ºÏ¶ÔÏó£¬×îÈÝÒ×ÁªÏëµ½µÄʵÏÖ¾ÍÊÇÒ»¸ö¸¡µãÊýµÄÊý×é¡£µ«ÔÚ¾ßÌåÓ¦ÓÃÓÉÓÚÏòÁ¿±¾ÉíÊý¾ÝÄÚÈݵIJ»Í¬£¬±ÈÈçÓÐЩÏòÁ¿µÄÖµºÜÃܼ¯£¬Ã¿¸öÓò¶¼ÓÐÖµ£»ÓÐÐ©ÄØÔòÊǺÜÏ¡Ê裬¿ÉÄÜÖ»ÓÐÉÙÁ¿ÓòÓÐÖµ£¬ËùÒÔ Mahout ÌṩÁ˶à¸öʵÏÖ£º
- DenseVector£¬ËüµÄʵÏÖ¾ÍÊÇÒ»¸ö¸¡µãÊýÊý×飬¶ÔÏòÁ¿ÀïËùÓÐÓò¶¼½øÐд洢£¬ÊʺÏÓÃÓÚ´æ´¢Ãܼ¯ÏòÁ¿¡£
- RandomAccessSparseVector »ùÓÚ¸¡µãÊýµÄ HashMap ʵÏֵģ¬key ÊÇÕûÐÎ (int) ÀàÐÍ£¬value ÊǸ¡µãÊý (double) ÀàÐÍ£¬ËüÖ»´æ´¢ÏòÁ¿Öв»Îª¿ÕµÄÖµ£¬²¢Ìá¹©Ëæ»ú·ÃÎÊ¡£
- SequentialAccessVector ʵÏÖΪÕûÐÎ (int) ÀàÐͺ͸¡µãÊý (double) ÀàÐ͵IJ¢ÐÐÊý×飬ËüÒ²Ö»´æ´¢ÏòÁ¿Öв»Îª¿ÕµÄÖµ£¬µ«Ö»Ìṩ˳Ðò·ÃÎÊ¡£
Óû§¿ÉÒÔ¸ù¾Ý×Ô¼ºËã·¨µÄÐèÇóÑ¡ÔñºÏÊʵÄÏòÁ¿ÊµÏÖÀ࣬Èç¹ûËã·¨ÐèÒªºÜ¶àËæ»ú·ÃÎÊ£¬Ó¦¸ÃÑ¡Ôñ DenseVector »òRandomAccessSparseVector£¬Èç¹û´ó²¿·Ö¶¼ÊÇ˳Ðò·ÃÎÊ£¬SequentialAccessVector µÄЧ¹ûÓ¦¸Ã¸üºÃ¡£Èç¹û²ÉÓÃK-MeansËã·¨£¬SequentialAccessVector µÄЧ¹û¸üºÃ¡£
½éÉÜÁËÏòÁ¿µÄʵÏÖ£¬ÏÂÃæÎÒÃÇ¿´¿´ÈçºÎ½«ÏÖÓеÄÊý¾Ý½¨Ä£³ÉÏòÁ¿£¬ÊõÓï¾ÍÊÇ¡°ÈçºÎ¶ÔÊý¾Ý½øÐÐÏòÁ¿»¯¡±£¬ÒÔ±ã²ÉÓà Mahout µÄ¸÷ÖÖ¸ßЧµÄ¾ÛÀàËã·¨¡£
- ¼òµ¥µÄÕûÐλò¸¡µãÐ͵ÄÊý¾Ý
ÕâÖÖÊý¾Ý×î¼òµ¥£¬Ö»Òª½«²»Í¬µÄÓò´æÔÚÏòÁ¿Öм´¿É£¬±ÈÈç n ά¿Õ¼äµÄµã£¬Æäʵ±¾Éí¿ÉÒÔ±»ÃèÊöΪһ¸öÏòÁ¿¡£
- ö¾ÙÀàÐÍÊý¾Ý
ÕâÀàÊý¾ÝÊǶÔÎïÌåµÄÃèÊö£¬Ö»ÊÇȡֵ·¶Î§ÓÐÏÞ¡£¾Ù¸öÀý×Ó£¬¼ÙÉèÄãÓÐÒ»¸öÆ»¹ûÐÅÏ¢µÄÊý¾Ý¼¯£¬Ã¿¸öÆ»¹ûµÄÊý¾Ý°üÀ¨£º´óС£¬ÖØÁ¿£¬ÑÕÉ«µÈ£¬ÎÒÃÇÒÔÑÕɫΪÀý£¬ÉèÆ»¹ûµÄÑÕÉ«Êý¾Ý°üÀ¨£ººìÉ«£¬»ÆÉ«ºÍÂÌÉ«¡£ÔÚ¶ÔÊý¾Ý½øÐн¨Ä£Ê±£¬ÎÒÃÇ¿ÉÒÔÓÃÊý×ÖÀ´±íʾÑÕÉ«£¬ºìÉ« =1£¬»ÆÉ« =2£¬ÂÌÉ« =3£¬ÄÇô´óСֱ¾¶ 8cm£¬ÖØÁ¿ 0.15kg£¬ÑÕÉ«ÊǺìÉ«µÄÆ»¹û£¬½¨Ä£µÄÏòÁ¿¾ÍÊÇ <8, 0.15, 1>¡£
ÏÂÃæµÄÇåµ¥ 1 ¸ø³öÁ˶ÔÒÔÉÏÁ½ÖÖÊý¾Ý½øÐÐÏòÁ¿»¯µÄÀý×Ó¡£
Çåµ¥ 1. ´´½¨¼òµ¥µÄÏòÁ¿
// ´´½¨Ò»¸ö¶þάµã¼¯µÄÏòÁ¿×é
public static final double[][] points = { { 1, 1 }, { 2, 1 }, { 1, 2 },
{ 2, 2 }, { 3, 3 }, { 8, 8 }, { 9, 8 }, { 8, 9 }, { 9, 9 }, { 5, 5 },
{ 5, 6 }, { 6, 6 }};
public static List<Vector> getPointVectors(double[][] raw) {
List<Vector> points = new ArrayList<Vector>();
for (int i = 0; i < raw.length; i++) {
double[] fr = raw[i];
// ÕâÀïÑ¡Ôñ´´½¨ RandomAccessSparseVector
Vector vec = new RandomAccessSparseVector(fr.length);
// ½«Êý¾Ý´æ·ÅÔÚ´´½¨µÄ Vector ÖÐ
vec.assign(fr);
points.add(vec);
}
return points;
}
// ´´½¨Æ»¹ûÐÅÏ¢Êý¾ÝµÄÏòÁ¿×é
public static List<Vector> generateAppleData() {
List<Vector> apples = new ArrayList<Vector>();
// ÕâÀï´´½¨µÄÊÇ NamedVector£¬Æäʵ¾ÍÊÇÔÚÉÏÃæ¼¸ÖÖ Vector µÄ»ù´¡ÉÏ£¬
//Ϊÿ¸ö Vector Ìṩһ¸ö¿É¶ÁµÄÃû×Ö
NamedVector apple = new NamedVector(new DenseVector(
new double[] {0.11, 510, 1}),
"Small round green apple");
apples.add(apple);
apple = new NamedVector(new DenseVector(new double[] {0.2, 650, 3}),
"Large oval red apple");
apples.add(apple);
apple = new NamedVector(new DenseVector(new double[] {0.09, 630, 1}),
"Small elongated red apple");
apples.add(apple);
apple = new NamedVector(new DenseVector(new double[] {0.25, 590, 3}),
"Large round yellow apple");
apples.add(apple);
apple = new NamedVector(new DenseVector(new double[] {0.18, 520, 2}),
"Medium oval green apple");
apples.add(apple);
return apples;
}
|
- Îı¾ÐÅÏ¢
×÷Ϊ¾ÛÀàËã·¨µÄÖ÷ÒªÓ¦Óó¡¾° - Îı¾·ÖÀ࣬¶ÔÎı¾ÐÅÏ¢µÄ½¨Ä£Ò²ÊÇÒ»¸ö³£¼ûµÄÎÊÌâ¡£ÔÚÐÅÏ¢¼ìË÷Ñо¿ÁìÓòÒѾÓкܺõĽ¨Ä£·½Ê½£¬¾ÍÊÇÐÅÏ¢¼ìË÷ÁìÓòÖÐ×î³£ÓõÄÏòÁ¿¿Õ¼äÄ£ÐÍ (Vector Space Model, VSM)¡£ÒòΪÏòÁ¿¿Õ¼äÄ£ÐͲ»ÊDZ¾ÎĵÄÖØµã£¬ÕâÀï¸øÒ»¸ö¼òÒªµÄ½éÉÜ£¬ÓÐÐËȤµÄÅóÓÑ¿ÉÒÔ²éÔIJο¼Ä¿Â¼Öиø³öµÄÏà¹ØÎĵµ¡£
Îı¾µÄÏòÁ¿¿Õ¼äÄ£Ð;ÍÊǽ«Îı¾ÐÅÏ¢½¨Ä£ÎªÒ»¸öÏòÁ¿£¬ÆäÖÐÿһ¸öÓòÊÇÎı¾ÖгöÏÖµÄÒ»¸ö´ÊµÄÈ¨ÖØ¡£¹ØÓÚÈ¨ÖØµÄ¼ÆËãÔòÓкܶàÖУº
- ×î¼òµ¥µÄιýÓÚÖ±½Ó¼ÆÊý£¬¾ÍÊÇ´ÊÔÚÎı¾Àï³öÏֵĴÎÊý¡£ÕâÖÖ·½·¨¼òµ¥£¬µ«ÊǶÔÎı¾ÄÚÈÝÃèÊöµÄ²»¹»¾«È·¡£
- ´ÊµÄƵÂÊ (Team Frequency, TF)£º¾ÍÊǽ«´ÊÔÚÎı¾ÖгöÏֵįµÂÊ×÷Ϊ´ÊµÄÈ¨ÖØ¡£ÕâÖÖ·½·¨Ö»ÊǶÔÓÚÖ±½Ó¼ÆÊý½øÐÐÁ˹éÒ»»¯´¦Àí£¬Ä¿µÄÊÇÈò»Í¬³¤¶ÈµÄÎı¾Ä£ÐÍÓÐͳһµÄȡֵ¿Õ¼ä£¬±ãÓÚÎı¾ÏàËÆ¶ÈµÄ±È½Ï£¬µ«¿ÉÒÔ¿´³ö£¬¼òµ¥¼ÆÊýºÍ´ÊƵ¶¼²»Äܽâ¾ö¡°¸ßƵÎÞÒâÒå´Ê»ãÈ¨ÖØ´óµÄÎÊÌ⡱£¬Ò²¾ÍÊÇ˵¶ÔÓÚÓ¢ÎÄÎı¾ÖУ¬¡°a¡±£¬¡°the¡±ÕâÑù¸ßƵµ«ÎÞʵ¼ÊÒâÒåµÄ´Ê»ã²¢Ã»ÓнøÐйýÂË£¬ÕâÑùµÄÎı¾Ä£ÐÍÔÚ¼ÆËãÎı¾ÏàËÆ¶Èʱ»áºÜ²»×¼È·¡£
- ´ÊƵ - ÄæÏòÎı¾ÆµÂÊ (Term Frequency ¨C Inverse Document Frequency, TF-IDF)£ºËüÊÇ¶Ô TF ·½·¨µÄÒ»ÖÖ¼ÓÇ¿£¬×ִʵÄÖØÒªÐÔËæ×ÅËüÔÚÎļþÖгöÏֵĴÎÊý³ÉÕý±ÈÔö¼Ó£¬µ«Í¬Ê±»áËæ×ÅËüÔÚËùÓÐÎı¾ÖгöÏֵįµÂʳɷ´±ÈϽµ¡£¾Ù¸öÀý×Ó£¬¶ÔÓÚ¡°¸ßƵÎÞÒâÒå´Ê»ã¡±£¬ÒòΪËüÃǴ󲿷ֻá³öÏÖÔÚËùÓеÄÎı¾ÖУ¬ËùÒÔËüÃǵÄÈ¨ÖØ»á´ó´òÕÛ¿Û£¬ÕâÑù¾ÍʹµÃÎı¾Ä£ÐÍÔÚÃèÊöÎı¾ÌØÕ÷Éϸü¼Ó¾«È·¡£ÔÚÐÅÏ¢¼ìË÷ÁìÓò£¬TF-IDF ÊǶÔÎı¾ÐÅÏ¢½¨Ä£µÄ×î³£Óõķ½·¨¡£
¶ÔÓÚÎı¾ÐÅÏ¢µÄÏòÁ¿»¯£¬Mahout ÒѾÌṩÁ˹¤¾ßÀ࣬Ëü»ùÓÚ Lucene ¸ø³öÁ˶ÔÎı¾ÐÅÏ¢½øÐзÖÎö£¬È»ºó´´½¨Îı¾ÏòÁ¿¡£ÏÂÃæµÄÇåµ¥ 2 ¸ø³öÁËÒ»¸öÀý×Ó£¬·ÖÎöµÄÎı¾Êý¾ÝÊÇ·͸ÌṩµÄÐÂÎÅÊý¾Ý£¬²Î¿¼×ÊÔ´Àï¸ø³öÁËÏÂÔØµØÖ·¡£½«Êý¾Ý¼¯ÏÂÔØºó£¬·ÅÔÚ¡°clustering/reuters¡±Ä¿Â¼Ï¡£
Çåµ¥ 2. ´´½¨Îı¾ÐÅÏ¢µÄÏòÁ¿
public static void documentVectorize(String[] args) throws Exception{
//1. ½«Â·Í¸µÄÊý¾Ý½âѹËõ , Mahout ÌṩÁËרÃŵķ½·¨
DocumentClustering.extractReuters();
//2. ½«Êý¾Ý´æ´¢³É SequenceFile£¬ÒòΪÕâЩ¹¤¾ßÀà¾ÍÊÇÔÚ Hadoop µÄ»ù´¡ÉÏ×öµÄ£¬ËùÒÔÊ×ÏÈÎÒÃÇÐèÒª½«Êý¾Ýд
// ³É SequenceFile£¬ÒÔ±ã¶ÁÈ¡ºÍ¼ÆËã
DocumentClustering.transformToSequenceFile();
//3. ½« SequenceFile ÎļþÖеÄÊý¾Ý£¬»ùÓÚ Lucene µÄ¹¤¾ß½øÐÐÏòÁ¿»¯
DocumentClustering.transformToVector();
}
public static void extractReuters(){
//ExtractReuters ÊÇ»ùÓÚ Hadoop µÄʵÏÖ£¬ËùÒÔÐèÒª½«ÊäÈëÊä³öµÄÎļþĿ¼´«¸øËü£¬ÕâÀïÎÒÃÇ¿ÉÒÔÖ±½Ó°ÑËüÓ³
// Éäµ½ÎÒÃDZ¾µØµÄÒ»¸öÎļþ¼Ð£¬½âѹºóµÄÊý¾Ý½«Ð´ÈëÊä³öĿ¼ÏÂ
File inputFolder = new File("clustering/reuters");
File outputFolder = new File("clustering/reuters-extracted");
ExtractReuters extractor = new ExtractReuters(inputFolder, outputFolder);
extractor.extract();
}
public static void transformToSequenceFile(){
//SequenceFilesFromDirectory ʵÏÖ½«Ä³¸öÎļþĿ¼ÏµÄËùÓÐÎļþдÈëÒ»¸ö SequenceFiles µÄ¹¦ÄÜ
// ËüÆäʵ±¾ÉíÊÇÒ»¸ö¹¤¾ßÀ࣬¿ÉÒÔÖ±½ÓÓÃÃüÁîÐе÷Óã¬ÕâÀïÖ±½Óµ÷ÓÃÁËËüµÄ main ·½·¨
String[] args = {"-c", "UTF-8", "-i", "clustering/reuters-extracted/", "-o",
"clustering/reuters-seqfiles"};
// ½âÊÍһϲÎÊýµÄÒâÒ壺
// -c: Ö¸¶¨ÎļþµÄ±àÂëÐÎʽ£¬ÕâÀïÓõÄÊÇ"UTF-8"
// -i: Ö¸¶¨ÊäÈëµÄÎļþĿ¼£¬ÕâÀïÖ¸µ½ÎÒÃǸոյ¼³öÎļþµÄĿ¼
// -o: Ö¸¶¨Êä³öµÄÎļþĿ¼
try {
SequenceFilesFromDirectory.main(args);
} catch (Exception e) {
e.printStackTrace();
}
}
public static void transformToVector(){
//SparseVectorsFromSequenceFiles ʵÏÖ½« SequenceFiles ÖеÄÊý¾Ý½øÐÐÏòÁ¿»¯¡£
// ËüÆäʵ±¾ÉíÊÇÒ»¸ö¹¤¾ßÀ࣬¿ÉÒÔÖ±½ÓÓÃÃüÁîÐе÷Óã¬ÕâÀïÖ±½Óµ÷ÓÃÁËËüµÄ main ·½·¨
String[] args = {"-i", "clustering/reuters-seqfiles/", "-o",
"clustering/reuters-vectors-bigram", "-a",
"org.apache.lucene.analysis.WhitespaceAnalyzer"
, "-chunk", "200", "-wt", "tfidf", "-s", "5",
"-md", "3", "-x", "90", "-ng", "2", "-ml", "50", "-seq"};
// ½âÊÍһϲÎÊýµÄÒâÒ壺
// -i: Ö¸¶¨ÊäÈëµÄÎļþĿ¼£¬ÕâÀïÖ¸µ½ÎÒÃǸոÕÉú³É SequenceFiles µÄĿ¼
// -o: Ö¸¶¨Êä³öµÄÎļþĿ¼
// -a: Ö¸¶¨Ê¹ÓÃµÄ Analyzer£¬ÕâÀïÓõÄÊÇ lucene µÄ¿Õ¸ñ·Ö´ÊµÄ Analyzer
// -chunk: Ö¸¶¨ Chunk µÄ´óС£¬µ¥Î»ÊÇ M¡£¶ÔÓÚ´óµÄÎļþ¼¯ºÏ£¬ÎÒÃDz»ÄÜÒ»´Î load ËùÓÐÎļþ£¬ËùÒÔÐèÒª
// ¶ÔÊý¾Ý½øÐÐÇпé
// -wt: Ö¸¶¨·ÖÎöʱ²ÉÓõļÆËãÈ¨ÖØµÄģʽ£¬ÕâÀïÑ¡ÁË tfidf
// -s: Ö¸¶¨´ÊÓïÔÚÕû¸öÎı¾¼¯ºÏ³öÏÖµÄ×îµÍƵ¶È£¬µÍÓÚÕâ¸öƵ¶ÈµÄ´Ê»ã½«±»¶ªµô
// -md: Ö¸¶¨´ÊÓïÔÚ¶àÉÙ²»Í¬µÄÎı¾ÖгöÏÖµÄ×îµÍÖµ£¬µÍÓÚÕâ¸öÖµµÄ´Ê»ã½«±»¶ªµô
// -x: Ö¸¶¨¸ßƵ´Ê»ãºÍÎÞÒâÒå´Ê»ã£¨ÀýÈç is£¬a£¬the µÈ£©µÄ³öÏÖÆµÂÊÉÏÏÞ£¬¸ßÓÚÉÏÏ޵Ľ«±»¶ªµô
// -ng: Ö¸¶¨·Ö´Êºó¿¼ÂÇ´Ê»ãµÄ×î´ó³¤¶È£¬ÀýÈç 1-gram ¾ÍÊÇ£¬coca£¬cola£¬ÕâÊÇÁ½¸ö´Ê£¬
// 2-gram ʱ£¬coca cola ÊÇÒ»¸ö´Ê»ã£¬2-gram ±È 1-gram ÔÚÒ»¶¨Çé¿öÏ·ÖÎöµÄ¸ü׼ȷ¡£
// -ml: Ö¸¶¨ÅжÏÏàÁÚ´ÊÓïÊDz»ÊÇÊôÓÚÒ»¸ö´Ê»ãµÄÏàËÆ¶ÈãÐÖµ£¬µ±Ñ¡Ôñ >1-gram ʱ²ÅÓÐÓã¬Æäʵ¼ÆËãµÄÊÇ
// Minimum Log Likelihood Ratio µÄãÐÖµ
// -seq: Ö¸¶¨Éú³ÉµÄÏòÁ¿ÊÇ SequentialAccessSparseVectors£¬Ã»ÉèÖÃʱĬÈÏÉú³É»¹ÊÇ
// RandomAccessSparseVectors
try {
SparseVectorsFromSequenceFiles.main(args);
} catch (Exception e) {
e.printStackTrace();
}
}
|
ÕâÀï²¹³äÒ»µã£¬Éú³ÉµÄÏòÁ¿»¯ÎļþµÄĿ¼½á¹¹ÊÇÕâÑùµÄ£º
ͼ 2 Îı¾ÐÅÏ¢ÏòÁ¿»¯
- df-count Ŀ¼£º±£´æ×ÅÎı¾µÄƵÂÊÐÅÏ¢
- tf-vectors Ŀ¼£º±£´æ×ÅÒÔ TF ×÷ΪȨֵµÄÎı¾ÏòÁ¿
- tfidf-vectors Ŀ¼£º±£´æ×ÅÒÔ TFIDF ×÷ΪȨֵµÄÎı¾ÏòÁ¿
- tokenized-documents Ŀ¼£º±£´æ×ŷִʹýºóµÄÎı¾ÐÅÏ¢
- wordcount Ŀ¼£º±£´æ×ÅÈ«¾ÖµÄ´Ê»ã³öÏֵĴÎÊý
- dictionary.file-0 Ŀ¼£º±£´æ×ÅÕâЩÎı¾µÄ´Ê»ã±í
- frequcency-file-0 Ŀ¼ : ±£´æ×Å´Ê»ã±í¶ÔÓ¦µÄƵÂÊÐÅÏ¢¡£
½éÉÜÍêÏòÁ¿»¯ÎÊÌ⣬ÏÂÃæÎÒÃÇÉîÈë·ÖÎö¸÷¸ö¾ÛÀàËã·¨£¬Ê×ÏȽéÉܵÄÊÇ×î¾µäµÄ K ¾ùÖµËã·¨¡£
K ¾ùÖµ¾ÛÀàËã·¨
K ¾ùÖµÊǵäÐ͵ĻùÓÚ¾àÀëµÄÅÅËûµÄ»®·Ö·½·¨£º¸ø¶¨Ò»¸ö n ¸ö¶ÔÏóµÄÊý¾Ý¼¯£¬Ëü¿ÉÒÔ¹¹½¨Êý¾ÝµÄ k ¸ö»®·Ö£¬Ã¿¸ö»®·Ö¾ÍÊÇÒ»¸ö¾ÛÀ࣬²¢ÇÒ k<=n£¬Í¬Ê±»¹ÐèÒªÂú×ãÁ½¸öÒªÇó£º
- ÿ¸ö×éÖÁÉÙ°üº¬Ò»¸ö¶ÔÏó
- ÿ¸ö¶ÔÏó±ØÐëÊôÓÚÇÒ½öÊôÓÚÒ»¸ö×é¡£
K ¾ùÖµµÄ»ù±¾ÔÀíÊÇÕâÑùµÄ£¬¸ø¶¨ k£¬¼´Òª¹¹½¨µÄ»®·ÖµÄÊýÄ¿£¬
- Ê×ÏÈ´´½¨Ò»¸ö³õʼ»®·Ö£¬Ëæ»úµØÑ¡Ôñ k ¸ö¶ÔÏó£¬Ã¿¸ö¶ÔÏó³õʼµØ´ú±íÁËÒ»¸ö´ØÖÐÐÄ¡£¶ÔÓÚÆäËûµÄ¶ÔÏ󣬸ù¾ÝÆäÓë¸÷¸ö´ØÖÐÐĵľàÀ룬½«ËüÃǸ³¸ø×î½üµÄ´Ø¡£
- È»ºó²ÉÓÃÒ»ÖÖµü´úµÄÖØ¶¨Î»¼¼Êõ£¬³¢ÊÔͨ¹ý¶ÔÏóÔÚ»®·Ö¼äÒÆ¶¯À´¸Ä½ø»®·Ö¡£ËùÎ½ÖØ¶¨Î»¼¼Êõ£¬¾ÍÊǵ±ÓÐеĶÔÏó¼ÓÈë´Ø»òÕßÒÑÓжÔÏóÀ뿪´ØµÄʱºò£¬ÖØÐ¼ÆËã´ØµÄƽ¾ùÖµ£¬È»ºó¶Ô¶ÔÏó½øÐÐÖØÐ·ÖÅä¡£Õâ¸ö¹ý³Ì²»¶ÏÖØ¸´£¬Ö±µ½Ã»ÓдØÖжÔÏóµÄ±ä»¯¡£
µ±½á¹û´ØÊÇÃܼ¯µÄ£¬¶øÇҴغʹØÖ®¼äµÄÇø±ð±È½ÏÃ÷ÏÔʱ£¬K ¾ùÖµµÄЧ¹û±È½ÏºÃ¡£¶ÔÓÚ´¦Àí´óÊý¾Ý¼¯£¬Õâ¸öËã·¨ÊÇÏà¶Ô¿ÉÉìËõµÄºÍ¸ßЧµÄ£¬ËüµÄ¸´ÔÓ¶ÈÊÇ O(nkt)£¬n ÊǶÔÏóµÄ¸öÊý£¬k ÊǴصÄÊýÄ¿£¬t Êǵü´úµÄ´ÎÊý£¬Í¨³£ k<<n£¬ÇÒ t<<n£¬ËùÒÔËã·¨¾³£ÒÔ¾Ö²¿×îÓŽáÊø¡£
K ¾ùÖµµÄ×î´óÎÊÌâÊÇÒªÇóÓû§±ØÐëÊÂÏȸø³ö k µÄ¸öÊý£¬k µÄÑ¡ÔñÒ»°ã¶¼»ùÓÚһЩ¾ÑéÖµºÍ¶à´ÎʵÑé½á¹û£¬¶ÔÓÚ²»Í¬µÄÊý¾Ý¼¯£¬k µÄȡֵûÓÐ¿É½è¼øÐÔ¡£ÁíÍ⣬K ¾ùÖµ¶Ô¡°ÔëÒô¡±ºÍ¹ÂÁ¢µãÊý¾ÝÊÇÃô¸ÐµÄ£¬ÉÙÁ¿ÕâÀàµÄÊý¾Ý¾ÍÄÜ¶ÔÆ½¾ùÖµÔì³É¼«´óµÄÓ°Ïì¡£
˵ÁËÕâô¶àÀíÂÛµÄÔÀí£¬ÏÂÃæÎÒÃÇ»ùÓÚ Mahout ʵÏÖÒ»¸ö¼òµ¥µÄ K ¾ùÖµËã·¨µÄÀý×Ó¡£ÈçÇ°Ãæ½éÉܵģ¬Mahout ÌṩÁË»ù±¾µÄ»ùÓÚÄÚ´æµÄʵÏֺͻùÓÚ Hadoop µÄ Map/Reduce µÄʵÏÖ£¬·Ö±ðÊÇ KMeansClusterer ºÍ KMeansDriver£¬ÏÂÃæ¸ø³öÒ»¸ö¼òµ¥µÄÀý×Ó£¬¾Í»ùÓÚÎÒÃÇÔÚÇåµ¥ 1 ÀﶨÒåµÄ¶þάµã¼¯Êý¾Ý¡£
Çåµ¥ 3. K ¾ùÖµ¾ÛÀàË㷨ʾÀý
// »ùÓÚÄÚ´æµÄ K ¾ùÖµ¾ÛÀàË㷨ʵÏÖ
public static void kMeansClusterInMemoryKMeans(){
// Ö¸¶¨ÐèÒª¾ÛÀàµÄ¸öÊý£¬ÕâÀïÑ¡Ôñ 2 Àà
int k = 2;
// Ö¸¶¨ K ¾ùÖµ¾ÛÀàËã·¨µÄ×î´óµü´ú´ÎÊý
int maxIter = 3;
// Ö¸¶¨ K ¾ùÖµ¾ÛÀàËã·¨µÄ×î´ó¾àÀëãÐÖµ
double distanceThreshold = 0.01;
// ÉùÃ÷Ò»¸ö¼ÆËã¾àÀëµÄ·½·¨£¬ÕâÀïÑ¡ÔñÁËÅ·¼¸ÀïµÂ¾àÀë
DistanceMeasure measure = new EuclideanDistanceMeasure();
// ÕâÀï¹¹½¨ÏòÁ¿¼¯£¬Ê¹ÓõÄÊÇÇåµ¥ 1 ÀïµÄ¶þάµã¼¯
List<Vector> pointVectors = SimpleDataSet.getPointVectors(SimpleDataSet.points);
// ´Óµã¼¯ÏòÁ¿ÖÐËæ»úµÄÑ¡Ôñ k ¸ö×÷Ϊ´ØµÄÖÐÐÄ
List<Vector> randomPoints = RandomSeedGenerator.chooseRandomPoints(pointVectors, k);
// »ùÓÚÇ°ÃæÑ¡ÖеÄÖÐÐĹ¹½¨´Ø
List<Cluster> clusters = new ArrayList<Cluster>();
int clusterId = 0;
for(Vector v : randomPoints){
clusters.add(new Cluster(v, clusterId ++, measure));
}
// µ÷ÓÃ KMeansClusterer.clusterPoints ·½·¨Ö´ÐÐ K ¾ùÖµ¾ÛÀà
List<List<Cluster>> finalClusters = KMeansClusterer.clusterPoints(pointVectors,
clusters, measure, maxIter, distanceThreshold);
// ´òÓ¡×îÖյľÛÀà½á¹û
for(Cluster cluster : finalClusters.get(finalClusters.size() -1)){
System.out.println("Cluster id: " + cluster.getId() +
" center: " + cluster.getCenter().asFormatString());
System.out.println(" Points: " + cluster.getNumPoints());
}
}
// »ùÓÚ Hadoop µÄ K ¾ùÖµ¾ÛÀàË㷨ʵÏÖ
public static void kMeansClusterUsingMapReduce () throws Exception{
// ÉùÃ÷Ò»¸ö¼ÆËã¾àÀëµÄ·½·¨£¬ÕâÀïÑ¡ÔñÁËÅ·¼¸ÀïµÂ¾àÀë
DistanceMeasure measure = new EuclideanDistanceMeasure();
// Ö¸¶¨ÊäÈë·¾¶£¬ÈçÇ°Ãæ½éÉܵÄÒ»Ñù£¬»ùÓÚ Hadoop µÄʵÏÖ¾ÍÊÇͨ¹ýÖ¸¶¨ÊäÈëÊä³öµÄÎļþ·¾¶À´Ö¸¶¨Êý¾ÝÔ´µÄ¡£
Path testpoints = new Path("testpoints");
Path output = new Path("output");
// Çå¿ÕÊäÈëÊä³ö·¾¶ÏµÄÊý¾Ý
HadoopUtil.overwriteOutput(testpoints);
HadoopUtil.overwriteOutput(output);
RandomUtils.useTestSeed();
// ÔÚÊäÈë·¾¶ÏÂÉú³Éµã¼¯£¬ÓëÄÚ´æµÄ·½·¨²»Í¬£¬ÕâÀïÐèÒª°ÑËùÓеÄÏòÁ¿Ð´½øÎļþ£¬ÏÂÃæ¸ø³ö¾ßÌåµÄÀý×Ó
SimpleDataSet.writePointsToFile(testpoints);
// Ö¸¶¨ÐèÒª¾ÛÀàµÄ¸öÊý£¬ÕâÀïÑ¡Ôñ 2 Àà
int k = 2;
// Ö¸¶¨ K ¾ùÖµ¾ÛÀàËã·¨µÄ×î´óµü´ú´ÎÊý
int maxIter = 3;
// Ö¸¶¨ K ¾ùÖµ¾ÛÀàËã·¨µÄ×î´ó¾àÀëãÐÖµ
double distanceThreshold = 0.01;
// Ëæ»úµÄÑ¡Ôñ k ¸ö×÷Ϊ´ØµÄÖÐÐÄ
Path clusters = RandomSeedGenerator.buildRandom(testpoints,
new Path(output, "clusters-0"), k, measure);
// µ÷Óà KMeansDriver.runJob ·½·¨Ö´ÐÐ K ¾ùÖµ¾ÛÀàËã·¨
KMeansDriver.runJob(testpoints, clusters, output, measure,
distanceThreshold, maxIter, 1, true, true);
// µ÷Óà ClusterDumper µÄ printClusters ·½·¨½«¾ÛÀà½á¹û´òÓ¡³öÀ´¡£
ClusterDumper clusterDumper = new ClusterDumper(new Path(output,
"clusters-" + maxIter -1), new Path(output, "clusteredPoints"));
clusterDumper.printClusters(null);
}
//SimpleDataSet µÄ writePointsToFile ·½·¨£¬½«²âÊԵ㼯дÈëÎļþÀï
// Ê×ÏÈÎÒÃǽ«²âÊԵ㼯°ü×°³É VectorWritable ÐÎʽ£¬´Ó¶ø½«ËüÃÇдÈëÎļþ
public static List<VectorWritable> getPoints(double[][] raw) {
List<VectorWritable> points = new ArrayList<VectorWritable>();
for (int i = 0; i < raw.length; i++) {
double[] fr = raw[i];
Vector vec = new RandomAccessSparseVector(fr.length);
vec.assign(fr);
// Ö»ÊÇÔÚ¼ÓÈëµã¼¯Ç°£¬ÔÚ RandomAccessSparseVector Íâ¼ÓÁËÒ»²ã VectorWritable µÄ°ü×°
points.add(new VectorWritable(vec));
}
return points;
}
// ½« VectorWritable µÄµã¼¯Ð´ÈëÎļþ£¬ÕâÀïÉæ¼°Ò»Ð©»ù±¾µÄ Hadoop ±à³ÌÔªËØ£¬ÏêϸµÄÇë²ÎÔIJο¼×ÊÔ´ÀïÏà¹ØµÄÄÚÈÝ
public static void writePointsToFile(Path output) throws IOException {
// µ÷ÓÃÇ°ÃæµÄ·½·¨Éú³Éµã¼¯
List<VectorWritable> pointVectors = getPoints(points);
// ÉèÖà Hadoop µÄ»ù±¾ÅäÖÃ
Configuration conf = new Configuration();
// Éú³É Hadoop Îļþϵͳ¶ÔÏó FileSystem
FileSystem fs = FileSystem.get(output.toUri(), conf);
// Éú³ÉÒ»¸ö SequenceFile.Writer£¬Ëü¸ºÔ𽫠Vector дÈëÎļþÖÐ
SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, output,
Text.class, VectorWritable.class);
// ÕâÀォÏòÁ¿°´ÕÕÎı¾ÐÎʽдÈëÎļþ
try {
for (VectorWritable vw : pointVectors) {
writer.append(new Text(), vw);
}
} finally {
writer.close();
}
}
Ö´Ðнá¹û
KMeans Clustering In Memory Result
Cluster id: 0
center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
"vector":"{\"values\":{\"table\":[0,1,0],\"values\":[1.8,1.8,0.0],\"state\":[1,1,0],
\"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,\"highWaterMark\":1,
\"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},\"size\":2,\"lengthSquared\":-1.0}"}
Points: 5
Cluster id: 1
center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
"vector":"{\"values\":{\"table\":[0,1,0],
\"values\":[7.142857142857143,7.285714285714286,0.0],\"state\":[1,1,0],
\"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,\"highWaterMark\":1,
\"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},\"size\":2,\"lengthSquared\":-1.0}"}
Points: 7
KMeans Clustering Using Map/Reduce Result
Weight: Point:
1.0: [1.000, 1.000]
1.0: [2.000, 1.000]
1.0: [1.000, 2.000]
1.0: [2.000, 2.000]
1.0: [3.000, 3.000]
Weight: Point:
1.0: [8.000, 8.000]
1.0: [9.000, 8.000]
1.0: [8.000, 9.000]
1.0: [9.000, 9.000]
1.0: [5.000, 5.000]
1.0: [5.000, 6.000]
1.0: [6.000, 6.000]
|
1) Path Input £º ËùÓдý¾ÛÀàµÄÊý¾ÝµãµÄ·¾¢£¬²ÎÊý²»¿Éȱ
2) Path clusters £º´æ´¢Ã¿¸ö´ØÖÐÐĵÄ·¾¢£¬²ÎÊý²»¿Éȱ
3) Path output £º¾ÛÀà½á¹û´æ´¢µÄ·¾¢£¬²ÎÊý²»¿Éȱ£¬Èç¹ûÖ¸¶¨Á˴صĸöÊý£¬Ôò¸Ã·¾¢ÏÂÎļþ¿ÉΪ¿Õ
4) DistanceMeasure measure £ºÊý¾Ýµã¼äµÄ¾àÀë¼ÆËã·½·¨£¬²ÎÊý¿Éȱ£¬Ä¬ÈÏÊÇ SquaredEuclidean Ëã·½·¨
Ìṩ²ÎÊýÖµ: ChebyshevDistanceMeasure ÇбÈÑ©·ò¾àÀë
CosineDistanceMeasure ÓàÏÒ¾àÀë
EuclideanDistanceMeasure Å·ÊϾàÀë
MahalanobisDistanceMeasure ÂíÊϾàÀë
ManhattanDistanceMeasure Âü¹þ¶Ù¾àÀë
MinkowskiDistanceMeasure ãɿɷò˹»ù¾àÀë
SquaredEuclideanDistanceMeasure Å·ÊϾàÀë ( ²»²Éȡƽ·½¸ù )
TanimotoDistanceMeasure Tanimoto ϵÊý¾àÀë
»¹ÓÐһЩ»ùÓÚÈ¨ÖØµÄ¾àÀë¼ÆËã·½·¨£º
WeightedDistanceMeasure
WeightedEuclideanDistanceMeasure ¡¢ WeightedManhattanDistanceMeasure
5) Double convergenceDelta: ÊÕÁ²ÏµÊý еĴØÖÐÐÄÓëÉϴεĴØÖÐÐĵĵľàÀë²»Äܳ¬¹ý convergenceDelta £¬Èç¹û³¬¹ý£¬Ôò¼ÌÐøµü´ú£¬·ñÔòÍ£Ö¹µü´ú¡£²ÎÊý¿Éȱ£¬Ä¬ÈÏÖµÊÇ 0.5
6) int maxIterations £º ×î´óµü´ú´ÎÊý£¬Èç¹ûµü´ú´ÎÊýСÓÚ maxIterations £¬¼ÌÐøµü´ú£¬·ñÔòÍ£Ö¹µø´ò£¬Óë 5) ÖеÄconvergenceDelta Âú×ãÈκÎÒ»¸öÍ£Ö¹µü´úµÄÌõ¼þ£¬ÔòÍ£Ö¹µü´ú¡£²ÎÊý²»¿Éȱ¡£
7) boolean runClustering £ºÈç¹ûÊÇ true ÔòÔÚ¼ÆËã´ØÖÐÐĺ󣬼ÆËãÿ¸öÊý¾ÝµãÊôÓÚÄĸö´Ø£¬·ñÔò¼ÆËã´ØÖÐÐĺó½áÊø£¬²ÎÊý¿Éȱ£¬Ä¬ÈÏΪ true
8) clusteringOption £º²ÉÓõ¥»ú»òÕß Map/Reduce µÄ·½·¨¼ÆËã¡£²ÎÊý¿Éȱ£¬Ä¬ÈÏÊÇ mapreduce ¡£
9) int numClustersOption £º´ØµÄ¸öÊý£¬²ÎÊý¿Éȱ¡£
½éÉÜÍê K ¾ùÖµ¾ÛÀàËã·¨£¬ÎÒÃÇ¿ÉÒÔ¿´³öËü×î´óµÄÓŵãÊÇ£ºÔÀí¼òµ¥£¬ÊµÏÖÆðÀ´Ò²Ïà¶Ô¼òµ¥£¬Í¬Ê±Ö´ÐÐЧÂʺͶÔÓÚ´óÊý¾ÝÁ¿µÄ¿ÉÉìËõÐÔ»¹ÊǽÏÇ¿µÄ¡£È»¶øÈ±µãÒ²ÊǺÜÃ÷È·µÄ£¬Ê×ÏÈËüÐèÒªÓû§ÔÚÖ´ÐоÛÀà֮ǰ¾ÍÓÐÃ÷È·µÄ¾ÛÀà¸öÊýµÄÉèÖã¬ÕâÒ»µãÊÇÓû§ÔÚ´¦Àí´ó²¿·ÖÎÊÌâʱ¶¼²»Ì«¿ÉÄÜÊÂÏÈÖªµÀµÄ£¬Ò»°ãÐèҪͨ¹ý¶à´ÎÊÔÑéÕÒ³öÒ»¸ö×îÓÅµÄ K Öµ£»Æä´Î¾ÍÊÇ£¬ÓÉÓÚËã·¨ÔÚ×ʼ²ÉÓÃËæ»úÑ¡Ôñ³õʼ¾ÛÀàÖÐÐĵķ½·¨£¬ËùÒÔËã·¨¶ÔÔëÒôºÍ¹ÂÁ¢µãµÄÈÝÈÌÄÜÁ¦½Ï²î¡£ËùνÔëÒô¾ÍÊÇ´ý¾ÛÀà¶ÔÏóÖдíÎóµÄÊý¾Ý£¬¶ø¹ÂÁ¢µãÊÇÖ¸ÓëÆäËûÊý¾Ý¾àÀë½ÏÔ¶£¬ÏàËÆÐԽϵ͵ÄÊý¾Ý¡£¶ÔÓÚ K ¾ùÖµËã·¨£¬Ò»µ©¹ÂÁ¢µãºÍÔëÒôÔÚ×ʼ±»Ñ¡×÷´ØÖÐÐÄ£¬¶ÔºóÃæÕû¸ö¾ÛÀà¹ý³Ì½«´øÀ´ºÜ´óµÄÎÊÌ⣬ÄÇôÎÒÃÇÓÐʲô·½·¨¿ÉÒÔÏÈ¿ìËÙÕÒ³öÓ¦¸ÃÑ¡Ôñ¶àÉÙ¸ö´Ø£¬Í¬Ê±ÕÒµ½´ØµÄÖÐÐÄ£¬ÕâÑù¿ÉÒÔ´ó´óÓÅ»¯ K ¾ùÖµ¾ÛÀàËã·¨µÄЧÂÊ£¬ÏÂÃæÎÒÃǾͽéÉÜÁíÒ»¸ö¾ÛÀà·½·¨£ºCanopy ¾ÛÀàËã·¨¡£
Canopy ¾ÛÀàËã·¨
Canopy ¾ÛÀàËã·¨µÄ»ù±¾ÔÔòÊÇ£ºÊ×ÏÈÓ¦Óóɱ¾µÍµÄ½üËÆµÄ¾àÀë¼ÆËã·½·¨¸ßЧµÄ½«Êý¾Ý·ÖΪ¶à¸ö×飬ÕâÀï³ÆÎªÒ»¸ö Canopy£¬ÎÒÃ**ÃÇÒ½«Ëü·ÒëΪ¡°»ª¸Ç¡±£¬Canopy Ö®¼ä¿ÉÒÔÓÐÖØµþµÄ²¿·Ö£»È»ºó²ÉÓÃÑϸñµÄ¾àÀë¼ÆË㷽ʽ׼ȷµÄ¼ÆËãÔÚͬһ Canopy Öеĵ㣬½«ËûÃÇ·ÖÅäÓë×îºÏÊʵĴØÖС£Canopy ¾ÛÀàËã·¨¾³£ÓÃÓÚ K ¾ùÖµ¾ÛÀàËã·¨µÄÔ¤´¦Àí£¬ÓÃÀ´ÕÒºÏÊ浀 k ÖµºÍ´ØÖÐÐÄ¡£
ÏÂÃæÏêϸ½éÉÜһϴ´½¨ Canopy µÄ¹ý³Ì£º³õʼ£¬¼ÙÉèÎÒÃÇÓÐÒ»×éµã¼¯ S£¬²¢ÇÒÔ¤ÉèÁËÁ½¸ö¾àÀëãÐÖµ£¬T1£¬T2£¨T1>T2£©£»È»ºóÑ¡ÔñÒ»¸öµã£¬¼ÆËãËüÓë S ÖÐÆäËûµãµÄ¾àÀ루ÕâÀï²ÉÓóɱ¾ºÜµÍµÄ¼ÆËã·½·¨£©£¬½«¾àÀëÔÚ T1 ÒÔÄڵķÅÈëÒ»¸ö Canopy ÖУ¬Í¬Ê±´Ó S ÖÐÈ¥µôÄÇЩÓë´Ëµã¾àÀëÔÚ T2 ÒÔÄڵĵ㣨ÕâÀïÊÇΪÁ˱£Ö¤ºÍÖÐÐľàÀëÔÚ T2 ÒÔÄڵĵ㲻ÄÜÔÙ×÷ΪÆäËû Canopy µÄÖÐÐÄ£©£¬Öظ´Õû¸ö¹ý³ÌÖ±µ½ S Ϊ¿ÕΪֹ¡£
¶Ô K ¾ùÖµµÄʵÏÖÒ»Ñù£¬Mahout Ò²ÌṩÁËÁ½¸ö Canopy ¾ÛÀàµÄʵÏÖ£¬ÏÂÃæÎÒÃǾͿ´¿´¾ßÌåµÄ´úÂëÀý×Ó¡£
Çåµ¥ 4. Canopy ¾ÛÀàË㷨ʾÀý
//Canopy ¾ÛÀàËã·¨µÄÄÚ´æÊµÏÖ
public static void canopyClusterInMemory () {
// ÉèÖþàÀëãÐÖµ T1,T2
double T1 = 4.0;
double T2 = 3.0;
// µ÷Óà CanopyClusterer.createCanopies ·½·¨´´½¨ Canopy£¬²ÎÊý·Ö±ðÊÇ£º
// 1. ÐèÒª¾ÛÀàµÄµã¼¯
// 2. ¾àÀë¼ÆËã·½·¨
// 3. ¾àÀëãÐÖµ T1 ºÍ T2
List<Canopy> canopies = CanopyClusterer.createCanopies(
SimpleDataSet.getPointVectors(SimpleDataSet.points),
new EuclideanDistanceMeasure(), T1, T2);
// ´òÓ¡´´½¨µÄ Canopy£¬ÒòΪ¾ÛÀàÎÊÌâºÜ¼òµ¥£¬ËùÒÔÕâÀïûÓнøÐÐÏÂÒ»²½¾«È·µÄ¾ÛÀà¡£
// ÓбØÐëµÄʱºò£¬¿ÉÒÔÄõ½ Canopy ¾ÛÀàµÄ½á¹û×÷Ϊ K ¾ùÖµ¾ÛÀàµÄÊäÈ룬Äܸü¾«È·¸ü¸ßЧµÄ½â¾ö¾ÛÀàÎÊÌâ
for(Canopy canopy : canopies) {
System.out.println("Cluster id: " + canopy.getId() +
" center: " + canopy.getCenter().asFormatString());
System.out.println(" Points: " + canopy.getNumPoints());
}
}
//Canopy ¾ÛÀàËã·¨µÄ Hadoop ʵÏÖ
public static void canopyClusterUsingMapReduce() throws Exception{
// ÉèÖþàÀëãÐÖµ T1,T2
double T1 = 4.0;
double T2 = 3.0;
// ÉùÃ÷¾àÀë¼ÆËãµÄ·½·¨
DistanceMeasure measure = new EuclideanDistanceMeasure();
// ÉèÖÃÊäÈëÊä³öµÄÎļþ·¾¶
Path testpoints = new Path("testpoints");
Path output = new Path("output");
// Çå¿ÕÊäÈëÊä³ö·¾¶ÏµÄÊý¾Ý
HadoopUtil.overwriteOutput(testpoints);
HadoopUtil.overwriteOutput(output);
// ½«²âÊԵ㼯дÈëÊäÈëĿ¼ÏÂ
SimpleDataSet.writePointsToFile(testpoints);
// µ÷Óà CanopyDriver.buildClusters µÄ·½·¨Ö´ÐÐ Canopy ¾ÛÀ࣬²ÎÊýÊÇ£º
// 1. ÊäÈë·¾¶£¬Êä³ö·¾¶
// 2. ¼ÆËã¾àÀëµÄ·½·¨
// 3. ¾àÀëãÐÖµ T1 ºÍ T2
new CanopyDriver().buildClusters(testpoints, output, measure, T1, T2, true);
// ´òÓ¡ Canopy ¾ÛÀàµÄ½á¹û
List<List<Cluster>> clustersM = DisplayClustering.loadClusters(output);
List<Cluster> clusters = clustersM.get(clustersM.size()-1);
if(clusters != null){
for(Cluster canopy : clusters) {
System.out.println("Cluster id: " + canopy.getId() +
" center: " + canopy.getCenter().asFormatString());
System.out.println(" Points: " + canopy.getNumPoints());
}
}
}
Ö´Ðнá¹û
Canopy Clustering In Memory Result
Cluster id: 0
center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
"vector":"{\"values\":{\"table\":[0,1,0],\"values\":[1.8,1.8,0.0],
\"state\":[1,1,0],\"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,
\"highWaterMark\":1,\"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},
\"size\":2,\"lengthSquared\":-1.0}"}
Points: 5
Cluster id: 1
center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
"vector":"{\"values\":{\"table\":[0,1,0],\"values\":[7.5,7.666666666666667,0.0],
\"state\":[1,1,0],\"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,
\"highWaterMark\":1,\"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},\"size\":2,
\"lengthSquared\":-1.0}"}
Points: 6
Cluster id: 2
center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
"vector":"{\"values\":{\"table\":[0,1,0],\"values\":[5.0,5.5,0.0],
\"state\":[1,1,0],\"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,
\"highWaterMark\":1,\"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},\"size\":2,
\"lengthSquared\":-1.0}"}
Points: 2
Canopy Clustering Using Map/Reduce Result
Cluster id: 0
center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
"vector":"{\"values\":{\"table\":[0,1,0],\"values\":[1.8,1.8,0.0],
\"state\":[1,1,0],\"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,
\"highWaterMark\":1,\"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},
\"size\":2,\"lengthSquared\":-1.0}"}
Points: 5
Cluster id: 1
center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
"vector":"{\"values\":{\"table\":[0,1,0],\"values\":[7.5,7.666666666666667,0.0],
\"state\":[1,1,0],\"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,
\"highWaterMark\":1,\"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},\"size\":2,
\"lengthSquared\":-1.0}"}
Points: 6
Cluster id: 2
center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
"vector":"{\"values\":{\"table\":[0,1,0],
\"values\":[5.333333333333333,5.666666666666667,0.0],\"state\":[1,1,0],
\"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,\"highWaterMark\":1,
\"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},\"size\":2,\"lengthSquared\":-1.0}"}
Points: 3
|
Ä£ºý K ¾ùÖµ¾ÛÀàËã·¨
Ä£ºý K ¾ùÖµ¾ÛÀàËã·¨ÊÇ K ¾ùÖµ¾ÛÀàµÄÀ©Õ¹£¬ËüµÄ»ù±¾ÔÀíºÍ K ¾ùÖµÒ»Ñù£¬Ö»ÊÇËüµÄ¾ÛÀà½á¹ûÔÊÐí´æÔÚ¶ÔÏóÊôÓÚ¶à¸ö´Ø£¬Ò²¾ÍÊÇ˵£ºËüÊôÓÚÎÒÃÇÇ°Ãæ½éÉܹýµÄ¿ÉÖØµþ¾ÛÀàËã·¨¡£ÎªÁËÉîÈëÀí½âÄ£ºý K ¾ùÖµºÍ K ¾ùÖµµÄÇø±ð£¬ÕâÀïÎÒÃǵû¨Ð©Ê±¼äÁ˽âÒ»¸ö¸ÅÄģºý²ÎÊý£¨Fuzziness Factor£©¡£
Óë K ¾ùÖµ¾ÛÀàÔÀíÀàËÆ£¬Ä£ºý K ¾ùÖµÒ²ÊÇÔÚ´ý¾ÛÀà¶ÔÏóÏòÁ¿¼¯ºÏÉÏÑ»·£¬µ«ÊÇËü²¢²»Êǽ«ÏòÁ¿·ÖÅ䏸¾àÀë×î½üµÄ´Ø£¬¶øÊǼÆËãÏòÁ¿Óë¸÷¸ö´ØµÄÏà¹ØÐÔ£¨Association£©¡£¼ÙÉèÓÐÒ»¸öÏòÁ¿ v£¬ÓÐ k ¸ö´Ø£¬v µ½ k ¸ö´ØÖÐÐĵľàÀë·Ö±ðÊÇ d1£¬d2¡ dk£¬ÄÇô V µ½µÚÒ»¸ö´ØµÄÏà¹ØÐÔ u1¿ÉÒÔͨ¹ýÏÂÃæµÄËãʽ¼ÆË㣺
¼ÆËã v µ½ÆäËû´ØµÄÏà¹ØÐÔÖ»Ð轫 d1Ìæ»»Îª¶ÔÓ¦µÄ¾àÀë¡£
´ÓÉÏÃæµÄËãʽ£¬ÎÒÃÇ¿´³ö£¬µ± m ½üËÆ 2 ʱ£¬Ïà¹ØÐÔ½üËÆ 1£»µ± m ½üËÆ 1 ʱ£¬Ïà¹ØÐÔ½üËÆÓÚµ½¸Ã´ØµÄ¾àÀ룬ËùÒÔ m µÄȡֵÔÚ£¨1£¬2£©Çø¼äÄÚ£¬µ± m Ô½´ó£¬Ä£ºý³Ì¶ÈÔ½´ó£¬m ¾ÍÊÇÎÒÃǸոÕÌáµ½µÄÄ£ºý²ÎÊý¡£
½²ÁËÕâô¶àÀíÂÛµÄÔÀí£¬ÏÂÃæÎÒÃÇ¿´¿´ÈçºÎʹÓà Mahout ʵÏÖÄ£ºý K ¾ùÖµ¾ÛÀà£¬Í¬Ç°ÃæµÄ·½·¨Ò»Ñù£¬Mahout Ò»ÑùÌṩÁË»ùÓÚÄÚ´æºÍ»ùÓÚ Hadoop Map/Reduce µÄÁ½ÖÖʵÏÖ FuzzyKMeansClusterer ºÍ FuzzyMeansDriver£¬·Ö±ðÊÇÇåµ¥ 5 ¸ø³öÁËÒ»¸öÀý×Ó¡£
Çåµ¥ 5. Ä£ºý K ¾ùÖµ¾ÛÀàË㷨ʾÀý
public static void fuzzyKMeansClusterInMemory() {
// Ö¸¶¨¾ÛÀàµÄ¸öÊý
int k = 2;
// Ö¸¶¨ K ¾ùÖµ¾ÛÀàËã·¨µÄ×î´óµü´ú´ÎÊý
int maxIter = 3;
// Ö¸¶¨ K ¾ùÖµ¾ÛÀàËã·¨µÄ×î´ó¾àÀëãÐÖµ
double distanceThreshold = 0.01;
// Ö¸¶¨Ä£ºý K ¾ùÖµ¾ÛÀàËã·¨µÄÄ£ºý²ÎÊý
float fuzzificationFactor = 10;
// ÉùÃ÷Ò»¸ö¼ÆËã¾àÀëµÄ·½·¨£¬ÕâÀïÑ¡ÔñÁËÅ·¼¸ÀïµÂ¾àÀë
DistanceMeasure measure = new EuclideanDistanceMeasure();
// ¹¹½¨ÏòÁ¿¼¯£¬Ê¹ÓõÄÊÇÇåµ¥ 1 ÀïµÄ¶þάµã¼¯
List<Vector> pointVectors = SimpleDataSet.getPointVectors(SimpleDataSet.points);
// ´Óµã¼¯ÏòÁ¿ÖÐËæ»úµÄÑ¡Ôñ k ¸ö×÷Ϊ´ØµÄÖÐÐÄ
List<Vector> randomPoints = RandomSeedGenerator.chooseRandomPoints(points, k);
// ¹¹½¨³õʼ´Ø£¬ÕâÀïÓë K ¾ùÖµ²»Í¬£¬Ê¹ÓÃÁË SoftCluster£¬±íʾ´ØÊÇ¿ÉÖØµþµÄ
List<SoftCluster> clusters = new ArrayList<SoftCluster>();
int clusterId = 0;
for (Vector v : randomPoints) {
clusters.add(new SoftCluster(v, clusterId++, measure));
}
// µ÷Óà FuzzyKMeansClusterer µÄ clusterPoints ·½·¨½øÐÐÄ£ºý K ¾ùÖµ¾ÛÀà
List<List<SoftCluster>> finalClusters =
FuzzyKMeansClusterer.clusterPoints(points,
clusters, measure, distanceThreshold, maxIter, fuzzificationFactor);
// ´òÓ¡¾ÛÀà½á¹û
for(SoftCluster cluster : finalClusters.get(finalClusters.size() - 1)) {
System.out.println("Fuzzy Cluster id: " + cluster.getId() +
" center: " + cluster.getCenter().asFormatString());
}
}
public class fuzzyKMeansClusterUsingMapReduce {
// Ö¸¶¨Ä£ºý K ¾ùÖµ¾ÛÀàËã·¨µÄÄ£ºý²ÎÊý
float fuzzificationFactor = 2.0f;
// Ö¸¶¨ÐèÒª¾ÛÀàµÄ¸öÊý£¬ÕâÀïÑ¡Ôñ 2 Àà
int k = 2;
// Ö¸¶¨×î´óµü´ú´ÎÊý
int maxIter = 3;
// Ö¸¶¨×î´ó¾àÀëãÐÖµ
double distanceThreshold = 0.01;
// ÉùÃ÷Ò»¸ö¼ÆËã¾àÀëµÄ·½·¨£¬ÕâÀïÑ¡ÔñÁËÅ·¼¸ÀïµÂ¾àÀë
DistanceMeasure measure = new EuclideanDistanceMeasure();
// ÉèÖÃÊäÈëÊä³öµÄÎļþ·¾¶
Path testpoints = new Path("testpoints");
Path output = new Path("output");
// Çå¿ÕÊäÈëÊä³ö·¾¶ÏµÄÊý¾Ý
HadoopUtil.overwriteOutput(testpoints);
HadoopUtil.overwriteOutput(output);
// ½«²âÊԵ㼯дÈëÊäÈëĿ¼ÏÂ
SimpleDataSet.writePointsToFile(testpoints);
// Ëæ»úµÄÑ¡Ôñ k ¸ö×÷Ϊ´ØµÄÖÐÐÄ
Path clusters = RandomSeedGenerator.buildRandom(testpoints,
new Path(output, "clusters-0"), k, measure);
FuzzyKMeansDriver.runJob(testpoints, clusters, output, measure, 0.5, maxIter, 1,
fuzzificationFactor, true, true, distanceThreshold, true);
// ´òÓ¡Ä£ºý K ¾ùÖµ¾ÛÀàµÄ½á¹û
ClusterDumper clusterDumper = new ClusterDumper(new Path(output, "clusters-" +
maxIter ),new Path(output, "clusteredPoints"));
clusterDumper.printClusters(null);
}
Ö´Ðнá¹û
Fuzzy KMeans Clustering In Memory Result
Fuzzy Cluster id: 0
center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
"vector":"{\"values\":{\"table\":[0,1,0],
\"values\":[1.9750483367699223,1.993870669568863,0.0],\"state\":[1,1,0],
\"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,\"highWaterMark\":1,
\"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},\"size\":2,\"lengthSquared\":-1.0}"}
Fuzzy Cluster id: 1
center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
"vector":"{\"values\":{\"table\":[0,1,0],
\"values\":[7.924827516566109,7.982356511917616,0.0],\"state\":[1,1,0],
\"freeEntries\":1, \"distinct\":2,\"lowWaterMark\":0,\"highWaterMark\":1,
\"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},\"size\":2,\"lengthSquared\":-1.0}"}
Funzy KMeans Clustering Using Map Reduce Result
Weight: Point:
0.9999249428064162: [8.000, 8.000]
0.9855340718746096: [9.000, 8.000]
0.9869963781734195: [8.000, 9.000]
0.9765978701133124: [9.000, 9.000]
0.6280999013864511: [5.000, 6.000]
0.7826097471578298: [6.000, 6.000]
Weight: Point:
0.9672607354172386: [1.000, 1.000]
0.9794914088151625: [2.000, 1.000]
0.9803932521191389: [1.000, 2.000]
0.9977806183197744: [2.000, 2.000]
0.9793701109946826: [3.000, 3.000]
0.5422929338028506: [5.000, 5.000]
|
µÒÀû¿ËÀ×¾ÛÀàËã·¨
Ç°Ãæ½éÉܵÄÈýÖÖ¾ÛÀàËã·¨¶¼ÊÇ»ùÓÚ»®·ÖµÄ£¬ÏÂÃæÎÒÃǼòÒª½éÉÜÒ»¸ö»ùÓÚ¸ÅÂÊ·Ö²¼Ä£Ð͵ľÛÀàËã·¨£¬µÒÀû¿ËÀ×¾ÛÀࣨDirichlet Processes Clustering£©¡£
Ê×ÏÈÎÒÃÇÏȼòÒª½éÉÜһϻùÓÚ¸ÅÂÊ·Ö²¼Ä£Ð͵ľÛÀàËã·¨£¨ºóÃæ¼ò³Æ»ùÓÚÄ£Ð͵ľÛÀàËã·¨£©µÄÔÀí£ºÊ×ÏÈÐèÒª¶¨ÒåÒ»¸ö·Ö²¼Ä£ÐÍ£¬¼òµ¥µÄÀýÈ磺ԲÐΣ¬Èý½ÇÐεȣ¬¸´ÔÓµÄÀýÈçÕýÔò·Ö²¼£¬²´ËÉ·Ö²¼µÈ£»È»ºó°´ÕÕÄ£ÐͶÔÊý¾Ý½øÐзÖÀ࣬½«²»Í¬µÄ¶ÔÏó¼ÓÈëÒ»¸öÄ£ÐÍ£¬Ä£ÐÍ»áÔö³¤»òÕßÊÕËõ£»Ã¿Ò»ÂÖ¹ýºóÐèÒª¶ÔÄ£Ð͵ĸ÷¸ö²ÎÊý½øÐÐÖØÐ¼ÆË㣬ͬʱ¹À¼Æ¶ÔÏóÊôÓÚÕâ¸öÄ£Ð͵ĸÅÂÊ¡£ËùÒÔ˵£¬»ùÓÚÄ£Ð͵ľÛÀàËã·¨µÄºËÐÄÊǶ¨ÒåÄ£ÐÍ£¬¶ÔÓÚÒ»¸ö¾ÛÀàÎÊÌ⣬ģÐͶ¨ÒåµÄÓÅÁÓÖ±½ÓÓ°ÏìÁ˾ÛÀàµÄ½á¹û£¬ÏÂÃæ¸ø³öÒ»¸ö¼òµ¥µÄÀý×Ó£¬¼ÙÉèÎÒÃǵÄÎÊÌâÊǽ«Ò»Ð©¶þάµÄµã·Ö³ÉÈý×飬ÔÚͼÖÐÓò»Í¬µÄÑÕÉ«±íʾ£¬Í¼ A ÊDzÉÓÃÔ²ÐÎÄ£Ð͵ľÛÀà½á¹û£¬Í¼ B ÊDzÉÓÃÈý½ÇÐÎÄ£Ð͵ľÛÀà½á¹û¡£¿ÉÒÔ¿´³ö£¬Ô²ÐÎÄ£ÐÍÊÇÒ»¸öÕýÈ·µÄÑ¡Ôñ£¬¶øÈý½ÇÐÎÄ£Ð͵Ľá¹û¼ÈÓÐÒÅ©ÓÖÓÐÎóÅУ¬ÊÇÒ»¸ö´íÎóµÄÑ¡Ôñ¡£
ͼ 3 ²ÉÓò»Í¬Ä£Ð͵ľÛÀà½á¹û
Mahout ʵÏֵĵÒÀû¿ËÀ×¾ÛÀàËã·¨Êǰ´ÕÕÈçϹý³Ì¹¤×÷µÄ£ºÊ×ÏÈ£¬ÎÒÃÇÓÐÒ»×é´ý¾ÛÀàµÄ¶ÔÏóºÍÒ»¸ö·Ö²¼Ä£ÐÍ¡£ÔÚ Mahout ÖÐʹÓà ModelDistribution Éú³É¸÷ÖÖÄ£ÐÍ¡£³õʼ״̬£¬ÎÒÃÇÓÐÒ»¸ö¿ÕµÄÄ£ÐÍ£¬È»ºó³¢ÊÔ½«¶ÔÏó¼ÓÈëÄ£ÐÍÖУ¬È»ºóÒ»²½Ò»²½¼ÆËã¸÷¸ö¶ÔÏóÊôÓÚ¸÷¸öÄ£Ð͵ĸÅÂÊ¡£ÏÂÃæÇåµ¥¸ø³öÁË»ùÓÚÄÚ´æÊµÏֵĵÒÀû¿ËÀ×¾ÛÀàËã·¨¡£
Çåµ¥ 6. µÒÀû¿ËÀ×¾ÛÀàË㷨ʾÀý
public static void DirichletProcessesClusterInMemory() {
// Ö¸¶¨µÒÀû¿ËÀ×Ëã·¨µÄ alpha ²ÎÊý£¬ËüÊÇÒ»¸ö¹ý¶É²ÎÊý£¬Ê¹µÃ¶ÔÏó·Ö²¼ÔÚ²»Í¬Ä£ÐÍǰºóÄܽøÐй⻬µÄ¹ý¶É
double alphaValue = 1.0;
// Ö¸¶¨¾ÛÀàÄ£Ð͵ĸöÊý
int numModels = 3;
// Ö¸¶¨ thin ºÍ burn ¼ä¸ô²ÎÊý£¬ËüÃÇÊÇÓÃÓÚ½µµÍ¾ÛÀà¹ý³ÌÖеÄÄÚ´æÊ¹ÓÃÁ¿µÄ
int thinIntervals = 2;
int burnIntervals = 2;
// Ö¸¶¨×î´óµü´ú´ÎÊý
int maxIter = 3;
List<VectorWritable> pointVectors =
SimpleDataSet.getPoints(SimpleDataSet.points);
// ³õʼ½×¶ÎÉú³É¿Õ·Ö²¼Ä£ÐÍ£¬ÕâÀïÓõÄÊÇ NormalModelDistribution
ModelDistribution<VectorWritable> model =
new NormalModelDistribution(new VectorWritable(new DenseVector(2)));
// Ö´ÐоÛÀà
DirichletClusterer dc = new DirichletClusterer(pointVectors, model, alphaValue,
numModels, thinIntervals, burnIntervals);
List<Cluster[]> result = dc.cluster(maxIter);
// ´òÓ¡¾ÛÀà½á¹û
for(Cluster cluster : result.get(result.size() -1)){
System.out.println("Cluster id: " + cluster.getId() + " center: " +
cluster.getCenter().asFormatString());
System.out.println(" Points: " + cluster.getNumPoints());
}
}
Ö´Ðнá¹û
Dirichlet Processes Clustering In Memory Result
Cluster id: 0
center:{"class":"org.apache.mahout.math.DenseVector",
"vector":"{\"values\":[5.2727272727272725,5.2727272727272725],
\"size\":2,\"lengthSquared\":-1.0}"}
Points: 11
Cluster id: 1
center:{"class":"org.apache.mahout.math.DenseVector",
"vector":"{\"values\":[1.0,2.0],\"size\":2,\"lengthSquared\":-1.0}"}
Points: 1
Cluster id: 2
center:{"class":"org.apache.mahout.math.DenseVector",
"vector":"{\"values\":[9.0,8.0],\"size\":2,\"lengthSquared\":-1.0}"}
Points: 0
|
Mahout ÖÐÌṩ¶àÖÖ¸ÅÂÊ·Ö²¼Ä£Ð͵ÄʵÏÖ£¬ËûÃǶ¼¼Ì³Ð ModelDistribution£¬Èçͼ 4 Ëùʾ£¬Óû§¿ÉÒÔ¸ù¾Ý×Ô¼ºµÄÊý¾Ý¼¯µÄÌØÕ÷Ñ¡ÔñºÏÊʵÄÄ£ÐÍ£¬ÏêϸµÄ½éÉÜÇë²Î¿¼ Mahout µÄ¹Ù·½Îĵµ¡£
ͼ 4 Mahout ÖеĸÅÂÊ·Ö²¼Ä£ÐͲã´Î½á¹¹
Mahout ¾ÛÀàËã·¨×ܽá
Ç°ÃæÏêϸ½éÉÜÁË Mahout ÌṩµÄËÄÖÖ¾ÛÀàËã·¨£¬ÕâÀï×öÒ»¸ö¼òÒªµÄ×ܽᣬ·ÖÎö¸÷¸öËã·¨ÓÅȱµã£¬Æäʵ£¬³ýÁËÕâËÄÖÖÒÔÍ⣬Mahout »¹ÌṩÁËһЩ±È½Ï¸´ÔӵľÛÀàËã·¨£¬ÕâÀï¾Í²»Ò»Ò»Ïêϸ½éÉÜÁË£¬ÏêϸÐÅÏ¢Çë²Î¿¼ Mahout Wiki Éϸø³öµÄ¾ÛÀàËã·¨Ïêϸ½éÉÜ¡£
±í 1 Mahout ¾ÛÀàËã·¨×ܽá
| Ëã·¨ |
ÄÚ´æÊµÏÖ |
Map/Reduce ʵÏÖ |
´Ø¸öÊýÊÇÈ·¶¨µÄ |
´ØÊÇ·ñÔÊÐíÖØµþ |
| K ¾ùÖµ |
KMeansClusterer |
KMeansDriver |
Y |
N |
| Canopy |
CanopyClusterer |
CanopyDriver |
N |
N |
| Ä£ºý K ¾ùÖµ |
FuzzyKMeansClusterer |
FuzzyKMeansDriver |
Y |
Y |
| µÒÀû¿ËÀ× |
DirichletClusterer |
DirichletDriver |
N |
Y |
×ܽá
¾ÛÀàËã·¨±»¹ã·ºµÄÔËÓÃÓÚÐÅÏ¢ÖÇÄÜ´¦Àíϵͳ¡£±¾ÎÄÊ×ÏȼòÊöÁ˾ÛÀà¸ÅÄîÓë¾ÛÀàË㷨˼Ï룬ʹµÃ¶ÁÕßÕûÌåÉÏÁ˽â¾ÛÀàÕâÒ»ÖØÒªµÄ¼¼Êõ¡£È»ºó´Óʵ¼Ê¹¹½¨Ó¦ÓõĽǶȳö·¢£¬ÉîÈëµÄ½éÉÜÁË¿ªÔ´Èí¼þ Apache Mahout ÖйØÓÚ¾ÛÀàµÄʵÏÖ¿ò¼Ü£¬°üÀ¨ÁËÆäÖеÄÊýѧģÐÍ£¬¸÷ÖÖ¾ÛÀàËã·¨ÒÔ¼°ÔÚ²»Í¬»ù´¡¼Ü¹¹ÉϵÄʵÏÖ¡£Í¨¹ý´úÂëʾÀý£¬¶ÁÕß¿ÉÒÔÖªµÀÕë¶ÔËûµÄÌØ¶¨µÄÊý¾ÝÎÊÌ⣬ÔõôÑùÏòÁ¿»¯Êý¾Ý£¬ÔõôÑùÑ¡Ôñ¸÷ÖÖ²»Í¬µÄ¾ÛÀàËã·¨¡£
±¾ÏµÁеÄÏÂһƪ½«¼ÌÐøÉîÈëÁ˽âÍÆ¼öÒýÇæµÄÏà¹ØËã·¨ -- ·ÖÀà¡£Óë¾ÛÀàÒ»Ñù£¬·ÖÀàÒ²ÊÇÒ»¸öÊý¾ÝÍÚ¾òµÄ¾µäÎÊÌ⣬Ö÷ÒªÓÃÓÚÌáÈ¡ÃèÊöÖØÒªÊý¾ÝÀàµÄÄ£ÐÍ£¬ËæºóÎÒÃÇ¿ÉÒÔ¸ù¾ÝÕâ¸öÄ£ÐͽøÐÐÔ¤²â£¬ÍƼö¾ÍÊÇÒ»ÖÖÔ¤²âµÄÐÐΪ¡£Í¬Ê±¾ÛÀàºÍ·ÖÀàÍùÍùÒ²ÊÇÏศÏà³ÉµÄ£¬ËûÃǶ¼ÎªÔÚº£Á¿Êý¾ÝÉϽøÐиßЧµÄÍÆ¼öÌṩ¸¨Öú¡£ËùÒÔ±¾ÏµÁеÄÏÂһƪÎÄÕ½«Ïêϸ½éÉܸ÷Àà·ÖÀàËã·¨£¬ËüÃǵÄÔÀí£¬ÓÅȱµãºÍʵÓó¡¾°£¬²¢¸ø³ö»ùÓÚ Apache Mahout µÄ·ÖÀàËã·¨µÄ¸ßЧʵÏÖ¡£
Ñо¿Çé¿ö
¡¡¡¡´«Í³µÄ¾ÛÀàÒѾ±È½Ï³É¹¦µÄ½â¾öÁ˵ÍάÊý¾ÝµÄ¾ÛÀàÎÊÌâ¡£µ«ÊÇÓÉÓÚʵ¼ÊÓ¦ÓÃÖÐÊý¾ÝµÄ¸´ÔÓÐÔ£¬ÔÚ´¦ÀíÐí¶àÎÊÌâʱ£¬ÏÖÓеÄËã·¨¾³£Ê§Ð§£¬ÌرðÊǶÔÓÚ¸ßάÊý¾ÝºÍ´óÐÍÊý¾ÝµÄÇé¿ö¡£ÒòΪ´«Í³¾ÛÀà·½·¨ÔÚ¸ßάÊý¾Ý¼¯ÖнøÐоÛÀàʱ£¬Ö÷ÒªÓöµ½Á½¸öÎÊÌâ¡£¢Ù¸ßάÊý¾Ý¼¯ÖдæÔÚ´óÁ¿Î޹صÄÊôÐÔʹµÃÔÚËùÓÐάÖдæÔڴصĿÉÄÜÐÔ¼¸ºõΪÁ㣻¢Ú¸ßά¿Õ¼äÖÐÊý¾Ý½ÏµÍά¿Õ¼äÖÐÊý¾Ý·Ö²¼ÒªÏ¡Ê裬ÆäÖÐÊý¾Ý¼ä¾àÀ뼸ºõÏàµÈÊÇÆÕ±éÏÖÏ󣬶ø´«Í³¾ÛÀà·½·¨ÊÇ»ùÓÚ¾àÀë½øÐоÛÀàµÄ£¬Òò´ËÔÚ¸ßά¿Õ¼äÖÐÎÞ·¨»ùÓÚ¾àÀëÀ´¹¹½¨´Ø¡£
¡¡¸ßά¾ÛÀà·ÖÎöÒѳÉΪ¾ÛÀà·ÖÎöµÄÒ»¸öÖØÒªÑо¿·½Ïò¡£Í¬Ê±¸ßάÊý¾Ý¾ÛÀàÒ²ÊǾÛÀ༼ÊõµÄÄÑµã¡£Ëæ×ż¼ÊõµÄ½ø²½Ê¹µÃÊý¾ÝÊÕ¼¯±äµÃÔ½À´Ô½ÈÝÒ×£¬µ¼ÖÂÊý¾Ý¿â¹æÄ£Ô½À´Ô½´ó¡¢¸´ÔÓÐÔÔ½À´Ô½¸ß£¬Èç¸÷ÖÖÀàÐ͵ÄóÒ×½»Ò×Êý¾Ý¡¢Web Îĵµ¡¢»ùÒò±í´ïÊý¾ÝµÈ£¬ËüÃǵÄά¶È£¨ÊôÐÔ£©Í¨³£¿ÉÒÔ´ïµ½³É°ÙÉÏǧά£¬ÉõÖÁ¸ü¸ß¡£µ«ÊÇ£¬ÊÜ¡°Î¬¶ÈЧӦ¡±µÄÓ°Ï죬Ðí¶àÔÚµÍάÊý¾Ý¿Õ¼ä±íÏÖÁ¼ºÃµÄ¾ÛÀà·½·¨ÔËÓÃÔÚ¸ßά¿Õ¼äÉÏÍùÍùÎÞ·¨»ñµÃºÃµÄ¾ÛÀàЧ¹û¡£¸ßάÊý¾Ý¾ÛÀà·ÖÎöÊǾÛÀà·ÖÎöÖÐÒ»¸ö·Ç³£»îÔ¾µÄÁìÓò£¬Í¬Ê±ËüÒ²ÊÇÒ»¸ö¾ßÓÐÌôÕ½ÐԵŤ×÷¡£Ä¿Ç°£¬¸ßάÊý¾Ý¾ÛÀà·ÖÎöÔÚÊг¡·ÖÎö¡¢ÐÅÏ¢°²È«¡¢½ðÈÚ¡¢ÓéÀÖ¡¢·´¿ÖµÈ·½Ãæ¶¼Óкܹ㷺µÄÓ¦Óá£
http://www.cnblogs.com/shipengzhi/articles/2489389.html
=========
Minhash based clustering
https://issues.apache.org/jira/browse/MAHOUT-344
========
How to improve clustering?
http://comments.gmane.org/gmane.comp.apache.mahout.user/16296
========
mahoutÖÐk-meansÀý×ÓµÄÔËÐÐ
|
|
| | | |