ÔØÈëÖС£¡£¡£ 'S bLog
 
ÔØÈëÖС£¡£¡£
 
ÔØÈëÖС£¡£¡£
ÔØÈëÖС£¡£¡£
ÔØÈëÖС£¡£¡£
ÔØÈëÖС£¡£¡£
ÔØÈëÖС£¡£¡£
 
ÌîдÄúµÄÓʼþµØÖ·£¬¶©ÔÄÎÒÃǵľ«²ÊÄÚÈÝ£º


 
¹ØÓÚ¾ÛÀàÓëMapreduce
[ 2013/4/25 15:48:00 | By: ÃÎÏè¶ù ]
 

mahoutϵÄK-Means ClusteringʵÏÖ

(½è¼øÓÚÍøÂç×ÊÁÏ£¬ÓÐÐÞ¸Ä)

 

Ò»¡¢¸ÅÄî½éÉÜ

      K-meansËã·¨ÊÇÓ²¾ÛÀàËã·¨£¬ÊǵäÐ͵ľÖÓòÔ­Ð͵ÄÄ¿±êº¯Êý¾ÛÀà·½·¨µÄ´ú±í£¬ËüÊÇÊý¾Ýµãµ½Ô­Ð͵ÄijÖÖ¾àÀë×÷ΪÓÅ»¯µÄÄ¿±êº¯Êý£¬ÀûÓú¯ÊýÇó¼«ÖµµÄ·½·¨µÃµ½µü´úÔËËãµÄµ÷Õû¹æÔò¡£K-meansËã·¨ÒÔŷʽ¾àÀë×÷ΪÏàËÆ¶È²â¶È£¬ËüÊÇÇó¶ÔӦijһ³õʼ¾ÛÀàÖÐÐÄÏòÁ¿V×îÓзÖÀ࣬ʹµÃÆÀ¼ÛÖ¸±êJ×îС¡£Ëã·¨²ÉÓÃÎó²îƽ·½ºÍ×¼Ôòº¯Êý×÷Ϊ¾ÛÀà×¼Ôòº¯Êý¡£

      K-meansËã·¨ÊǺܵäÐ͵ĻùÓÚ¾àÀëµÄ¾ÛÀàËã·¨£¬²ÉÓþàÀë×÷ΪÏàËÆÐÔµÄÆÀ¼ÛÖ¸±ê£¬¼´ÈÏΪÁ½¸ö¶ÔÏóµÄ¾àÀëÔ½½ü£¬ÆäÏàËÆ¶È¾ÍÔ½´ó¡£¸ÃËã·¨ÈÏΪ´ØÊÇÓɾàÀë¿¿½üµÄ¶ÔÏó×é³ÉµÄ£¬Òò´Ë°ÑµÃµ½½ô´ÕÇÒ¶ÀÁ¢µÄ´Ø×÷Ϊ×îÖÕÄ¿±ê¡£

¡¡¡¡k¸ö³õʼÀà¾ÛÀàÖÐÐĵãµÄѡȡ¶Ô¾ÛÀà½á¹û¾ßÓнϴóµÄÓ°Ï죬ÒòΪÔÚ¸ÃËã·¨µÚÒ»²½ÖÐÊÇËæ»úµÄѡȡÈÎÒâk¸ö¶ÔÏó×÷Ϊ³õʼ¾ÛÀàµÄÖÐÐÄ£¬³õʼµØ´ú±íÒ»¸ö´Ø¡£¸ÃËã·¨ÔÚÿ´Îµü´úÖжÔÊý¾Ý¼¯ÖÐÊ£ÓàµÄÿ¸ö¶ÔÏ󣬸ù¾ÝÆäÓë¸÷¸ö´ØÖÐÐĵľàÀ뽫ÿ¸ö¶ÔÏóÖØÐ¸³¸ø×î½üµÄ´Ø¡£µ±¿¼²ìÍêËùÓÐÊý¾Ý¶ÔÏóºó£¬Ò»´Îµü´úÔËËãÍê³É£¬ÐµľÛÀàÖÐÐı»¼ÆËã³öÀ´¡£Èç¹ûÔÚÒ»´Îµü´úǰºó£¬ÆÀ¼ÛÖ¸±êJµÄֵûÓз¢Éú±ä»¯£¬ËµÃ÷Ëã·¨ÒѾ­ÊÕÁ²¡£

 

¶þ¡¢»ù±¾Ë¼Ïë

1.ÊýѧÃèÊö

¸ø¶¨dάʵÊýÏòÁ¿(x_1,x_2,\quad ...\quad x_n)£¬ºóÃæ¾Í½«Õâ¸öʵÊýÏòÁ¿³Æ×÷µã°É£¬¶Ì£¡K-MeansËã·¨»á¸ù¾ÝÊÂÏÈÖÆ¶¨µÄ²ÎÊýk£¬½«ÕâЩµã»®·Ö³ök¸öCluster(k ¡Ü n)£¬¶ø»®·ÖµÄ±ê×¼ÊÇ×îС»¯µãÓëClusterÖØÐÄ(¾ùÖµ)µÄ¾àÀëÆ½·½ºÍ£¬¼ÙÉèÕâЩClusterΪ£ºC={C_1,C_2,...,C_k}£¬ÔòÊýѧÃèÊöÈçÏ£º

                                                  arg_Cmin \sum \limit_{i=1}^{k} \sum \limit_{x_j \in C_i}{||x_j-\mu_i||}^2
£¬ÆäÖÐ\mu_iΪµÚi¸öClusterµÄ¡°ÖØÐÄ¡±(ClusterÖÐËùÓеãµÄƽ¾ùÖµ)¡£

      ¾ÛÀàµÄЧ¹ûÀàËÆÏÂͼ£º

¾ßÌå¿É¼û£ºhttp://en.wikipedia.org/wiki/K-means_clustering

2.K-meansËã·¨

 ËüÊÇÒ»ÖÖµü´úµÄËã·¨£º

      (1)¡¢¸ù¾ÝÊÂÏȸø¶¨µÄkÖµ½¨Á¢³õʼ»®·Ö£¬µÃµ½k¸öCluster£¬±ÈÈ磬¿ÉÒÔËæ»úÑ¡Ôñk¸öµã×÷Ϊk¸öClusterµÄÖØÐÄ£¬ÓÖ»òÕßÓÃCanopy ClusteringµÃµ½µÄCluster×÷Ϊ³õÊ¼ÖØÐÄ(µ±È»Õâ¸öʱºòkµÄÖµÓÉCanopy ClusteringµÃ½á¹û¾ö¶¨)£»

      (2)¡¢¼ÆËãÿ¸öµãµ½¸÷¸öClusterÖØÐĵľàÀ룬½«Ëü¼ÓÈëµ½×î½üµÄÄǸöCluster£»

      (3)¡¢ÖØÐ¼ÆËãÿ¸öClusterµÄÖØÐÄ£»

      (4)¡¢Öظ´¹ý³Ì2~3£¬Ö±µ½¸÷¸öClusterÖØÐÄÔÚij¸ö¾«¶È·¶Î§ÄÚ²»±ä»¯»òÕß´ïµ½×î´óµü´ú´ÎÊý¡£

      ±ð¿´Ëã·¨¼òµ¥£¬ºÜ¶à¸´ÔÓËã·¨µÄʵ¼ÊЧ¹û»òÐí¶¼²»ÈçËü£¬¶øÇÒËüµÄ¾Ö²¿ÐԽϺã¬ÈÝÒײ¢Ðл¯£¬¶Ô´ó¹æÄ£Êý¾Ý¼¯ºÜÓÐÒâÒ壻Ë㷨ʱ¼ä¸´ÔÓ¶ÈÊÇ£ºO(nkt)£¬ÆäÖУºn ÊǾÛÀàµã¸öÊý£¬k ÊÇCluster¸öÊý£¬t Êǵü´ú´ÎÊý¡£

 

Èý¡¢²¢Ðл¯K-means

      K-Means½ÏºÃµØ¾Ö²¿ÐÔʹËüÄܺܺõı»²¢Ðл¯¡£µÚÒ»½×¶Î£¬Éú³ÉClusterµÄ¹ý³Ì¿ÉÒÔ²¢Ðл¯£¬¸÷¸öSlaves¶ÁÈ¡´æÔÚ±¾µØµÄÊý¾Ý¼¯£¬ÓÃÉÏÊöËã·¨Éú³ÉCluster¼¯ºÏ£¬×îºóÓÃÈô¸ÉCluster¼¯ºÏÉú³ÉµÚÒ»´Îµü´úµÄÈ«¾ÖCluster¼¯ºÏ£¬È»ºóÖØ¸´Õâ¸ö¹ý³ÌÖ±µ½Âú×ã½áÊøÌõ¼þ£¬µÚ¶þ½×¶Î£¬ÓÃ֮ǰµÃµ½µÄCluster½øÐоÛÀà²Ù×÷¡£

      ÓÃmap-reduceÃèÊöÊÇ£ºdatanodeÔÚmap½×¶Î¶Á³öλÓÚ±¾µØµÄÊý¾Ý¼¯£¬Êä³öÿ¸öµã¼°Æä¶ÔÓ¦µÄCluster£»combiner²Ù×÷¶ÔλÓÚ±¾µØ°üº¬ÔÚÏàͬClusterÖÐµÄµã½øÐÐreduce²Ù×÷²¢Êä³ö£¬reduce²Ù×÷µÃµ½È«¾ÖCluster¼¯ºÏ²¢Ð´ÈëHDFS¡£

 

ËÄ¡¢MahoutµÄK-means

       mahoutʵÏÖÁ˱ê×¼K-Means Clustering£¬Ë¼ÏëÓëÇ°ÃæÏàͬ£¬Ò»¹²Ê¹ÓÃÁË2¸ömap²Ù×÷¡¢1¸öcombine²Ù×÷ºÍ1¸öreduce²Ù×÷£¬Ã¿´Îµü´ú¶¼ÓÃ1¸ömap¡¢1¸öcombineºÍÒ»¸öreduce²Ù×÷µÃµ½²¢±£´æÈ«¾ÖCluster¼¯ºÏ£¬µü´ú½áÊøºó£¬ÓÃÒ»¸ömap½øÐоÛÀà²Ù×÷¡£

1.Êý¾Ý½á¹¹Ä£ÐÍ

      Mahout¾ÛÀàËã·¨½«¶ÔÏóÒÔVectorµÄ·½Ê½±íʾ£¬Ëüͬʱ֧³Ödense vectorºÍsparse vector£¬Ò»¹²ÓÐÈýÖÖ±íʾ·½Ê½£¨ËüÃÇÓµÓй²Í¬µÄ»ùÀàAbstractVector£¬ÀïÃæÊµÏÖÁËÓйØVectorµÄºÜ¶à²Ù×÷£©£º

      (1)¡¢DenseVector

      ËüʵÏÖµÄʱºòÓÃÒ»¸ödoubleÊý×é±íʾVector£¨private double[] values£©£¬ ¶ÔÓÚdense data¿ÉÒÔʹÓÃËü£»

      (2)¡¢RandomAccessSparseVector

     ËüÓÃÀ´±íʾһ¸ö¿ÉÒÔËæ»ú·ÃÎʵÄsparse vector£¬Ö»´æ´¢·ÇÁãÔªËØ£¬Êý¾ÝµÄ´æ´¢²ÉÓÃhashÓ³É䣺OpenIntDoubleHashMap;

      ¹ØÓÚOpenIntDoubleHashMap£¬ÆäkeyΪintÀàÐÍ£¬valueΪdoubleÀàÐÍ£¬½â¾ö³åÍ»µÄ·½·¨ÊÇdouble hashing£¬

      (3)¡¢SequentialAccessSparseVector

      ËüÓÃÀ´±íʾһ¸ö˳Ðò·ÃÎʵÄsparse vector£¬Í¬ÑùÖ»´æ´¢·ÇÁãÔªËØ£¬Êý¾ÝµÄ´æ´¢²ÉÓÃ˳ÐòÓ³É䣺OrderedIntDoubleMapping;

      ¹ØÓÚOrderedIntDoubleMapping£¬ÆäkeyΪintÀàÐÍ£¬valueΪdoubleÀàÐÍ£¬´æ´¢µÄ·½Ê½ÈÃÎÒÏëÆðÁËLibsvmÊý¾Ý±íʾµÄÐÎʽ£º·ÇÁãÔªËØË÷Òý:·ÇÁãÔªËØµÄÖµ£¬ÕâÀïÓÃÒ»¸öintÊý×é´æ´¢indices£¬ÓÃdoubleÊý×é´æ´¢·ÇÁãÔªËØ£¬ÒªÏë¶Áдij¸öÔªËØ£¬ÐèÒªÔÚindicesÖвéÕÒoffset£¬ÓÉÓÚindicesÓ¦¸ÃÊÇÓÐÐòµÄ£¬ËùÒÔ²éÕÒ²Ù×÷ÓõÄÊǶþ·Ö·¨¡£

2.K-means±äÁ¿º¬Òå

     ¿ÉÒÔ´ÓCluster.java¼°Æä¸¸À࣬¶ÔÓÚCluster£¬mahoutʵÏÖÁËÒ»¸ö³éÏóÀàAbstractCluster·â×°Cluster£¬¾ßÌå˵Ã÷¿ÉÒԲο¼ÉÏһƪÎÄÕ£¬ÕâÀï×ö¸ö¼òµ¥ËµÃ÷£º

      (1)¡¢private int id; #ÿ¸öK-MeansËã·¨²úÉúµÄClusterµÄid

      (2)¡¢private long numPoints; #ClusterÖаüº¬µãµÄ¸öÊý£¬ÕâÀïµÄµã¶¼ÊÇVector

      (3)¡¢private Vector center; #ClusterµÄÖØÐÄ£¬ÕâÀï¾ÍÊÇÆ½¾ùÖµ£¬ÓÉs0ºÍs1¼ÆËã¶øÀ´¡£

      (4)¡¢private Vector Radius; #ClusterµÄ°ë¾¶£¬Õâ¸ö°ë¾¶ÊǸ÷¸öµãµÄ±ê×¼²î£¬·´Ó³×éÄÚ¸öÌå¼äµÄÀëÉ¢³Ì¶È£¬ÓÉs0¡¢s1ºÍs2¼ÆËã¶øÀ´¡£

      (5)¡¢private double s0; #±íʾCluster°üº¬µãµÄÈ¨ÖØÖ®ºÍ£¬s_0=\sum\limit_{i=0}^{n}{w_i}

      (6)¡¢private Vector s1; #±íʾCluster°üº¬µãµÄ¼ÓȨºÍ£¬s_1=\sum\limit_{i=0}^{n}{x_iw_i}

      (7)¡¢private Vector s2; #±íʾCluster°üº¬µãƽ·½µÄ¼ÓȨºÍ£¬s_2=\sum\limit_{i=0}^{n}{x_i^2w_i}

      (8)¡¢public void computeParameters(); #¸ù¾Ýs0¡¢s1¡¢s2¼ÆËãnumPoints¡¢centerºÍRadius£º

             numPoints={(int)}s0

            center=s1/s0

            radius=\frac{\sqrt{s2\quad s0 -s1\quad s1}}{s0}

            s0 = 0             s1 = null             s2 = null

            Õ⼸¸ö²Ù×÷ºÜÖØÒª£¬×îºóÈý²½ºÜ±ØÒª£¬ÔÚºóÃæ»á×ö˵Ã÷¡£

(9)¡¢public void observe(VectorWritable x, double weight); #ÿµ±ÓÐÒ»¸öеĵã¼ÓÈ뵱ǰClusterʱ¶¼ÐèÒª¸üÐÂs0¡¢s1¡¢s2µÄÖµ 

      (10)¡¢public ClusterObservation getObservations(); #Õâ¸ö²Ù×÷ÔÚcombine²Ù×÷ʱ»á¾­³£±»Óõ½£¬Ëü»á·µ»ØÓÉs0¡¢s1¡¢s2³õʼ»¯µÄClusterObservation¶ÔÏ󣬱íʾµ±Ç°ClusterÖаüº¬µÄËùÓб»¹Û²ì¹ýµÄµã

3.K-meansµÄMap-ReduceʵÏÖ

      K-Means ClusteringµÄʵÏÖͬÑù°üº¬µ¥»ú°æºÍMRÁ½¸ö°æ±¾£¬µ¥»ú°æ¾Í²»ËµÁË£¬MR°æÓÃÁËÁ½¸ömap²Ù×÷¡¢Ò»¸öcombine²Ù×÷ºÍÒ»¸öreduce²Ù×÷£¬ÊÇͨ¹ýÁ½¸ö²»Í¬µÄjob´¥·¢£¬ÓÃDirverÀ´×éÖ¯µÄ£¬mapºÍreduce½×¶ÎÖ´ÐÐ˳ÐòÊÇ£º

(1)¶ÔÓÚK³õʼ»¯µÄÐγÉ

K-MeansËã·¨ÐèÒªÒ»¸ö¶ÔÊý¾ÝµãµÄ³õʼ»®·Ö£¬mahoutÀïÓÃÁËÁ½ÖÖ·½·¨£¨ÒÔIris datasetǰ3¸öfeatureΪÀý£©£º

      A¡¢Ê¹ÓÃRandomSeedGeneratorÀà

      ÔÚÖ¸¶¨clustersĿ¼Éú³Ék¸ö³õʼ»®·Ö²¢ÒÔSequence FileÐÎʽ´æ´¢£¬ÆäÑ¡Ôñ·½·¨Ï£ÍûÄܾ¡¿ÉÄܲ»ÈùÂÁ¢µã×÷ΪClusterÖØÐÄ£¬´ó¸ÅÁ÷³ÌÈçÏ£º

      ͼ2

      B¡¢Ê¹ÓÃCanopy Clustering

      Canopy Clustering³£³£ÓÃÀ´¶Ô³õʼÊý¾Ý×öÒ»¸ö´ÖÂԵĻ®·Ö£¬ËüµÄ½á¹û¿ÉÒÔΪ֮ºó´ú¼Û½Ï¸ß¾ÛÀàÌṩ°ïÖú£¬Canopy Clustering¿ÉÄÜÓÃÔÚÊý¾ÝÔ¤´¦ÀíÉÏÒª±Èµ¥´¿ÄÃÀ´¾ÛÀà¸üÓÐÓ㬱ÈÈç¶ÔK-MeansÀ´ËµÌṩkÖµ£¬ÁíÍ⻹ÄܺܺõĴ¦Àí¹ÂÁ¢µã£¬µ±È»£¬ÐèÒªÈ˹¤Ö¸¶¨µÄ²ÎÊýÓÉk±ä³ÉÁËT1¡¢T2£¬T1ºÍT2ËùÆðµÄ×÷ÓÃÊÇȱһ²»¿ÉµÄ£¬T1¾ö¶¨ÁËÿ¸öCluster°üº¬µãµÄÊýÄ¿£¬ÕâÖ±½ÓÓ°ÏìÁËClusterµÄ¡°ÖØÐÄ¡±ºÍ¡°°ë¾¶¡±£¬¶øT2Ôò¾ö¶¨ÁËClusterµÄÊýÄ¿£¬T2Ì«´ó»áµ¼ÖÂÖ»ÓÐÒ»¸öCluster£¬¶øÌ«Ð¡Ôò»á³öÏÖ¹ý¶àµÄCluster¡£Í¨¹ýʵÑ飬T1ºÍT2ȡֵ»áÑÏÖØÓ°Ïìµ½Ëã·¨µÄЧ¹û£¬ÈçºÎÈ·¶¨T1ºÍT2£¬Ëƺõ¿ÉÒÔÓÃAIC¡¢BIC»òÕß½»²æÑé֤ȥ×ö¡£¡£¡£

 

(2).ÅäÖÃClusterÐÅÏ¢

      K-MeansËã·¨µÄMRʵÏÖ£¬µÚÒ»´Îµü´úÐèÒª½«Ëæ»ú·½·¨»òÕßCanopy Clustering·½·¨½á¹ûĿ¼×÷ΪkmeansµÚÒ»´Îµü´úµÄÊäÈëĿ¼£¬½ÓÏÂÀ´µÄÿ´Îµü´ú¶¼ÐèÒª½«Éϴεü´úµÄÊä³öĿ¼×÷Ϊ±¾´Îµü´úµÄÊäÈëĿ¼£¬Õâ¾ÍÐèÒªÄÜÔÚÿ´Îkmeans mapºÍkmeans reduce²Ù×÷ǰ´Ó¸ÃĿ¼µÃµ½ClusterµÄÐÅÏ¢£¬Õâ¸ö¹¦ÄÜÓÉKMeansUtil.configureWithClusterInfoʵÏÖ£¬ËüÒÀ¾ÝÖ¸¶¨µÄHDFSĿ¼½«Canopy Cluster»òÕßÉϴεü´úClusterµÄÐÅÏ¢´æ´¢µ½Ò»¸öCollectionÖУ¬Õâ¸ö·½·¨ÔÚÖ®ºóµÄÿ¸ömapºÍreduce²Ù×÷Öж¼ÐèÒª¡£

 

(3).KMeansMapper

public class KMeansMapper extends Mapper<WritableComparable<?>, VectorWritable, Text, ClusterObservations> {

  private KMeansClusterer clusterer;

  private final Collection<Cluster> clusters = new ArrayList<Cluster>();

  @Override
  protected void map(WritableComparable<?> key, VectorWritable point, Context context)
    throws IOException, InterruptedException {
    this.clusterer.emitPointToNearestCluster(point.get(), this.clusters, context);
  }

  @Override
  protected void setup(Context context) throws IOException, InterruptedException {
    super.setup(context);
    Configuration conf = context.getConfiguration();
    try {
      ClassLoader ccl = Thread.currentThread().getContextClassLoader();
      DistanceMeasure measure = ccl.loadClass(conf.get(KMeansConfigKeys.DISTANCE_MEASURE_KEY))
          .asSubclass(DistanceMeasure.class).newInstance();
      measure.configure(conf);

      this.clusterer = new KMeansClusterer(measure);

      String clusterPath = conf.get(KMeansConfigKeys.CLUSTER_PATH_KEY);
      if (clusterPath != null && clusterPath.length() > 0) {
        KMeansUtil.configureWithClusterInfo(conf, new Path(clusterPath), clusters);
        if (clusters.isEmpty()) {
          throw new IllegalStateException("No clusters found. Check your -c path.");
        }
      }
    } catch (ClassNotFoundException e) {
      throw new IllegalStateException(e);
    } catch (IllegalAccessException e) {
      throw new IllegalStateException(e);
    } catch (InstantiationException e) {
      throw new IllegalStateException(e);
    }
  }

  void setup(Collection<Cluster> clusters, DistanceMeasure measure) {
    this.clusters.clear();
    this.clusters.addAll(clusters);
    this.clusterer = new KMeansClusterer(measure);
  }
}

A¡¢KMeansMapper½ÓÊÕµÄÊÇ(WritableComparable<?>, VectorWritable) Pair£¬setup·½·¨ÀûÓÃKMeansUtil.configureWithClusterInfoµÃµ½ÉÏÒ»´Îµü´úµÄClustering½á¹û£¬map²Ù×÷ÐèÒªÒÀ¾ÝÕâ¸ö½á¹û¾ÛÀà¡£

B¡¢Ã¿¸öslave»úÆ÷»á·Ö²¼Ê½µÄ´¦Àí´æÔÚÓ²ÅÌÉϵÄÊý¾Ý£¬ÒÀ¾Ý֮ǰµÃµ½µÃClusterÐÅÏ¢£¬ÓÃemitPointToNearestCluster·½·¨½«Ã¿¸öµã¼ÓÈëµ½ÓëÆä¾àÀë×î½üµÄCluster£¬Êä³ö½á¹ûΪ(Ó뵱ǰµã¾àÀë×î½üClusterµÄID, Óɵ±Ç°µã°ü×°¶ø³ÉµÄClusterObservations) Pair,ÖµµÃ×¢ÒâµÄÊÇMapperÖ»Êǽ«µã¼ÓÈë×î½üµÄCluster£¬²¢ÒÔ(key,value)ÐÎʽעÃ÷´ËµãËùÀë×î½üµÄcluster£¬µÈ´ýcombiner£¬reducerËѼ¯£¬Ã»ÓиüÐÂClusterÖØÐĵȲÎÊý¡£

 

 (4).KMeansCombiner

public class KMeansCombiner extends Reducer<Text, ClusterObservations, Text, ClusterObservations> {

  @Override
  protected void reduce(Text key, Iterable<ClusterObservations> values, Context context)
    throws IOException, InterruptedException {
    Cluster cluster = new Cluster();
    for (ClusterObservations value : values) {
      cluster.observe(value);
    }
    context.write(key, cluster.getObservations());
  }

}

              combiner²Ù×÷ÊÇÒ»¸ö±¾µØµÄreduce²Ù×÷£¬·¢ÉúÔÚmapÖ®ºó£¬reduce֮ǰ£º

(5).KMeansReducer

public class KMeansReducer extends Reducer<Text, ClusterObservations, Text, Cluster> {

  private Map<String, Cluster> clusterMap;
  private double convergenceDelta;
  private KMeansClusterer clusterer;

  @Override
  protected void reduce(Text key, Iterable<ClusterObservations> values, Context context)
    throws IOException, InterruptedException {
    Cluster cluster = clusterMap.get(key.toString());
    for (ClusterObservations delta : values) {
      cluster.observe(delta);
    }
    // force convergence calculation
    boolean converged = clusterer.computeConvergence(cluster, convergenceDelta);
    if (converged) {
      context.getCounter("Clustering", "Converged Clusters").increment(1);
    }
    cluster.computeParameters();
    context.write(new Text(cluster.getIdentifier()), cluster);
  }

  @Override
  protected void setup(Context context) throws IOException, InterruptedException {
    super.setup(context);
    Configuration conf = context.getConfiguration();
    try {
      ClassLoader ccl = Thread.currentThread().getContextClassLoader();
      DistanceMeasure measure = ccl.loadClass(conf.get(KMeansConfigKeys.DISTANCE_MEASURE_KEY))
          .asSubclass(DistanceMeasure.class).newInstance();
      measure.configure(conf);

      this.convergenceDelta = Double.parseDouble(conf.get(KMeansConfigKeys.CLUSTER_CONVERGENCE_KEY));
      this.clusterer = new KMeansClusterer(measure);
      this.clusterMap = new HashMap<String, Cluster>();

      String path = conf.get(KMeansConfigKeys.CLUSTER_PATH_KEY);
      if (path.length() > 0) {
        Collection<Cluster> clusters = new ArrayList<Cluster>();
        KMeansUtil.configureWithClusterInfo(conf, new Path(path), clusters);
        setClusterMap(clusters);
        if (clusterMap.isEmpty()) {
          throw new IllegalStateException("Cluster is empty!");
        }
      }
    } catch (ClassNotFoundException e) {
      throw new IllegalStateException(e);
    } catch (IllegalAccessException e) {
      throw new IllegalStateException(e);
    } catch (InstantiationException e) {
      throw new IllegalStateException(e);
    }
  }

  private void setClusterMap(Collection<Cluster> clusters) {
    clusterMap = new HashMap<String, Cluster>();
    for (Cluster cluster : clusters) {
      clusterMap.put(cluster.getIdentifier(), cluster);
    }
    clusters.clear();
  }

  public void setup(Collection<Cluster> clusters, DistanceMeasure measure) {
    setClusterMap(clusters);
    this.clusterer = new KMeansClusterer(measure);
  }

}

  ºÜÖ±°×µÄµÄ²Ù×÷£¬Ö»ÊÇÔÚsetupµÄʱºòÉÔ¸´ÔÓ¡£

 A¡¢setup²Ù×÷µÄÄ¿µÄÊǶÁÈ¡³õʼ»®·Ö»òÕßÉϴεü´úµÄ½á¹û£¬¹¹½¨ClusterÐÅÏ¢£¬Í¬Ê±×öÁËMap<ClusterµÄID,Cluster>Ó³É䣬·½±ã´ÓIDÕÒCluster¡£

 B¡¢reduce²Ù×÷·Ç³£Ö±°×£¬½«´Ócombiner´«À´µÄ<Cluster ID£¬ClusterObservations>½øÐлã×Ü£»

        computeConvergenceÓÃÀ´Åжϵ±Ç°ClusterÊÇ·ñÊÕÁ²£¬¼´Ðµġ°ÖØÐÄ¡±ÓëÀϵġ°ÖØÐÄ¡±¾àÀëÊÇ·ñÂú×ã֮ǰ´«ÈëµÄ¾«¶ÈÒªÇó£»

        ×¢Òâµ½Óиöcluster.computeParameters()²Ù×÷£¬Õâ¸ö²Ù×÷·Ç³£ÖØÒª£¬Ëü±£Ö¤Á˱¾´Îµü´úµÄ½á¹û²»»áÓ°Ï쵽ϴεü´ú£¬Ò²¾ÍÊDZ£Ö¤ÁËÄܹ»¡°ÖØÐ¼ÆËãÿ¸öClusterµÄÖØÐÄ¡±ÕâÒ»²½Öè¡£

                               numPoints={(int)}s0

                              center=s1/s0

                              radius=\frac{\sqrt{s2\quad s0 -s1\quad s1}}{s0}

      ǰÈý¸ö²Ù×÷µÃµ½ÐµÄClusterÐÅÏ¢£»

                              s0 = 0  

                             s1 = null   

                            s2 = null

      ºóÈý¸ö²½ÖèÇå¿ÕS0¡¢S1¡¢S2ÐÅÏ¢£¬±£Ö¤Ï´εü´úËùÐèµÄClusterÐÅÏ¢ÊÇ¡°¸É¾»¡±µÄ¡£

      Ö®ºó£¬reduce½«(Cluster ID, Cluster) PairдÈëµ½HDFSÖÐÒÔ¡±clusters-µü´ú´ÎÊý¡°ÃüÃûµÄÎļþ¼ÐÖУ¬¹©ºóÃæµü´úʱºòʹÓá£

Reduce²Ù×÷ËѼ¯Ç°ÃæCombinerÊä³öµÄÐÅÏ¢£¬²¢ÔÙÒ»´Î¶ÔCanopyÖØÐĵÈÐÅÏ¢½øÐÐÁ˸üÐÂ

 (6).KMeansClusterMapper

 Ö®Ç°µÄMR²Ù×÷ÓÃÓÚ¹¹½¨ClusterÐÅÏ¢£¬KMeansClusterMapperÔòÓù¹ÔìºÃµÄClusterÐÅÏ¢À´¾ÛÀà¡£

public class KMeansClusterMapper
    extends Mapper<WritableComparable<?>,VectorWritable,IntWritable,WeightedVectorWritable> {
  
  private final Collection<Cluster> clusters = new ArrayList<Cluster>();
  private KMeansClusterer clusterer;

  @Override
  protected void map(WritableComparable<?> key, VectorWritable point, Context context)
    throws IOException, InterruptedException {
    clusterer.outputPointWithClusterInfo(point.get(), clusters, context);
  }

  @Override
  protected void setup(Context context) throws IOException, InterruptedException {
    super.setup(context);
    Configuration conf = context.getConfiguration();
    try {
      ClassLoader ccl = Thread.currentThread().getContextClassLoader();
      DistanceMeasure measure = ccl.loadClass(conf.get(KMeansConfigKeys.DISTANCE_MEASURE_KEY))
          .asSubclass(DistanceMeasure.class).newInstance();
      measure.configure(conf);
      
      String clusterPath = conf.get(KMeansConfigKeys.CLUSTER_PATH_KEY);
      if (clusterPath != null && clusterPath.length() > 0) {
        KMeansUtil.configureWithClusterInfo(conf, new Path(clusterPath), clusters);
        if (clusters.isEmpty()) {
          throw new IllegalStateException("No clusters found. Check your -c path.");
        }
      }  
      this.clusterer = new KMeansClusterer(measure);
    } catch (ClassNotFoundException e) {
      throw new IllegalStateException(e);
    } catch (IllegalAccessException e) {
      throw new IllegalStateException(e);
    } catch (InstantiationException e) {
      throw new IllegalStateException(e);
    }
  }
}


      A¡¢setupÒÀÈ»ÊÇ´ÓÖ¸¶¨Ä¿Â¼¶ÁÈ¡²¢¹¹½¨ClusterÐÅÏ¢£»

      B¡¢map²Ù×÷ͨ¹ý¼ÆËãÿ¸öµãµ½¸÷Cluster¡°ÖØÐÄ¡±µÄ¾àÀëÍê³É¾ÛÀà²Ù×÷£¬¿ÉÒÔ¿´µ½map²Ù×÷½áÊø£¬ËùÓеã¾Í¶¼±»·ÅÔÚΨһһ¸öÓëÖ®¾àÀë×î½üµÄClusterÖÐÁË£¬Òò´ËÖ®ºó²¢²»ÐèÒªreduce²Ù×÷¡£

(7).KMeansDriver

      ÕâÀïÖµµÃ×¢ÒâµÄÊÇbuildClusterÖеĵü´ú¹ý³Ì£¬runIterationÖÐÉèÖÃÇ°ÃæKMeanMapper,KMeansCombiner,KMeanReducerËùÔÚjobµÄ²ÎÊý¡£

ÆäÖÐbuildCluster´úÂ룺

private static Path buildClustersMR(Configuration conf,
                                      Path input,
                                      Path clustersIn,
                                      Path output,
                                      DistanceMeasure measure,
                                      int maxIterations,
                                      String delta) throws IOException, InterruptedException, ClassNotFoundException {

    boolean converged = false;
    int iteration = 1;
    while (!converged && iteration <= maxIterations) {
      log.info("K-Means Iteration {}", iteration);
      // point the output to a new directory per iteration
      Path clustersOut = new Path(output, AbstractCluster.CLUSTERS_DIR + iteration);
      converged = runIteration(conf, input, clustersIn, clustersOut, measure.getClass().getName(), delta);
      // now point the input to the old output directory
      clustersIn = clustersOut;
      iteration++;
    }
    return clustersIn;
  }


      Èç¹û°ÑÇ°ÃæµÄKMeansMapper¡¢KMeansCombiner¡¢KMeansReducer¡¢KMeansClusterMapper¿´×öÊÇשµÄ»°£¬KMeansDriver¾ÍÊǸǷ¿×ÓµÄÈË£¬ËüÓÃÀ´×éÖ¯Õû¸ökmeansËã·¨Á÷³Ì(°üÀ¨µ¥»ú°æºÍMR°æ)¡£Ê¾ÒâͼÈçÏ£º

ͼ4

http://www.cnblogs.com/biyeymyhjob/archive/2012/07/20/2599544.html

========

Mahout¾ÛÀà·ÖÎö

ʲôÊǾÛÀà·ÖÎö£¿

¾ÛÀà (Clustering) ¾ÍÊǽ«Êý¾Ý¶ÔÏó·Ö×é³ÉΪ¶à¸öÀà»òÕß´Ø (Cluster)£¬ËüµÄÄ¿±êÊÇ£ºÔÚͬһ¸ö´ØÖеĶÔÏóÖ®¼ä¾ßÓнϸߵÄÏàËÆ¶È£¬¶ø²»Í¬´ØÖеĶÔÏó²î±ð½Ï´ó¡£ËùÒÔ£¬ÔںܶàÓ¦ÓÃÖУ¬Ò»¸ö´ØÖеÄÊý¾Ý¶ÔÏó¿ÉÒÔ±»×÷Ϊһ¸öÕûÌåÀ´¶Ô´ý£¬´Ó¶ø¼õÉÙ¼ÆËãÁ¿»òÕßÌá¸ß¼ÆËãÖÊÁ¿¡£

Æäʵ¾ÛÀàÊÇÒ»¸öÈËÃÇÈÕ³£Éú»îµÄ³£¼ûÐÐΪ£¬¼´Ëùν¡°ÎïÒÔÀà¾Û£¬ÈËÒÔȺ·Ö¡±£¬ºËÐĵÄ˼ÏëÒ²¾ÍÊǾÛÀà¡£ÈËÃÇ×ÜÊDz»¶ÏµØ¸Ä½øÏÂÒâʶÖеľÛÀàģʽÀ´Ñ§Ï°ÈçºÎÇø·Ö¸÷¸öÊÂÎïºÍÈË¡£Í¬Ê±£¬¾ÛÀà·ÖÎöÒѾ­¹ã·ºµÄÓ¦ÓÃÔÚÐí¶àÓ¦ÓÃÖУ¬°üÀ¨Ä£Ê½Ê¶±ð£¬Êý¾Ý·ÖÎö£¬Í¼Ïñ´¦ÀíÒÔ¼°Êг¡Ñо¿¡£Í¨¹ý¾ÛÀ࣬ÈËÃÇÄÜÒâʶµ½Ãܼ¯ºÍÏ¡ÊèµÄÇøÓò£¬·¢ÏÖÈ«¾ÖµÄ·Ö²¼Ä£Ê½£¬ÒÔ¼°Êý¾ÝÊôÐÔÖ®¼äµÄÓÐȤµÄÏ໥¹ØÏµ¡£

¾ÛÀàͬʱҲÔÚ Web Ó¦ÓÃÖÐÆðµ½Ô½À´Ô½ÖØÒªµÄ×÷Óá£×î±»¹ã·ºÊ¹ÓõļÈÊÇ¶Ô Web ÉϵÄÎĵµ½øÐзÖÀ࣬×éÖ¯ÐÅÏ¢µÄ·¢²¼£¬¸øÓû§Ò»¸öÓÐЧ·ÖÀàµÄÄÚÈÝä¯ÀÀϵͳ£¨ÃÅ»§ÍøÕ¾£©£¬Í¬Ê±¿ÉÒÔ¼ÓÈëʱ¼äÒòËØ£¬½ø¶ø·¢ÏÖ¸÷¸öÀàÄÚÈݵÄÐÅÏ¢·¢Õ¹£¬×î½ü±»´ó¼Ò¹Ø×¢µÄÖ÷ÌâºÍ»°Ì⣬»òÕß·ÖÎöÒ»¶Îʱ¼äÄÚÈËÃǶÔʲôÑùµÄÄÚÈݱȽϸÐÐËȤ£¬ÕâЩÓÐȤµÄÓ¦Óö¼µÃ½¨Á¢ÔÚ¾ÛÀàµÄ»ù´¡Ö®ÉÏ¡£×÷Ϊһ¸öÊý¾ÝÍÚ¾òµÄ¹¦ÄÜ£¬¾ÛÀà·ÖÎöÄÜ×÷Ϊ¶ÀÁ¢µÄ¹¤¾ßÀ´»ñµÃÊý¾Ý·Ö²¼µÄÇé¿ö£¬¹Û²ìÿ¸ö´ØµÄÌØµã£¬¼¯ÖжÔÌØ¶¨µÄijЩ´Ø×ö½øÒ»²½µÄ·ÖÎö£¬´ËÍ⣬¾ÛÀà·ÖÎö»¹¿ÉÒÔ×÷ΪÆäËûËã·¨µÄÔ¤´¦Àí²½Ö裬¼ò»¯¼ÆËãÁ¿£¬Ìá¸ß·ÖÎöЧÂÊ£¬ÕâÒ²ÊÇÎÒÃÇÔÚÕâÀï½éÉܾÛÀà·ÖÎöµÄÄ¿µÄ¡£

²»Í¬µÄ¾ÛÀàÎÊÌâ

¶ÔÓÚÒ»¸ö¾ÛÀàÎÊÌ⣬ҪÌôÑ¡×îÊʺÏ×î¸ßЧµÄËã·¨±ØÐë¶ÔÒª½â¾öµÄ¾ÛÀàÎÊÌâ±¾Éí½øÐÐÆÊÎö£¬ÏÂÃæÎÒÃǾʹӼ¸¸ö²àÃæ·ÖÎöһϾÛÀàÎÊÌâµÄÐèÇó¡£

¾ÛÀà½á¹ûÊÇÅÅËûµÄ»¹ÊÇ¿ÉÖØµþµÄ

ΪÁ˺ܺÃÀí½âÕâ¸öÎÊÌ⣬ÎÒÃÇÒÔÒ»¸öÀý×Ó½øÐзÖÎö£¬¼ÙÉèÄãµÄ¾ÛÀàÎÊÌâÐèÒªµÃµ½¶þ¸ö´Ø£º¡°Ï²»¶Õ²Ä·Ë¹¿¨Ã·Â¡µçÓ°µÄÓû§¡±ºÍ¡°²»Ï²»¶Õ²Ä·Ë¹¿¨Ã·Â¡µÄÓû§¡±£¬ÕâÆäʵÊÇÒ»¸öÅÅËûµÄ¾ÛÀàÎÊÌ⣬¶ÔÓÚÒ»¸öÓû§£¬ËûҪôÊôÓÚ¡°Ï²»¶¡±µÄ´Ø£¬ÒªÃ´ÊôÓÚ²»Ï²»¶µÄ´Ø¡£µ«Èç¹ûÄãµÄ¾ÛÀàÎÊÌâÊÇ¡°Ï²»¶Õ²Ä·Ë¹¿¨Ã·Â¡µçÓ°µÄÓû§¡±ºÍ¡°Ï²»¶Àï°ÂÄɶàµçÓ°µÄÓû§¡±£¬ÄÇôÕâ¸ö¾ÛÀàÎÊÌâ¾ÍÊÇÒ»¸ö¿ÉÖØµþµÄÎÊÌ⣬һ¸öÓû§Ëû¿ÉÒÔ¼Èϲ»¶Õ²Ä·Ë¹¿¨Ã·Â¡ÓÖϲ»¶Àï°ÂÄɶࡣ

ËùÒÔÕâ¸öÎÊÌâµÄºËÐÄÊÇ£¬¶ÔÓÚÒ»¸öÔªËØ£¬ËûÊÇ·ñ¿ÉÒÔÊôÓÚ¾ÛÀà½á¹ûÖеĶà¸ö´ØÖУ¬Èç¹ûÊÇ£¬ÔòÊÇÒ»¸ö¿ÉÖØµþµÄ¾ÛÀàÎÊÌ⣬Èç¹û·ñ£¬ÄÇôÊÇÒ»¸öÅÅËûµÄ¾ÛÀàÎÊÌâ¡£

»ùÓÚ²ã´Î»¹ÊÇ»ùÓÚ»®·Ö

Æäʵ´ó²¿·ÖÈËÏëµ½µÄ¾ÛÀàÎÊÌâ¶¼ÊÇ¡°»®·Ö¡±ÎÊÌ⣬¾ÍÊÇÄõ½Ò»×é¶ÔÏ󣬰´ÕÕÒ»¶¨µÄÔ­Ôò½«ËüÃǷֳɲ»Í¬µÄ×飬ÕâÊǵäÐ͵Ļ®·Ö¾ÛÀàÎÊÌâ¡£µ«³ýÁË»ùÓÚ»®·ÖµÄ¾ÛÀ࣬»¹ÓÐÒ»ÖÖÔÚÈÕ³£Éú»îÖÐÒ²ºÜ³£¼ûµÄÀàÐÍ£¬¾ÍÊÇ»ùÓÚ²ã´ÎµÄ¾ÛÀàÎÊÌ⣬ËüµÄ¾ÛÀà½á¹ûÊǽ«ÕâЩ¶ÔÏó·ÖµÈ¼¶£¬ÔÚ¶¥²ã½«¶ÔÏó½øÐдóÖµķÖ×é£¬Ëæºóÿһ×éÔÙ±»½øÒ»²½µÄϸ·Ö£¬Ò²ÐíËùÓз¾¶×îÖÕ¶¼Òªµ½´ïÒ»¸öµ¥¶ÀʵÀý£¬ÕâÊÇÒ»ÖÖ¡°×Ô¶¥ÏòÏ¡±µÄ²ã´Î¾ÛÀà½â¾ö·½·¨£¬¶ÔÓ¦µÄ£¬Ò²ÓС°×Ôµ×ÏòÉÏ¡±µÄ¡£Æäʵ¿ÉÒÔ¼òµ¥µÄÀí½â£¬¡°×Ô¶¥ÏòÏ¡±¾ÍÊÇÒ»²½²½µÄϸ»¯·Ö×飬¶ø¡°×Ôµ×ÏòÉÏ¡±¾ÍÊÇÒ»²½²½µÄ¹é²¢·Ö×é¡£

´ØÊýÄ¿¹Ì¶¨µÄ»¹ÊÇÎÞÏÞÖÆµÄ¾ÛÀà

Õâ¸öÊôÐԺܺÃÀí½â£¬¾ÍÊÇÄãµÄ¾ÛÀàÎÊÌâÊÇÔÚÖ´ÐоÛÀàË㷨ǰÒѾ­È·¶¨¾ÛÀàµÄ½á¹ûÓ¦¸ÃµÃµ½¶àÉٴأ¬»¹ÊǸù¾ÝÊý¾Ý±¾ÉíµÄÌØÕ÷£¬ÓɾÛÀàË㷨ѡÔñºÏÊʵĴصÄÊýÄ¿¡£

»ùÓÚ¾àÀ뻹ÊÇ»ùÓÚ¸ÅÂÊ·Ö²¼Ä£ÐÍ

ÔÚ±¾ÏµÁеĵڶþƪ½éÉÜЭͬ¹ýÂ˵ÄÎÄÕÂÖУ¬ÎÒÃÇÒѾ­Ïêϸ½éÉÜÁËÏàËÆÐԺ;àÀëµÄ¸ÅÄî¡£»ùÓÚ¾àÀëµÄ¾ÛÀàÎÊÌâÓ¦¸ÃºÜºÃÀí½â£¬¾ÍÊǽ«¾àÀë½üµÄÏàËÆµÄ¶ÔÏó¾ÛÔÚÒ»Æð¡£Ïà±ÈÆðÀ´£¬»ùÓÚ¸ÅÂÊ·Ö²¼Ä£Ð͵ģ¬¿ÉÄܲ»Ì«ºÃÀí½â£¬ÄÇôÏÂÃæ¸ø¸ö¼òµ¥µÄÀý×Ó¡£

Ò»¸ö¸ÅÂÊ·Ö²¼Ä£ÐÍ¿ÉÒÔÀí½âÊÇÔÚ N ά¿Õ¼äµÄÒ»×éµãµÄ·Ö²¼£¬¶øËüÃǵķֲ¼ÍùÍù·ûºÏÒ»¶¨µÄÌØÕ÷£¬±ÈÈç×é³ÉÒ»¸öÌØ¶¨µÄÐÎ×´¡£»ùÓÚ¸ÅÂÊ·Ö²¼Ä£Ð͵ľÛÀàÎÊÌ⣬¾ÍÊÇÔÚÒ»×é¶ÔÏóÖУ¬ÕÒµ½ÄÜ·ûºÏÌØ¶¨·Ö²¼Ä£Ð͵ĵãµÄ¼¯ºÏ£¬ËûÃDz»Ò»¶¨ÊǾàÀë×î½üµÄ»òÕß×îÏàËÆµÄ£¬¶øÊÇÄÜÍêÃÀµÄ³ÊÏÖ³ö¸ÅÂÊ·Ö²¼Ä£ÐÍËùÃèÊöµÄÄ£ÐÍ¡£

ÏÂÃæÍ¼ 1 ¸ø³öÁËÒ»¸öÀý×Ó£¬¶ÔͬÑùÒ»×éµã¼¯£¬Ó¦Óò»Í¬µÄ¾ÛÀà²ßÂÔ£¬µÃµ½ÍêÈ«²»Í¬µÄ¾ÛÀà½á¹û¡£×ó²à¸ø³öµÄ½á¹ûÊÇ»ùÓÚ¾àÀëµÄ£¬ºËÐĵÄÔ­Ôò¾ÍÊǽ«¾àÀë½üµÄµã¾ÛÔÚÒ»Æð£¬ÓÒ²à¸ø³öµÄ»ùÓÚ¸ÅÂÊ·Ö²¼Ä£Ð͵ľÛÀà½á¹û£¬ÕâÀï²ÉÓõĸÅÂÊ·Ö²¼Ä£ÐÍÊÇÒ»¶¨»¡¶ÈµÄÍÖÔ²¡£Í¼ÖÐרÃűê³öÁËÁ½¸öºìÉ«µÄµã£¬ÕâÁ½µãµÄ¾àÀëºÜ½ü£¬ÔÚ»ùÓÚ¾àÀëµÄ¾ÛÀàÖУ¬½«ËûÃǾÛÔÚÒ»¸öÀàÖУ¬µ«»ùÓÚ¸ÅÂÊ·Ö²¼Ä£Ð͵ľÛÀàÔò½«ËüÃÇ·ÖÔÚ²»Í¬µÄÀàÖУ¬Ö»ÊÇΪÁËÂú×ãÌØ¶¨µÄ¸ÅÂÊ·Ö²¼Ä£ÐÍ£¨µ±È»ÕâÀïÎÒÌØÒâ¾ÙÁËÒ»¸ö±È½Ï¼«¶ËµÄÀý×Ó£©¡£ËùÒÔÎÒÃÇ¿ÉÒÔ¿´³ö£¬ÔÚ»ùÓÚ¸ÅÂÊ·Ö²¼Ä£Ð͵ľÛÀà·½·¨ÀºËÐÄÊÇÄ£Ð͵͍Ò壬²»Í¬µÄÄ£ÐÍ¿ÉÄܵ¼ÖÂÍêÈ«²»Í¬µÄ¾ÛÀà½á¹û¡£


ͼ 1 »ùÓÚ¾àÀëºÍ»ùÓÚ¸ÅÂÊ·Ö²¼Ä£Ð͵ľÛÀàÎÊÌâ
 

 

Apache Mahout ÖеľÛÀà·ÖÎö¿ò¼Ü

Apache Mahout ÊÇ Apache Software Foundation (ASF) ÆìϵÄÒ»¸ö¿ªÔ´ÏîÄ¿£¬ÌṩһЩ¿ÉÀ©Õ¹µÄ»úÆ÷ѧϰÁìÓò¾­µäËã·¨µÄʵÏÖ£¬Ö¼ÔÚ°ïÖú¿ª·¢ÈËÔ±¸ü¼Ó·½±ã¿ì½ÝµØ´´½¨ÖÇÄÜÓ¦ÓóÌÐò£¬²¢ÇÒ£¬ÔÚ Mahout µÄ×î½ü°æ±¾Öл¹¼ÓÈëÁË¶Ô Apache Hadoop µÄÖ§³Ö£¬Ê¹ÕâЩËã·¨¿ÉÒÔ¸ü¸ßЧµÄÔËÐÐÔÚÔÆ¼ÆËã»·¾³ÖС£

¹ØÓÚ Apache Mahout µÄ°²×°ºÍÅäÖÃÇë²Î¿¼¡¶»ùÓÚ Apache Mahout ¹¹½¨Éç»á»¯ÍƼöÒýÇæ¡·£¬ËüÊDZÊÕß 09 Äê·¢±íµÄһƪ¹ØÓÚ»ùÓÚ Mahout ʵÏÖÍÆ¼öÒýÇæµÄ developerWorks ÎÄÕ£¬ÆäÖÐÏêϸ½éÉÜÁË Mahout µÄ°²×°²½Öè¡£

Mahout ÖÐÌṩÁ˳£ÓõĶàÖÖ¾ÛÀàËã·¨£¬Éæ¼°ÎÒÃǸոÕÌÖÂÛ¹ýµÄ¸÷ÖÖÀàÐÍËã·¨µÄ¾ßÌåʵÏÖ£¬ÏÂÃæÎÒÃǾͽøÒ»²½ÉîÈ뼸¸öµäÐ͵ľÛÀàËã·¨µÄÔ­Àí£¬ÓÅȱµãºÍʵÓó¡¾°£¬ÒÔ¼°ÈçºÎʹÓà Mahout ¸ßЧµÄʵÏÖËüÃÇ¡£

 

ÉîÈë¾ÛÀàËã·¨

ÉîÈë½éÉܾÛÀàË㷨֮ǰ£¬ÕâÀïÏÈ¶Ô Mahout ÖжԸ÷ÖÖ¾ÛÀàÎÊÌâµÄÊý¾ÝÄ£ÐͽøÐмòÒªµÄ½éÉÜ¡£

Êý¾ÝÄ£ÐÍ

Mahout µÄ¾ÛÀàËã·¨½«¶ÔÏó±íʾ³ÉÒ»ÖÖ¼òµ¥µÄÊý¾ÝÄ£ÐÍ£ºÏòÁ¿ (Vector)¡£ÔÚÏòÁ¿Êý¾ÝÃèÊöµÄ»ù´¡ÉÏ£¬ÎÒÃÇ¿ÉÒÔÇáËɵļÆËãÁ½¸ö¶ÔÏóµÄÏàËÆÐÔ£¬¹ØÓÚÏòÁ¿ºÍÏòÁ¿µÄÏàËÆ¶È¼ÆË㣬±¾ÏµÁеÄÉÏһƪ½éÉÜЭͬ¹ýÂËËã·¨µÄÎÄÕÂÖÐÒѾ­½øÐÐÁËÏêϸµÄ½éÉÜ£¬Çë²Î¿¼¡¶¡°Ì½Ë÷ÍÆ¼öÒýÇæÄÚ²¿µÄÃØÃÜ¡±ÏµÁÐ - Part 2: ÉîÈëÍÆ¼öÒýÇæÏà¹ØËã·¨ -- Эͬ¹ýÂË¡·¡£

Mahout ÖеÄÏòÁ¿ Vector ÊÇÒ»¸öÿ¸öÓòÊǸ¡µãÊý (double) µÄ¸´ºÏ¶ÔÏó£¬×îÈÝÒ×ÁªÏëµ½µÄʵÏÖ¾ÍÊÇÒ»¸ö¸¡µãÊýµÄÊý×é¡£µ«ÔÚ¾ßÌåÓ¦ÓÃÓÉÓÚÏòÁ¿±¾ÉíÊý¾ÝÄÚÈݵIJ»Í¬£¬±ÈÈçÓÐЩÏòÁ¿µÄÖµºÜÃܼ¯£¬Ã¿¸öÓò¶¼ÓÐÖµ£»ÓÐÐ©ÄØÔòÊǺÜÏ¡Ê裬¿ÉÄÜÖ»ÓÐÉÙÁ¿ÓòÓÐÖµ£¬ËùÒÔ Mahout ÌṩÁ˶à¸öʵÏÖ£º

  1. DenseVector£¬ËüµÄʵÏÖ¾ÍÊÇÒ»¸ö¸¡µãÊýÊý×飬¶ÔÏòÁ¿ÀïËùÓÐÓò¶¼½øÐд洢£¬ÊʺÏÓÃÓÚ´æ´¢Ãܼ¯ÏòÁ¿¡£
  2. RandomAccessSparseVector »ùÓÚ¸¡µãÊýµÄ HashMap ʵÏֵģ¬key ÊÇÕûÐÎ (int) ÀàÐÍ£¬value ÊǸ¡µãÊý (double) ÀàÐÍ£¬ËüÖ»´æ´¢ÏòÁ¿Öв»Îª¿ÕµÄÖµ£¬²¢Ìá¹©Ëæ»ú·ÃÎÊ¡£
  3. SequentialAccessVector ʵÏÖΪÕûÐÎ (int) ÀàÐͺ͸¡µãÊý (double) ÀàÐ͵IJ¢ÐÐÊý×飬ËüÒ²Ö»´æ´¢ÏòÁ¿Öв»Îª¿ÕµÄÖµ£¬µ«Ö»Ìṩ˳Ðò·ÃÎÊ¡£

Óû§¿ÉÒÔ¸ù¾Ý×Ô¼ºËã·¨µÄÐèÇóÑ¡ÔñºÏÊʵÄÏòÁ¿ÊµÏÖÀ࣬Èç¹ûËã·¨ÐèÒªºÜ¶àËæ»ú·ÃÎÊ£¬Ó¦¸ÃÑ¡Ôñ DenseVector »òRandomAccessSparseVector£¬Èç¹û´ó²¿·Ö¶¼ÊÇ˳Ðò·ÃÎÊ£¬SequentialAccessVector µÄЧ¹ûÓ¦¸Ã¸üºÃ¡£Èç¹û²ÉÓÃK-MeansËã·¨£¬SequentialAccessVector µÄЧ¹û¸üºÃ¡£

½éÉÜÁËÏòÁ¿µÄʵÏÖ£¬ÏÂÃæÎÒÃÇ¿´¿´ÈçºÎ½«ÏÖÓеÄÊý¾Ý½¨Ä£³ÉÏòÁ¿£¬ÊõÓï¾ÍÊÇ¡°ÈçºÎ¶ÔÊý¾Ý½øÐÐÏòÁ¿»¯¡±£¬ÒÔ±ã²ÉÓà Mahout µÄ¸÷ÖÖ¸ßЧµÄ¾ÛÀàËã·¨¡£

  1. ¼òµ¥µÄÕûÐλò¸¡µãÐ͵ÄÊý¾Ý

    ÕâÖÖÊý¾Ý×î¼òµ¥£¬Ö»Òª½«²»Í¬µÄÓò´æÔÚÏòÁ¿Öм´¿É£¬±ÈÈç n ά¿Õ¼äµÄµã£¬Æäʵ±¾Éí¿ÉÒÔ±»ÃèÊöΪһ¸öÏòÁ¿¡£

  2. ö¾ÙÀàÐÍÊý¾Ý

    ÕâÀàÊý¾ÝÊǶÔÎïÌåµÄÃèÊö£¬Ö»ÊÇȡֵ·¶Î§ÓÐÏÞ¡£¾Ù¸öÀý×Ó£¬¼ÙÉèÄãÓÐÒ»¸öÆ»¹ûÐÅÏ¢µÄÊý¾Ý¼¯£¬Ã¿¸öÆ»¹ûµÄÊý¾Ý°üÀ¨£º´óС£¬ÖØÁ¿£¬ÑÕÉ«µÈ£¬ÎÒÃÇÒÔÑÕɫΪÀý£¬ÉèÆ»¹ûµÄÑÕÉ«Êý¾Ý°üÀ¨£ººìÉ«£¬»ÆÉ«ºÍÂÌÉ«¡£ÔÚ¶ÔÊý¾Ý½øÐн¨Ä£Ê±£¬ÎÒÃÇ¿ÉÒÔÓÃÊý×ÖÀ´±íʾÑÕÉ«£¬ºìÉ« =1£¬»ÆÉ« =2£¬ÂÌÉ« =3£¬ÄÇô´óСֱ¾¶ 8cm£¬ÖØÁ¿ 0.15kg£¬ÑÕÉ«ÊǺìÉ«µÄÆ»¹û£¬½¨Ä£µÄÏòÁ¿¾ÍÊÇ <8, 0.15, 1>¡£

    ÏÂÃæµÄÇåµ¥ 1 ¸ø³öÁ˶ÔÒÔÉÏÁ½ÖÖÊý¾Ý½øÐÐÏòÁ¿»¯µÄÀý×Ó¡£



    Çåµ¥ 1. ´´½¨¼òµ¥µÄÏòÁ¿
    				 
     // ´´½¨Ò»¸ö¶þάµã¼¯µÄÏòÁ¿×é
     public static final double[][] points = { { 1, 1 }, { 2, 1 }, { 1, 2 }, 
     { 2, 2 }, { 3, 3 },  { 8, 8 }, { 9, 8 }, { 8, 9 }, { 9, 9 }, { 5, 5 }, 
     { 5, 6 }, { 6, 6 }}; 
     public static List<Vector> getPointVectors(double[][] raw) { 
    	 List<Vector> points = new ArrayList<Vector>(); 
    	 for (int i = 0; i < raw.length; i++) { 
    		 double[] fr = raw[i]; 
     // ÕâÀïÑ¡Ôñ´´½¨ RandomAccessSparseVector 
    		 Vector vec = new RandomAccessSparseVector(fr.length); 
    		 // ½«Êý¾Ý´æ·ÅÔÚ´´½¨µÄ Vector ÖÐ
     vec.assign(fr); 
    		 points.add(vec); 
    	 } 
    	 return points; 
     } 
    
     // ´´½¨Æ»¹ûÐÅÏ¢Êý¾ÝµÄÏòÁ¿×é
     public static List<Vector> generateAppleData() { 
     List<Vector> apples = new ArrayList<Vector>(); 
     // ÕâÀï´´½¨µÄÊÇ NamedVector£¬Æäʵ¾ÍÊÇÔÚÉÏÃæ¼¸ÖÖ Vector µÄ»ù´¡ÉÏ£¬
     //Ϊÿ¸ö Vector Ìṩһ¸ö¿É¶ÁµÄÃû×Ö
    	 NamedVector apple = new NamedVector(new DenseVector(
    	 new double[] {0.11, 510, 1}), 
    		"Small round green apple"); 
    	 apples.add(apple); 
     apple = new NamedVector(new DenseVector(new double[] {0.2, 650, 3}), 
    		"Large oval red apple"); 
    	 apples.add(apple); 
    	 apple = new NamedVector(new DenseVector(new double[] {0.09, 630, 1}), 
    		"Small elongated red apple"); 
    	 apples.add(apple); 
    	 apple = new NamedVector(new DenseVector(new double[] {0.25, 590, 3}), 
    		"Large round yellow apple"); 
    	 apples.add(apple); 
    	 apple = new NamedVector(new DenseVector(new double[] {0.18, 520, 2}), 
    		"Medium oval green apple"); 
    	 apples.add(apple); 
    	 return apples; 
     } 
    

  3. Îı¾ÐÅÏ¢

    ×÷Ϊ¾ÛÀàËã·¨µÄÖ÷ÒªÓ¦Óó¡¾° - Îı¾·ÖÀ࣬¶ÔÎı¾ÐÅÏ¢µÄ½¨Ä£Ò²ÊÇÒ»¸ö³£¼ûµÄÎÊÌâ¡£ÔÚÐÅÏ¢¼ìË÷Ñо¿ÁìÓòÒѾ­ÓкܺõĽ¨Ä£·½Ê½£¬¾ÍÊÇÐÅÏ¢¼ìË÷ÁìÓòÖÐ×î³£ÓõÄÏòÁ¿¿Õ¼äÄ£ÐÍ (Vector Space Model, VSM)¡£ÒòΪÏòÁ¿¿Õ¼äÄ£ÐͲ»ÊDZ¾ÎĵÄÖØµã£¬ÕâÀï¸øÒ»¸ö¼òÒªµÄ½éÉÜ£¬ÓÐÐËȤµÄÅóÓÑ¿ÉÒÔ²éÔIJο¼Ä¿Â¼Öиø³öµÄÏà¹ØÎĵµ¡£

    Îı¾µÄÏòÁ¿¿Õ¼äÄ£Ð;ÍÊǽ«Îı¾ÐÅÏ¢½¨Ä£ÎªÒ»¸öÏòÁ¿£¬ÆäÖÐÿһ¸öÓòÊÇÎı¾ÖгöÏÖµÄÒ»¸ö´ÊµÄÈ¨ÖØ¡£¹ØÓÚÈ¨ÖØµÄ¼ÆËãÔòÓкܶàÖУº

    • ×î¼òµ¥µÄιýÓÚÖ±½Ó¼ÆÊý£¬¾ÍÊÇ´ÊÔÚÎı¾Àï³öÏֵĴÎÊý¡£ÕâÖÖ·½·¨¼òµ¥£¬µ«ÊǶÔÎı¾ÄÚÈÝÃèÊöµÄ²»¹»¾«È·¡£
    • ´ÊµÄƵÂÊ (Team Frequency, TF)£º¾ÍÊǽ«´ÊÔÚÎı¾ÖгöÏֵįµÂÊ×÷Ϊ´ÊµÄÈ¨ÖØ¡£ÕâÖÖ·½·¨Ö»ÊǶÔÓÚÖ±½Ó¼ÆÊý½øÐÐÁ˹éÒ»»¯´¦Àí£¬Ä¿µÄÊÇÈò»Í¬³¤¶ÈµÄÎı¾Ä£ÐÍÓÐͳһµÄȡֵ¿Õ¼ä£¬±ãÓÚÎı¾ÏàËÆ¶ÈµÄ±È½Ï£¬µ«¿ÉÒÔ¿´³ö£¬¼òµ¥¼ÆÊýºÍ´ÊƵ¶¼²»Äܽâ¾ö¡°¸ßƵÎÞÒâÒå´Ê»ãÈ¨ÖØ´óµÄÎÊÌ⡱£¬Ò²¾ÍÊÇ˵¶ÔÓÚÓ¢ÎÄÎı¾ÖУ¬¡°a¡±£¬¡°the¡±ÕâÑù¸ßƵµ«ÎÞʵ¼ÊÒâÒåµÄ´Ê»ã²¢Ã»ÓнøÐйýÂË£¬ÕâÑùµÄÎı¾Ä£ÐÍÔÚ¼ÆËãÎı¾ÏàËÆ¶Èʱ»áºÜ²»×¼È·¡£
    • ´ÊƵ - ÄæÏòÎı¾ÆµÂÊ (Term Frequency ¨C Inverse Document Frequency, TF-IDF)£ºËüÊÇ¶Ô TF ·½·¨µÄÒ»ÖÖ¼ÓÇ¿£¬×ִʵÄÖØÒªÐÔËæ×ÅËüÔÚÎļþÖгöÏֵĴÎÊý³ÉÕý±ÈÔö¼Ó£¬µ«Í¬Ê±»áËæ×ÅËüÔÚËùÓÐÎı¾ÖгöÏֵįµÂʳɷ´±ÈϽµ¡£¾Ù¸öÀý×Ó£¬¶ÔÓÚ¡°¸ßƵÎÞÒâÒå´Ê»ã¡±£¬ÒòΪËüÃǴ󲿷ֻá³öÏÖÔÚËùÓеÄÎı¾ÖУ¬ËùÒÔËüÃǵÄÈ¨ÖØ»á´ó´òÕÛ¿Û£¬ÕâÑù¾ÍʹµÃÎı¾Ä£ÐÍÔÚÃèÊöÎı¾ÌØÕ÷Éϸü¼Ó¾«È·¡£ÔÚÐÅÏ¢¼ìË÷ÁìÓò£¬TF-IDF ÊǶÔÎı¾ÐÅÏ¢½¨Ä£µÄ×î³£Óõķ½·¨¡£

    ¶ÔÓÚÎı¾ÐÅÏ¢µÄÏòÁ¿»¯£¬Mahout ÒѾ­ÌṩÁ˹¤¾ßÀ࣬Ëü»ùÓÚ Lucene ¸ø³öÁ˶ÔÎı¾ÐÅÏ¢½øÐзÖÎö£¬È»ºó´´½¨Îı¾ÏòÁ¿¡£ÏÂÃæµÄÇåµ¥ 2 ¸ø³öÁËÒ»¸öÀý×Ó£¬·ÖÎöµÄÎı¾Êý¾ÝÊÇ·͸ÌṩµÄÐÂÎÅÊý¾Ý£¬²Î¿¼×ÊÔ´Àï¸ø³öÁËÏÂÔØµØÖ·¡£½«Êý¾Ý¼¯ÏÂÔØºó£¬·ÅÔÚ¡°clustering/reuters¡±Ä¿Â¼Ï¡£



    Çåµ¥ 2. ´´½¨Îı¾ÐÅÏ¢µÄÏòÁ¿
    				 
     public static void documentVectorize(String[] args) throws Exception{ 
    	 //1. ½«Â·Í¸µÄÊý¾Ý½âѹËõ , Mahout ÌṩÁËרÃŵķ½·¨
     DocumentClustering.extractReuters(); 
     //2. ½«Êý¾Ý´æ´¢³É SequenceFile£¬ÒòΪÕâЩ¹¤¾ßÀà¾ÍÊÇÔÚ Hadoop µÄ»ù´¡ÉÏ×öµÄ£¬ËùÒÔÊ×ÏÈÎÒÃÇÐèÒª½«Êý¾Ýд
     //    ³É SequenceFile£¬ÒÔ±ã¶ÁÈ¡ºÍ¼ÆËã
    	 DocumentClustering.transformToSequenceFile(); 
     //3. ½« SequenceFile ÎļþÖеÄÊý¾Ý£¬»ùÓÚ Lucene µÄ¹¤¾ß½øÐÐÏòÁ¿»¯
    	 DocumentClustering.transformToVector(); 	
     } 
    
     public static void extractReuters(){ 
     //ExtractReuters ÊÇ»ùÓÚ Hadoop µÄʵÏÖ£¬ËùÒÔÐèÒª½«ÊäÈëÊä³öµÄÎļþĿ¼´«¸øËü£¬ÕâÀïÎÒÃÇ¿ÉÒÔÖ±½Ó°ÑËüÓ³
     // Éäµ½ÎÒÃDZ¾µØµÄÒ»¸öÎļþ¼Ð£¬½âѹºóµÄÊý¾Ý½«Ð´ÈëÊä³öĿ¼ÏÂ
    	 File inputFolder = new File("clustering/reuters"); 
    	 File outputFolder = new File("clustering/reuters-extracted"); 
    	 ExtractReuters extractor = new ExtractReuters(inputFolder, outputFolder); 
     extractor.extract(); 
     } 
    	
     public static void transformToSequenceFile(){ 
     //SequenceFilesFromDirectory ʵÏÖ½«Ä³¸öÎļþĿ¼ÏµÄËùÓÐÎļþдÈëÒ»¸ö SequenceFiles µÄ¹¦ÄÜ
     // ËüÆäʵ±¾ÉíÊÇÒ»¸ö¹¤¾ßÀ࣬¿ÉÒÔÖ±½ÓÓÃÃüÁîÐе÷Óã¬ÕâÀïÖ±½Óµ÷ÓÃÁËËüµÄ main ·½·¨
    	 String[] args = {"-c", "UTF-8", "-i", "clustering/reuters-extracted/", "-o",
    	 "clustering/reuters-seqfiles"}; 
             // ½âÊÍһϲÎÊýµÄÒâÒ壺
     // 	 -c: Ö¸¶¨ÎļþµÄ±àÂëÐÎʽ£¬ÕâÀïÓõÄÊÇ"UTF-8"
     // 	 -i: Ö¸¶¨ÊäÈëµÄÎļþĿ¼£¬ÕâÀïÖ¸µ½ÎÒÃǸոյ¼³öÎļþµÄĿ¼
     // 	 -o: Ö¸¶¨Êä³öµÄÎļþĿ¼
    
    	 try { 
    		 SequenceFilesFromDirectory.main(args); 
    	 } catch (Exception e) { 
    		 e.printStackTrace(); 
    	 } 
     } 
    	
     public static void transformToVector(){ 
     //SparseVectorsFromSequenceFiles ʵÏÖ½« SequenceFiles ÖеÄÊý¾Ý½øÐÐÏòÁ¿»¯¡£
     // ËüÆäʵ±¾ÉíÊÇÒ»¸ö¹¤¾ßÀ࣬¿ÉÒÔÖ±½ÓÓÃÃüÁîÐе÷Óã¬ÕâÀïÖ±½Óµ÷ÓÃÁËËüµÄ main ·½·¨
     String[] args = {"-i", "clustering/reuters-seqfiles/", "-o", 
     "clustering/reuters-vectors-bigram", "-a", 
     "org.apache.lucene.analysis.WhitespaceAnalyzer"
    , "-chunk", "200", "-wt", "tfidf", "-s", "5", 
    "-md", "3", "-x", "90", "-ng", "2", "-ml", "50", "-seq"}; 
     // ½âÊÍһϲÎÊýµÄÒâÒ壺
     // 	 -i: Ö¸¶¨ÊäÈëµÄÎļþĿ¼£¬ÕâÀïÖ¸µ½ÎÒÃǸոÕÉú³É SequenceFiles µÄĿ¼
     // 	 -o: Ö¸¶¨Êä³öµÄÎļþĿ¼
     // 	 -a: Ö¸¶¨Ê¹ÓÃµÄ Analyzer£¬ÕâÀïÓõÄÊÇ lucene µÄ¿Õ¸ñ·Ö´ÊµÄ Analyzer 
     // 	 -chunk: Ö¸¶¨ Chunk µÄ´óС£¬µ¥Î»ÊÇ M¡£¶ÔÓÚ´óµÄÎļþ¼¯ºÏ£¬ÎÒÃDz»ÄÜÒ»´Î load ËùÓÐÎļþ£¬ËùÒÔÐèÒª
     // 		¶ÔÊý¾Ý½øÐÐÇпé
     // 	 -wt: Ö¸¶¨·ÖÎöʱ²ÉÓõļÆËãÈ¨ÖØµÄģʽ£¬ÕâÀïÑ¡ÁË tfidf 
     // 	 -s:  Ö¸¶¨´ÊÓïÔÚÕû¸öÎı¾¼¯ºÏ³öÏÖµÄ×îµÍƵ¶È£¬µÍÓÚÕâ¸öƵ¶ÈµÄ´Ê»ã½«±»¶ªµô
     // 	 -md: Ö¸¶¨´ÊÓïÔÚ¶àÉÙ²»Í¬µÄÎı¾ÖгöÏÖµÄ×îµÍÖµ£¬µÍÓÚÕâ¸öÖµµÄ´Ê»ã½«±»¶ªµô
     // 	 -x:  Ö¸¶¨¸ßƵ´Ê»ãºÍÎÞÒâÒå´Ê»ã£¨ÀýÈç is£¬a£¬the µÈ£©µÄ³öÏÖÆµÂÊÉÏÏÞ£¬¸ßÓÚÉÏÏ޵Ľ«±»¶ªµô
     // 	 -ng: Ö¸¶¨·Ö´Êºó¿¼ÂÇ´Ê»ãµÄ×î´ó³¤¶È£¬ÀýÈç 1-gram ¾ÍÊÇ£¬coca£¬cola£¬ÕâÊÇÁ½¸ö´Ê£¬
     // 	      2-gram ʱ£¬coca cola ÊÇÒ»¸ö´Ê»ã£¬2-gram ±È 1-gram ÔÚÒ»¶¨Çé¿öÏ·ÖÎöµÄ¸ü׼ȷ¡£
     // 	 -ml: Ö¸¶¨ÅжÏÏàÁÚ´ÊÓïÊDz»ÊÇÊôÓÚÒ»¸ö´Ê»ãµÄÏàËÆ¶ÈãÐÖµ£¬µ±Ñ¡Ôñ >1-gram ʱ²ÅÓÐÓã¬Æäʵ¼ÆËãµÄÊÇ
     // 	      Minimum Log Likelihood Ratio µÄãÐÖµ
     // 	 -seq: Ö¸¶¨Éú³ÉµÄÏòÁ¿ÊÇ SequentialAccessSparseVectors£¬Ã»ÉèÖÃʱĬÈÏÉú³É»¹ÊÇ
     //       RandomAccessSparseVectors 
    
    	 try { 
    		 SparseVectorsFromSequenceFiles.main(args); 
    	 } catch (Exception e) { 
    		 e.printStackTrace(); 
    	 } 
     } 
    


    ÕâÀï²¹³äÒ»µã£¬Éú³ÉµÄÏòÁ¿»¯ÎļþµÄĿ¼½á¹¹ÊÇÕâÑùµÄ£º



    ͼ 2 Îı¾ÐÅÏ¢ÏòÁ¿»¯
     

    • df-count Ŀ¼£º±£´æ×ÅÎı¾µÄƵÂÊÐÅÏ¢
    • tf-vectors Ŀ¼£º±£´æ×ÅÒÔ TF ×÷ΪȨֵµÄÎı¾ÏòÁ¿
    • tfidf-vectors Ŀ¼£º±£´æ×ÅÒÔ TFIDF ×÷ΪȨֵµÄÎı¾ÏòÁ¿
    • tokenized-documents Ŀ¼£º±£´æ×ŷִʹýºóµÄÎı¾ÐÅÏ¢
    • wordcount Ŀ¼£º±£´æ×ÅÈ«¾ÖµÄ´Ê»ã³öÏֵĴÎÊý
    • dictionary.file-0 Ŀ¼£º±£´æ×ÅÕâЩÎı¾µÄ´Ê»ã±í
    • frequcency-file-0 Ŀ¼ : ±£´æ×Å´Ê»ã±í¶ÔÓ¦µÄƵÂÊÐÅÏ¢¡£

½éÉÜÍêÏòÁ¿»¯ÎÊÌ⣬ÏÂÃæÎÒÃÇÉîÈë·ÖÎö¸÷¸ö¾ÛÀàËã·¨£¬Ê×ÏȽéÉܵÄÊÇ×î¾­µäµÄ K ¾ùÖµËã·¨¡£

K ¾ùÖµ¾ÛÀàËã·¨

K ¾ùÖµÊǵäÐ͵ĻùÓÚ¾àÀëµÄÅÅËûµÄ»®·Ö·½·¨£º¸ø¶¨Ò»¸ö n ¸ö¶ÔÏóµÄÊý¾Ý¼¯£¬Ëü¿ÉÒÔ¹¹½¨Êý¾ÝµÄ k ¸ö»®·Ö£¬Ã¿¸ö»®·Ö¾ÍÊÇÒ»¸ö¾ÛÀ࣬²¢ÇÒ k<=n£¬Í¬Ê±»¹ÐèÒªÂú×ãÁ½¸öÒªÇó£º

  • ÿ¸ö×éÖÁÉÙ°üº¬Ò»¸ö¶ÔÏó
  • ÿ¸ö¶ÔÏó±ØÐëÊôÓÚÇÒ½öÊôÓÚÒ»¸ö×é¡£

K ¾ùÖµµÄ»ù±¾Ô­ÀíÊÇÕâÑùµÄ£¬¸ø¶¨ k£¬¼´Òª¹¹½¨µÄ»®·ÖµÄÊýÄ¿£¬

  1. Ê×ÏÈ´´½¨Ò»¸ö³õʼ»®·Ö£¬Ëæ»úµØÑ¡Ôñ k ¸ö¶ÔÏó£¬Ã¿¸ö¶ÔÏó³õʼµØ´ú±íÁËÒ»¸ö´ØÖÐÐÄ¡£¶ÔÓÚÆäËûµÄ¶ÔÏ󣬸ù¾ÝÆäÓë¸÷¸ö´ØÖÐÐĵľàÀ룬½«ËüÃǸ³¸ø×î½üµÄ´Ø¡£
  2. È»ºó²ÉÓÃÒ»ÖÖµü´úµÄÖØ¶¨Î»¼¼Êõ£¬³¢ÊÔͨ¹ý¶ÔÏóÔÚ»®·Ö¼äÒÆ¶¯À´¸Ä½ø»®·Ö¡£ËùÎ½ÖØ¶¨Î»¼¼Êõ£¬¾ÍÊǵ±ÓÐеĶÔÏó¼ÓÈë´Ø»òÕßÒÑÓжÔÏóÀ뿪´ØµÄʱºò£¬ÖØÐ¼ÆËã´ØµÄƽ¾ùÖµ£¬È»ºó¶Ô¶ÔÏó½øÐÐÖØÐ·ÖÅä¡£Õâ¸ö¹ý³Ì²»¶ÏÖØ¸´£¬Ö±µ½Ã»ÓдØÖжÔÏóµÄ±ä»¯¡£

µ±½á¹û´ØÊÇÃܼ¯µÄ£¬¶øÇҴغʹØÖ®¼äµÄÇø±ð±È½ÏÃ÷ÏÔʱ£¬K ¾ùÖµµÄЧ¹û±È½ÏºÃ¡£¶ÔÓÚ´¦Àí´óÊý¾Ý¼¯£¬Õâ¸öËã·¨ÊÇÏà¶Ô¿ÉÉìËõµÄºÍ¸ßЧµÄ£¬ËüµÄ¸´ÔÓ¶ÈÊÇ O(nkt)£¬n ÊǶÔÏóµÄ¸öÊý£¬k ÊǴصÄÊýÄ¿£¬t Êǵü´úµÄ´ÎÊý£¬Í¨³£ k<<n£¬ÇÒ t<<n£¬ËùÒÔËã·¨¾­³£ÒÔ¾Ö²¿×îÓŽáÊø¡£

K ¾ùÖµµÄ×î´óÎÊÌâÊÇÒªÇóÓû§±ØÐëÊÂÏȸø³ö k µÄ¸öÊý£¬k µÄÑ¡ÔñÒ»°ã¶¼»ùÓÚһЩ¾­ÑéÖµºÍ¶à´ÎʵÑé½á¹û£¬¶ÔÓÚ²»Í¬µÄÊý¾Ý¼¯£¬k µÄȡֵûÓÐ¿É½è¼øÐÔ¡£ÁíÍ⣬K ¾ùÖµ¶Ô¡°ÔëÒô¡±ºÍ¹ÂÁ¢µãÊý¾ÝÊÇÃô¸ÐµÄ£¬ÉÙÁ¿ÕâÀàµÄÊý¾Ý¾ÍÄÜ¶ÔÆ½¾ùÖµÔì³É¼«´óµÄÓ°Ïì¡£

˵ÁËÕâô¶àÀíÂÛµÄÔ­Àí£¬ÏÂÃæÎÒÃÇ»ùÓÚ Mahout ʵÏÖÒ»¸ö¼òµ¥µÄ K ¾ùÖµËã·¨µÄÀý×Ó¡£ÈçÇ°Ãæ½éÉܵģ¬Mahout ÌṩÁË»ù±¾µÄ»ùÓÚÄÚ´æµÄʵÏֺͻùÓÚ Hadoop µÄ Map/Reduce µÄʵÏÖ£¬·Ö±ðÊÇ KMeansClusterer ºÍ KMeansDriver£¬ÏÂÃæ¸ø³öÒ»¸ö¼òµ¥µÄÀý×Ó£¬¾Í»ùÓÚÎÒÃÇÔÚÇåµ¥ 1 ÀﶨÒåµÄ¶þάµã¼¯Êý¾Ý¡£


Çåµ¥ 3. K ¾ùÖµ¾ÛÀàË㷨ʾÀý

				 
 // »ùÓÚÄÚ´æµÄ K ¾ùÖµ¾ÛÀàË㷨ʵÏÖ
 public static void kMeansClusterInMemoryKMeans(){ 
 // Ö¸¶¨ÐèÒª¾ÛÀàµÄ¸öÊý£¬ÕâÀïÑ¡Ôñ 2 Àà
 int k = 2; 
 // Ö¸¶¨ K ¾ùÖµ¾ÛÀàËã·¨µÄ×î´óµü´ú´ÎÊý
 int maxIter = 3; 
 // Ö¸¶¨ K ¾ùÖµ¾ÛÀàËã·¨µÄ×î´ó¾àÀëãÐÖµ
 double distanceThreshold = 0.01; 
 // ÉùÃ÷Ò»¸ö¼ÆËã¾àÀëµÄ·½·¨£¬ÕâÀïÑ¡ÔñÁËÅ·¼¸ÀïµÂ¾àÀë
 DistanceMeasure measure = new EuclideanDistanceMeasure(); 
 // ÕâÀï¹¹½¨ÏòÁ¿¼¯£¬Ê¹ÓõÄÊÇÇåµ¥ 1 ÀïµÄ¶þάµã¼¯
 List<Vector> pointVectors = SimpleDataSet.getPointVectors(SimpleDataSet.points); 
 // ´Óµã¼¯ÏòÁ¿ÖÐËæ»úµÄÑ¡Ôñ k ¸ö×÷Ϊ´ØµÄÖÐÐÄ
 List<Vector> randomPoints = RandomSeedGenerator.chooseRandomPoints(pointVectors, k); 
 // »ùÓÚÇ°ÃæÑ¡ÖеÄÖÐÐĹ¹½¨´Ø
 List<Cluster> clusters = new ArrayList<Cluster>(); 
 int clusterId = 0; 
 for(Vector v : randomPoints){ 
	 clusters.add(new Cluster(v, clusterId ++, measure)); 
 } 
 // µ÷ÓÃ KMeansClusterer.clusterPoints ·½·¨Ö´ÐÐ K ¾ùÖµ¾ÛÀà
 List<List<Cluster>> finalClusters = KMeansClusterer.clusterPoints(pointVectors, 
 clusters, measure, maxIter, distanceThreshold); 

 // ´òÓ¡×îÖյľÛÀà½á¹û
 for(Cluster cluster : finalClusters.get(finalClusters.size() -1)){ 
	 System.out.println("Cluster id: " + cluster.getId() + 
" center: " + cluster.getCenter().asFormatString()); 
	 System.out.println("       Points: " + cluster.getNumPoints()); 	
 } 
 } 
 // »ùÓÚ Hadoop µÄ K ¾ùÖµ¾ÛÀàË㷨ʵÏÖ
 public static void kMeansClusterUsingMapReduce () throws Exception{ 
 // ÉùÃ÷Ò»¸ö¼ÆËã¾àÀëµÄ·½·¨£¬ÕâÀïÑ¡ÔñÁËÅ·¼¸ÀïµÂ¾àÀë
	 DistanceMeasure measure = new EuclideanDistanceMeasure(); 
	 // Ö¸¶¨ÊäÈë·¾¶£¬ÈçÇ°Ãæ½éÉܵÄÒ»Ñù£¬»ùÓÚ Hadoop µÄʵÏÖ¾ÍÊÇͨ¹ýÖ¸¶¨ÊäÈëÊä³öµÄÎļþ·¾¶À´Ö¸¶¨Êý¾ÝÔ´µÄ¡£
	 Path testpoints = new Path("testpoints"); 
	 Path output = new Path("output"); 
	 // Çå¿ÕÊäÈëÊä³ö·¾¶ÏµÄÊý¾Ý
 HadoopUtil.overwriteOutput(testpoints); 
	 HadoopUtil.overwriteOutput(output); 
	 RandomUtils.useTestSeed(); 
 // ÔÚÊäÈë·¾¶ÏÂÉú³Éµã¼¯£¬ÓëÄÚ´æµÄ·½·¨²»Í¬£¬ÕâÀïÐèÒª°ÑËùÓеÄÏòÁ¿Ð´½øÎļþ£¬ÏÂÃæ¸ø³ö¾ßÌåµÄÀý×Ó
	 SimpleDataSet.writePointsToFile(testpoints); 
 // Ö¸¶¨ÐèÒª¾ÛÀàµÄ¸öÊý£¬ÕâÀïÑ¡Ôñ 2 Àà
 int k = 2; 
 // Ö¸¶¨ K ¾ùÖµ¾ÛÀàËã·¨µÄ×î´óµü´ú´ÎÊý
 int maxIter = 3; 
	 // Ö¸¶¨ K ¾ùÖµ¾ÛÀàËã·¨µÄ×î´ó¾àÀëãÐÖµ
 double distanceThreshold = 0.01; 
 // Ëæ»úµÄÑ¡Ôñ k ¸ö×÷Ϊ´ØµÄÖÐÐÄ
 Path clusters = RandomSeedGenerator.buildRandom(testpoints, 
 new Path(output, "clusters-0"), k, measure); 
 // µ÷Óà KMeansDriver.runJob ·½·¨Ö´ÐÐ K ¾ùÖµ¾ÛÀàËã·¨
 KMeansDriver.runJob(testpoints, clusters, output, measure, 
 distanceThreshold, maxIter, 1, true, true); 
 // µ÷Óà ClusterDumper µÄ printClusters ·½·¨½«¾ÛÀà½á¹û´òÓ¡³öÀ´¡£
 ClusterDumper clusterDumper = new ClusterDumper(new Path(output, 
"clusters-" + maxIter -1), new Path(output, "clusteredPoints")); 
 clusterDumper.printClusters(null); 
 } 
 //SimpleDataSet µÄ writePointsToFile ·½·¨£¬½«²âÊԵ㼯дÈëÎļþÀï
 // Ê×ÏÈÎÒÃǽ«²âÊԵ㼯°ü×°³É VectorWritable ÐÎʽ£¬´Ó¶ø½«ËüÃÇдÈëÎļþ
 public static List<VectorWritable> getPoints(double[][] raw) { 
	 List<VectorWritable> points = new ArrayList<VectorWritable>(); 
 for (int i = 0; i < raw.length; i++) { 
		 double[] fr = raw[i]; 
		 Vector vec = new RandomAccessSparseVector(fr.length); 
		 vec.assign(fr); 
 // Ö»ÊÇÔÚ¼ÓÈëµã¼¯Ç°£¬ÔÚ RandomAccessSparseVector Íâ¼ÓÁËÒ»²ã VectorWritable µÄ°ü×°
		 points.add(new VectorWritable(vec)); 
	 } 
 return points; 
 } 
 // ½« VectorWritable µÄµã¼¯Ð´ÈëÎļþ£¬ÕâÀïÉæ¼°Ò»Ð©»ù±¾µÄ Hadoop ±à³ÌÔªËØ£¬ÏêϸµÄÇë²ÎÔIJο¼×ÊÔ´ÀïÏà¹ØµÄÄÚÈÝ
 public static void writePointsToFile(Path output) throws IOException { 
	 // µ÷ÓÃÇ°ÃæµÄ·½·¨Éú³Éµã¼¯
	 List<VectorWritable> pointVectors = getPoints(points); 
	 // ÉèÖà Hadoop µÄ»ù±¾ÅäÖÃ
	 Configuration conf = new Configuration(); 
	 // Éú³É Hadoop Îļþϵͳ¶ÔÏó FileSystem 
	 FileSystem fs = FileSystem.get(output.toUri(), conf); 
 // Éú³ÉÒ»¸ö SequenceFile.Writer£¬Ëü¸ºÔ𽫠Vector дÈëÎļþÖÐ
	 SequenceFile.Writer writer = new SequenceFile.Writer(fs, conf, output, 
	 Text.class,  VectorWritable.class); 
	 // ÕâÀォÏòÁ¿°´ÕÕÎı¾ÐÎʽдÈëÎļþ
	 try { 
 for (VectorWritable vw : pointVectors) { 
 writer.append(new Text(), vw); 
		 } 
	 } finally { 
		 writer.close(); 
	 }  
 } 

Ö´Ðнá¹û
 KMeans Clustering In Memory Result 
 Cluster id: 0 
 center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
"vector":"{\"values\":{\"table\":[0,1,0],\"values\":[1.8,1.8,0.0],\"state\":[1,1,0],
\"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,\"highWaterMark\":1,
\"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},\"size\":2,\"lengthSquared\":-1.0}"} 
       Points: 5 
 Cluster id: 1 
 center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
 "vector":"{\"values\":{\"table\":[0,1,0],
 \"values\":[7.142857142857143,7.285714285714286,0.0],\"state\":[1,1,0],
 \"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,\"highWaterMark\":1,
 \"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},\"size\":2,\"lengthSquared\":-1.0}"} 
       Points: 7 

 KMeans Clustering Using Map/Reduce Result 
	 Weight:  Point: 
	 1.0: [1.000, 1.000] 
	 1.0: [2.000, 1.000] 
	 1.0: [1.000, 2.000] 
	 1.0: [2.000, 2.000] 
	 1.0: [3.000, 3.000] 
	 Weight:  Point: 
	 1.0: [8.000, 8.000] 
	 1.0: [9.000, 8.000] 
	 1.0: [8.000, 9.000] 
	 1.0: [9.000, 9.000] 
	 1.0: [5.000, 5.000] 
	 1.0: [5.000, 6.000] 
	 1.0: [6.000, 6.000] 

 

1)    Path Input £º ËùÓдý¾ÛÀàµÄÊý¾ÝµãµÄ·¾¢£¬²ÎÊý²»¿Éȱ

2)    Path clusters £º´æ´¢Ã¿¸ö´ØÖÐÐĵÄ·¾¢£¬²ÎÊý²»¿Éȱ

3)    Path output £º¾ÛÀà½á¹û´æ´¢µÄ·¾¢£¬²ÎÊý²»¿Éȱ£¬Èç¹ûÖ¸¶¨Á˴صĸöÊý£¬Ôò¸Ã·¾¢ÏÂÎļþ¿ÉΪ¿Õ

4)    DistanceMeasure measure £ºÊý¾Ýµã¼äµÄ¾àÀë¼ÆËã·½·¨£¬²ÎÊý¿Éȱ£¬Ä¬ÈÏÊÇ SquaredEuclidean Ëã·½·¨

     Ìṩ²ÎÊýÖµ:   ChebyshevDistanceMeasure ÇбÈÑ©·ò¾àÀë

                    CosineDistanceMeasure ÓàÏÒ¾àÀë

                    EuclideanDistanceMeasure Å·ÊϾàÀë

                    MahalanobisDistanceMeasure ÂíÊϾàÀë

                    ManhattanDistanceMeasure Âü¹þ¶Ù¾àÀë

                   MinkowskiDistanceMeasure ãɿɷò˹»ù¾àÀë

                   SquaredEuclideanDistanceMeasure Å·ÊϾàÀë ( ²»²Éȡƽ·½¸ù )

                   TanimotoDistanceMeasure  Tanimoto ÏµÊý¾àÀë

                   »¹ÓÐһЩ»ùÓÚÈ¨ÖØµÄ¾àÀë¼ÆËã·½·¨£º

                   WeightedDistanceMeasure

                    WeightedEuclideanDistanceMeasure ¡¢ WeightedManhattanDistanceMeasure

5)  Double convergenceDelta: ÊÕÁ²ÏµÊý ÐµĴØÖÐÐÄÓëÉϴεĴØÖÐÐĵĵľàÀë²»Äܳ¬¹ý convergenceDelta £¬Èç¹û³¬¹ý£¬Ôò¼ÌÐøµü´ú£¬·ñÔòÍ£Ö¹µü´ú¡£²ÎÊý¿Éȱ£¬Ä¬ÈÏÖµÊÇ 0.5

6)  int maxIterations £º ×î´óµü´ú´ÎÊý£¬Èç¹ûµü´ú´ÎÊýСÓÚ maxIterations £¬¼ÌÐøµü´ú£¬·ñÔòÍ£Ö¹µø´ò£¬Óë 5) ÖеÄconvergenceDelta Âú×ãÈκÎÒ»¸öÍ£Ö¹µü´úµÄÌõ¼þ£¬ÔòÍ£Ö¹µü´ú¡£²ÎÊý²»¿Éȱ¡£

7)  boolean runClustering £ºÈç¹ûÊÇ true ÔòÔÚ¼ÆËã´ØÖÐÐĺ󣬼ÆËãÿ¸öÊý¾ÝµãÊôÓÚÄĸö´Ø£¬·ñÔò¼ÆËã´ØÖÐÐĺó½áÊø£¬²ÎÊý¿Éȱ£¬Ä¬ÈÏΪ true

8)  clusteringOption £º²ÉÓõ¥»ú»òÕß Map/Reduce µÄ·½·¨¼ÆËã¡£²ÎÊý¿Éȱ£¬Ä¬ÈÏÊÇ mapreduce ¡£

9)  int numClustersOption £º´ØµÄ¸öÊý£¬²ÎÊý¿Éȱ¡£

½éÉÜÍê K ¾ùÖµ¾ÛÀàËã·¨£¬ÎÒÃÇ¿ÉÒÔ¿´³öËü×î´óµÄÓŵãÊÇ£ºÔ­Àí¼òµ¥£¬ÊµÏÖÆðÀ´Ò²Ïà¶Ô¼òµ¥£¬Í¬Ê±Ö´ÐÐЧÂʺͶÔÓÚ´óÊý¾ÝÁ¿µÄ¿ÉÉìËõÐÔ»¹ÊǽÏÇ¿µÄ¡£È»¶øÈ±µãÒ²ÊǺÜÃ÷È·µÄ£¬Ê×ÏÈËüÐèÒªÓû§ÔÚÖ´ÐоÛÀà֮ǰ¾ÍÓÐÃ÷È·µÄ¾ÛÀà¸öÊýµÄÉèÖã¬ÕâÒ»µãÊÇÓû§ÔÚ´¦Àí´ó²¿·ÖÎÊÌâʱ¶¼²»Ì«¿ÉÄÜÊÂÏÈÖªµÀµÄ£¬Ò»°ãÐèҪͨ¹ý¶à´ÎÊÔÑéÕÒ³öÒ»¸ö×îÓÅµÄ K Öµ£»Æä´Î¾ÍÊÇ£¬ÓÉÓÚËã·¨ÔÚ×ʼ²ÉÓÃËæ»úÑ¡Ôñ³õʼ¾ÛÀàÖÐÐĵķ½·¨£¬ËùÒÔËã·¨¶ÔÔëÒôºÍ¹ÂÁ¢µãµÄÈÝÈÌÄÜÁ¦½Ï²î¡£ËùνÔëÒô¾ÍÊÇ´ý¾ÛÀà¶ÔÏóÖдíÎóµÄÊý¾Ý£¬¶ø¹ÂÁ¢µãÊÇÖ¸ÓëÆäËûÊý¾Ý¾àÀë½ÏÔ¶£¬ÏàËÆÐԽϵ͵ÄÊý¾Ý¡£¶ÔÓÚ K ¾ùÖµËã·¨£¬Ò»µ©¹ÂÁ¢µãºÍÔëÒôÔÚ×ʼ±»Ñ¡×÷´ØÖÐÐÄ£¬¶ÔºóÃæÕû¸ö¾ÛÀà¹ý³Ì½«´øÀ´ºÜ´óµÄÎÊÌ⣬ÄÇôÎÒÃÇÓÐʲô·½·¨¿ÉÒÔÏÈ¿ìËÙÕÒ³öÓ¦¸ÃÑ¡Ôñ¶àÉÙ¸ö´Ø£¬Í¬Ê±ÕÒµ½´ØµÄÖÐÐÄ£¬ÕâÑù¿ÉÒÔ´ó´óÓÅ»¯ K ¾ùÖµ¾ÛÀàËã·¨µÄЧÂÊ£¬ÏÂÃæÎÒÃǾͽéÉÜÁíÒ»¸ö¾ÛÀà·½·¨£ºCanopy ¾ÛÀàËã·¨¡£

Canopy ¾ÛÀàËã·¨

Canopy ¾ÛÀàËã·¨µÄ»ù±¾Ô­ÔòÊÇ£ºÊ×ÏÈÓ¦Óóɱ¾µÍµÄ½üËÆµÄ¾àÀë¼ÆËã·½·¨¸ßЧµÄ½«Êý¾Ý·ÖΪ¶à¸ö×飬ÕâÀï³ÆÎªÒ»¸ö Canopy£¬ÎÒÃ**ÃÇÒ½«Ëü·­ÒëΪ¡°»ª¸Ç¡±£¬Canopy Ö®¼ä¿ÉÒÔÓÐÖØµþµÄ²¿·Ö£»È»ºó²ÉÓÃÑϸñµÄ¾àÀë¼ÆË㷽ʽ׼ȷµÄ¼ÆËãÔÚͬһ Canopy Öеĵ㣬½«ËûÃÇ·ÖÅäÓë×îºÏÊʵĴØÖС£Canopy ¾ÛÀàËã·¨¾­³£ÓÃÓÚ K ¾ùÖµ¾ÛÀàËã·¨µÄÔ¤´¦Àí£¬ÓÃÀ´ÕÒºÏÊ浀 k ÖµºÍ´ØÖÐÐÄ¡£

ÏÂÃæÏêϸ½éÉÜһϴ´½¨ Canopy µÄ¹ý³Ì£º³õʼ£¬¼ÙÉèÎÒÃÇÓÐÒ»×éµã¼¯ S£¬²¢ÇÒÔ¤ÉèÁËÁ½¸ö¾àÀëãÐÖµ£¬T1£¬T2£¨T1>T2£©£»È»ºóÑ¡ÔñÒ»¸öµã£¬¼ÆËãËüÓë S ÖÐÆäËûµãµÄ¾àÀ루ÕâÀï²ÉÓóɱ¾ºÜµÍµÄ¼ÆËã·½·¨£©£¬½«¾àÀëÔÚ T1 ÒÔÄڵķÅÈëÒ»¸ö Canopy ÖУ¬Í¬Ê±´Ó S ÖÐÈ¥µôÄÇЩÓë´Ëµã¾àÀëÔÚ T2 ÒÔÄڵĵ㣨ÕâÀïÊÇΪÁ˱£Ö¤ºÍÖÐÐľàÀëÔÚ T2 ÒÔÄڵĵ㲻ÄÜÔÙ×÷ΪÆäËû Canopy µÄÖÐÐÄ£©£¬Öظ´Õû¸ö¹ý³ÌÖ±µ½ S Ϊ¿ÕΪֹ¡£

¶Ô K ¾ùÖµµÄʵÏÖÒ»Ñù£¬Mahout Ò²ÌṩÁËÁ½¸ö Canopy ¾ÛÀàµÄʵÏÖ£¬ÏÂÃæÎÒÃǾͿ´¿´¾ßÌåµÄ´úÂëÀý×Ó¡£


Çåµ¥ 4. Canopy ¾ÛÀàË㷨ʾÀý

				 
 //Canopy ¾ÛÀàËã·¨µÄÄÚ´æÊµÏÖ
 public static void canopyClusterInMemory () { 
	 // ÉèÖþàÀëãÐÖµ T1,T2 
 double T1 = 4.0; 
	 double T2 = 3.0; 
 // µ÷Óà CanopyClusterer.createCanopies ·½·¨´´½¨ Canopy£¬²ÎÊý·Ö±ðÊÇ£º
	 // 	 1. ÐèÒª¾ÛÀàµÄµã¼¯
	 // 	 2. ¾àÀë¼ÆËã·½·¨
	 // 	 3. ¾àÀëãÐÖµ T1 ºÍ T2 
	 List<Canopy> canopies = CanopyClusterer.createCanopies( 
 SimpleDataSet.getPointVectors(SimpleDataSet.points), 
		 new EuclideanDistanceMeasure(), T1, T2); 
	 // ´òÓ¡´´½¨µÄ Canopy£¬ÒòΪ¾ÛÀàÎÊÌâºÜ¼òµ¥£¬ËùÒÔÕâÀïûÓнøÐÐÏÂÒ»²½¾«È·µÄ¾ÛÀà¡£
	 // ÓбØÐëµÄʱºò£¬¿ÉÒÔÄõ½ Canopy ¾ÛÀàµÄ½á¹û×÷Ϊ K ¾ùÖµ¾ÛÀàµÄÊäÈ룬Äܸü¾«È·¸ü¸ßЧµÄ½â¾ö¾ÛÀàÎÊÌâ
 for(Canopy canopy : canopies) { 
		 System.out.println("Cluster id: " + canopy.getId() + 
" center: " + canopy.getCenter().asFormatString()); 
		 System.out.println("       Points: " + canopy.getNumPoints()); 	
	 } 
 } 

 //Canopy ¾ÛÀàËã·¨µÄ Hadoop ʵÏÖ
 public static void canopyClusterUsingMapReduce() throws Exception{ 
	 // ÉèÖþàÀëãÐÖµ T1,T2 
 double T1 = 4.0; 
	 double T2 = 3.0; 
	 // ÉùÃ÷¾àÀë¼ÆËãµÄ·½·¨
	 DistanceMeasure measure = new EuclideanDistanceMeasure(); 
	 // ÉèÖÃÊäÈëÊä³öµÄÎļþ·¾¶
	 Path testpoints = new Path("testpoints"); 
	 Path output = new Path("output"); 
	 // Çå¿ÕÊäÈëÊä³ö·¾¶ÏµÄÊý¾Ý
	 HadoopUtil.overwriteOutput(testpoints); 
	 HadoopUtil.overwriteOutput(output); 
	 // ½«²âÊԵ㼯дÈëÊäÈëĿ¼ÏÂ
 SimpleDataSet.writePointsToFile(testpoints); 

 // µ÷Óà CanopyDriver.buildClusters µÄ·½·¨Ö´ÐÐ Canopy ¾ÛÀ࣬²ÎÊýÊÇ£º
	 // 	 1. ÊäÈë·¾¶£¬Êä³ö·¾¶
	 // 	 2. ¼ÆËã¾àÀëµÄ·½·¨
	 // 	 3. ¾àÀëãÐÖµ T1 ºÍ T2 
	 new CanopyDriver().buildClusters(testpoints, output, measure, T1, T2, true); 
	 // ´òÓ¡ Canopy ¾ÛÀàµÄ½á¹û
	 List<List<Cluster>> clustersM = DisplayClustering.loadClusters(output);
	 	 List<Cluster> clusters = clustersM.get(clustersM.size()-1); 
	 if(clusters != null){ 
 for(Cluster canopy : clusters) { 
    System.out.println("Cluster id: " + canopy.getId() + 
" center: " + canopy.getCenter().asFormatString()); 
   System.out.println("       Points: " + canopy.getNumPoints());
   		 } 
	 } 
 } 

Ö´Ðнá¹û
 Canopy Clustering In Memory Result 
 Cluster id: 0 
 center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
 "vector":"{\"values\":{\"table\":[0,1,0],\"values\":[1.8,1.8,0.0],
 \"state\":[1,1,0],\"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,
 \"highWaterMark\":1,\"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},
 \"size\":2,\"lengthSquared\":-1.0}"} 
       Points: 5 
 Cluster id: 1 
 center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
 "vector":"{\"values\":{\"table\":[0,1,0],\"values\":[7.5,7.666666666666667,0.0],
 \"state\":[1,1,0],\"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,
 \"highWaterMark\":1,\"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},\"size\":2,
 \"lengthSquared\":-1.0}"} 
       Points: 6 
 Cluster id: 2 
 center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
 "vector":"{\"values\":{\"table\":[0,1,0],\"values\":[5.0,5.5,0.0],
 \"state\":[1,1,0],\"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,
 \"highWaterMark\":1,\"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},\"size\":2,
 \"lengthSquared\":-1.0}"} 
       Points: 2 

 Canopy Clustering Using Map/Reduce Result 
 Cluster id: 0 
 center:{"class":"org.apache.mahout.math.RandomAccessSparseVector", 
 "vector":"{\"values\":{\"table\":[0,1,0],\"values\":[1.8,1.8,0.0],
 \"state\":[1,1,0],\"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,
 \"highWaterMark\":1,\"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},
 \"size\":2,\"lengthSquared\":-1.0}"} 
       Points: 5 
 Cluster id: 1 
 center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
 "vector":"{\"values\":{\"table\":[0,1,0],\"values\":[7.5,7.666666666666667,0.0],
 \"state\":[1,1,0],\"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,
 \"highWaterMark\":1,\"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},\"size\":2,
 \"lengthSquared\":-1.0}"} 
       Points: 6 
 Cluster id: 2 
 center:{"class":"org.apache.mahout.math.RandomAccessSparseVector", 
 "vector":"{\"values\":{\"table\":[0,1,0], 
 \"values\":[5.333333333333333,5.666666666666667,0.0],\"state\":[1,1,0],
 \"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,\"highWaterMark\":1,
 \"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},\"size\":2,\"lengthSquared\":-1.0}"} 
       Points: 3 

 

Ä£ºý K ¾ùÖµ¾ÛÀàËã·¨

Ä£ºý K ¾ùÖµ¾ÛÀàËã·¨ÊÇ K ¾ùÖµ¾ÛÀàµÄÀ©Õ¹£¬ËüµÄ»ù±¾Ô­ÀíºÍ K ¾ùÖµÒ»Ñù£¬Ö»ÊÇËüµÄ¾ÛÀà½á¹ûÔÊÐí´æÔÚ¶ÔÏóÊôÓÚ¶à¸ö´Ø£¬Ò²¾ÍÊÇ˵£ºËüÊôÓÚÎÒÃÇÇ°Ãæ½éÉܹýµÄ¿ÉÖØµþ¾ÛÀàËã·¨¡£ÎªÁËÉîÈëÀí½âÄ£ºý K ¾ùÖµºÍ K ¾ùÖµµÄÇø±ð£¬ÕâÀïÎÒÃǵû¨Ð©Ê±¼äÁ˽âÒ»¸ö¸ÅÄģºý²ÎÊý£¨Fuzziness Factor£©¡£

Óë K ¾ùÖµ¾ÛÀàÔ­ÀíÀàËÆ£¬Ä£ºý K ¾ùÖµÒ²ÊÇÔÚ´ý¾ÛÀà¶ÔÏóÏòÁ¿¼¯ºÏÉÏÑ­»·£¬µ«ÊÇËü²¢²»Êǽ«ÏòÁ¿·ÖÅ䏸¾àÀë×î½üµÄ´Ø£¬¶øÊǼÆËãÏòÁ¿Óë¸÷¸ö´ØµÄÏà¹ØÐÔ£¨Association£©¡£¼ÙÉèÓÐÒ»¸öÏòÁ¿ v£¬ÓÐ k ¸ö´Ø£¬v µ½ k ¸ö´ØÖÐÐĵľàÀë·Ö±ðÊÇ d1£¬d2¡­ dk£¬ÄÇô V µ½µÚÒ»¸ö´ØµÄÏà¹ØÐÔ u1¿ÉÒÔͨ¹ýÏÂÃæµÄËãʽ¼ÆË㣺


 

¼ÆËã v µ½ÆäËû´ØµÄÏà¹ØÐÔÖ»Ð轫 d1Ìæ»»Îª¶ÔÓ¦µÄ¾àÀë¡£

´ÓÉÏÃæµÄËãʽ£¬ÎÒÃÇ¿´³ö£¬µ± m ½üËÆ 2 ʱ£¬Ïà¹ØÐÔ½üËÆ 1£»µ± m ½üËÆ 1 ʱ£¬Ïà¹ØÐÔ½üËÆÓÚµ½¸Ã´ØµÄ¾àÀ룬ËùÒÔ m µÄȡֵÔÚ£¨1£¬2£©Çø¼äÄÚ£¬µ± m Ô½´ó£¬Ä£ºý³Ì¶ÈÔ½´ó£¬m ¾ÍÊÇÎÒÃǸոÕÌáµ½µÄÄ£ºý²ÎÊý¡£

½²ÁËÕâô¶àÀíÂÛµÄÔ­Àí£¬ÏÂÃæÎÒÃÇ¿´¿´ÈçºÎʹÓà Mahout ʵÏÖÄ£ºý K ¾ùÖµ¾ÛÀà£¬Í¬Ç°ÃæµÄ·½·¨Ò»Ñù£¬Mahout Ò»ÑùÌṩÁË»ùÓÚÄÚ´æºÍ»ùÓÚ Hadoop Map/Reduce µÄÁ½ÖÖʵÏÖ FuzzyKMeansClusterer ºÍ FuzzyMeansDriver£¬·Ö±ðÊÇÇåµ¥ 5 ¸ø³öÁËÒ»¸öÀý×Ó¡£


Çåµ¥ 5. Ä£ºý K ¾ùÖµ¾ÛÀàË㷨ʾÀý

				 
 public static void fuzzyKMeansClusterInMemory() { 
 // Ö¸¶¨¾ÛÀàµÄ¸öÊý
 int k = 2; 
 // Ö¸¶¨ K ¾ùÖµ¾ÛÀàËã·¨µÄ×î´óµü´ú´ÎÊý
 int maxIter = 3; 
 // Ö¸¶¨ K ¾ùÖµ¾ÛÀàËã·¨µÄ×î´ó¾àÀëãÐÖµ
 double distanceThreshold = 0.01; 
 // Ö¸¶¨Ä£ºý K ¾ùÖµ¾ÛÀàËã·¨µÄÄ£ºý²ÎÊý
 float fuzzificationFactor = 10; 
 // ÉùÃ÷Ò»¸ö¼ÆËã¾àÀëµÄ·½·¨£¬ÕâÀïÑ¡ÔñÁËÅ·¼¸ÀïµÂ¾àÀë
 DistanceMeasure measure = new EuclideanDistanceMeasure(); 
 // ¹¹½¨ÏòÁ¿¼¯£¬Ê¹ÓõÄÊÇÇåµ¥ 1 ÀïµÄ¶þάµã¼¯
	 List<Vector> pointVectors = SimpleDataSet.getPointVectors(SimpleDataSet.points); 
 // ´Óµã¼¯ÏòÁ¿ÖÐËæ»úµÄÑ¡Ôñ k ¸ö×÷Ϊ´ØµÄÖÐÐÄ
	 List<Vector> randomPoints = RandomSeedGenerator.chooseRandomPoints(points, k); 
	 // ¹¹½¨³õʼ´Ø£¬ÕâÀïÓë K ¾ùÖµ²»Í¬£¬Ê¹ÓÃÁË SoftCluster£¬±íʾ´ØÊÇ¿ÉÖØµþµÄ
	 List<SoftCluster> clusters = new ArrayList<SoftCluster>(); 
	 int clusterId = 0; 
	 for (Vector v : randomPoints) { 
		 clusters.add(new SoftCluster(v, clusterId++, measure)); 
	 } 
 // µ÷Óà FuzzyKMeansClusterer µÄ clusterPoints ·½·¨½øÐÐÄ£ºý K ¾ùÖµ¾ÛÀà
	 List<List<SoftCluster>> finalClusters = 
	 FuzzyKMeansClusterer.clusterPoints(points, 
 clusters, measure, distanceThreshold, maxIter, fuzzificationFactor); 
	 // ´òÓ¡¾ÛÀà½á¹û
	 for(SoftCluster cluster : finalClusters.get(finalClusters.size() - 1)) { 
		 System.out.println("Fuzzy Cluster id: " + cluster.getId() + 
" center: " + cluster.getCenter().asFormatString()); 
	 } 
 } 

 public class fuzzyKMeansClusterUsingMapReduce { 
 // Ö¸¶¨Ä£ºý K ¾ùÖµ¾ÛÀàËã·¨µÄÄ£ºý²ÎÊý
	 float fuzzificationFactor = 2.0f; 
 // Ö¸¶¨ÐèÒª¾ÛÀàµÄ¸öÊý£¬ÕâÀïÑ¡Ôñ 2 Àà
	 int k = 2; 
 // Ö¸¶¨×î´óµü´ú´ÎÊý
	 int maxIter = 3; 
 // Ö¸¶¨×î´ó¾àÀëãÐÖµ
	 double distanceThreshold = 0.01; 
 // ÉùÃ÷Ò»¸ö¼ÆËã¾àÀëµÄ·½·¨£¬ÕâÀïÑ¡ÔñÁËÅ·¼¸ÀïµÂ¾àÀë
	 DistanceMeasure measure = new EuclideanDistanceMeasure(); 
 // ÉèÖÃÊäÈëÊä³öµÄÎļþ·¾¶
	 Path testpoints = new Path("testpoints"); 
	 Path output = new Path("output"); 
 // Çå¿ÕÊäÈëÊä³ö·¾¶ÏµÄÊý¾Ý
	 HadoopUtil.overwriteOutput(testpoints); 
	 HadoopUtil.overwriteOutput(output); 
 // ½«²âÊԵ㼯дÈëÊäÈëĿ¼ÏÂ
	 SimpleDataSet.writePointsToFile(testpoints); 
 // Ëæ»úµÄÑ¡Ôñ k ¸ö×÷Ϊ´ØµÄÖÐÐÄ
	 Path clusters = RandomSeedGenerator.buildRandom(testpoints, 
 new Path(output, "clusters-0"), k, measure); 
	 FuzzyKMeansDriver.runJob(testpoints, clusters, output, measure, 0.5, maxIter, 1, 
 fuzzificationFactor, true, true, distanceThreshold, true); 
 // ´òÓ¡Ä£ºý K ¾ùÖµ¾ÛÀàµÄ½á¹û
	 ClusterDumper clusterDumper = new ClusterDumper(new Path(output, "clusters-" + 
 maxIter ),new Path(output, "clusteredPoints")); 
	 clusterDumper.printClusters(null); 
 } 

Ö´Ðнá¹û
 Fuzzy KMeans Clustering In Memory Result 
 Fuzzy Cluster id: 0 
 center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
 "vector":"{\"values\":{\"table\":[0,1,0],
 \"values\":[1.9750483367699223,1.993870669568863,0.0],\"state\":[1,1,0],
 \"freeEntries\":1,\"distinct\":2,\"lowWaterMark\":0,\"highWaterMark\":1,
 \"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},\"size\":2,\"lengthSquared\":-1.0}"} 
 Fuzzy Cluster id: 1 
 center:{"class":"org.apache.mahout.math.RandomAccessSparseVector",
 "vector":"{\"values\":{\"table\":[0,1,0], 
 \"values\":[7.924827516566109,7.982356511917616,0.0],\"state\":[1,1,0],
 \"freeEntries\":1, \"distinct\":2,\"lowWaterMark\":0,\"highWaterMark\":1,
 \"minLoadFactor\":0.2,\"maxLoadFactor\":0.5},\"size\":2,\"lengthSquared\":-1.0}"} 

 Funzy KMeans Clustering Using Map Reduce Result 
 Weight:  Point: 
	 0.9999249428064162: [8.000, 8.000] 
	 0.9855340718746096: [9.000, 8.000] 
	 0.9869963781734195: [8.000, 9.000] 
	 0.9765978701133124: [9.000, 9.000] 
	 0.6280999013864511: [5.000, 6.000] 
	 0.7826097471578298: [6.000, 6.000] 
	 Weight:  Point: 
	 0.9672607354172386: [1.000, 1.000] 
	 0.9794914088151625: [2.000, 1.000] 
	 0.9803932521191389: [1.000, 2.000] 
	 0.9977806183197744: [2.000, 2.000] 
	 0.9793701109946826: [3.000, 3.000] 
	 0.5422929338028506: [5.000, 5.000] 

 

µÒÀû¿ËÀ×¾ÛÀàËã·¨

Ç°Ãæ½éÉܵÄÈýÖÖ¾ÛÀàËã·¨¶¼ÊÇ»ùÓÚ»®·ÖµÄ£¬ÏÂÃæÎÒÃǼòÒª½éÉÜÒ»¸ö»ùÓÚ¸ÅÂÊ·Ö²¼Ä£Ð͵ľÛÀàËã·¨£¬µÒÀû¿ËÀ×¾ÛÀࣨDirichlet Processes Clustering£©¡£

Ê×ÏÈÎÒÃÇÏȼòÒª½éÉÜһϻùÓÚ¸ÅÂÊ·Ö²¼Ä£Ð͵ľÛÀàËã·¨£¨ºóÃæ¼ò³Æ»ùÓÚÄ£Ð͵ľÛÀàËã·¨£©µÄÔ­Àí£ºÊ×ÏÈÐèÒª¶¨ÒåÒ»¸ö·Ö²¼Ä£ÐÍ£¬¼òµ¥µÄÀýÈ磺ԲÐΣ¬Èý½ÇÐεȣ¬¸´ÔÓµÄÀýÈçÕýÔò·Ö²¼£¬²´ËÉ·Ö²¼µÈ£»È»ºó°´ÕÕÄ£ÐͶÔÊý¾Ý½øÐзÖÀ࣬½«²»Í¬µÄ¶ÔÏó¼ÓÈëÒ»¸öÄ£ÐÍ£¬Ä£ÐÍ»áÔö³¤»òÕßÊÕËõ£»Ã¿Ò»ÂÖ¹ýºóÐèÒª¶ÔÄ£Ð͵ĸ÷¸ö²ÎÊý½øÐÐÖØÐ¼ÆË㣬ͬʱ¹À¼Æ¶ÔÏóÊôÓÚÕâ¸öÄ£Ð͵ĸÅÂÊ¡£ËùÒÔ˵£¬»ùÓÚÄ£Ð͵ľÛÀàËã·¨µÄºËÐÄÊǶ¨ÒåÄ£ÐÍ£¬¶ÔÓÚÒ»¸ö¾ÛÀàÎÊÌ⣬ģÐͶ¨ÒåµÄÓÅÁÓÖ±½ÓÓ°ÏìÁ˾ÛÀàµÄ½á¹û£¬ÏÂÃæ¸ø³öÒ»¸ö¼òµ¥µÄÀý×Ó£¬¼ÙÉèÎÒÃǵÄÎÊÌâÊǽ«Ò»Ð©¶þάµÄµã·Ö³ÉÈý×飬ÔÚͼÖÐÓò»Í¬µÄÑÕÉ«±íʾ£¬Í¼ A ÊDzÉÓÃÔ²ÐÎÄ£Ð͵ľÛÀà½á¹û£¬Í¼ B ÊDzÉÓÃÈý½ÇÐÎÄ£Ð͵ľÛÀà½á¹û¡£¿ÉÒÔ¿´³ö£¬Ô²ÐÎÄ£ÐÍÊÇÒ»¸öÕýÈ·µÄÑ¡Ôñ£¬¶øÈý½ÇÐÎÄ£Ð͵Ľá¹û¼ÈÓÐÒÅ©ÓÖÓÐÎóÅУ¬ÊÇÒ»¸ö´íÎóµÄÑ¡Ôñ¡£


ͼ 3 ²ÉÓò»Í¬Ä£Ð͵ľÛÀà½á¹û
 

Mahout ʵÏֵĵÒÀû¿ËÀ×¾ÛÀàËã·¨Êǰ´ÕÕÈçϹý³Ì¹¤×÷µÄ£ºÊ×ÏÈ£¬ÎÒÃÇÓÐÒ»×é´ý¾ÛÀàµÄ¶ÔÏóºÍÒ»¸ö·Ö²¼Ä£ÐÍ¡£ÔÚ Mahout ÖÐʹÓà ModelDistribution Éú³É¸÷ÖÖÄ£ÐÍ¡£³õʼ״̬£¬ÎÒÃÇÓÐÒ»¸ö¿ÕµÄÄ£ÐÍ£¬È»ºó³¢ÊÔ½«¶ÔÏó¼ÓÈëÄ£ÐÍÖУ¬È»ºóÒ»²½Ò»²½¼ÆËã¸÷¸ö¶ÔÏóÊôÓÚ¸÷¸öÄ£Ð͵ĸÅÂÊ¡£ÏÂÃæÇåµ¥¸ø³öÁË»ùÓÚÄÚ´æÊµÏֵĵÒÀû¿ËÀ×¾ÛÀàËã·¨¡£


Çåµ¥ 6. µÒÀû¿ËÀ×¾ÛÀàË㷨ʾÀý

				 
 public static void DirichletProcessesClusterInMemory() { 
 // Ö¸¶¨µÒÀû¿ËÀ×Ëã·¨µÄ alpha ²ÎÊý£¬ËüÊÇÒ»¸ö¹ý¶É²ÎÊý£¬Ê¹µÃ¶ÔÏó·Ö²¼ÔÚ²»Í¬Ä£ÐÍǰºóÄܽøÐй⻬µÄ¹ý¶É
	 double alphaValue = 1.0; 
 // Ö¸¶¨¾ÛÀàÄ£Ð͵ĸöÊý
	 int numModels = 3; 
 // Ö¸¶¨ thin ºÍ burn ¼ä¸ô²ÎÊý£¬ËüÃÇÊÇÓÃÓÚ½µµÍ¾ÛÀà¹ý³ÌÖеÄÄÚ´æÊ¹ÓÃÁ¿µÄ
	 int thinIntervals = 2; 
	 int burnIntervals = 2; 
 // Ö¸¶¨×î´óµü´ú´ÎÊý
	 int maxIter = 3; 
	 List<VectorWritable> pointVectors = 
	 SimpleDataSet.getPoints(SimpleDataSet.points); 
 // ³õʼ½×¶ÎÉú³É¿Õ·Ö²¼Ä£ÐÍ£¬ÕâÀïÓõÄÊÇ NormalModelDistribution 
	 ModelDistribution<VectorWritable> model = 
 new NormalModelDistribution(new VectorWritable(new DenseVector(2))); 
 // Ö´ÐоÛÀà
	 DirichletClusterer dc = new DirichletClusterer(pointVectors, model, alphaValue, 
 numModels, thinIntervals, burnIntervals); 
	 List<Cluster[]> result = dc.cluster(maxIter); 
 // ´òÓ¡¾ÛÀà½á¹û
	 for(Cluster cluster : result.get(result.size() -1)){ 
		 System.out.println("Cluster id: " + cluster.getId() + " center: " + 
 cluster.getCenter().asFormatString()); 
		 System.out.println("       Points: " + cluster.getNumPoints()); 	
	 } 
 } 

Ö´Ðнá¹û
 Dirichlet Processes Clustering In Memory Result 
 Cluster id: 0 
 center:{"class":"org.apache.mahout.math.DenseVector",
 "vector":"{\"values\":[5.2727272727272725,5.2727272727272725],
 \"size\":2,\"lengthSquared\":-1.0}"} 
       Points: 11 
 Cluster id: 1 
 center:{"class":"org.apache.mahout.math.DenseVector",
 "vector":"{\"values\":[1.0,2.0],\"size\":2,\"lengthSquared\":-1.0}"} 
       Points: 1 
 Cluster id: 2 
 center:{"class":"org.apache.mahout.math.DenseVector",
 "vector":"{\"values\":[9.0,8.0],\"size\":2,\"lengthSquared\":-1.0}"} 
       Points: 0 

 

Mahout ÖÐÌṩ¶àÖÖ¸ÅÂÊ·Ö²¼Ä£Ð͵ÄʵÏÖ£¬ËûÃǶ¼¼Ì³Ð ModelDistribution£¬Èçͼ 4 Ëùʾ£¬Óû§¿ÉÒÔ¸ù¾Ý×Ô¼ºµÄÊý¾Ý¼¯µÄÌØÕ÷Ñ¡ÔñºÏÊʵÄÄ£ÐÍ£¬ÏêϸµÄ½éÉÜÇë²Î¿¼ Mahout µÄ¹Ù·½Îĵµ¡£


ͼ 4 Mahout ÖеĸÅÂÊ·Ö²¼Ä£ÐͲã´Î½á¹¹
 

Mahout ¾ÛÀàËã·¨×ܽá

Ç°ÃæÏêϸ½éÉÜÁË Mahout ÌṩµÄËÄÖÖ¾ÛÀàËã·¨£¬ÕâÀï×öÒ»¸ö¼òÒªµÄ×ܽᣬ·ÖÎö¸÷¸öËã·¨ÓÅȱµã£¬Æäʵ£¬³ýÁËÕâËÄÖÖÒÔÍ⣬Mahout »¹ÌṩÁËһЩ±È½Ï¸´ÔӵľÛÀàËã·¨£¬ÕâÀï¾Í²»Ò»Ò»Ïêϸ½éÉÜÁË£¬ÏêϸÐÅÏ¢Çë²Î¿¼ Mahout Wiki Éϸø³öµÄ¾ÛÀàËã·¨Ïêϸ½éÉÜ¡£


±í 1 Mahout ¾ÛÀàËã·¨×ܽá

Ëã·¨ ÄÚ´æÊµÏÖ Map/Reduce ʵÏÖ ´Ø¸öÊýÊÇÈ·¶¨µÄ ´ØÊÇ·ñÔÊÐíÖØµþ
K ¾ùÖµ KMeansClusterer KMeansDriver Y N
Canopy CanopyClusterer CanopyDriver N N
Ä£ºý K ¾ùÖµ FuzzyKMeansClusterer FuzzyKMeansDriver Y Y
µÒÀû¿ËÀ× DirichletClusterer DirichletDriver N Y


 

×ܽá

¾ÛÀàËã·¨±»¹ã·ºµÄÔËÓÃÓÚÐÅÏ¢ÖÇÄÜ´¦Àíϵͳ¡£±¾ÎÄÊ×ÏȼòÊöÁ˾ÛÀà¸ÅÄîÓë¾ÛÀàË㷨˼Ï룬ʹµÃ¶ÁÕßÕûÌåÉÏÁ˽â¾ÛÀàÕâÒ»ÖØÒªµÄ¼¼Êõ¡£È»ºó´Óʵ¼Ê¹¹½¨Ó¦ÓõĽǶȳö·¢£¬ÉîÈëµÄ½éÉÜÁË¿ªÔ´Èí¼þ Apache Mahout ÖйØÓÚ¾ÛÀàµÄʵÏÖ¿ò¼Ü£¬°üÀ¨ÁËÆäÖеÄÊýѧģÐÍ£¬¸÷ÖÖ¾ÛÀàËã·¨ÒÔ¼°ÔÚ²»Í¬»ù´¡¼Ü¹¹ÉϵÄʵÏÖ¡£Í¨¹ý´úÂëʾÀý£¬¶ÁÕß¿ÉÒÔÖªµÀÕë¶ÔËûµÄÌØ¶¨µÄÊý¾ÝÎÊÌ⣬ÔõôÑùÏòÁ¿»¯Êý¾Ý£¬ÔõôÑùÑ¡Ôñ¸÷ÖÖ²»Í¬µÄ¾ÛÀàËã·¨¡£

±¾ÏµÁеÄÏÂһƪ½«¼ÌÐøÉîÈëÁ˽âÍÆ¼öÒýÇæµÄÏà¹ØËã·¨ -- ·ÖÀà¡£Óë¾ÛÀàÒ»Ñù£¬·ÖÀàÒ²ÊÇÒ»¸öÊý¾ÝÍÚ¾òµÄ¾­µäÎÊÌ⣬Ö÷ÒªÓÃÓÚÌáÈ¡ÃèÊöÖØÒªÊý¾ÝÀàµÄÄ£ÐÍ£¬ËæºóÎÒÃÇ¿ÉÒÔ¸ù¾ÝÕâ¸öÄ£ÐͽøÐÐÔ¤²â£¬ÍƼö¾ÍÊÇÒ»ÖÖÔ¤²âµÄÐÐΪ¡£Í¬Ê±¾ÛÀàºÍ·ÖÀàÍùÍùÒ²ÊÇÏศÏà³ÉµÄ£¬ËûÃǶ¼ÎªÔÚº£Á¿Êý¾ÝÉϽøÐиßЧµÄÍÆ¼öÌṩ¸¨Öú¡£ËùÒÔ±¾ÏµÁеÄÏÂһƪÎÄÕ½«Ïêϸ½éÉܸ÷Àà·ÖÀàËã·¨£¬ËüÃǵÄÔ­Àí£¬ÓÅȱµãºÍʵÓó¡¾°£¬²¢¸ø³ö»ùÓÚ Apache Mahout µÄ·ÖÀàËã·¨µÄ¸ßЧʵÏÖ¡£

Ñо¿Çé¿ö

¡¡¡¡´«Í³µÄ¾ÛÀàÒѾ­±È½Ï³É¹¦µÄ½â¾öÁ˵ÍάÊý¾ÝµÄ¾ÛÀàÎÊÌâ¡£µ«ÊÇÓÉÓÚʵ¼ÊÓ¦ÓÃÖÐÊý¾ÝµÄ¸´ÔÓÐÔ£¬ÔÚ´¦ÀíÐí¶àÎÊÌâʱ£¬ÏÖÓеÄËã·¨¾­³£Ê§Ð§£¬ÌرðÊǶÔÓÚ¸ßάÊý¾ÝºÍ´óÐÍÊý¾ÝµÄÇé¿ö¡£ÒòΪ´«Í³¾ÛÀà·½·¨ÔÚ¸ßάÊý¾Ý¼¯ÖнøÐоÛÀàʱ£¬Ö÷ÒªÓöµ½Á½¸öÎÊÌâ¡£¢Ù¸ßάÊý¾Ý¼¯ÖдæÔÚ´óÁ¿Î޹صÄÊôÐÔʹµÃÔÚËùÓÐάÖдæÔڴصĿÉÄÜÐÔ¼¸ºõΪÁ㣻¢Ú¸ßά¿Õ¼äÖÐÊý¾Ý½ÏµÍά¿Õ¼äÖÐÊý¾Ý·Ö²¼ÒªÏ¡Ê裬ÆäÖÐÊý¾Ý¼ä¾àÀ뼸ºõÏàµÈÊÇÆÕ±éÏÖÏ󣬶ø´«Í³¾ÛÀà·½·¨ÊÇ»ùÓÚ¾àÀë½øÐоÛÀàµÄ£¬Òò´ËÔÚ¸ßά¿Õ¼äÖÐÎÞ·¨»ùÓÚ¾àÀëÀ´¹¹½¨´Ø¡£

¡¡¸ßά¾ÛÀà·ÖÎöÒѳÉΪ¾ÛÀà·ÖÎöµÄÒ»¸öÖØÒªÑо¿·½Ïò¡£Í¬Ê±¸ßάÊý¾Ý¾ÛÀàÒ²ÊǾÛÀ༼ÊõµÄÄÑµã¡£Ëæ×ż¼ÊõµÄ½ø²½Ê¹µÃÊý¾ÝÊÕ¼¯±äµÃÔ½À´Ô½ÈÝÒ×£¬µ¼ÖÂÊý¾Ý¿â¹æÄ£Ô½À´Ô½´ó¡¢¸´ÔÓÐÔÔ½À´Ô½¸ß£¬Èç¸÷ÖÖÀàÐ͵ÄóÒ×½»Ò×Êý¾Ý¡¢Web Îĵµ¡¢»ùÒò±í´ïÊý¾ÝµÈ£¬ËüÃǵÄά¶È£¨ÊôÐÔ£©Í¨³£¿ÉÒÔ´ïµ½³É°ÙÉÏǧά£¬ÉõÖÁ¸ü¸ß¡£µ«ÊÇ£¬ÊÜ¡°Î¬¶ÈЧӦ¡±µÄÓ°Ï죬Ðí¶àÔÚµÍάÊý¾Ý¿Õ¼ä±íÏÖÁ¼ºÃµÄ¾ÛÀà·½·¨ÔËÓÃÔÚ¸ßά¿Õ¼äÉÏÍùÍùÎÞ·¨»ñµÃºÃµÄ¾ÛÀàЧ¹û¡£¸ßάÊý¾Ý¾ÛÀà·ÖÎöÊǾÛÀà·ÖÎöÖÐÒ»¸ö·Ç³£»îÔ¾µÄÁìÓò£¬Í¬Ê±ËüÒ²ÊÇÒ»¸ö¾ßÓÐÌôÕ½ÐԵŤ×÷¡£Ä¿Ç°£¬¸ßάÊý¾Ý¾ÛÀà·ÖÎöÔÚÊг¡·ÖÎö¡¢ÐÅÏ¢°²È«¡¢½ðÈÚ¡¢ÓéÀÖ¡¢·´¿ÖµÈ·½Ãæ¶¼Óкܹ㷺µÄÓ¦Óá£

http://www.cnblogs.com/shipengzhi/articles/2489389.html

=========

Minhash based clustering

https://issues.apache.org/jira/browse/MAHOUT-344

========

How to improve clustering?

http://comments.gmane.org/gmane.comp.apache.mahout.user/16296

========

mahoutÖÐk-meansÀý×ÓµÄÔËÐÐ

http://blog.163.com/jiayouweijiewj@126/blog/static/171232177201011475716354/

 
 
  • ±êÇ©£º¾ÛÀà Mapreduce 
  • ·¢±íÆÀÂÛ£º
    ÔØÈëÖС£¡£¡£

     
     
     

    ÃÎÏè¶ùÍøÕ¾ ÃηÉÏèµÄµØ·½ http://www.dreamflier.net
    ÖлªÈËÃñ¹²ºÍ¹úÐÅÏ¢²úÒµ²¿TCP/IPϵͳ ±¸°¸ÐòºÅ£ºÁÉICP±¸09000550ºÅ

    Powered by Oblog.