º£Á¿Êý¾Ý(¡°Big Data¡±)ÊÇÖ¸ÄÇЩ×ã¹»´óµÄÊý¾Ý£¬ÒÔÖÁÓÚÎÞ·¨ÔÙʹÓô«Í³µÄ·½·¨½øÐд¦Àí¡£ÔÚ¹ýÈ¥£¬Ò»Ö±ÊÇWebËÑË÷ÒýÇæµÄ´´½¨ÕßÃÇÊ×µ±Æä³åµÄÃæ¶ÔÕâ¸öÎÊÌâ¡£¶ø½ñÌ죬¸÷ÖÖÉç½»ÍøÂç£¬ÒÆ¶¯Ó¦ÓÃÒÔ¼°¸÷ÖÖ´«¸ÐÆ÷ºÍ¿ÆÑ§ÁìÓòÿÌì´´½¨×ÅÉÏPBµÄÊý¾Ý¡£ ΪÁËÓ¦¶ÔÕâÖÖ´ó¹æÄ£Êý¾Ý´¦ÀíµÄÌôÕ½£¬google´´ÔìÁËMapReduce¡£GoogleµÄ¹¤×÷ÒÔ¼°yahoo´´½¨µÄHadoop·õ»¯³öÒ»¸öÍêÕûµÄº£Á¿Êý¾Ý´¦Àí¹¤¾ßµÄÉú̬ϵͳ¡£
Ëæ×ÅMapReduceµÄÁ÷ÐУ¬Ò»¸öÓÉÊý¾Ý´æ´¢²ã£¬MapReduceºÍ²éѯ(¼ò³ÆSMAQ)×é³ÉµÄº£Á¿Êý¾Ý´¦ÀíµÄջʽģÐÍÒ²Öð½¥Õ¹ÏÖ³öÀ´¡£SMAQϵͳͨ³£ÊÇ¿ªÔ´µÄ£¬·Ö²¼Ê½µÄ£¬ÔËÐÐÔÚÆÕͨӲ¼þÉÏ¡£

¾ÍÏñÓÉLinux, Apache, MySQL and PHP ×é³ÉµÄLAMP¸Ä±äÁË»¥ÁªÍøÓ¦Óÿª·¢ÁìÓòÒ»Ñù£¬SMAQ½«»á°Ñº£Á¿Êý¾Ý´¦Àí´øÈëÒ»¸ö¸ü¹ãÀ«µÄÌìµØ¡£ÕýÈçLAMP³ÉΪWeb2.0µÄ¹Ø¼üÍÆ¶¯ÕßÒ»Ñù£¬SMAQϵͳ½«Ö§³ÅÆðÒ»¸ö´´ÐµÄÒÔÊý¾ÝΪÇý¶¯µÄ²úÆ·ºÍ·þÎñµÄÐÂʱ´ú¡£
¾¡¹Ü»ùÓÚHadoopµÄ¼Ü¹¹Õ¼¾ÝÁËÖ÷µ¼µØÎ»£¬µ«ÊÇSMAQÄ£ÐÍÒ²°üº¬´óÁ¿µÄÆäËûϵͳ£¬°üÀ¨Ö÷Á÷µÄNoSQLÊý¾Ý¿â¡£ÕâÆªÎÄÕÂÃèÊöÁËSMAQջʽģÐÍÒÔ¼°½ñÌìÄÇЩ¿ÉÒÔ°üÀ¨ÔÚÕâ¸öÄ£ÐÍϵĺ£Á¿Êý¾Ý´¦Àí¹¤¾ß¡£
MapReduce
MapReduceÊÇgoogleΪ´´½¨webÍøÒ³Ë÷Òý¶ø´´½¨µÄ¡£MapReduce¿ò¼ÜÒѳÉΪ½ñÌì´ó¶àÊýº£Á¿Êý¾Ý´¦ÀíµÄ³§·¿¡£MapReduceµÄ¹Ø¼üÔÚÓÚ£¬½«ÔÚÊý¾Ý¼¯ºÏÉϵÄÒ»¸ö²éѯ½øÐл®·Ö£¬È»ºóÔÚ¶à¸ö½ÚµãÉϲ¢ÐÐÖ´ÐС£ÕâÖÖ·Ö²¼Ê½Ä£Ê½½â¾öÁËÊý¾ÝÌ«´óÒÔÖÁÓÚÎÞ·¨´æ·ÅÔÚµ¥¶Àһ̨»úÆ÷ÉϵÄÄÑÌâ¡£

ΪÁËÀí½âMapReduceÊÇÈçºÎ¹¤×÷µÄ£¬ÎÒÃÇÊ×ÏÈ¿´ËüÃû×ÖËùÌåÏÖ³öµÄÁ½¸ö¹ý³Ì¡£Ê×ÏÈÔÚmap½×¶Î£¬ÊäÈëÊý¾Ý±»Ò»ÏîÒ»ÏîµÄ´¦Àí£¬×ª»»³ÉÒ»¸öÖмä½á¹û¼¯£¬È»ºóÔÚreduce½×¶Î£¬ÕâЩÖмä½á¹ûÓÖ±»¹æÔ¼²úÉúÒ»¸öÎÒÃÇËùÆÚÍûµÃµ½µÄ¹éÄɽá¹û¡£
˵µ½MapReduce£¬Í¨³£Òª¾ÙµÄÒ»¸öÀý×Ó¾ÍÊDzéÕÒһƪÎĵµÖв»Í¬µ¥´ÊµÄ³öÏÖ¸öÊý¡£ÔÚmap½×¶Îµ¥´Ê±»³é³öÀ´£¬È»ºó¸ø¸öcountÖµ1£¬ÔÚreduce½Úµã£¬½«ÏàͬµÄµ¥´ÊµÄcountÖµÀÛ¼ÓÆðÀ´¡£
¿´ÆðÀ´ÊDz»Êǽ«Ò»¸öºÜ¼òµ¥µÄ¹¤×÷¸ãµØºÜ¸´ÔÓÁË£¬Õâ¾ÍÊÇMapReduce¡£ÎªÁËÈÃMapReduceÍê³ÉÕâÏîÈÎÎñ£¬mapºÍreduce½×¶Î±ØÐë×ñÊØÒ»¶¨µÄÏÞÖÆÀ´Ê¹µÃ¹¤×÷¿ÉÒÔ²¢Ðл¯¡£½«²éѯÇëÇóת»»ÎªÒ»¸ö»òÕß¶à¸öMapReduce²¢²»ÊÇÒ»¸öÖ±¹ÛµÄ¹ý³Ì£¬ÎªÁ˽â¾öÕâ¸öÎÊÌ⣬һЩ¸ü¸ß¼¶µÄ³éÏó±»Ìá³öÀ´£¬ÎÒÃǽ«ÔÚÏÂÃæ¹ØÓÚ²éѯµÄÄǽÚÀï½øÐÐÌÖÂÛ¡£
ʹÓÃMapReduce½â¾öÎÊÌ⣬ͨ³£ÐèÒªÈý¸ö²Ù×÷£º
Êý¾Ý¼ÓÔØ¡ªÓÃÊý¾Ý²Ö¿âµÄ½Ð·¨£¬Õâ¸ö¹ý³Ì½Ð×ö³éÈ¡(extract),ת»»(transform),¼ÓÔØ(load)£û¼ò³ÆETL£ý¸üºÏÊÊЩ¡£ÎªÁËÀûÓÃMapReduce½øÐд¦Àí£¬Êý¾Ý±ØÐë´ÓÔ´Êý¾ÝÀï³éÈ¡³öÀ´£¬½øÐбØÒªµÄ½á¹¹»¯£¬¼ÓÔØµ½MapReduce¿ÉÒÔ·ÃÎʵĴ洢²ã¡£
MapReduce¡ª´Ó´æ´¢²ã·ÃÎÊÊý¾Ý£¬½øÐд¦Àí£¬ÔÙ½«½á¹û·µ»Ø¸ø´æ´¢²ã
½á¹û³éÈ¡¡ªÒ»µ©´¦ÀíÍê±Ï£¬ÎªÁËÈýá¹û¶ÔÓÚÈËÀ´ËµÊÇ¿ÉÓõ쬻¹ÐèÒªÄܹ»½«´æ´¢²ãµÄ½á¹ûÊý¾Ý½øÐвéѯºÍչʾ¡£
ºÜ¶àSMAQϵͳ¶¼¾ßÓÐ×ÔÉíµÄһЩÊôÐÔ£¬Ö÷Òª¾ÍÊÇÎ§ÈÆÉÏÊöÈý¸ö¹ý³ÌµÄ¼ò»¯¡£
Hadoop MapReduce
HadoopÊÇÖ÷ÒªµÄ¿ªÔ´MapReduceʵÏÖ¡£ÓÉyahoo×ÊÖú£¬2006ÄêÓÉDoug Cutting´´½¨£¬2008Äê´ïµ½ÁËweb¹æÄ£µÄÊý¾Ý´¦ÀíÈÝÁ¿¡£
HadoopÏîÄ¿ÏÖÔÚÓÉApache¹ÜÀí¡£Ëæ×Ų»¶ÏµÄŬÁ¦£¬ºÍ¶à¸ö×ÓÏîĿһÆð¹²Í¬¹¹³ÉÁËÍêÕûµÄSMAQÄ£ÐÍ¡£
ÓÉÓÚÊÇÓÃjavaʵÏֵģ¬ËùÒÔHadoopµÄMapReduceʵÏÖ¿ÉÒÔͨ¹ýjavaÓïÑÔ½»»¥¡£´´½¨MapReduce jobͨ³£ÐèҪдһЩº¯ÊýÓÃÀ´ÊµÏÖmapºÍreduce½×¶ÎÐèÒª×öµÄ¼ÆËã¡£´¦ÀíÊý¾Ý±ØÐëÄܹ»¼ÓÔØµ½HadoopµÄ·Ö²¼Ê½ÎļþϵͳÖС£
ÒÔwordcountΪÀý£¬mapº¯ÊýÈçÏÂ(À´Ô´ÓÚHadoop MapReduceÎĵµ£¬Õ¹Ê¾ÁËÆäÖйؼüµÄ²½Öè)
public static class Map
extends Mapper<LongWritable, Text, Text, IntWritable> {
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(LongWritable key, Text value, Context context)
throws IOException, InterruptedException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
context.write(word, one);
}
}
}
¶ÔÓ¦µÄreduceº¯ÊýÈçÏ£º
public static class Reduce
extends Reducer<Text, IntWritable, Text, IntWritable> {
public void reduce(Text key, Iterable<IntWritable> values,
Context context) throws IOException, InterruptedException {
int sum = 0;
for (IntWritable val : values) {
sum += val.get();
}
context.write(key, new IntWritable(sum));
}
}
ʹÓÃHadoopÔËÐÐÒ»¸öMapReduce job°üÀ¨Èçϼ¸¸ö²½Ö裺
1. ÓÃÒ»¸öjava³ÌÐò¶¨ÒåMapReduceµÄ¸÷¸ö½×¶Î
2. ½«Êý¾Ý¼ÓÔØ½øÎļþϵͳ
3. Ìá½»job½øÐÐÖ´ÐÐ
4. ´ÓÎļþϵͳ»ñȡִÐнá¹û
Ö±½Óͨ¹ýjava API£¬Hadoop MapReduce jobдÆðÀ´¿ÉÄܸܺ´ÔÓ£¬ÐèÒª³ÌÐòÔ±ºÜ¶à·½ÃæµÄ²ÎÓ롣ΪÁËÈÃÊý¾Ý¼ÓÔØºÍ´¦Àí¹¤×÷¸ü¼Ó¼òµ¥Ö±½Ó£¬Î§ÈÆ×ÅHadoopÒ»¸öºÜ´óµÄÉú̬ϵͳÒѾÐγɡ£
ÆäËûʵÏÖ
MapReduceÒѾÔںܶàÆäËûµÄ³ÌÐòÓïÑÔºÍϵͳÖÐʵÏÖ£¬ÏêϸµÄÁбí¿ÉÒԲο¼Wikipedia's entry for MapReduce.¡£ÓÈÆäÊǼ¸¸öNoSQLÊý¾ÝÒѾ¼¯³ÉÁËMapReduce£¬ºóÃæÎÒÃÇ»á¶Ô´Ë½øÐÐÃèÊö¡£
Storage
´ÓÊý¾Ý»ñÈ¡µ½½á¹û´æ·Å£¬MapReduce¶¼ÐèÒªÓë´æ´¢´ò½»µÀ¡£Ó봫ͳÊý¾Ý¿â²»Í¬£¬MapReduceµÄÊäÈëÊý¾Ý²¢²»Ê**ØÏµÐ͵ġ£ÊäÈëÊý¾Ý´æ·ÅÔÚ²»Í¬µÄchunkÉÏ£¬Äܹ»»®·Ö¸ø²»Í¬µÄ½Úµã£¬È»ºóÌṩÒÔkey-valueµÄÐÎʽÌṩ¸ømap½×¶Î¡£Êý¾Ý²»ÐèÒªÒ»¸öschema£¬¶øÇÒ¿ÉÄÜÊÇÎ޽ṹµÄ¡£µ«ÊÇÊý¾Ý±ØÐëÊǿɷֲ¼µÄ£¬Äܹ»Ìṩ¸ø²»Í¬µÄ´¦Àí½Úµã¡£

´æ´¢²ãµÄÉè¼ÆºÍÌØµãºÜÖØÒª²»½ö½öÊÇÒòΪËüÓëMapReduceµÄ½Ó¿Ú£¬¶øÇÒÒòΪËüÃÇÖ±½Ó¾ö¶¨ÁËÊý¾Ý¼ÓÔØºÍ½á¹û²éѯºÍչʾµÄ·½±ãÐÔ¡£
Hadoop·Ö²¼Ê½Îļþϵͳ
HadoopʹÓõıê×¼´æ´¢»úÖÆÊÇHDFS¡£×÷ΪHadoopµÄºËÐIJ¿·Ö£¬HDFSÓÐÈçÏÂÌØµã£¬Ïêϸ²Î¼ûHDFS design document.£º
ÈÝ´í -- ¼ÙÉèʧ°ÜÊdz£Ì¬ÔÊÐíHDFSÔËÐÐÔÚÆÕͨӲ¼þÉÏ
Á÷Êý¾Ý·ÃÎÊ ¨C HDFSʵÏÖʱ¿¼ÂǵÄÊÇÅúÁ¿´¦Àí£¬Òò´Ë×ÅÖØÓÚ¸ßÍÌÍÂÂʶø²»ÊÇÊý¾ÝµÄËæ»ú·ÃÎÊ
¸ß¶È¿ÉÀ©Õ¹ÐÔ ¨C HDFS¿ÉÒÔÀ©Õ¹µ½PB¼¶µÄÊý¾Ý£¬±ÈÈçFacebook¾ÍÓÐÒ»¸öÕâÑùµÄ²úÆ·¼¶Ê¹ÓÃ
¿ÉÒÆÖ²ÐÔ ¨C HadoopÊÇ¿ÉÒÔ¿ç²Ù×÷ÏµÍ³ÒÆÖ²µÄ
µ¥´Îд ¨C ¼ÙÉèÎļþдºó²»»á¸Ä±ä£¬HDFS¼ò»¯ÁËreplicationÌá¸ßÁËÊý¾ÝÍÌÍÂÂÊ
¼ÆËã±¾µØ»¯ ¨C ¿¼Âǵ½Êý¾ÝÁ¿£¬Í¨³£½«³ÌÐòÒÆµ½Êý¾Ý¸½½üÖ´Ðлá¸ü¿ì£¬HDFSÌṩÁËÕâ·½ÃæµÄÖ§³Ö
HDFSÌṩÁËÒ»¸öÀàËÆÓÚ±ê×¼ÎļþϵͳµÄ½Ó¿Ú¡£Ó봫ͳÊý¾Ý¿â²»Í¬£¬HDFSÖ»ÄܽøÐÐÊý¾Ý´æ´¢ºÍ·ÃÎÊ£¬¶ø²»ÄÜΪÊý¾Ý½¨Á¢Ë÷Òý¡£ÎÞ·¨¶ÔÊý¾Ý½øÐмòµ¥µÄËæ»ú·ÃÎÊ¡£µ«ÊÇһЩ¸ü¸ß¼¶µÄ³éÏóÒѾ´´½¨³öÀ´£¬ÓÃÀ´Ìṩ¶ÔHadoopµÄ¸üϸÁ£¶ÈµÄ¹¦ÄÜ£¬±ÈÈçHBase¡£
HBase,HadoopÊý¾Ý¿â
Ò»ÖÖʹHDFS¸ü¾ß¿ÉÓÃÐԵķ½·¨ÊÇHBase¡£Ä£·Â¹È¸èµÄBigTableÊý¾Ý¿â£¬HBaseÒ²ÊÇÒ»¸öÉè¼ÆÓÃÀ´´æ´¢º£Á¿Êý¾ÝµÄÁдæÊ½Êý¾Ý¿â¡£ËüÒ²ÊôÓÚNoSQLÊý¾Ý¿â·¶³ë£¬ÀàËÆÓÚCassandra and Hypertable¡£

HBaseʹÓÃHDFS×÷Ϊµ×²ã´æ´¢ÏµÍ³£¬Òò´ËÒ²¾ßÓÐͨ¹ý´óÁ¿ÈÝ´í·Ö²¼Ê½½ÚµãÀ´´æ´¢´óÁ¿µÄÊý¾ÝµÄÄÜÁ¦¡£ÓëÆäËûµÄÁд洢Êý¾Ý¿âÀàËÆ£¬HBaseÒ²Ìṩ»ùÓÚRESTºÍThriftµÄ·ÃÎÊAPI¡£
ÓÉÓÚ´´½¨ÁËË÷Òý£¬HBase¿ÉÒÔΪһЩ¼òµ¥µÄ²éѯÌṩ¶ÔÄÚÈÝ¿ìËÙµÄËæ»ú·ÃÎÊ¡£¶ÔÓÚ¸´ÔӵIJÙ×÷£¬HBaseΪHadoop MapReduceÌṩÊý¾ÝÔ´ºÍ´æ´¢Ä¿±ê¡£Òò´ËHBaseÔÊÐíϵͳÒÔÊý¾Ý¿âµÄ·½Ê½ÓëMapReduce½øÐн»»¥£¬¶ø²»ÊÇͨ¹ýµ×²ãµÄHDFS¡£
Hive
Êý¾Ý²Ö¿â»òÕßÊÇʹ±¨¸æºÍ·ÖÎö¸ü¼òµ¥µÄ´æ´¢·½Ê½ÊÇSMAQϵͳµÄÒ»¸öÖØÒªÓ¦ÓÃÁìÓò¡£×î³õÔÚFacebook¿ª·¢µÄHive£¬ÊÇÒ»¸ö½¨Á¢ÔÚHadoopÖ®ÉÏÊÇÊý¾Ý²Ö¿â¿ò¼Ü¡£ÀàËÆÓÚHBase£¬HiveÌṩһ¸öÔÚHDFSÉϵĻùÓÚ±íµÄ³éÏ󣬼ò»¯Á˽ṹ»¯Êý¾ÝµÄ¼ÓÔØ¡£ÓëHBaseÏà±È£¬HiveÖ»ÄÜÔËÐÐMapReduce job½øÐÐÅúÁ¿Êý¾Ý·ÖÎö¡£ÈçÏÂÃæ²éѯÄDz¿·ÖÃèÊöµÄ£¬HiveÌṩÁËÒ»¸öÀàSQLµÄ²éѯÓïÑÔÀ´Ö´ÐÐMapReduce job¡£
Cassandra and Hypertable
CassandraºÍ Hypertable¶¼ÊǾßÓÐBigTableģʽµÄÀàËÆÓÚHBaseµÄÁд洢Êý¾Ý¿â¡£

×÷ΪApacheµÄÒ»¸öÏîÄ¿£¬Cassandra×î³õÊÇÔÚFacebook²úÉúµÄ¡£ÏÖÔÚÓ¦ÓÃÔںܶà´ó¹æÄ£µÄwebÕ¾µã£¬°üÀ¨Twitter, Facebook, Reddit and Digg¡£Hypertable²úÉúÓÚZvents£¬ÏÖÔÚÒ²ÊÇÒ»¸ö¿ªÔ´ÏîÄ¿¡£
ÕâÁ½¸öÊý¾Ý¿â¶¼ÌṩÓëHadoop MapReduce½»»¥µÄ½Ó¿Ú£¬ÔÊÐíËüÃÇ×÷ΪHadoop MapReduce jobµÄÊý¾ÝÔ´ºÍÄ¿±ê¡£ÔÚ¸ü¸ß²ã´ÎÉÏ£¬CassandraÌṩÓëPig²éѯÓïÑԵÉ(²Î¼û²éѯÕ½Ú)£¬¶øHypertableÒѾÓëHive¼¯³É¡£
NoSQLÊý¾Ý¿âµÄMapReduceʵÏÖ
ĿǰΪֹÎÒÃÇÌáµ½µÄ´æ´¢½â¾ö·½°¸¶¼ÊÇÒÀÀµÓÚHadoop½øÐÐMapReduce¡£»¹ÓÐһЩNoSQLÊý¾Ý¿âΪÁ˶Դ洢Êý¾Ý½øÐв¢ÐмÆËã±¾Éí¾ßÓÐÄÚ½¨µÄMapreduceÖ§³Ö¡£ÓëHadoopϵͳµÄ¶à×é¼þSMAQ¼Ü¹¹²»Í¬£¬ËüÃÇÌṩһ¸öÓÉstorage, MapReduce and queryÒ»Ìå×é³ÉµÄ×Ô°üº¬ÏµÍ³¡£
»ùÓÚHadoopµÄϵͳͨ³£ÊÇÃæÏòÅúÁ¿´¦Àí·ÖÎö£¬NoSQL´æ´¢Í¨³£ÊÇÃæÏòʵʱӦÓá£ÔÚÕâЩÊý¾Ý¿âÀMapReduceͨ³£Ö»ÊÇÒ»¸ö¸½¼Ó¹¦ÄÜ£¬×÷ΪÆäËû²éѯ»úÖÆµÄÒ»¸ö²¹³ä¶ø´æÔÚ¡£±ÈÈ磬ÔÚRiakÀ¶ÔMapReduce jobͨ³£ÓÐÒ»¸ö60ÃëµÄ³¬Ê±ÏÞÖÆ£¬¶øÍ¨³£À´Ëµ£¬ Hadoop ÈÏΪһ¸öjob¿ÉÄÜÔËÐÐÊý·ÖÖÓ»òÕßÊýСʱ¡£
ÏÂÃæµÄÕâЩNoSQLÊý¾Ý¿â¶¼¾ßÓÐMapReduce¹¦ÄÜ£º
CouchDB£¬Ò»¸ö·Ö²¼Ê½Êý¾Ý¿â£¬ÌṩÁ˰ë½á¹¹»¯µÄÎĵµ´æ´¢¹¦ÄÜ¡£Ö÷ÒªÌØµãÊÇÌṩºÜÇ¿µÄ¶à¸±±¾Ö§³Ö£¬ÒÔ¼°¿ÉÒÔ½øÐзֲ¼Ê½¸üС£ÔÚCouchDBÀ²éѯÊÇͨ¹ýʹÓÃjavascript¶¨ÒåMapReduceµÄmapºÍreduce½×¶ÎʵÏֵġ£
MongoDB£¬±¾ÉíºÜÀàËÆÓÚCouchDB£¬µ«ÊǸü×¢ÖØÐÔÄÜ£¬¶ÔÓÚ·Ö²¼Ê½¸üУ¬¸±±¾£¬°æ±¾µÄÖ§³ÖÏà¶ÔÈõЩ¡£MapReduceÒ²ÊÇͨ¹ýjavascriptÃèÊöµÄ¡£
Riak£¬ÓëÇ°ÃæÁ½¸öÊý¾Ý¿âÒ²ºÜÀàËÆ¡£µ«ÊǸü¹Ø×¢¸ß¿ÉÓÃÐÔ¡£¿ÉÒÔʹÓÃjavascript»òÕßErlangÃèÊöMapReduce¡£
Óë¹ØÏµÐÍÊý¾Ý¿âµÄ¼¯³É
ÔںܶàÓ¦ÓÃÖУ¬Ö÷ÒªµÄÔ´Êý¾Ý´æ´¢ÔÚ¹ØÏµÐÍÊý¾Ý¿âÖУ¬±ÈÈçMysql»òÕßOracle¡£MapReduceͨ³£Í¨¹ýÁ½ÖÖ·½Ê½Ê¹ÓÃÕâЩÊý¾Ý£º
ʹÓùØÏµÐÍÊý¾Ý¿â×÷ΪԴ(±ÈÈçÉç½»ÍøÂçÖеÄÅóÓÑÁбí)
½«MapReduce½á¹ûÖØÐÂ×¢Èëµ½¹ØÏµÐÍÊý¾Ý¿â(±ÈÈç»ùÓÚÅóÓѵÄÐËȤ²úÉúµÄ²úÆ·ÍÆ¼öÁбí)
Àí½âMapReduceÈçºÎÓë¹ØÏµÐÍÊý¾Ý¿â½»»¥ÊǺÜÖØÒªµÄ¡£×î¼òµ¥µÄ£¬Í¨¹ý×éºÏʹÓÃSQLµ¼³öÃüÁîºÍHDFS²Ù×÷£¬´ø·Ö¸ô·ûµÄÎı¾Îļþ¿ÉÒÔ×÷Ϊ´«Í³¹ØÏµÐÍÊý¾Ý¿âºÍHadoopϵͳ¼äµÄµ¼Èëµ¼³ö¸ñʽ¡£¸ü½øÒ»²½µÄ½²£¬»¹´æÔÚһЩ¸ü¸´ÔӵŤ¾ß¡£
Sqoop¹¤¾ßÊÇÉè¼ÆÓÃÀ´½«Êý¾Ý´Ó¹ØÏµÐÍÊý¾Ý¿âµ¼Èëµ½Hadoopϵͳ¡£ËüÊÇÓÉCloudera¿ª·¢µÄ£¬Ò»¸öרעÓÚÆóÒµ¼¶Ó¦ÓõÄHadoopƽ̨¾ÏúÉÌ¡£SqoopÊÇÓë¾ßÌåÊý¾Ý¿âÎ޹صģ¬ÒòΪËüʹÓÃÁËjavaµÄJDBCÊý¾Ý¿âAPI¡£¿ÉÒÔ½«Õû¸ö±íµ¼È룬Ҳ¿ÉÒÔʹÓòéѯÃüÁîÏÞÖÆÐèÒªµ¼ÈëµÄÊý¾Ý¡£
SqoopÒ²Ìṩ½«MapReduceµÄ½á¹û´ÓHDFSµ¼»Ø¹ØÏµÐÍÊý¾Ý¿âµÄ¹¦ÄÜ¡£ÒòΪHDFSÊÇÒ»¸öÎļþϵͳ£¬ËùÒÔSqoopÐèÒªÒÔ·Ö¸ô·û±êʶµÄÎı¾ÎªÊäÈ룬ÐèÒª½«ËüÃÇת»»ÎªÏàÓ¦µÄSQLÃüÁî²ÅÄܽ«Êý¾Ý²åÈëµ½Êý¾Ý¿â¡£
¶ÔÓÚHadoopϵͳÀ´Ëµ£¬Í¨¹ýʹÓÃCascading APIÖеÄcascading.jdbcºÍ cascading-dbmigrateÒ²ÄÜʵÏÖÀàËÆµÄ¹¦ÄÜ¡£
ÓëstreamingÊý¾ÝÔ´µÄ¼¯³É
¹ØÏµÐÍÊý¾Ý¿âÒÔ¼°Á÷ʽÊý¾ÝÔ´(±ÈÈçweb·þÎñÆ÷ÈÕÖ¾£¬´«¸ÐÆ÷Êä³ö)×é³ÉÁ˺£Á¿Êý¾ÝϵͳµÄ×î³£¼ûµÄÊý¾ÝÀ´Ô´¡£ClouderaµÄFlumeÏîÄ¿¾ÍÊÇÖ¼ÔÚÌṩÁ÷ʽÊý¾ÝÔ´ÓëHadoopÖ®¼ä¼¯³ÉµÄ·½±ã¹¤¾ß¡£FlumeÊÕ¼¯À´×ÔÓÚ¼¯Èº»úÆ÷ÉϵÄÊý¾Ý£¬½«ËüÃDz»¶ÏµÄ×¢Èëµ½HDFSÖС£FacebookµÄScribe·þÎñÆ÷Ò²ÌṩÀàËÆµÄ¹¦ÄÜ¡£
ÉÌÒµÐÔµÄSMAQ½â¾ö·½°¸
һЩMPPÊý¾Ý¿â¾ßÓÐÄÚ½¨µÄMapReduce¹¦ÄÜÖ§³Ö¡£MPPÊý¾Ý¿â¾ßÓÐÒ»¸öÓɲ¢ÐÐÔËÐеĶÀÁ¢½Úµã×é³ÉµÄ·Ö²¼Ê½¼Ü¹¹¡£ËüÃǵÄÖ÷Òª¹¦ÄÜÊÇÊý¾Ý²Ö¿âºÍ·ÖÎö£¬¿ÉÒÔʹÓÃSQL¡£
Greenplum£º»ùÓÚ¿ªÔ´µÄPostreSQL DBMS£¬ÔËÐÐÔÚ·Ö²¼Ê½Ó²¼þ×é³ÉµÄ¼¯ÈºÉÏ¡£MapReduce×÷ΪSQLµÄ²¹³ä£¬¿ÉÒÔ½øÐÐÔÚGreenplumÉϵĸü¿ìËÙ¸ü´ó¹æÄ£µÄÊý¾Ý·ÖÎö£¬¼õÉÙÁ˼¸¸öÊýÁ¿¼¶µÄ²éѯʱ¼ä¡£Greenplum MapReduceÔÊÐíʹÓÃÓÉÊý¾Ý¿â´æ´¢ºÍÍⲿÊý¾ÝÔ´×é³ÉµÄ»ìºÏÊý¾Ý¡£MapReduce²Ù×÷¿ÉÒÔʹÓÃPerl»òÕßPythonº¯Êý½øÐÐÃèÊö¡£
Aster Data µÄnClusterÊý¾Ý²Ö¿âϵͳҲÌṩMapReduceÖ§³Ö¡£MapReduce²Ù×÷¿ÉÒÔͨ¹ýʹÓÃAster DataµÄSQL-MapReduce¼¼Êõµ÷Óá£SQL-MapReduce¼¼Êõ¿ÉÒÔʹSQL²éѯºÍͨ¹ý¸÷ÖÖÓïÑÔ(C#, C++, Java, R or Python)µÄÔ´´úÂ붨ÒåµÄMapReduce job×éºÏÔÚÒ»¿é¡£
ÆäËûµÄһЩÊý¾Ý²Ö¿â½â¾ö·½°¸Ñ¡ÔñÌṩÓëHadoopµÄÁ¬½ÓÆ÷£¬¶ø²»ÊÇÔÚÄÚ²¿¼¯³ÉMapReduce¹¦ÄÜ¡£
Vertica£ºÊÇÒ»¸öÌṩÁËHadoopÁ¬½ÓÆ÷µÄÁдæÊ½Êý¾Ý¿â¡£
Netezza£º×î½üÓÉIBMÊÕ¹º¡£ÓëClouderaºÏ×÷Ìá¸ßÁËËüÓëHadoopÖ®¼äµÄ»¥²Ù×÷ÐÔ¡£¾¡¹ÜËü½â¾öÁËÀàËÆµÄÎÊÌ⣬µ«ÊÇʵ¼ÊÉÏËüÒѾ²»ÔÚÎÒÃǵÄSMAQÄ£ÐͶ¨ÒåÖ®ÄÚ£¬ÒòΪËü¼È²»¿ªÔ´Ò²²»ÔËÐÐÔÚÆÕͨӲ¼þÉÏ¡£
¾¡¹Ü¿ÉÒÔÈ«²¿Ê¹ÓÿªÔ´Èí¼þÀ´´´½¨Ò»¸ö»ùÓÚHadoopµÄϵͳ£¬µ«ÊǼ¯³ÉÕâÑùµÄÒ»¸öϵͳÈÔÈ»ÐèҪһЩŬÁ¦¡£ClouderaµÄÄ¿µÄ¾ÍÊÇʹµÃHadoop¸üÄÜÊÊÓ¦ÓÃÆóÒµ»¯µÄÓ¦Ó㬶øÇÒÔÚËüÃǵÄCloudera Distribution for Hadoop (CDH)ÖÐÒѾÌṩһ¸öͳһµÄHadoop·¢Ðа档
²éѯ
ͨ¹ýÉÏÃæµÄjava´úÂë¿ÉÒÔ¿´³öʹÓóÌÐòÓïÑÔ¶¨ÒåMapReduce jobµÄmapºÍreduce¹ý³Ì²¢²»ÊÇÄÇôµÄÖ±¹ÛºÍ·½±ã¡£ÎªÁ˽â¾öÕâ¸öÎÊÌ⣬SMAQϵͳÒýÈËÁËÒ»¸ö¸ü¸ß²ãµÄ²éѯ²ãÀ´¼ò»¯MapReduce²Ù×÷ºÍ½á¹û²éѯ¡£

ºÜ¶àʹÓÃHadoopµÄ×é֯ΪÁËʹ²Ù×÷¸ü¼Ó·½±ã£¬ÒѾ¶ÔHadoopµÄAPI½øÐÐÁËÄÚ²¿µÄ·â×°¡£ÓÐЩÒѾ³ÉΪ¿ªÔ´ÏîÄ¿»òÕßÉÌÒµÐÔ²úÆ·¡£
²éѯ²ãͨ³£²¢²»½ö½öÌṩÓÃÓÚÃèÊö¼ÆËã¹ý³ÌµÄÌØÐÔ£¬¶øÇÒÖ§³Ö¶ÔÊý¾ÝµÄ´æÈ¡ÒÔ¼°¼ò»¯ÔÚMapReduce¼¯ÈºÉϵÄÖ´ÐÐÁ÷³Ì¡£
Pig
ÓÉyahoo¿ª·¢£¬Ä¿Ç°ÊÇHadoopÏîÄ¿µÄÒ»²¿·Ö¡£PigÌṩÁËÒ»¸ö³ÆÎªPig LatinµÄ¸ß¼¶²éѯÓïÑÔÀ´ÃèÊöºÍÔËÐÐMapReduce job¡£ËüµÄÄ¿µÄÊÇÈÃHadoop¸üÈÝÒ×±»ÄÇЩÊìϤSQLµÄ¿ª·¢ÈËÔ±·ÃÎÊ£¬³ýÁËÒ»¸öJava API£¬Ëü»¹Ìṩһ¸ö½»»¥Ê½µÄ½Ó¿Ú¡£PigĿǰÒѾ¼¯³ÉÔÚCassandra ºÍHBaseÊý¾Ý¿âÖС£ ÏÂÃæÊÇʹÓÃPigдµÄÉÏÃæµÄwordcountµÄÀý×Ó£¬°üÀ¨ÁËÊý¾ÝµÄ¼ÓÔØºÍ´æ´¢¹ý³Ì($0´ú±í¼Ç¼µÄµÚÒ»¸ö×Ö¶Î)¡£
input = LOAD 'input/sentences.txt' USING TextLoader();
words = FOREACH input GENERATE FLATTEN(TOKENIZE($0));
grouped = GROUP words BY $0;
counts = FOREACH grouped GENERATE group, COUNT(words);
ordered = ORDER counts BY $0;
STORE ordered INTO 'output/wordCount' USING PigStorage();
PigÊǷdz£¾ßÓбí´ïÁ¦µÄ£¬ËüÔÊÐí¿ª·¢Õßͨ¹ýUDFs(User Defined Functions )ÊéдһЩ¶¨ÖÆ»¯µÄ¹¦ÄÜ¡£ÕâЩUDFʹÓÃjavaÓïÑÔÊéд¡£¾¡¹ÜËü±ÈMapReduce API¸üÈÝÒ×Àí½âºÍʹÓ㬵«ÊÇËüÒªÇóÓû§È¥Ñ§Ï°Ò»ÃÅеÄÓïÑÔ¡£Ä³Ð©³Ì¶ÈÉÏËüÓëSQLÓÐЩÀàËÆ£¬µ«ÊÇËüÓÖÓëSQL¾ßÓкܴóµÄ²»Í¬£¬ÒòΪÄÇЩÊìϤSQLµÄÈËÃǺÜÄѽ«ËüÃǵÄ֪ʶÔÚÕâÀïÖØÓá£
Hive
ÕýÈçÇ°ÃæËùÊö£¬HiveÊÇÒ»¸ö½¨Á¢ÔÚHadoopÖ®ÉϵĿªÔ´µÄÊý¾Ý²Ö¿â¡£ÓÉFacebook´´½¨£¬ËüÌṩÁËÒ»¸ö·Ç³£ÀàËÆÓÚSQLµÄ²éѯÓïÑÔ£¬¶øÇÒÌṩһ¸öÖ§³Ö¼òµ¥ÄÚ½¨²éѯµÄweb½Ó¿Ú¡£Òò´ËËüºÜÊʺÏÓÚÄÇЩÊìϤSQLµÄ·Ç¿ª·¢ÕßÓû§¡£
ÓëPigºÍCascadingµÄÐèÒª½øÐбàÒëÏà±È£¬HiveµÄÒ»¸ö³¤´¦ÊÇÌṩ¼´Ï¯²éѯ¡£¶ÔÓÚÄÇЩÒѾ³ÉÊìµÄÉÌÎñÖÇÄÜϵͳÀ´Ëµ£¬HiveÊÇÒ»¸ö¸ü×ÔÈ»µÄÆðµã£¬ÒòΪËüÌṩÁËÒ»¸ö¶ÔÓڷǼ¼ÊõÓû§¸ü¼ÓÓѺõĽӿڡ£ClouderaµÄHadoop·¢ÐаæÀO³ÉÁËHive£¬¶øÇÒͨ¹ýHUEÏîÄ¿ÌṩÁËÒ»¸ö¸ü¸ß¼¶µÄÓû§½Ó¿Ú£¬Ê¹µÃÓû§¿ÉÒÔÌá½»²éѯ²¢ÇÒ¼à¿ØMapReduce jobµÄÖ´ÐС£
Cascading, the API Approach
CascadingÌṩÁËÒ»¸ö¶ÔHadoopµÄMapReduce APIµÄ°ü×°ÒÔʹËü¸üÈÝÒ×±»javaÓ¦ÓóÌÐòʹÓá£ËüÖ»ÊÇÒ»¸öΪÁËÈÃMapReduce¼¯³Éµ½¸ü´óµÄϵͳÖÐʱ¸ü¼òµ¥µÄÒ»¸ö°ü×°²ã¡£Cascading°üÀ¨Èçϼ¸¸öÌØÐÔ£º
Ö¼ÔÚ¼ò»¯MapReduce job¶¨ÒåµÄÊý¾Ý´¦ÀíAPI
Ò»¸ö¿ØÖÆMapReduce jobÔÚHadoop¼¯ÈºÉÏÔËÐеÄAPI
·ÃÎÊ»ùÓÚJvmµÄ½Å±¾ÓïÑÔ£¬±ÈÈçJython, Groovy, or JRuby.
ÓëHDFSÖ®ÍâµÄÊý¾ÝÔ´µÄ¼¯³É£¬°üÀ¨Amazon S3£¬web·þÎñÆ÷
ÌṩMapReduce¹ý³Ì²âÊÔµÄÑéÖ¤»úÖÆ
CascadingµÄ¹Ø¼üÌØÐÔÊÇËüÔÊÐí¿ª·¢Õß½«MapReduce jobÒÔÁ÷µÄÐÎʽ½øÐÐ×é×°£¬Í¨¹ý½«Ñ¡¶¨µÄһЩpipesÁ¬½ÓÆðÀ´¡£Òò´ËºÜÊÊÓÃÓÚ½«Hadoop¼¯³Éµ½Ò»¸ö¸ü´óµÄϵͳÖС£ Cascading±¾Éí²¢²»Ìṩ¸ß¼¶²éѯÓïÑÔ£¬ÓÉËü¶øÑÜÉú³öµÄÒ»¸ö½ÐCascalogµÄ¿ªÔ´ÏîÄ¿Íê³ÉÁËÕâÏ×÷¡£Cascalogͨ¹ýʹÓÃClojure JVMÓïÑÔʵÏÖÁËÒ»¸öÀàËÆÓÚDatalogµÄ²éѯÓïÑÔ¡£¾¡¹ÜºÜÇ¿´ó£¬CascalogÈÔȻֻÊÇÒ»¸öС·¶Î§ÄÚʹÓõÄÓïÑÔ£¬ÒòΪËü¼È²»ÏñHiveÄÇÑùÌṩһ¸öÀàSQL£¬Ò²²»ÏñPigÄÇÑùÊ**ý³ÌÐԵġ£ÏÂÃæÊÇʹÓÃCascalogÍê³ÉµÄwordcoutµÄÀý×Ó£º
(defmapcatop split [sentence]
(seq (.split sentence "\\s+")))
(?<- (stdout) [?word ?count]
(sentence ?s) (split ?s :> ?word)
(c/count ?count))
ʹÓÃSolr½øÐÐËÑË÷
´ó¹æÄ£Êý¾ÝϵͳµÄÒ»¸öÖØÒª×é¼þ¾ÍÊÇÊý¾Ý²éѯºÍÕªÒª¡£Êý¾Ý¿â²ã±ÈÈçHBaseÌṩÁ˶ÔÊý¾ÝµÄ¼òµ¥·ÃÎÊ£¬µ«ÊDz¢²»¾ß±¸¸´ÔÓµÄËÑË÷ÄÜÁ¦¡£ÎªÁ˽â¾öËÑË÷ÎÊÌâ¡£¿ªÔ´µÄËÑË÷ºÍË÷Òýƽ̨Solrͨ³£ÓëNoSQLÊý¾Ý¿â×éºÏʹÓá£SolrʹÓÃLuenceËÑË÷¼¼ÊõÌṩһ¸ö×Ô°üº¬µÄËÑË÷·þÎñÆ÷²úÆ·¡£±ÈÈ磬¿¼ÂÇÒ»¸öÉç½»ÍøÂçÊý¾Ý¿â£¬MapReduce¿ÉÒÔʹÓÃһЩºÏÀíµÄ²ÎÊýÓÃÀ´¼ÆËã¸öÈ˵ÄÓ°ÏìÁ¦£¬Õâ¸öÊýÖµ»á±»Ð´»Øµ½Êý¾Ý¿â¡£Ö®ºóʹÓÃSolr½øÐÐË÷Òý£¬¾ÍÔÊÐíÔÚÕâ¸öÉç½»ÍøÂçÉϽøÐÐһЩ²Ù×÷£¬±ÈÈçÕÒµ½×îÓÐÓ°ÏìÁ¦µÄÈË¡£
×î³õÔÚCENT¿ª·¢£¬ÏÖÔÚ×÷ΪApacheÏîÄ¿µÄSolr£¬ÒѾ´ÓÒ»¸öµ¥Ò»µÄÎı¾ËÑË÷ÒýÇæÑÝ»¯ÎªÖ§³Öµ¼º½ºÍ½á¹û¾ÛÀà¡£´ËÍ⣬Solr»¹¿ÉÒÔ¹ÜÀí´æ´¢ÔÚ·Ö²¼Ê½·þÎñÆ÷Éϵĺ£Á¿Êý¾Ý¡£ÕâʹµÃËü³ÉΪÔÚº£Á¿Êý¾ÝÉϽøÐÐËÑË÷µÄÀíÏë½â¾ö·½°¸£¬ÒÔ¼°¹¹½¨ÉÌÒµÖÇÄÜϵͳµÄÖØÒª×é¼þ¡£
×ܽá
MapReduceÓÈÆäÊÇHadoopʵÏÖÌṩÁËÔÚÆÕͨ·þÎñÆ÷ÉϽøÐзֲ¼Ê½¼ÆËãµÄÇ¿ÓÐÁ¦µÄ·½Ê½¡£ÔÙ¼ÓÉÏ·Ö²¼Ê½´æ´¢ÒÔ¼°Óû§ÓѺõIJéѯ»úÖÆ£¬ËüÃÇÐγɵÄSMAQ¼Ü¹¹Ê¹µÃº£Á¿Êý¾Ý´¦Àíͨ¹ýСÐÍÍŶÓÉõÖÁ¸öÈË¿ª·¢Ò²ÄÜʵÏÖ¡£
ÏÖÔÚ¶ÔÊý¾Ý½øÐÐÉîÈëµÄ·ÖÎö»òÕß´´½¨ÒÀÀµÓÚ¸´ÔÓ¼ÆËãµÄÊý¾Ý²úÆ·ÒѾ±äµÃºÜÁ®¼Û¡£Æä½á¹ûÒѾÉîÔ¶µÄÓ°ÏìÁËÊý¾Ý·ÖÎöºÍÊý¾Ý²Ö¿âÁìÓòµÄ¸ñ¾Ö£¬½µµÍÁ˸ÃÁìÓòµÄ½øÈëÃż÷£¬ÅàÑøÁËÐÂÒ»´úµÄ²úÆ·£¬·þÎñºÍ×éÖ¯·½Ê½¡£ÕâÖÖÇ÷ÊÆÔÚMike LoukidesµÄ"What is Data Science?"±¨¸æÖÐÓиüÉîÈëµÄÚ¹ÊÍ¡£
LinuxµÄ³öÏÖ½ö½öͨ¹ýһ̨°ÚÔÚ×ÀÃæÉϵÄlinux·þÎñÆ÷´ø¸øÄÇЩ´´ÐµĿª·¢ÕßÃÇÒÔÁ¦Á¿¡£SMAQÓµÓÐͬÑù´óµÄDZÁ¦À´Ìá¸ßÊý¾ÝÖÐÐĵÄЧÂÊ£¬´Ù½ø×éÖ¯±ßÔµµÄ´´Ð£¬¿ªÆôÁ®¼Û´´½¨Êý¾ÝÇý¶¯ÒµÎñµÄÐÂʱ´ú¡£
±¾ÎÄ·Òë×ÔThe SMAQ stack for big data
Ó¢ÎÄÔÎÄ£ºhttp://radar.oreilly.com/2010/09/the-smaq-stack-for-big-data.html
SMAQ´ú±íÁË´æ´¢£¬MapReduceºÍ²éѯ¡£
×ªÔØÇë×¢Ã÷ÒëÕߣºphylips@bmy
³ö´¦£ºhttp://duanple.blog.163.com/blog/static/709717672011016103028473/