载入中。。。 'S bLog
 
载入中。。。
 
载入中。。。
载入中。。。
载入中。。。
载入中。。。
载入中。。。
 
填写您的邮件地址,订阅我们的精彩内容:


 
writing hadoop mapreduce program in php
[ 2011/6/25 16:30:00 | By: 梦翔儿 ]
 

最近在作营销效果分析的需求,需要处理大量网页访问日志(千万级PV以上),HADOOP在这方面正好有用武之地。PHP是一个简单实用的语言,我很喜欢。影响我最大的计算机语言,除了LISP,就是PHP了。于是我准备在我的D630本本上玩玩php in hadoop.

 

  • 先下载VMWARE PLAYER:

 

The virtual machine image is designed to be used with the free VMware Player.

VMware Player 2.5.1

Latest Version: 2.5.1 | 2008/11/21 | Build: 126130

http://www.vmware.com/download/player/

 

  • 下载hadoopvm

 

Download the VMWare image here:

http://code.google.com/intl/zh-CN/edu/parallel/tools/hadoopvm/index.html

 

启动hadoopvm

 

  • 去掉原来的源;然后加入源,否则feisty不能升级包;:
vi /etc/apt/sources.list

 

deb http://ubuntu.cn99.com/ubuntu/ feisty main restricted universe multiverse

deb http://ubuntu.cn99.com/ubuntu/ feisty-security main restricted universe multiverse

deb http://ubuntu.cn99.com/ubuntu/ feisty-updates main restricted universe multiverse

deb http://ubuntu.cn99.com/ubuntu/ feisty-proposed main restricted universe multiverse

deb http://ubuntu.cn99.com/ubuntu/ feisty-backports main restricted universe multiverse

deb-src http://ubuntu.cn99.com/ubuntu/ feisty main restricted universe multiverse

deb-src http://ubuntu.cn99.com/ubuntu/ feisty-security main restricted universe multiverse

deb-src http://ubuntu.cn99.com/ubuntu/ feisty-updates main restricted universe multiverse

deb-src http://ubuntu.cn99.com/ubuntu/ feisty-proposed main restricted universe multiverse

deb-src http://ubuntu.cn99.com/ubuntu/ feisty-backports main restricted universe multiverse

 

  • 执行apt-get update 更新包;
  • 确认当前用户是ROOT

 

Login as user ‘root’ (password ‘root’),

 

  • run ‘apt-get install php5-cli’ to install PHP5.
  • Now switch to user ‘guest’ (password ‘guest’).

 

 

 

  • php -v查看当前PHP是否安装好

 

(图1)

 

 

  • 生成两个程序:mapper.php和reducer.php

 

 

#!/usr/bin/php

<?

 

$word2count = array();

 

// input comes from STDIN (standard input)

while (($line = fgets(STDIN)) !== false) {

   // remove leading and trailing whitespace and lowercase

   $line = strtolower(trim($line));

   // split the line into words while removing any empty string

   $words = preg_split(’/\W/’, $line, 0, PREG_SPLIT_NO_EMPTY);

   // increase counters

   foreach ($words as $word) {

       $word2count[$word] += 1;

   }

}

 

// write the results to STDOUT (standard output)

// what we output here will be the input for the

// Reduce step, i.e. the input for reducer.py

foreach ($word2count as $word => $count) {

   // tab-delimited

   echo $word, chr(9), $count, PHP_EOL;

}

 

?>

reducer.php

 

#!/usr/bin/php

<?

 

$word2count = array();

 

// input comes from STDIN

while (($line = fgets(STDIN)) !== false) {

    // remove leading and trailing whitespace

    $line = trim($line);

    // parse the input we got from mapper.php

    list($word, $count) = explode(chr(9), $line);

    // convert count (currently a string) to int

    $count = intval($count);

    // sum counts

    if ($count > 0) $word2count[$word] += $count;

}

 

// sort the words lexigraphically

//

// this set is NOT required, we just do it so that our

// final output will look more like the official Hadoop

// word count examples

ksort($word2count);

 

// write the results to STDOUT (standard output)

foreach ($word2count as $word => $count) {

    echo $word, chr(9), $count, PHP_EOL;

}

 

?>

 

 

 

  • chmod +x /home/guest/mapper.php /home/guest/reducer.php
  • 准备程序所需要的作为输入的文本文件的目录,

 

mkdir /tmp/countfile

 

  • 下载远程的两个TXT文本作为输入

 

wget http://pge.rastko.net/dirs/2/0/4/1/20417/20417-8.txt

wget http://www.pg-news.org/nl_archives/2009/pgmonthly_2009_01_21.txt

 

  • 把文件放入HADOOP 的DFS中

 

cd /home/guest/hadoop/

bin/hadoop dfs -copyFromLocal /tmp/countfile countfile

 

  • 执行php程序处理这些文本

 

bin/hadoop jar contrib/hadoop-streaming.jar -mapper /home/guest/mapper.php -reducer /home/guest/reducer.php -input countfile/* -output countfile-output


 

  • 查看输出的结果

 

bin/hadoop dfs -ls countfile-output

bin/hadoop dfs -cat countfile-output/part-00000

参考:

http://www.lunchpauze.com/2007/10/writing-hadoop-mapreduce-program-in-php.html

http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v1.0

from:http://www.alisdn.com/wordpress/?tag=hadoop

 
 
  • 标签:hadoop php 
  • 发表评论:
    载入中。。。

     
     
     

    梦翔儿网站 梦飞翔的地方 http://www.dreamflier.net
    中华人民共和国信息产业部TCP/IP系统 备案序号:辽ICP备09000550号

    Powered by Oblog.