writing　hadoop　mapreduce　program　in　php--梦飞翔的地方(梦翔天空)

载入中。。。 'S bLog

载入中。。。

writing hadoop mapreduce program in php

[ 2011/6/25 16:30:00 | By: 梦翔儿 ]

最近在作营销效果分析的需求，需要处理大量网页访问日志（千万级PV以上），HADOOP在这方面正好有用武之地。PHP是一个简单实用的语言，我很喜欢。影响我最大的计算机语言，除了LISP，就是PHP了。于是我准备在我的D630本本上玩玩php in hadoop.

先下载VMWARE PLAYER：

The virtual machine image is designed to be used with the free VMware Player.

VMware Player 2.5.1

Latest Version: 2.5.1 | 2008/11/21 | Build: 126130

http://www.vmware.com/download/player/

下载hadoopvm

Download the VMWare image here:

http://code.google.com/intl/zh-CN/edu/parallel/tools/hadoopvm/index.html

启动hadoopvm

去掉原来的源；然后加入源,否则feisty不能升级包；：

vi /etc/apt/sources.list

deb http://ubuntu.cn99.com/ubuntu/ feisty main restricted universe multiverse

deb http://ubuntu.cn99.com/ubuntu/ feisty-security main restricted universe multiverse

deb http://ubuntu.cn99.com/ubuntu/ feisty-updates main restricted universe multiverse

deb http://ubuntu.cn99.com/ubuntu/ feisty-proposed main restricted universe multiverse

deb http://ubuntu.cn99.com/ubuntu/ feisty-backports main restricted universe multiverse

deb-src http://ubuntu.cn99.com/ubuntu/ feisty main restricted universe multiverse

deb-src http://ubuntu.cn99.com/ubuntu/ feisty-security main restricted universe multiverse

deb-src http://ubuntu.cn99.com/ubuntu/ feisty-updates main restricted universe multiverse

deb-src http://ubuntu.cn99.com/ubuntu/ feisty-proposed main restricted universe multiverse

deb-src http://ubuntu.cn99.com/ubuntu/ feisty-backports main restricted universe multiverse

执行apt-get update 更新包；
确认当前用户是ROOT

Login as user ‘root’ (password ‘root’),

run ‘apt-get install php5-cli’ to install PHP5.
Now switch to user ‘guest’ (password ‘guest’).

php -v查看当前PHP是否安装好

(图1)

生成两个程序：mapper.php和reducer.php

#!/usr/bin/php

<?

$word2count = array();

// input comes from STDIN (standard input)

while (($line = fgets(STDIN)) !== false) {

   // remove leading and trailing whitespace and lowercase

   $line = strtolower(trim($line));

   // split the line into words while removing any empty string

   $words = preg_split(’/\W/’, $line, 0, PREG_SPLIT_NO_EMPTY);

   // increase counters

   foreach ($words as $word) {

   $word2count[$word] += 1;

   }

}

// write the results to STDOUT (standard output)

// what we output here will be the input for the

// Reduce step, i.e. the input for reducer.py

foreach ($word2count as $word => $count) {

   // tab-delimited

   echo $word, chr(9), $count, PHP_EOL;

}

?>

reducer.php

#!/usr/bin/php

<?

$word2count = array();

// input comes from STDIN

while (($line = fgets(STDIN)) !== false) {

   // remove leading and trailing whitespace

   $line = trim($line);

   // parse the input we got from mapper.php

   list($word, $count) = explode(chr(9), $line);

   // convert count (currently a string) to int

   $count = intval($count);

   // sum counts

   if ($count > 0) $word2count[$word] += $count;

}

// sort the words lexigraphically

//

// this set is NOT required, we just do it so that our

// final output will look more like the official Hadoop

// word count examples

ksort($word2count);

// write the results to STDOUT (standard output)

foreach ($word2count as $word => $count) {

   echo $word, chr(9), $count, PHP_EOL;

}

?>

chmod +x /home/guest/mapper.php /home/guest/reducer.php
准备程序所需要的作为输入的文本文件的目录，

mkdir /tmp/countfile

下载远程的两个TXT文本作为输入

wget http://pge.rastko.net/dirs/2/0/4/1/20417/20417-8.txt

wget http://www.pg-news.org/nl_archives/2009/pgmonthly_2009_01_21.txt

把文件放入HADOOP 的DFS中

cd /home/guest/hadoop/

bin/hadoop dfs -copyFromLocal /tmp/countfile countfile

执行php程序处理这些文本

bin/hadoop jar contrib/hadoop-streaming.jar -mapper /home/guest/mapper.php -reducer /home/guest/reducer.php -input countfile/* -output countfile-output

查看输出的结果

bin/hadoop dfs -ls countfile-output

bin/hadoop dfs -cat countfile-output/part-00000

参考：

http://www.lunchpauze.com/2007/10/writing-hadoop-mapreduce-program-in-php.html

http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v1.0

from:http://www.alisdn.com/wordpress/?tag=hadoop

阅读全文 | 回复(0) | 引用通告 | 编辑

标签：hadoop php

上一篇：自工作以来,所指导和合作指导的本科毕业论文题目列表
下一篇：linux中查看系统资源占用情况的命令

发表评论：

梦翔儿网站梦飞翔的地方 http://www.dreamflier.net
中华人民共和国信息产业部TCP/IP系统备案序号：辽ICP备09000550号