最近在作营销效果分析的需求,需要处理大量网页访问日志(千万级PV以上),HADOOP在这方面正好有用武之地。PHP是一个简单实用的语言,我很喜欢。影响我最大的计算机语言,除了LISP,就是PHP了。于是我准备在我的D630本本上玩玩php in hadoop.
The virtual machine image is designed to be used with the free VMware Player.
VMware Player 2.5.1
Latest Version: 2.5.1 | 2008/11/21 | Build: 126130
http://www.vmware.com/download/player/
Download the VMWare image here:
http://code.google.com/intl/zh-CN/edu/parallel/tools/hadoopvm/index.html
启动hadoopvm
- 去掉原来的源;然后加入源,否则feisty不能升级包;:
vi /etc/apt/sources.list
deb http://ubuntu.cn99.com/ubuntu/ feisty main restricted universe multiverse
deb http://ubuntu.cn99.com/ubuntu/ feisty-security main restricted universe multiverse
deb http://ubuntu.cn99.com/ubuntu/ feisty-updates main restricted universe multiverse
deb http://ubuntu.cn99.com/ubuntu/ feisty-proposed main restricted universe multiverse
deb http://ubuntu.cn99.com/ubuntu/ feisty-backports main restricted universe multiverse
deb-src http://ubuntu.cn99.com/ubuntu/ feisty main restricted universe multiverse
deb-src http://ubuntu.cn99.com/ubuntu/ feisty-security main restricted universe multiverse
deb-src http://ubuntu.cn99.com/ubuntu/ feisty-updates main restricted universe multiverse
deb-src http://ubuntu.cn99.com/ubuntu/ feisty-proposed main restricted universe multiverse
deb-src http://ubuntu.cn99.com/ubuntu/ feisty-backports main restricted universe multiverse
- 执行apt-get update 更新包;
- 确认当前用户是ROOT
Login as user ‘root’ (password ‘root’),
- run ‘apt-get install php5-cli’ to install PHP5.
- Now switch to user ‘guest’ (password ‘guest’).
(图1)

- 生成两个程序:mapper.php和reducer.php
#!/usr/bin/php
<?
$word2count = array();
// input comes from STDIN (standard input)
while (($line = fgets(STDIN)) !== false) {
// remove leading and trailing whitespace and lowercase
$line = strtolower(trim($line));
// split the line into words while removing any empty string
$words = preg_split(’/\W/’, $line, 0, PREG_SPLIT_NO_EMPTY);
// increase counters
foreach ($words as $word) {
$word2count[$word] += 1;
}
}
// write the results to STDOUT (standard output)
// what we output here will be the input for the
// Reduce step, i.e. the input for reducer.py
foreach ($word2count as $word => $count) {
// tab-delimited
echo $word, chr(9), $count, PHP_EOL;
}
?>
reducer.php
#!/usr/bin/php
<?
$word2count = array();
// input comes from STDIN
while (($line = fgets(STDIN)) !== false) {
// remove leading and trailing whitespace
$line = trim($line);
// parse the input we got from mapper.php
list($word, $count) = explode(chr(9), $line);
// convert count (currently a string) to int
$count = intval($count);
// sum counts
if ($count > 0) $word2count[$word] += $count;
}
// sort the words lexigraphically
//
// this set is NOT required, we just do it so that our
// final output will look more like the official Hadoop
// word count examples
ksort($word2count);
// write the results to STDOUT (standard output)
foreach ($word2count as $word => $count) {
echo $word, chr(9), $count, PHP_EOL;
}
?>
- chmod +x /home/guest/mapper.php /home/guest/reducer.php
- 准备程序所需要的作为输入的文本文件的目录,
mkdir /tmp/countfile
wget http://pge.rastko.net/dirs/2/0/4/1/20417/20417-8.txt
wget http://www.pg-news.org/nl_archives/2009/pgmonthly_2009_01_21.txt
cd /home/guest/hadoop/
bin/hadoop dfs -copyFromLocal /tmp/countfile countfile
bin/hadoop jar contrib/hadoop-streaming.jar -mapper /home/guest/mapper.php -reducer /home/guest/reducer.php -input countfile/* -output countfile-output
bin/hadoop dfs -ls countfile-output
bin/hadoop dfs -cat countfile-output/part-00000

参考:
http://www.lunchpauze.com/2007/10/writing-hadoop-mapreduce-program-in-php.html
http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#Example%3A+WordCount+v1.0
from:http://www.alisdn.com/wordpress/?tag=hadoop