hadoop学习笔记1　hadoop命令参数,示例与手动编译执行wordcount--梦飞翔的地方(梦翔天空)

载入中。。。 'S bLog

载入中。。。

hadoop学习笔记1 hadoop命令参数,示例与手动编译执行wordcount

[ 2012/1/29 15:25:00 | By: 梦翔儿 ]

测试环境:cloudera hadoop 0.20.2 (CDH3)伪分布式

1.hadoop命令参数

hadoop

Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
namenode -format     format the DFS filesystem
secondarynamenode    run the DFS secondary namenode
namenode             run the DFS namenode
datanode             run a DFS datanode
dfsadmin             run a DFS admin client
mradmin              run a Map-Reduce admin client
fsck                 run a DFS filesystem checking utility
fs                   run a generic filesystem user client
balancer             run a cluster balancing utility
fetchdt              fetch a delegation token from the NameNode
jobtracker           run the MapReduce job Tracker node
pipes                run a Pipes job
tasktracker          run a MapReduce task Tracker node
job                  manipulate MapReduce jobs
queue                get information regarding JobQueues
version              print the version
jar <jar>            run a jar file
distcp <srcurl> <desturl> copy file or directories recursively
archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
oiv                  apply the offline fsimage viewer to an fsimage
classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
daemonlog            get/set the log level for each daemon
or
CLASSNAME            run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

=========

2.查看hadoop自带示例

cd /usr/lib/hadoop

hadoop jar hadoop-examples-0.20.2-cdh3u0.jar

An example program must be given as the first argument.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
dbcount: An example job that count the pageview counts from a database.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using monte-carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sleep: A job that sleeps at each map and reduce task.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.

======

3.运行wordcount示例

hadoop jar hadoop-examples-0.20.2-cdh3u0.jar wordcount
Usage: wordcount <in> <out>

hadoop fs -mkdir input

sudo mkdir input

sudo wget -c http://frwebgate.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=2011_record&docid=cr25ja11-87

sudo mv getdoc.cgi\?dbname\=2011_record test.txt

上传到hdfs

hadoop fs -copyFromLocal input/test.txt input

执行wordcount

hadoop jar hadoop-examples-0.20.2-cdh3u0.jar wordcount input/test.txt output

查看输出

hadoop fs -ls output

hadoop fs -cat output/part-r-00000

========

4.手动编译wordcount

示例源码目录在这城

/usr/lib/hadoop/src/examples/org/apache/hadoop/examples

拷贝源码

cd /usr/lib/hadoop

sudo mkdir playground
sudo mkdir playground/src
sudo mkdir playground/classes
sudo cp src/examples/org/apache/hadoop/examples/WordCount.java playground/src

编译

sudo javac -classpath hadoop-0.20.2-cdh3u0-core.jar:lib/commons-cli-1.2.jar -d playground/classes playground/src/WordCount.java

sudo jar -cvf playground/WordCount.jar -C playground/classes/ .

注意要写清cli和classes目录,否则会出下面的错误:

class file for org.apache.commons.cli.Options not found
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

先删除hdfs上的output文件夹后,执行编译后的wordcount程序:

hadoop dfs -rmr output

hadoop jar playground/WordCount.jar org.apache.hadoop.examples.WordCount input output

查看输出

hadoop fs -ls output

hadoop fs -cat output/part-r-00000

5.修改WordCount源码:

package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString(),"\t\n\r\f,.: #%&;?![]'");  //默认是空格来分,这里忽略后面这些符号.
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken().toLowerCase()); //转为小写来处理
        context.write(word, one);
      }
    }
}

public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      if (sum > 4) //只有大于4个的才输出
       result.set(sum);
      context.write(key, result);
    }
}

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

以上有三处改进,主要是解决原来的程序分词用的是空格而不是标点,原来区分了大小写,原来只有一次或两次的显示,解决改进后的程序见程序中间的批注.

按前面方法重新编译执行后

查看输出

hadoop fs -ls output

hadoop fs -cat output/part-r-00000

梦翔儿,实践学习自: Hadoop in Action 首发于海云在线:http://cloud.dlmu.edu.cn/cloudsite

阅读全文 | 回复(1) | 引用通告 | 编辑

标签：hadoop wordcount 编译

上一篇：自工作以来,所指导和合作指导的本科毕业论文题目列表
下一篇：StringTokenizer详讲:字符串分隔解析类型

Re:hadoop学习笔记1 hadoop命令参数,示例与手动编译执行wordcount

[ 2013/4/9 16:42:28 | By: yps(游客) ]

我按照这个方法编译Randomwriter方法后出现 https://issues.apache.org/jira/browse/HADOOP-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
这个错误，请问怎么解决下，自己菜鸟一枚……

以下为梦翔儿的回复：
在伪分布下?是不是空间不足?

个人主页 | 引用 | 返回 | 删除 | 回复

发表评论：

梦翔儿网站梦飞翔的地方 http://www.dreamflier.net
中华人民共和国信息产业部TCP/IP系统备案序号：辽ICP备09000550号