载入中。。。 'S bLog
 
载入中。。。
 
载入中。。。
载入中。。。
载入中。。。
载入中。。。
载入中。。。
 
填写您的邮件地址,订阅我们的精彩内容:


 
hadoop学习笔记1 hadoop命令参数,示例与手动编译执行wordcount
[ 2012/1/29 15:25:00 | By: 梦翔儿 ]
 

测试环境:cloudera hadoop 0.20.2 (CDH3)伪分布式

1.hadoop命令参数

hadoop

Usage: hadoop [--config confdir] COMMAND
where COMMAND is one of:
  namenode -format     format the DFS filesystem
  secondarynamenode    run the DFS secondary namenode
  namenode             run the DFS namenode
  datanode             run a DFS datanode
  dfsadmin             run a DFS admin client
  mradmin              run a Map-Reduce admin client
  fsck                 run a DFS filesystem checking utility
  fs                   run a generic filesystem user client
  balancer             run a cluster balancing utility
  fetchdt              fetch a delegation token from the NameNode
  jobtracker           run the MapReduce job Tracker node
  pipes                run a Pipes job
  tasktracker          run a MapReduce task Tracker node
  job                  manipulate MapReduce jobs
  queue                get information regarding JobQueues
  version              print the version
  jar <jar>            run a jar file
  distcp <srcurl> <desturl> copy file or directories recursively
  archive -archiveName NAME -p <parent path> <src>* <dest> create a hadoop archive
  oiv                  apply the offline fsimage viewer to an fsimage
  classpath            prints the class path needed to get the
                       Hadoop jar and the required libraries
  daemonlog            get/set the log level for each daemon
 or
  CLASSNAME            run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

=========

2.查看hadoop自带示例

cd /usr/lib/hadoop

hadoop jar hadoop-examples-0.20.2-cdh3u0.jar

An example program must be given as the first argument.
Valid program names are:
  aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
  aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files.
  dbcount: An example job that count the pageview counts from a database.
  grep: A map/reduce program that counts the matches of a regex in the input.
  join: A job that effects a join over sorted, equally partitioned datasets
  multifilewc: A job that counts words from several files.
  pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
  pi: A map/reduce program that estimates Pi using monte-carlo method.
  randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
  randomwriter: A map/reduce program that writes 10GB of random data per node.
  secondarysort: An example defining a secondary sort to the reduce.
  sleep: A job that sleeps at each map and reduce task.
  sort: A map/reduce program that sorts the data written by the random writer.
  sudoku: A sudoku solver.
  teragen: Generate data for the terasort
  terasort: Run the terasort
  teravalidate: Checking results of terasort
  wordcount: A map/reduce program that counts the words in the input files.

======

3.运行wordcount示例

hadoop jar hadoop-examples-0.20.2-cdh3u0.jar wordcount
Usage: wordcount <in> <out>

hadoop fs -mkdir input

sudo mkdir input

sudo wget -c http://frwebgate.access.gpo.gov/cgi-bin/getdoc.cgi?dbname=2011_record&docid=cr25ja11-87

sudo mv getdoc.cgi\?dbname\=2011_record test.txt

上传到hdfs

hadoop fs -copyFromLocal input/test.txt input

执行wordcount

hadoop jar hadoop-examples-0.20.2-cdh3u0.jar wordcount input/test.txt output

查看输出

hadoop fs -ls output

hadoop fs -cat output/part-r-00000

========

4.手动编译wordcount

示例源码目录在这城

/usr/lib/hadoop/src/examples/org/apache/hadoop/examples

拷贝源码

cd /usr/lib/hadoop

sudo mkdir playground
sudo mkdir playground/src
sudo mkdir playground/classes
sudo cp src/examples/org/apache/hadoop/examples/WordCount.java playground/src

编译

sudo javac -classpath hadoop-0.20.2-cdh3u0-core.jar:lib/commons-cli-1.2.jar -d playground/classes playground/src/WordCount.java

sudo jar -cvf playground/WordCount.jar -C playground/classes/ .

注意要写清cli和classes目录,否则会出下面的错误:

class file for org.apache.commons.cli.Options not found
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();

先删除hdfs上的output文件夹后,执行编译后的wordcount程序:

hadoop dfs -rmr output

hadoop jar playground/WordCount.jar org.apache.hadoop.examples.WordCount input output

查看输出

hadoop fs -ls output

hadoop fs -cat output/part-r-00000

5.修改WordCount源码:

package org.apache.hadoop.examples;

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;

public class WordCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{
   
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();
     
    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString(),"\t\n\r\f,.: #%&;?![]'");  //默认是空格来分,这里忽略后面这些符号.
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken().toLowerCase());  //转为小写来处理
        context.write(word, one);
      }
    }
  }
 
  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {
      int sum = 0;
      for (IntWritable val : values) {
        sum += val.get();
      }
      if (sum > 4)  //只有大于4个的才输出
       result.set(sum);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
    if (otherArgs.length != 2) {
      System.err.println("Usage: wordcount <in> <out>");
      System.exit(2);
    }
    Job job = new Job(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
    FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

以上有三处改进,主要是解决原来的程序分词用的是空格而不是标点,原来区分了大小写,原来只有一次或两次的显示,解决改进后的程序见程序中间的批注.

按前面方法重新编译执行后

查看输出

hadoop fs -ls output

hadoop fs -cat output/part-r-00000

梦翔儿,实践学习自: Hadoop in Action 首发于海云在线:http://cloud.dlmu.edu.cn/cloudsite

 
 
  • 标签:hadoop wordcount 编译 
  •  
    Re:hadoop学习笔记1 hadoop命令参数,示例与手动编译执行wordcount
    [ 2013/4/9 16:42:28 | By: yps(游客) ]
     
    yps(游客)我按照这个方法编译Randomwriter方法后出现 https://issues.apache.org/jira/browse/HADOOP-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
    这个错误,请问怎么解决下,自己菜鸟一枚……
    以下为梦翔儿的回复:
    在伪分布下?是不是空间不足?
     
    个人主页 | 引用 | 返回 | 删除 | 回复
     
    发表评论:
    载入中。。。

     
     
     

    梦翔儿网站 梦飞翔的地方 http://www.dreamflier.net
    中华人民共和国信息产业部TCP/IP系统 备案序号:辽ICP备09000550号

    Powered by Oblog.