fileinputformatHadoop,Combiner有什么用？

fileinputformat 时间:2021-06-08 阅读:()

如何使用Hadoop的Partitioner

Hadoop里面的MapReduce编程模型，非常灵活，大部分环节我们都可以重写它的API，来灵活定制我们自己的一些特殊需求。

今天散仙要说的这个分区函数Partitioner，也是一样如此，下面我们先来看下Partitioner的作用：对map端输出的数据key作一个散列，使数据能够均匀分布在各个reduce上进行后续操作，避免产生热点区。

Hadoop默认使用的分区函数是Hash Partitioner，源码如下： /** Partition keys by their {@link Object#hashCode()}. */ public class HashPartitioner<K, V> extends Partitioner<K, V> { /** Use {@link Object#hashCode()} to partition. */ public int getPartition(K key, V value, int numReduceTasks) { //默认使用key的hash值与上int的最大值，避免出现数据溢出的情况 return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks; } } 大部分情况下，我们都会使用默认的分区函数，但有时我们又有一些，特殊的需求，而需要定制Partition来完成我们的业务，案例如下：对如下数据，按字符串的长度分区，长度为1的放在一个，2的一个，3的各一个。

河南省;1 河南;2 中国;3 中国人;4 大;1 小;3 中;11 这时候，我们使用默认的分区函数，就不行了，所以需要我们定制自己的Partition，首先分析下，我们需要3个分区输出，所以在设置reduce的个数时，一定要设置为3，其次在partition里，进行分区时，要根据长度具体分区，而不是根据字符串的hash码来分区。

核心代码如下： /** * Partitioner * * * */ public static class PPartition extends Partitioner<Text, Text>{ @Override public int getPartition(Text arg0, Text arg1, int arg2) { /** * 自定义分区，实现长度不同的字符串，分到不同的reduce里面 * * 现在只有3个长度的字符串，所以可以把reduce的个数设置为3 * 有几个分区，就设置为几 * */ String key=arg0.toString(); if(key.length()==1){ return 1%arg2; }else if(key.length()==2){ return 2%arg2; }else if(key.length()==3){ return 3%arg2; } return 0; } } 全部代码如下： .partition.test; import java.io.IOException; .apache.hadoop.fs.FileSystem; .apache.hadoop.fs.Path; .apache.hadoop.io.LongWritable; .apache.hadoop.io.Text; .apache.hadoop.mapred.JobConf; .apache.hadoop.mapreduce.Job; .apache.hadoop.mapreduce.Mapper; .apache.hadoop.mapreduce.Partitioner; .apache.hadoop.mapreduce.Reducer; .apache.hadoop.mapreduce.lib.db.DBConfiguration; .apache.hadoop.mapreduce.lib.db.DBInputFormat; .apache.hadoop.mapreduce.lib.input.FileInputFormat; .apache.hadoop.mapreduce.lib.output.FileOutputFormat; .apache.hadoop.mapreduce.lib.output.MultipleOutputs; .apache.hadoop.mapreduce.lib.output.TextOutputFormat; .qin.operadb.PersonRecoder; .qin.operadb.ReadMapDB; /** * @author qindongliang * * 大数据交流群：376932160 * * * **/ public class MyTestPartition { /** * map任务 * * */ public static class PMapper extends Mapper<LongWritable, Text, Text, Text>{ @Override protected void map(LongWritable key, Text value,Context context) throws IOException, InterruptedException { // System.out.println("进map了"); //mos.write(namedOutput, key, value); String ss[]=value.toString().split(";"); context.write(new Text(ss[0]), new Text(ss[1])); } } /** * Partitioner * * * */ public static class PPartition extends Partitioner<Text, Text>{ @Override public int getPartition(Text arg0, Text arg1, int arg2) { /** * 自定义分区，实现长度不同的字符串，分到不同的reduce里面 * * 现在只有3个长度的字符串，所以可以把reduce的个数设置为3 * 有几个分区，就设置为几 * */ String key=arg0.toString(); if(key.length()==1){ return 1%arg2; }else if(key.length()==2){ return 2%arg2; }else if(key.length()==3){ return 3%arg2; } return 0; } } /*** * Reduce任务 * * **/ public static class PReduce extends Reducer<Text, Text, Text, Text>{ @Override protected void reduce(Text arg0, Iterable<Text> arg1, Context arg2) throws IOException, InterruptedException { String key=arg0.toString().split(",")[0]; System.out.println("key==> "+key); for(Text t:arg1){ //System.out.println("Reduce: "+arg0.toString()+" "+t.toString()); arg2.write(arg0, t); } } } public static void main(String[] args) throws Exception{ JobConf conf=new JobConf(ReadMapDB.class); //Configuration conf=new Configuration(); conf.set("mapred.job.tracker","192.168.75.130:9001"); //读取person中的数据字段 conf.setJar("tt.jar"); //注意这行代码放在最前面，进行初始化，否则会报 /**Job任务**/ Job job=new Job(conf, "testpartion"); job.setJarByClass(MyTestPartition.class); System.out.println("模式： "+conf.get("mapred.job.tracker"));; // job.setCombinerClass(PCombine.class); job.setPartitionerClass(PPartition.class); job.setNumReduceTasks(3);//设置为3 job.setMapperClass(PMapper.class); // MultipleOutputs.addNamedOutput(job, "hebei", TextOutputFormat.class, Text.class, Text.class); // MultipleOutputs.addNamedOutput(job, "henan", TextOutputFormat.class, Text.class, Text.class); job.setReducerClass(PReduce.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); String path="hdfs://192.168.75.130:9000/root/outputdb"; FileSystem fs=FileSystem.get(conf); Path p=new Path(path); if(fs.exists(p)){ fs.delete(p, true); System.out.println("输出路径存在，已删除！"); } FileInputFormat.setInputPaths(job, "hdfs://192.168.75.130:9000/root/input"); FileOutputFormat.setOutputPath(job,p ); System.exit(job.waitForCompletion(true) ? 0 : 1); } }

如何使用eclipse调试Hadoop作业

将hadoop开发包里面的相关jar导进工程就行, 至于想调试,就看hadoop计数器返回到eclipse里的内容就可以了. 不过有一点, 如果调试的是MapReduce,速度可能不快.

Hadoop,Combiner有什么用？

Combiner，Combiner号称本地的Reduce，Reduce最终的输入，是Combiner的输出。

Combiner是用reducer来定义的，多数的情况下Combiner和reduce处理的是同一种逻辑，所以job.setCombinerClass()的参数可以直接使用定义的reduce。

当然也可以单独去定义一个有别于reduce的Combiner，继承Reducer，写法基本上定义reduce一样。

展开全文

fileinputformatHadoop,Combiner有什么用？相关文档

fileinputformatmapreduce 键值对怎么定义的

fileinputformat不同mapreduce程序可以连续运行吗？比如说多个这样的程序，用上一个的输出作为下一个的输入，求

fileinputformathadoop 怎么设置多个输入路径

jmh6.13 泗洪事件是怎么个情况、？林俊杰怎么了？企鹅医生共享体检真的方便吗 csonline2看新闻 csol2 马上就要发布了我有个问题问大神们拜托了 12种颜色水粉颜料调色过程十二种颜色 qq网络硬盘如何使用QQ网络硬盘腾讯技术腾讯QQ是谁研发的？在那一年上市的？assemblyinfocsgo很跟cs有什么区别币众筹收益权众筹为什么有吸引力什么是生态系统什么是生态环境？电子邮件软件邮件客户端软件虚拟主机服务商台湾服务器租用 php主机租用中国万网虚拟主机 greengeeks 国外永久服务器 kdata 表单样式 typecho debian7 免费全能空间台湾谷歌网址北京双线什么是服务器托管免费phpmysql空间吉林铁通跟踪路由命令空间登陆首页我的世界服务器ip 石家庄服务器托管更多

fileinputformatHadoop,Combiner有什么用？

如何使用Hadoop的Partitioner

如何使用eclipse调试Hadoop作业

Hadoop,Combiner有什么用？

香港服务器多少钱一个月?香港云服务器最便宜价格

PQ.hosting：香港HE/乌克兰/俄罗斯/荷兰/摩尔多瓦/德国/斯洛伐克/捷克vps,2核/2GB内存/30GB NVMe空间,€3/月

UCloud 618活动：香港云服务器月付13元起;最高可购3年,AMD/Intel系列