淘先锋技术网

首页 1 2 3 4 5 6 7

副本放置

直接查看Hadoop的源码,这里以hadoop-2.9.2版本为例。
文件位置:
hadoop-2.9.2-src\hadoop-hdfs-project\hadoop-hdfs\src\main\java\org\apache\hadoop\hdfs\server\blockmanagement
实现类:
BlockPlacementPolicyDefault.java
在这个实现类里面,关于HDFS副本放置策略作了详细的说明。具体如下:

/**
 * The class is responsible for choosing the desired number of targets
 * for placing block replicas.
 * The replica placement strategy is that if the writer is on a datanode,
 * the 1st replica is placed on the local machine, 
 * otherwise a random datanode. The 2nd replica is placed on a datanode
 * that is on a different rack. The 3rd replica is placed on a datanode
 * which is on a different node of the rack as the second replica.
 */

由此,我们得出结论(以3个副本数为例):

  • 第一个副本放置在上传文件的DataNode服务器节点上,如果是在集群外提交,则随机放置在一个DataNode服务器节点上。
  • 第二个副本放置在与第一个DataNode不同的机架的一个节点上。
  • 第三个副本放置在与第二个DataNode相同的机架的不同节点上。
  • 更多副本:随机节点放置。

副本选择

官网原文:
To minimize global bandwidth consumption and read latency, HDFS tries to satisfy a read request from a replica that is closest to the reader. If there exists a replica on the same rack as the reader node, then that replica is preferred to satisfy the read request. If HDFS cluster spans multiple data centers, then a replica that is resident in the local data center is preferred over any remote replica.

直译如下:
为了最小化全局带宽消耗和读取延迟,HDFS尝试满足来自离读取器最近的副本的读取请求。如果在读取器节点所在的机架上存在一个副本,则该副本首选满足读取请求。如果HDFS集群跨越多个数据中心,那么与任何远程副本相比,本地数据中心中的副本是首选的。

参考资料

Hadoop官方文档:http://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html#Replica_Placement:_The_First_Baby_Steps