淘先锋技术网

首页 1 2 3 4 5 6 7


本节主要讲解hadoop的安装配置,以及使用lzo压缩的配置,最后进行了hadoop的简单测试。
代码下载地址

一、集群规划

node01node02node03
HDFSNameNode
DataNode
DataNodeDataNode
Secondary NameNode

二、hadoop3.2.2 下载(node01/jack)

1 下载hadoop3.2.2

[jack@node01 u02]$ wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz

2 解压缩hadoop文件到/u01/hadoop-3.2.2目录下

[jack@node01 u02]$ tar -zxf hadoop-3.2.2.tar.gz -C /u01

3 在/u01/hadoop-3.2.2目录下创建日志目录和文件存储目录

[jack@node01 u01]$ cd hadoop-3.2.2
[jack@node01 hadoop-3.2.2]$ mkdir logs hadoop_data hadoop_data/tmp hadoop_data/namenode hadoop_data/datanode secret
[jack@node01 hadoop-3.2.2]$ cd /u01/hadoop-3.2.2/secret
[jack@node01 secret]$ vi hadoop-http-auth-signature-secret
输入访问用户:比如:jack

使用simple伪安全配置,需要设置访问用户,具体见core-site.xml。如果需要更安全的认证可以使用kerberos。
在hadoop web访问地址后面加 ?user.name=jack
比如:http://node01:8088/cluster?user.name=jack

三、配置环境变量(all/jack)

注意:三个节点都要配置

[jack@node01 hadoop-3.2.2]$ sudo vi /etc/profile
export HADOOP_HOME=/u01/hadoop-3.2.2
export LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
[jack@node01 hadoop-3.2.2]$ source /etc/profile

LD_LIBRARY_PATH:是hadoop lzo压缩的path,在电商数仓项目(一)中已经安装过

四、上传jar文件(node01/jack)

  1. 复制在上节编译好的hadoop-lzo.jar文件
    目录:$HADOOP_HOME/share/hadoop/common
    下载地址:附件中下载(hadoop-3.2.2\share\hadoop\common\hadoop-lzo-0.4.21.jar)
    为什么:hdfs文件支持lzo压缩
  2. 上传junit.jar
    目录:$HADOOP_HOME/share/hadoop/common/lib
    版本:4.8.2
    下载地址:附件中下载(hadoop-3.2.2\share\hadoop\common\lib\junit-4.8.2.jar)
    为什么:hadoop jar运行时需要junit.jar,不然会报错:java.lang.NoClassDefFoundError: junit/framework/TestCase

五、配置env文件

  1. 分别在httpfs-env.sh、mapred-env.sh、yarn-env.sh文件前添加JAVA_HOME环境变量
    目录:$HADOOP_HOME/etc/hadoop
export JAVA_HOME=/usr/java/jdk1.8.0_101
  1. 在hadoop-env.sh文件中添加JAVA_HOME和HADOOP_HOME
    目录:$HADOOP_HOME/etc/hadoop
export JAVA_HOME=/usr/java/jdk1.8.0_101
export HADOOP_HOME=/u01/hadoop-3.2.2
export LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib

六、配置core-site.xml

添加文件目录,安全设置,lzo压缩设置等,内容如下,需要创建secret文件夹和hadoop-http-auth-signature-secret文件,具体见第二步。
目录:$HADOOP_HOME/etc/hadoop

<configuration>
  <property>
    <name>fs.defaultFS</name>
    <value>hdfs://node01:9000</value>
  </property>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/u01/hadoop-3.2.2/hadoop_data/tmp</value>
  </property>
  
  <property>
    <name>io.compression.codecs</name>
	<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
  </property>
  
  <property>
    <name>io.compression.codec.lzo.class</name>
    <value>com.hadoop.compression.lzo.LzoCodec</value>
  </property>
  
  <property>
    <name>hadoop.http.filter.initializers</name>
    <value>org.apache.hadoop.security.AuthenticationFilterInitializer</value>
	<description>由于8088端口是暴露在公网,添加安全设置</description>
  </property>
  
  <property>
    <name>hadoop.http.authentication.type</name>
    <value>simple</value>
  </property>
  <property>
    <name>hadoop.http.authentication.signature.secret.file</name>
    <value>/u01/hadoop-3.2.2/secret/hadoop-http-auth-signature-secret</value>
	<description>加密文件内容可以随便输入</description>
  </property>
  <property>
   <name>hadoop.http.authentication.simple.anonymous.allowed</name>
    <value>false</value>
	<description>是否允许匿名请求,默认是true</description>
  </property>
  
  
  <property>
    <name>dfs.permissions.enabled</name>
    <value>false</value>
  </property>

  <property>
    <name>hadoop.proxyuser.jack.hosts</name>
    <value>*</value>
  </property>

  <property>
    <name>hadoop.proxyuser.jack.groups</name>
    <value>*</value>
  </property>
  
  <!--删除文件后先放到.Trash目录 -->
  <property>
    <name>fs.trash.interval</name>
    <value>1440</value>
    <description>单位是分钟,1440/60 = 24 小时,保留一天时间</description>
  </property>
  <property>
    <name>fs.trash.checkpoint.interval</name>
    <value>1440</value>
  </property>
</configuration>

七、配置hdfs-site.xml

设置namenode目录、datanode目录、HDFS副本数目、secondarynamenode节点信息
目录:$HADOOP_HOME/etc/hadoop

<configuration>
   <property>
      <name>dfs.namenode.name.dir</name>
      <value>/u01/hadoop-3.2.2/hadoop_data/namenode</value>
      <description>元数据存储目录</description>
   </property>

   <property>
      <name>dfs.datanode.data.dir</name>
      <value>/u01/hadoop-3.2.2/hadoop_data/datanode</value>
      <description>datanode 的数据存储目录</description>
   </property>

   <property>
      <name>dfs.replication</name>
      <value>3</value>
      <description>HDFS 的数据块的副本个数</description>
   </property>
   
   <property>
      <name>dfs.secondary.http.address</name>
      <value>node02:9001</value>
      <description>secondary namenode 节点信息,最好是和namenode 设置为不同节点</description>
   </property>
  
   <property>
      <name>dfs.webhdfs.enabled</name>
      <value>true</value>
   </property>
</configuration>

八、配置mapred-site.xml

设置mapreduce使用的资源管理器、mapreduce环境变量等
目录:$HADOOP_HOME/etc/hadoop

<configuration>
   <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
   </property>
   
   <property>
     <name>yarn.app.mapreduce.am.env</name>
     <value>HADOOP_MAPRED_HOME=/u01/hadoop-3.2.2/etc/hadoop:/u01/hadoop-3.2.2/share/hadoop/common/lib/*:/u01/hadoop-3.2.2/share/hadoop/common/*:/u01/hadoop-3.2.2/share/hadoop/hdfs:/u01/hadoop-3.2.2/share/hadoop/hdfs/lib/*:/u01/hadoop-3.2.2/share/hadoop/hdfs/*:/u01/hadoop-3.2.2/share/hadoop/mapreduce/*:/u01/hadoop-3.2.2/share/hadoop/yarn:/u01/hadoop-3.2.2/share/hadoop/yarn/lib/*:/u01/hadoop-3.2.2/share/hadoop/yarn/*</value>
   </property>
   <property>
     <name>mapreduce.map.env</name>
     <value>HADOOP_MAPRED_HOME=/u01/hadoop-3.2.2/etc/hadoop:/u01/hadoop-3.2.2/share/hadoop/common/lib/*:/u01/hadoop-3.2.2/share/hadoop/common/*:/u01/hadoop-3.2.2/share/hadoop/hdfs:/u01/hadoop-3.2.2/share/hadoop/hdfs/lib/*:/u01/hadoop-3.2.2/share/hadoop/hdfs/*:/u01/hadoop-3.2.2/share/hadoop/mapreduce/*:/u01/hadoop-3.2.2/share/hadoop/yarn:/u01/hadoop-3.2.2/share/hadoop/yarn/lib/*:/u01/hadoop-3.2.2/share/hadoop/yarn/*</value>
   </property>
   <property>
     <name>mapreduce.reduce.env</name>
     <value>HADOOP_MAPRED_HOME=/u01/hadoop-3.2.2/etc/hadoop:/u01/hadoop-3.2.2/share/hadoop/common/lib/*:/u01/hadoop-3.2.2/share/hadoop/common/*:/u01/hadoop-3.2.2/share/hadoop/hdfs:/u01/hadoop-3.2.2/share/hadoop/hdfs/lib/*:/u01/hadoop-3.2.2/share/hadoop/hdfs/*:/u01/hadoop-3.2.2/share/hadoop/mapreduce/*:/u01/hadoop-3.2.2/share/hadoop/yarn:/u01/hadoop-3.2.2/share/hadoop/yarn/lib/*:/u01/hadoop-3.2.2/share/hadoop/yarn/*</value>
   </property>

   <property>  
     <name>mapred.map.output.compression.codec</name>  
     <value>com.hadoop.compression.lzo.LzoCodec</value>  
   </property>  

   <property>  
     <name>mapred.child.env</name>  
     <value>LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib</value>  
   </property>
   
   <property>  
     <name>mapred.child.java.opts</name>  
     <value>-Xmx1048m</value>  
   </property> 
   
   <property>  
     <name>mapreduce.map.java.opts</name>  
     <value>-Xmx1310m</value>  
   </property> 
   
   <property>  
     <name>mapreduce.reduce.java.opts</name>  
     <value>-Xmx2620m</value>  
   </property> 
   
   <property>
     <name>mapreduce.job.counters.limit</name>
     <value>20000</value>
     <description>Limit on the number of counters allowed per job. The default value is 200.</description>
   </property>
</configuration>

九、配置yarn-site.xml

内存设置、yarn的地址等
目录:$HADOOP_HOME/etc/hadoop

<configuration>
   <!-- Site specific YARN configuration properties -->
   <!-- Reducer获取数据的方式 -->
   <property>
      <name>yarn.nodemanager.aux-services</name>
      <value>mapreduce_shuffle</value>
   </property>
   
   <property>
      <name>yarn.resourcemanager.hostname</name>
      <value>node01</value>
   </property>
   
   <property>
      <description>Amount of physical memory, in MB, that can be allocated for containers.</description>
      <name>yarn.nodemanager.resource.memory-mb</name>
      <value>7192</value>
   </property>
   
   <property>
      <description>The minimum allocation for every container request at the RM,in MBs. 
	  Memory requests lower than this won't take effect,and the specified value will get allocated at minimum.</description>
      <name>yarn.scheduler.minimum-allocation-mb</name>
      <value>1024</value>
   </property>

   <property>
      <description>The maximum allocation for every container request at the RM,in MBs. 
	  Memory requests higher than this won't take effect, and will get capped to this value.</description>
      <name>yarn.scheduler.maximum-allocation-mb</name>
      <value>7192</value>
   </property>

   <property>
      <name>yarn.nodemanager.vmem-check-enabled</name>
	  <value>false</value>
   </property>
   
    <property>
      <name>yarn.app.mapreduce.am.command-opts</name>
      <value>-Xmx2457m</value>
  </property>
</configuration>

十、配置workers

设置datanode的服务器,之前文件名是slaves,hadoop3之后改为workers了。
目录:$HADOOP_HOME/etc/hadoop

node01
node02
node03

十一、同步hadoop

使用第一节中讲的同步程序,进行同步。或者使用scp命令同步也可以。

[jack@node01 bin]$ cd /u01/bin
[jack@node01 bin]$ xsync /u01/hadoop-3.2.2

十二、格式化hadoop

[jack@node01 hadoop-3.2.2]$ hdfs namenode -format

十三、启动或停止hadoop

  1. 启用集群
[jack@node01 hadoop-3.2.2]$ start-all.sh
WARNING: Attempting to start all Apache Hadoop daemons as jack in 10 seconds.
WARNING: This is not a recommended production deployment configuration.
WARNING: Use CTRL-C to abort.
Starting namenodes on [node01]
Starting datanodes
Starting secondary namenodes [node02]
Starting resourcemanager
Starting nodemanagers
  1. 查看集群启动信息
[jack@node01 hadoop-3.2.2]$ jps
31825 Jps
31351 ResourceManager
30845 NameNode
31486 NodeManager
31007 DataNode
  1. 停止集群
[jack@node01 hadoop-3.2.2]$ stop-all.sh
WARNING: Stopping all Apache Hadoop daemons as jack in 10 seconds.
WARNING: Use CTRL-C to abort.
Stopping namenodes on [node01]
Stopping datanodes
Stopping secondary namenodes [node02]
Stopping nodemanagers
Stopping resourcemanager

十四、web界面查看

  1. 节点信息查看:
    首次访问(由于设置了simple安全策略):http://node01:9870?user.name=jack
  2. Job查看:http://node01:8088/cluster?user.name=jack

十五、Hadoop测试

  1. 测试PI
[jack@node01 hadoop-3.2.2]$ cd /u01/hadoop-3.2.2/share/hadoop/mapreduce
[jack@node01 mapreduce]$ hadoop jar hadoop-mapreduce-examples-3.2.2.jar pi 100 100
  1. lzo压缩测试
    本示例上传到u02目录下:app.log ,可以在附件中下载。使用lzop命令压缩文件,在hdfs创建input目录,上传压缩文件到input目录,之后再创建索引,可以看到目录下生成了一个.index文件。
[jack@node01 u02]$ lzop app.log
[jack@node01 u02]$ ls -l app.log*
-rw-rw-r-- 1 jack jack 10306945 Feb 27 22:32 app.log
-rw-rw-r-- 1 jack jack  3291867 Feb 27 22:32 app.log.lzo
[jack@node01 u02]$ hdfs dfs -mkdir -p /input
[jack@node01 u02]$ hdfs dfs -put app.log.lzo /input
[jack@node01 u02]$ hdfs dfs -ls /input
Found 1 items
-rw-r--r--   3 jack supergroup    3291867 2021-02-27 23:06 /input/app.log.lzo
[jack@node01 u02]$ cd $HADOOP_HOME/share/hadoop/common
[jack@node01 common]$ hadoop jar hadoop-lzo-0.4.21.jar com.hadoop.compression.lzo.DistributedLzoIndexer /input/app.log.lzo
[jack@node01 common]$ hdfs dfs -ls /input
Found 2 items
-rw-r--r--   3 jack supergroup    3291867 2021-02-27 23:06 /input/app.log.lzo
-rw-r--r--   3 jack supergroup        320 2021-02-27 23:09 /input/app.log.lzo.index
  1. 测试HDFS写性能
    测试内容:向HDFS集群写10个100M的文件
[jack@node01 common]$ cd $HADOOP_HOME/share/hadoop/mapreduce
[jack@node01 mapreduce]$ hadoop jar hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 100MB
  1. 测试HDFS读性能
    测试内容:读取HDFS集群10个100MB的文件
[jack@node01 mapreduce]$ hadoop jar hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 100MB
[jack@node01 mapreduce]$ hadoop jar hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -clean

电商数仓项目(一) 系统规划和配置
电商数仓项目(二) Maven 安装和hadoop-lzo编译
电商数仓项目(三) hadoop3.2.2 安装与配置
电商数仓项目(四) 模拟电商日志数据开发