电商数仓项目<三>hadoop3.2.2 安装与配置
本节主要讲解hadoop的安装配置,以及使用lzo压缩的配置,最后进行了hadoop的简单测试。
代码下载地址
一、集群规划
node01 | node02 | node03 | |
HDFS | NameNode DataNode | DataNode | DataNode Secondary NameNode |
二、hadoop3.2.2 下载(node01/jack)
1 下载hadoop3.2.2
[jack@node01 u02]$ wget https://archive.apache.org/dist/hadoop/common/hadoop-3.2.2/hadoop-3.2.2.tar.gz
2 解压缩hadoop文件到/u01/hadoop-3.2.2目录下
[jack@node01 u02]$ tar -zxf hadoop-3.2.2.tar.gz -C /u01
3 在/u01/hadoop-3.2.2目录下创建日志目录和文件存储目录
[jack@node01 u01]$ cd hadoop-3.2.2
[jack@node01 hadoop-3.2.2]$ mkdir logs hadoop_data hadoop_data/tmp hadoop_data/namenode hadoop_data/datanode secret
[jack@node01 hadoop-3.2.2]$ cd /u01/hadoop-3.2.2/secret
[jack@node01 secret]$ vi hadoop-http-auth-signature-secret
输入访问用户:比如:jack
使用simple伪安全配置,需要设置访问用户,具体见core-site.xml。如果需要更安全的认证可以使用kerberos。
在hadoop web访问地址后面加 ?user.name=jack
比如:http://node01:8088/cluster?user.name=jack
三、配置环境变量(all/jack)
注意:三个节点都要配置
[jack@node01 hadoop-3.2.2]$ sudo vi /etc/profile
export HADOOP_HOME=/u01/hadoop-3.2.2
export LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
[jack@node01 hadoop-3.2.2]$ source /etc/profile
LD_LIBRARY_PATH:是hadoop lzo压缩的path,在电商数仓项目(一)中已经安装过
四、上传jar文件(node01/jack)
- 复制在上节编译好的hadoop-lzo.jar文件
目录:$HADOOP_HOME/share/hadoop/common
下载地址:附件中下载(hadoop-3.2.2\share\hadoop\common\hadoop-lzo-0.4.21.jar)
为什么:hdfs文件支持lzo压缩 - 上传junit.jar
目录:$HADOOP_HOME/share/hadoop/common/lib
版本:4.8.2
下载地址:附件中下载(hadoop-3.2.2\share\hadoop\common\lib\junit-4.8.2.jar)
为什么:hadoop jar运行时需要junit.jar,不然会报错:java.lang.NoClassDefFoundError: junit/framework/TestCase
五、配置env文件
- 分别在httpfs-env.sh、mapred-env.sh、yarn-env.sh文件前添加JAVA_HOME环境变量
目录:$HADOOP_HOME/etc/hadoop
export JAVA_HOME=/usr/java/jdk1.8.0_101
- 在hadoop-env.sh文件中添加JAVA_HOME和HADOOP_HOME
目录:$HADOOP_HOME/etc/hadoop
export JAVA_HOME=/usr/java/jdk1.8.0_101
export HADOOP_HOME=/u01/hadoop-3.2.2
export LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib
六、配置core-site.xml
添加文件目录,安全设置,lzo压缩设置等,内容如下,需要创建secret文件夹和hadoop-http-auth-signature-secret文件,具体见第二步。
目录:$HADOOP_HOME/etc/hadoop
<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://node01:9000</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/u01/hadoop-3.2.2/hadoop_data/tmp</value>
</property>
<property>
<name>io.compression.codecs</name>
<value>org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.BZip2Codec</value>
</property>
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
<property>
<name>hadoop.http.filter.initializers</name>
<value>org.apache.hadoop.security.AuthenticationFilterInitializer</value>
<description>由于8088端口是暴露在公网,添加安全设置</description>
</property>
<property>
<name>hadoop.http.authentication.type</name>
<value>simple</value>
</property>
<property>
<name>hadoop.http.authentication.signature.secret.file</name>
<value>/u01/hadoop-3.2.2/secret/hadoop-http-auth-signature-secret</value>
<description>加密文件内容可以随便输入</description>
</property>
<property>
<name>hadoop.http.authentication.simple.anonymous.allowed</name>
<value>false</value>
<description>是否允许匿名请求,默认是true</description>
</property>
<property>
<name>dfs.permissions.enabled</name>
<value>false</value>
</property>
<property>
<name>hadoop.proxyuser.jack.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.jack.groups</name>
<value>*</value>
</property>
<!--删除文件后先放到.Trash目录 -->
<property>
<name>fs.trash.interval</name>
<value>1440</value>
<description>单位是分钟,1440/60 = 24 小时,保留一天时间</description>
</property>
<property>
<name>fs.trash.checkpoint.interval</name>
<value>1440</value>
</property>
</configuration>
七、配置hdfs-site.xml
设置namenode目录、datanode目录、HDFS副本数目、secondarynamenode节点信息
目录:$HADOOP_HOME/etc/hadoop
<configuration>
<property>
<name>dfs.namenode.name.dir</name>
<value>/u01/hadoop-3.2.2/hadoop_data/namenode</value>
<description>元数据存储目录</description>
</property>
<property>
<name>dfs.datanode.data.dir</name>
<value>/u01/hadoop-3.2.2/hadoop_data/datanode</value>
<description>datanode 的数据存储目录</description>
</property>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>HDFS 的数据块的副本个数</description>
</property>
<property>
<name>dfs.secondary.http.address</name>
<value>node02:9001</value>
<description>secondary namenode 节点信息,最好是和namenode 设置为不同节点</description>
</property>
<property>
<name>dfs.webhdfs.enabled</name>
<value>true</value>
</property>
</configuration>
八、配置mapred-site.xml
设置mapreduce使用的资源管理器、mapreduce环境变量等
目录:$HADOOP_HOME/etc/hadoop
<configuration>
<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>
<property>
<name>yarn.app.mapreduce.am.env</name>
<value>HADOOP_MAPRED_HOME=/u01/hadoop-3.2.2/etc/hadoop:/u01/hadoop-3.2.2/share/hadoop/common/lib/*:/u01/hadoop-3.2.2/share/hadoop/common/*:/u01/hadoop-3.2.2/share/hadoop/hdfs:/u01/hadoop-3.2.2/share/hadoop/hdfs/lib/*:/u01/hadoop-3.2.2/share/hadoop/hdfs/*:/u01/hadoop-3.2.2/share/hadoop/mapreduce/*:/u01/hadoop-3.2.2/share/hadoop/yarn:/u01/hadoop-3.2.2/share/hadoop/yarn/lib/*:/u01/hadoop-3.2.2/share/hadoop/yarn/*</value>
</property>
<property>
<name>mapreduce.map.env</name>
<value>HADOOP_MAPRED_HOME=/u01/hadoop-3.2.2/etc/hadoop:/u01/hadoop-3.2.2/share/hadoop/common/lib/*:/u01/hadoop-3.2.2/share/hadoop/common/*:/u01/hadoop-3.2.2/share/hadoop/hdfs:/u01/hadoop-3.2.2/share/hadoop/hdfs/lib/*:/u01/hadoop-3.2.2/share/hadoop/hdfs/*:/u01/hadoop-3.2.2/share/hadoop/mapreduce/*:/u01/hadoop-3.2.2/share/hadoop/yarn:/u01/hadoop-3.2.2/share/hadoop/yarn/lib/*:/u01/hadoop-3.2.2/share/hadoop/yarn/*</value>
</property>
<property>
<name>mapreduce.reduce.env</name>
<value>HADOOP_MAPRED_HOME=/u01/hadoop-3.2.2/etc/hadoop:/u01/hadoop-3.2.2/share/hadoop/common/lib/*:/u01/hadoop-3.2.2/share/hadoop/common/*:/u01/hadoop-3.2.2/share/hadoop/hdfs:/u01/hadoop-3.2.2/share/hadoop/hdfs/lib/*:/u01/hadoop-3.2.2/share/hadoop/hdfs/*:/u01/hadoop-3.2.2/share/hadoop/mapreduce/*:/u01/hadoop-3.2.2/share/hadoop/yarn:/u01/hadoop-3.2.2/share/hadoop/yarn/lib/*:/u01/hadoop-3.2.2/share/hadoop/yarn/*</value>
</property>
<property>
<name>mapred.map.output.compression.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
<property>
<name>mapred.child.env</name>
<value>LD_LIBRARY_PATH=/usr/local/hadoop/lzo/lib</value>
</property>
<property>
<name>mapred.child.java.opts</name>
<value>-Xmx1048m</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>-Xmx1310m</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>-Xmx2620m</value>
</property>
<property>
<name>mapreduce.job.counters.limit</name>
<value>20000</value>
<description>Limit on the number of counters allowed per job. The default value is 200.</description>
</property>
</configuration>
九、配置yarn-site.xml
内存设置、yarn的地址等
目录:$HADOOP_HOME/etc/hadoop
<configuration>
<!-- Site specific YARN configuration properties -->
<!-- Reducer获取数据的方式 -->
<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>node01</value>
</property>
<property>
<description>Amount of physical memory, in MB, that can be allocated for containers.</description>
<name>yarn.nodemanager.resource.memory-mb</name>
<value>7192</value>
</property>
<property>
<description>The minimum allocation for every container request at the RM,in MBs.
Memory requests lower than this won't take effect,and the specified value will get allocated at minimum.</description>
<name>yarn.scheduler.minimum-allocation-mb</name>
<value>1024</value>
</property>
<property>
<description>The maximum allocation for every container request at the RM,in MBs.
Memory requests higher than this won't take effect, and will get capped to this value.</description>
<name>yarn.scheduler.maximum-allocation-mb</name>
<value>7192</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.app.mapreduce.am.command-opts</name>
<value>-Xmx2457m</value>
</property>
</configuration>
十、配置workers
设置datanode的服务器,之前文件名是slaves,hadoop3之后改为workers了。
目录:$HADOOP_HOME/etc/hadoop
node01
node02
node03
十一、同步hadoop
使用第一节中讲的同步程序,进行同步。或者使用scp命令同步也可以。
[jack@node01 bin]$ cd /u01/bin
[jack@node01 bin]$ xsync /u01/hadoop-3.2.2
十二、格式化hadoop
[jack@node01 hadoop-3.2.2]$ hdfs namenode -format
十三、启动或停止hadoop
- 启用集群
[jack@node01 hadoop-3.2.2]$ start-all.sh
WARNING: Attempting to start all Apache Hadoop daemons as jack in 10 seconds.
WARNING: This is not a recommended production deployment configuration.
WARNING: Use CTRL-C to abort.
Starting namenodes on [node01]
Starting datanodes
Starting secondary namenodes [node02]
Starting resourcemanager
Starting nodemanagers
- 查看集群启动信息
[jack@node01 hadoop-3.2.2]$ jps
31825 Jps
31351 ResourceManager
30845 NameNode
31486 NodeManager
31007 DataNode
- 停止集群
[jack@node01 hadoop-3.2.2]$ stop-all.sh
WARNING: Stopping all Apache Hadoop daemons as jack in 10 seconds.
WARNING: Use CTRL-C to abort.
Stopping namenodes on [node01]
Stopping datanodes
Stopping secondary namenodes [node02]
Stopping nodemanagers
Stopping resourcemanager
十四、web界面查看
- 节点信息查看:
首次访问(由于设置了simple安全策略):http://node01:9870?user.name=jack - Job查看:http://node01:8088/cluster?user.name=jack
十五、Hadoop测试
- 测试PI
[jack@node01 hadoop-3.2.2]$ cd /u01/hadoop-3.2.2/share/hadoop/mapreduce
[jack@node01 mapreduce]$ hadoop jar hadoop-mapreduce-examples-3.2.2.jar pi 100 100
- lzo压缩测试
本示例上传到u02目录下:app.log ,可以在附件中下载。使用lzop命令压缩文件,在hdfs创建input目录,上传压缩文件到input目录,之后再创建索引,可以看到目录下生成了一个.index文件。
[jack@node01 u02]$ lzop app.log
[jack@node01 u02]$ ls -l app.log*
-rw-rw-r-- 1 jack jack 10306945 Feb 27 22:32 app.log
-rw-rw-r-- 1 jack jack 3291867 Feb 27 22:32 app.log.lzo
[jack@node01 u02]$ hdfs dfs -mkdir -p /input
[jack@node01 u02]$ hdfs dfs -put app.log.lzo /input
[jack@node01 u02]$ hdfs dfs -ls /input
Found 1 items
-rw-r--r-- 3 jack supergroup 3291867 2021-02-27 23:06 /input/app.log.lzo
[jack@node01 u02]$ cd $HADOOP_HOME/share/hadoop/common
[jack@node01 common]$ hadoop jar hadoop-lzo-0.4.21.jar com.hadoop.compression.lzo.DistributedLzoIndexer /input/app.log.lzo
[jack@node01 common]$ hdfs dfs -ls /input
Found 2 items
-rw-r--r-- 3 jack supergroup 3291867 2021-02-27 23:06 /input/app.log.lzo
-rw-r--r-- 3 jack supergroup 320 2021-02-27 23:09 /input/app.log.lzo.index
- 测试HDFS写性能
测试内容:向HDFS集群写10个100M的文件
[jack@node01 common]$ cd $HADOOP_HOME/share/hadoop/mapreduce
[jack@node01 mapreduce]$ hadoop jar hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -write -nrFiles 10 -fileSize 100MB
- 测试HDFS读性能
测试内容:读取HDFS集群10个100MB的文件
[jack@node01 mapreduce]$ hadoop jar hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -read -nrFiles 10 -fileSize 100MB
[jack@node01 mapreduce]$ hadoop jar hadoop-mapreduce-client-jobclient-3.2.2-tests.jar TestDFSIO -clean
电商数仓项目(一) 系统规划和配置
电商数仓项目(二) Maven 安装和hadoop-lzo编译
电商数仓项目(三) hadoop3.2.2 安装与配置
电商数仓项目(四) 模拟电商日志数据开发