Hadoop 快速开始

Hadoop实现了一个分布式文件系统(Hadoop Distributed File System),简称HDFS。Hadoop的框架最核心的设计就是:HDFS和MapReduce。HDFS为海量的数据提供了存储,则MapReduce为海量的数据提供了计算。

Download

wget https://mirrors.tuna.tsinghua.edu.cn/apache/hadoop/core/hadoop-2.8.1/hadoop-2.8.1.tar.gz

Github

Version / 2.8.1
hadoop-getstarted

Env

export HADOOP_HOME=/opt/apache/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

Config

Standalone

hadoop/etc/hadoop/hadoop-env.sh

export JAVA_HOME=/usr/lib/jvm/java-8-oracle

hadoop/etc/hadoop/mapred-site.xml

<configuration>
</configuration>

hadoop/etc/hadoop/core-site.xml

<configuration>
        <property>
                <name>fs.default.name</name>
                <value>hdfs://127.0.0.1:9000</value>
        </property>
</configuration>

hadoop/etc/hadoop/hdfs-site.xml

<configuration>
</configuration>

Start

$HADOOP_HOME/bin/hdfs namenode -format
$HADOOP_HOME/bin/hdfs getconf -namenodes
$HADOOP_HOME/sbin/start-all.sh

Check status

jps

Example

# ~/opt/apache/hadoop
## Usage
bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar wordcount

## Put file for processing
hadoop fs -put LICENSE.txt

## schedule job
bin/hadoop jar ./share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar  wordcount LICENSE.txt LICENSE.wc
hadoop fs -get LICENSE.wc
cat LICENSE.wc/part-r-00000

Web Client

# Web UI
http://desert:8088/cluster/cluster

# Datanode
http://desert:50070/dfshealth.html#tab-overview

# Job history server
# http://www.cnblogs.com/luogankun/p/4019303.html
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh --config $HADOOP_CONF_DIR start historyserver

Workflow

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-i1c0l2dV-1602391041278)(https://static-public.chatopera.com/backlog/chatbot/images/2017/07/hadoop2.png)]

在这里插入图片描述在这里插入图片描述

Streaming

Hadoop Stream允许我们使用任何可执行的脚本处理按行组织的数据流,数据取自Unix的标准输入STDIN,并输出到标准输出到STDOUT。
https://hadoop.apache.org/docs/r2.7.3/hadoop-streaming/HadoopStreaming.html

Example

http://www.cnblogs.com/dandingyy/archive/2013/03/01/2938442.html

Download data

wget http://www.nber.org/patents/Cite75_99.zip -O data/Cite75_99.zip

Python Streaming, RandomSample.py

#!/usr/bin/env python
import sys, random

for line in sys.stdin:
    if random.randint(1, 100) <= int(sys.argv[1]):
        print line.strip()

Submit Job

bin/hadoop jar share/hadoop/tools/lib/hadoop-streaming-2.8.1.jar \
        -input data/cite75_99.txt \
        -output cite75_99_sample \
        -mapper 'RandomSample.py 10' \
        -file RandomSample.py \
        -D mapred.reduce.tasks=1

By default, using IdentityReducer, after job is finished, use getmergeto get final result.

Breaking changes

TaskTracker and JobTracker are replaced.

In Hadoop 2.0, the JobTracker and TaskTracker no longer exist and have been replaced by three components:

ResourceManager: a scheduler that allocates available resources in the cluster amongst the competing applications.

NodeManager: runs on each node in the cluster and takes direction from the ResourceManager. It is responsible for managing resources available on a single node.

ApplicationMaster: an instance of a framework-specific library, an ApplicationMaster runs a specific YARN job and is responsible for negotiating resources from the ResourceManager and also working with the NodeManager to execute and monitor Containers.

So as far as you are seeing ResourceManager(on NN) & NodeManager(on DN) processes you are good to go.
王海良@Chatopera 聊天机器人 机器学习 智能客服
Chatopera 联合创始人 & CEO,运营聊天机器人平台 https://bot.chatopera.com,让聊天机器人上线!2015年开始探索聊天机器人的商业应用,实现基于自然语言交互的流程引擎、语音识别、自然语言理解,2018年出版《智能问答与深度学习》一书。