Hadoop to explore data

global hadoop

Big data by definition denotes datasets that are so large or complex that traditional data processing application frameworks and software are inadequate to deal with them. Hadoop is the answer to the inadequacies of the traditional tools and technologies. Hadoop is the framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Hadoop is designed to scale up from single servers to thousands of machines, each offering local computation and storage. Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

This post (chain of posts) is to help the developers desiring to begin with Hadoop. We will start setting up your Hadoop environment and during the setup get to know the basic Hadoop components. In this post we will set up a single node Hadoop cluster.

Hadoop software library is demonstrated to run on linux machines with 2000 nodes. Here we will be configuring a single node Hadoop installation with capability to perform simple operations like store in data in Hadoop file system and perform map reduce on top of the stored data. There are few prerequisites which we would need to satisfy to build your own Hadoop ecosystem:

  1. GNU/Linux OS
  2. Java 7/8 installed
  3. SSH installed on the node (optional)

Here we will be using Ubuntu 16.04 Linux distribution for the Hadoop node. Hadoop is java library so java 8 is needed for the Hadoop installation. Please check your java installation and make a note of it. We would need to let Hadoop know the Java home during the setup

Create a hadoop user and group

This is an optional step, but I recommend you to do this.

Pooja

This will help us segregate Hadoop installation and provide authorization based on user and groups. Add a new group “hadoop” and a new user “hduser” in the group.


$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hduser

SSH setup

Follow the below steps to enable ssh

Check whether you are able to ssh to the localhost
$ ssh localhost
$ ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Install hadoop

Download a stable release Hadoop binary from the Apache Hadoop download page. We will be using Hadoop 2.8.1. Feel free to use a newer Hadoop version if java 8 is available on your node.

If you want to install java

$ sudo add-apt-repository ppa:webupd8team/java
$ sudo apt-get update
$ sudo apt-get install oracle-java8-installer
            

Download and untar the hadoop binaries

$ cd /usr/local
$ wget http://apache.mirror.anlx.net/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz
$ tar –xvf hadoop-2.8.1.tar.gz
$ mv hadoop-2.8.1 hadoop

There are few hadoop components that we need to know before we can go ahead.

HDFS : Hadoop Distributed File System is a distributed file system designed to run on commodity hardware. HDFS has a master/slave architecture. The master is the name node and the slaves are data nodes.

NameNode : An HDFS cluster consists of NameNode, a master server that manages the file system namespace and regulates access to files by clients. The NameNode executes file system namespace operations like opening, closing, and renaming files and directories. It is responsible to maintain all the metadata of the data stored in the HDFS. If we lose the NameNode data we effectively lose the data stored in the cluster as the metadata to access the data stored is lost. We can safeguard from this single point of failure by having secondary NameNode or ensure high availability by having multiple NameNodes in hot standby.

DataNode : An HDFS cluster consists of DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on. Internally, a file is split into one or more blocks and these blocks are stored in a set of DataNodes. The DataNode has no knowledge about HDFS files. It stores each block of HDFS data in a separate file in its local file system. The blocks are also replicated to provide the needful robustness in the file system. When a DataNode starts up, it scans through its local file system, generates a list of all HDFS data blocks that correspond to each of these local files and sends this report to the NameNode. This is the Blockreport.

HDFS setup

Let’s now setup HDFS for our single node cluster. We will specify the HADOOP_HOME for the hduser we created. Add the below lines to the .bashrc of the user (The .bashrc file will be at $HOME/.bashrc).

export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=
export PATH=$PATH:$HADOOP_HOME/bin
            

If you are using multiple users, do the above the steps for all them.

We will add a few needful configuration changes to the setup. The configuration files for Hadoop can be found under the below directory location

$HADOOP_HOME/etc/hadoop
            

We will start by specifying what our namenode address is. Edit the core-site.xml as below

<configuration>
    <property>
        <name>fs.defaultFS</name>
        <value>hdfs://localhost:9000</value>
    </property>
</configuration>
            

Remember this will solve two purposes. First, it will allow dfs commands to skip the hdfs protocol while accessing the hdfs location. Secondly, datanode will use this to send their heartbeat to the port specified in the configuration.

Without this configuration, if we want to list file in hdfs root:

hdfs dfs –ls hdfs://hdfs/

After adding this configuration you can simply use the below command to list the files

hdfs dfs –ls /

We now need to specify Hadoop where is the namenode directory and datanode directory. Remember, this will be like a workbench/datastore for the namenode and datanode component. This is the location that determines where on the local filesystem the DFS name node should store the name table and DFS data node should store its blocks.
Replication is an important aspect of the hdfs. The replication factor in hdfs-site.xml determines how many copies of a single block of file will be created in the datanode. Default replication is 3.

Edit the hdfs-site.xml as below:

<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.namenode.name.dir</name>
        <value>/data/nn/</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>/data/dn/</value>
    </property>
</configuration>
        

Make sure /data/nn and /data/dn exists in the filesystem. I would encourage you to refer the documentation for the default values of the configurations. We are all ready now to start the Hadoop cluster.
Follow the following steps to start the dfs.

  1. Format the filesystem:
    $ hadoop namenode -format
                
  2. Start the dfs:
    $/usr/local/hadoop/sbin/start-dfs.sh
                
  3. Navigate to the below location to check out the web interface of the namenode : http://localhost:9870/

Let us now try to run a map reduce job on the single node cluster step up. We are going to run a map reduce job to count words in a file.

1. Create the user directory in hdfs.
$ hdfs dfs -mkdir -p /user/hduser

2. Ship the word file from the local directory to hdfs.
$ hdfs dfs -copyFromLocal /tmp/wordfile.txt /user/hduser

3. Run the map reduce job from the example library provided with the Hadoop installation.
$ /usr/local/hadoop/bin/hadoop jar ../share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar wordcount /user/hduser /user/hduser/output

4. You can choose to stop hdfs with the below command.
$ /usr/local/hadoop/stop-dfs.sh
            

Please do leave your comment if you found this article useful or you have any additional query about the installation. Cheers!

Important links for further study:

  1. http://hadoop.apache.org/
  2. https://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

3 Comments

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.