Tuesday, 4 November 2014

Hadoop Single Node Setup & Map-Reduce

This time I would like to share something big and here I come with the "BigData". Though I won't talk much on BigData, I will explain the Single node setup for hadoop and a map reduce java program running on it.

Following are the pre-requisites of running a Single node hadoop:
  • Java 6 or higher.
  • Apache Hadoop 2.5.1 (Download Link)
  • SSH Server.
  • Password less SSH login for localhost.
  • Make/Edit ~/.bashrc file.
First 2 steps are pretty simple and can be accomplished with ease. At this point I assume that you have downloaded/installed java and hadoop.

Note: Hadoop installation is merely a copy and paste task, so once after downloading the tar file for hadoop, unpack the same and put it to the following directory: /usr/local/hadoop
Command: sudo cp –rf ~/Downloads/hadoop-2.5.1/* /usr/local/hadoop
Add your user as owner to the hadoop directory i.e. /usr/local/hadoop as below:
Command: sudo chown <username> /usr/local/hadoop

SSH Server:
Install the SSH Server using the below command:
Command: sudo apt-get install openssh-server openssh-client

Password less SSH login for localhost:
To get rid of input the password login to localhost, we have to execute the following commands:

        1. Delete the SSH directory:
rm -rf ~/.ssh
2. Generate the SSH key:
ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa
3. Register the generated key as authorized:
cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys
4. Login to localhost:
ssh localhost (Should not ask for password)

Make/Edit ~/.bashrc file:

This is one of the important part of this setup, so insert the below entries into the .bashrc file if you have .bashrc file otherwise create .bashrc file by yourself and insert the entries:

JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
PATH=$PATH:$JAVA_HOME/bin
JRE_HOME=/usr/lib/jvm/java-7-openjdk-amd64
PATH=$PATH:$JRE_HOME/bin
HADOOP_INSTALL=/usr/local/hadoop/
PATH=$PATH:$HADOOP_INSTALL/bin
PATH=$PATH:/sbin
export HADOOP_CLASSPATH=$JAVA_HOME/lib/tools.jar
export JAVA_HOME
export JRE_HOME
export PATH

Above entries are to set the important environment variables. Once after you are done with .bashrc file creation/modification close your terminal window and start a new one as changes may not reflect in the same terminal window.

Note: As we can see in the above entries there is an environment variable HADOOP_INSTALL, don’t confuse with it, it is same as the HADOOP_HOME however that is deprecated now so use HADOOP_INSTALL in place of HADOOP_HOME.


Okay !!! Now we are done with the pre-requisites so let’s make some configuration changes necessary for the hadoop single node setup to work:

1. Move to the directory which has all the configuration files in the hadoop installation directory i.e. /usr/local/hadoop/etc/hadoop

2. Set the JAVA_HOME variable inside file hadoop-env.sh:

export JAVA_HOME=${JAVA_HOME}
OR
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

3. Add following lines in the core-site.xml:
<property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:9000</value>
</property>
<property>
    <name>hadoop.tmp.dir</name>
    <value>/usr/local/hadoop/tmp</value>
</property>
As we can see here the value for hadoop.tmp.dir is /usr/local/hadoop/tmp, we have to create tmp directory before running hadoop scripts.

4. Add following lines in the hdfs-site.xml:
<property>
    <name>dfs.replication</name>
    <value>1</value>
</property>
<property>
    <name>dfs.namenode.name.dir</name>
    <value>/var/hadoop/namenode</value>
</property>
<property>
    <name>dfs.datanode.data.dir</name>
    <value>/var/hadoop/datanode</value>
</property>
Here we are defining the name node and data node directories; these should also be created before we run/start the hadoop. Also we have to add our username as the owner of these directories. Below are the commands for the same:

sudo mkdir /var/hadoop/namenode
sudo mkdir /var/hadoop/datanode
sudo chown <username> /var/hadoop

5. There will be a mapred-site.xml.template file inside /usr/local/hadoop/etc/hadoop so we have to move the file as mapred-site.xml:
sudo mv etc/hadoop/mapred-site.xml.template etc/hadoop/mapred-site.xml

Then add below entries in the same:
<property>
    <name>mapred.job.tracker</name>
    <value>localhost:9001</value>
</property>
Now we are done with the configurations files.


HDFS Steps:


Now we’ve to format the name node : (we are at the /usr/local/hadoop/)
hadoop namenode –format

To check if the name node is formatting is successful check for the below message:
/var/hadoop/namenode had been successfully formatted. (In our case)

After this start all the necessary services using below command:

sbin/start-all.sh
<username>@mysystem-desktop /usr/local/hadoop $ sbin/start-all.sh
This script is Deprecated. Instead use start-dfs.sh and start-yarn.sh
Starting namenodes on [localhost]
localhost: starting namenode, logging to /usr/local/hadoop/logs/hadoop-<username>-namenode-xxx-desktop.out
localhost: starting datanode, logging to /usr/local/hadoop/logs/hadoop-<username>-datanode-xxx-desktop.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /usr/local/hadoop/logs/hadoop-<username>-secondarynamenode-xxx-desktop.out
starting yarn daemons
starting resourcemanager, logging to /usr/local/hadoop/logs/yarn-<username>-resourcemanager-xxx-desktop.out
localhost: starting nodemanager, logging to /usr/local/hadoop/logs/yarn-<username>-nodemanager-xxx-desktop.out

You can check about the running processes using jps command as below:

<username>@mysystem-desktop /usr/local/hadoop $ jps
3316 SecondaryNameNode
3506 ResourceManager
3160 DataNode
3932 Jps
3607 NodeManager
3060 NameNode

This will tell us about the running services and components. All the components listed above should be running if not there is some issue.

Once all the processes are started you can view the name node setup with the help of below URL: NameNode - http://localhost:50070/. Here you can get all the information about the directories, datanode etc.


Setup for Map-Reduce:

To run the Map-Reduce example we first have to make the input directory where we will store our input files on which map-reduce program will run.

1. Make the HDFS directories required to execute MapReduce jobs:
$ bin/hdfs dfs -mkdir -p /user/<username>/input
2. Copy the input files into the distributed filesystem: (This is not exactly the input file but the configuration files)
$ bin/hdfs dfs -put etc/hadoop input
3. Map Reduce Dictionary example: Map-Reduce Example, we are using the example from the link mentioned before, where the whole procedure is divided in below steps:

1. Go to the below links and download the dictionary files for Italian, French ans Spanish:
2. Merge all the three files into one file with the below:
cat French.txt >> fulldictionary.txt
cat Italian.txt >> fulldictionary.txt
cat Spanish.txt >> fulldictionary.txt
3. Copy this file i.e. fulldictionary.txt to hadoop file system input directory with the below command: (assuming you are in the path: /usr/local/hadoop)
bin/hdfs dfs -copyFromLocal ~/<your_path>/fulldictionary.txt input
4. In this example Map-Reduce Example there is one Dictionary.java file having map-reduce logic in it, which you have to compile, make jar and run the jar using below commands:
Compilation:
javac -classpath $HADOOP_INSTALL/share/hadoop/common/hadoop-common-2.5.1.jar:$HADOOP_INSTALL/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.5.1.jar:$HADOOP_INSTALL/share/hadoop/common/lib/commons-cli-1.2.jar Dictionary.java
JAR Creation:
jar -cvf dc.jar Dict*.class
JAR Execution:
hadoop jar dc.jar Dictionary input output
 Note: Before compiling the Dictionary.java file make the below changes in it:
FileOutputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
Instead of hard-coding the path for Input and Output take it from command line arguments as shown above.
Once you are done with JAR Execution, check the output directory in hdfs for result. Please note that, do not make output directory by yourself, output directory will be created by the map-reduce job itself else you will get an error saying output directory already exists.

Viola !!! The End. Hope this tutorial will be helpful for the beginners.














No comments:

Post a Comment