Set up Hadoop Cluster on VirtualBox machines running CentOS 7
In this tutorial I’m going to show you how to create your own Hadoop cluster with VirtualBox, using three CentOS virtual machines.
1. Setting up
First, you’re gonna have to create a virtual machine on VirtualBox and install CentOS 7 on it. On our new virtual machine, the first thing we want to do is setup the network. For that, we’re going to use the command:
nmtui
It should open something like this:
Let’s select “Edit a connection” and select one. It should have a name like “enp0s3”, at least that’s what I have on my VM.
You should set the IPv4 and IPv6 configurations to Automatic and select “Automatically connect” (select with Space). Then just press “OK”. If it went all fine you should now be able to ping “1.1.1.1”, Cloudflare’s DNS server. When you’re satisfied just press Ctrl + C to exit.
Now we’re going to install Java. For that you simply have to type:
yum install java-1.8.0-openjdk-devel -y
2. Installing Hadoop
It’s recommended that we create a “super user” to use hadoop, so we’ll do that by using the following commands:
adduser hadoop
passwd hadoop
usermod -aG wheel hadoop
Next on we’re going to create a secure shell key and add it to “authorized keys” to enable an ssh connection to our machine. That can be done in the hadoop user by accessing it:
su - hadoop
And then just using the following commands:
ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
Now, on downloading the latest version of hadoop, we need to get the url from the website: https://hadoop.apache.org/releases.html. Once there, you select the latest binary and the following page will present you the URL. In my case it looks like this:
Using the curl and tar commands we download hadoop from that URL and extract it:
I’m going to rename my hadoop folder to simply “hadoop” (it’s a matter of issue):
mv hadoop-3.1.2 hadoop
Moving on we need to set some constants in our .bashrc file. That can be done by opening it with vi:
vi ~/.bashrc
Inside vi, you just press “a” (the append shortcut) on your keyboard and add the following to the bottom of the file:
export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin
Now, note that in my case the folder is just called “hadoop”, but that may not be the case if you didn’t change it!
Then just press Esc and write “:wq” (for write and quit). That should save the file. However, to change the current environment, we have to use the command:
source ~/.bashrc
To make sure everything is set right we can use:
ls $HADOOP_HOME
In case you didn’t get the same results as I did you should go back and review your ~/.bashrc file.
3. Setting up Hadoop
Now that we have hadoop on our machine we should configure it. This is where we use the Java that we installed in the beginning of this tutorial.
First things first, we should navigate to the folder where all the configuration files should be by using:
cd $HADOOP_HOME/etc/hadoop
The first file to edit is called “ hadoop-env.sh” and we’re going to use vi again:
vi hadoop-env.sh
We only need to add the line:
export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk
It should be irrevelant where to put it, as long as it’s a new line. Then just write and quit vi. Finally, open and write for the following files:
With that done we return to our default directory with the command “cd”:
4. Setting Up Multi-node
To use a Multi-node setup we must define the nodes in our hosts file, which can be achieved using vi once more:
vi /etc/hosts
All we need to add are the nodes’ names and respective IP addresses. In my case I’m using three nodes, one “master” and two “workers”:
Now we’re going to create those machines. To begin we have to clone the “master” machine twice:
You should now have three virtual machines. For them to be able to communicate with each other, we’re going to open VirtualBox’s Network Settings:
On “Network”, we select “Adapter 2", enable and select Host-Only Adapter. Let’s set “Promiscuous Mode” to “Allow All” and be careful that the MAC Adresses are different from each other:
Each Virtual Machine should have a different hostname from the ones we defined in the hosts file. Log in with the hadoop account directly for this part.
The rest is simpler than what we’ve already done with vi:
vi /etc/hostname
To further configure that, we have to use nmtui like at the start of this tutorial:
nmtui
In the new connection, “Wired Connection 1” in my case, we’ll simply set IPv4 to “Manual”, press “Show” and input the IP Address for the respective machine followed by “/24”. This is on my “master” VM:
Simply press “OK” and reboot. Rinse and repeat for each VM. We should now have a network between our machines:
Now let’s inform Hadoop about the existing nodes. On our “master” machine we should use the “hadoop” user to edit the workers file:
vi $HADOOP_HOME/etc/hadoop/workers
In this file we need to write the machines that will do the work:
master
slave1
slave2
Finally, with all the machines on, we start our cluster by using the following commands on the “master” machine:
$HADOOP_HOME/etc/hadoop/start-dfs.sh
$HADOOP_HOME/etc/hadoop/start-yarn.sh
But don’t take my word for granted, open one of your “worker” machines and use the jps command:
jps
If you see “NodeManager” and “DataNode” there, your cluster is up and running 🙂