Set up Hadoop Cluster on VirtualBox machines running CentOS 7

6 min readMar 20, 2019

In this tutorial I’m going to show you how to create your own Hadoop cluster with VirtualBox, using three CentOS virtual machines.

1. Setting up

First, you’re gonna have to create a virtual machine on VirtualBox and install CentOS 7 on it. On our new virtual machine, the first thing we want to do is setup the network. For that, we’re going to use the command:

nmtui

It should open something like this:

Let’s select “Edit a connection” and select one. It should have a name like “enp0s3”, at least that’s what I have on my VM.

You should set the IPv4 and IPv6 configurations to Automatic and select “Automatically connect” (select with Space). Then just press “OK”. If it went all fine you should now be able to ping “1.1.1.1”, Cloudflare’s DNS server. When you’re satisfied just press Ctrl + C to exit.

Now we’re going to install Java. For that you simply have to type:

yum install java-1.8.0-openjdk-devel -y

2. Installing Hadoop

It’s recommended that we create a “super user” to use hadoop, so we’ll do that by using the following commands:

adduser hadoop
passwd hadoop
usermod -aG wheel hadoop

Next on we’re going to create a secure shell key and add it to “authorized keys” to enable an ssh connection to our machine. That can be done in the hadoop user by accessing it:

su - hadoop

And then just using the following commands:

ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys

Now, on downloading the latest version of hadoop, we need to get the url from the website: https://hadoop.apache.org/releases.html. Once there, you select the latest binary and the following page will present you the URL. In my case it looks like this:

Using the curl and tar commands we download hadoop from that URL and extract it:

I’m going to rename my hadoop folder to simply “hadoop” (it’s a matter of issue):

mv hadoop-3.1.2 hadoop

Moving on we need to set some constants in our .bashrc file. That can be done by opening it with vi:

vi ~/.bashrc

Inside vi, you just press “a” (the append shortcut) on your keyboard and add the following to the bottom of the file:

export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Now, note that in my case the folder is just called “hadoop”, but that may not be the case if you didn’t change it!

Then just press Esc and write “:wq” (for write and quit). That should save the file. However, to change the current environment, we have to use the command:

source ~/.bashrc

To make sure everything is set right we can use:

ls $HADOOP_HOME

In case you didn’t get the same results as I did you should go back and review your ~/.bashrc file.

3. Setting up Hadoop

Now that we have hadoop on our machine we should configure it. This is where we use the Java that we installed in the beginning of this tutorial.

First things first, we should navigate to the folder where all the configuration files should be by using:

cd $HADOOP_HOME/etc/hadoop

The first file to edit is called “ hadoop-env.sh” and we’re going to use vi again:

vi hadoop-env.sh

We only need to add the line:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk

It should be irrevelant where to put it, as long as it’s a new line. Then just write and quit vi. Finally, open and write for the following files:

With that done we return to our default directory with the command “cd”:

4. Setting Up Multi-node

To use a Multi-node setup we must define the nodes in our hosts file, which can be achieved using vi once more:

vi /etc/hosts

All we need to add are the nodes’ names and respective IP addresses. In my case I’m using three nodes, one “master” and two “workers”:

Now we’re going to create those machines. To begin we have to clone the “master” machine twice:

Using the Right-Click, we select the “Clone” option

After choosing the name and folder to the new machine, VirtualBox will start cloning

You should now have three virtual machines. For them to be able to communicate with each other, we’re going to open VirtualBox’s Network Settings:

Using Right-Click again, we select “Settings”

On “Network”, we select “Adapter 2", enable and select Host-Only Adapter. Let’s set “Promiscuous Mode” to “Allow All” and be careful that the MAC Adresses are different from each other:

Each Virtual Machine should have a different hostname from the ones we defined in the hosts file. Log in with the hadoop account directly for this part.

The rest is simpler than what we’ve already done with vi:

vi /etc/hostname

To further configure that, we have to use nmtui like at the start of this tutorial:

nmtui

In the new connection, “Wired Connection 1” in my case, we’ll simply set IPv4 to “Manual”, press “Show” and input the IP Address for the respective machine followed by “/24”. This is on my “master” VM:

Simply press “OK” and reboot. Rinse and repeat for each VM. We should now have a network between our machines:

Now let’s inform Hadoop about the existing nodes. On our “master” machine we should use the “hadoop” user to edit the workers file:

vi $HADOOP_HOME/etc/hadoop/workers

In this file we need to write the machines that will do the work:

master
slave1
slave2

Finally, with all the machines on, we start our cluster by using the following commands on the “master” machine:

$HADOOP_HOME/etc/hadoop/start-dfs.sh
$HADOOP_HOME/etc/hadoop/start-yarn.sh

But don’t take my word for granted, open one of your “worker” machines and use the jps command:

jps

If you see “NodeManager” and “DataNode” there, your cluster is up and running 🙂

Set up Hadoop Cluster on VirtualBox machines running CentOS 7

1. Setting up

2. Installing Hadoop

3. Setting up Hadoop

4. Setting Up Multi-node

Written by Tiago Lucas