Set up Hadoop Cluster on VirtualBox machines running CentOS 7

Tiago Lucas
6 min readMar 20, 2019

In this tutorial I’m going to show you how to create your own Hadoop cluster with VirtualBox, using three CentOS virtual machines.

1. Setting up

First, you’re gonna have to create a virtual machine on VirtualBox and install CentOS 7 on it. On our new virtual machine, the first thing we want to do is setup the network. For that, we’re going to use the command:

nmtui

It should open something like this:

Network Manager start screen

Let’s select “Edit a connection” and select one. It should have a name like “enp0s3”, at least that’s what I have on my VM.

You should set the IPv4 and IPv6 configurations to Automatic and select “Automatically connect” (select with Space). Then just press “OK”. If it went all fine you should now be able to ping “1.1.1.1”, Cloudflare’s DNS server. When you’re satisfied just press Ctrl + C to exit.

Now we’re going to install Java. For that you simply have to type:

yum install java-1.8.0-openjdk-devel -y

2. Installing Hadoop

It’s recommended that we create a “super user” to use hadoop, so we’ll do that by using the following commands:

adduser hadoop
passwd hadoop
usermod -aG wheel hadoop
Creating the Super User “hadoop”

Next on we’re going to create a secure shell key and add it to “authorized keys” to enable an ssh connection to our machine. That can be done in the hadoop user by accessing it:

su - hadoop

And then just using the following commands:

ssh-keygen -t rsa
cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
chmod 0600 ~/.ssh/authorized_keys
Generating and authorizing an ssh key

Now, on downloading the latest version of hadoop, we need to get the url from the website: https://hadoop.apache.org/releases.html. Once there, you select the latest binary and the following page will present you the URL. In my case it looks like this:

Using the curl and tar commands we download hadoop from that URL and extract it:

I’m going to rename my hadoop folder to simply “hadoop” (it’s a matter of issue):

mv hadoop-3.1.2 hadoop

Moving on we need to set some constants in our .bashrc file. That can be done by opening it with vi:

vi ~/.bashrc
Vi text editor

Inside vi, you just press “a” (the append shortcut) on your keyboard and add the following to the bottom of the file:

export HADOOP_HOME=/home/hadoop/hadoop
export HADOOP_INSTALL=$HADOOP_HOME
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export PATH=$PATH:$HADOOP_HOME/sbin:$HADOOP_HOME/bin

Now, note that in my case the folder is just called “hadoop”, but that may not be the case if you didn’t change it!

Then just press Esc and write “:wq” (for write and quit). That should save the file. However, to change the current environment, we have to use the command:

source ~/.bashrc

To make sure everything is set right we can use:

ls $HADOOP_HOME

In case you didn’t get the same results as I did you should go back and review your ~/.bashrc file.

3. Setting up Hadoop

Now that we have hadoop on our machine we should configure it. This is where we use the Java that we installed in the beginning of this tutorial.

First things first, we should navigate to the folder where all the configuration files should be by using:

cd $HADOOP_HOME/etc/hadoop

The first file to edit is called “ hadoop-env.sh” and we’re going to use vi again:

vi hadoop-env.sh

We only need to add the line:

export JAVA_HOME=/usr/lib/jvm/java-1.8.0-openjdk

It should be irrevelant where to put it, as long as it’s a new line. Then just write and quit vi. Finally, open and write for the following files:

core-site.xml
hdfs-site.xml
mapred-site.xml
yarn-site.xml

With that done we return to our default directory with the command “cd”:

“pwd” means “print working directory”

4. Setting Up Multi-node

To use a Multi-node setup we must define the nodes in our hosts file, which can be achieved using vi once more:

vi /etc/hosts

All we need to add are the nodes’ names and respective IP addresses. In my case I’m using three nodes, one “master” and two “workers”:

hosts file

Now we’re going to create those machines. To begin we have to clone the “master” machine twice:

Using the Right-Click, we select the “Clone” option
After choosing the name and folder to the new machine, VirtualBox will start cloning

You should now have three virtual machines. For them to be able to communicate with each other, we’re going to open VirtualBox’s Network Settings:

Using Right-Click again, we select “Settings”

On “Network”, we select “Adapter 2", enable and select Host-Only Adapter. Let’s set “Promiscuous Mode” to “Allow All” and be careful that the MAC Adresses are different from each other:

Each Virtual Machine should have a different hostname from the ones we defined in the hosts file. Log in with the hadoop account directly for this part.

The rest is simpler than what we’ve already done with vi:

vi /etc/hostname

To further configure that, we have to use nmtui like at the start of this tutorial:

nmtui

In the new connection, “Wired Connection 1” in my case, we’ll simply set IPv4 to “Manual”, press “Show” and input the IP Address for the respective machine followed by “/24”. This is on my “master” VM:

Simply press “OK” and reboot. Rinse and repeat for each VM. We should now have a network between our machines:

Now let’s inform Hadoop about the existing nodes. On our “master” machine we should use the “hadoop” user to edit the workers file:

vi $HADOOP_HOME/etc/hadoop/workers

In this file we need to write the machines that will do the work:

master
slave1
slave2

Finally, with all the machines on, we start our cluster by using the following commands on the “master” machine:

$HADOOP_HOME/etc/hadoop/start-dfs.sh
$HADOOP_HOME/etc/hadoop/start-yarn.sh

But don’t take my word for granted, open one of your “worker” machines and use the jps command:

jps

If you see “NodeManager” and “DataNode” there, your cluster is up and running 🙂

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Tiago Lucas
Tiago Lucas

Written by Tiago Lucas

Enthusiastically curious and raised with entrepeneurship values, I often wander outside my confort zone to explore other topics.

No responses yet

What are your thoughts?