Compile and build specific Hadoop source code branch using Azure VM

Sometimes you may want to test a Hadoop feature that is available in a specific branch that is not available as a binary release. For example, in my case, I want to try accessing Azure Data Lake Store (ADLS) via its WebHDFS endpoint. Access to ADLS requires OAuth2, support for which was added in Hadoop 2.8 (HDFS-8155) but is not available in the current Hadoop 2.7.x releases.

Hadoop source code is available in this mirrored GitHub repo https://github.com/apache/hadoop. Version 2.8 specific code is available in the branch appropriately called “branch-2.8”

Image for post
Image for post

Deploy Azure VM with Ubuntu 14.04-LTS

As is described in the Building instructions for Hadoop, “the easiest way to get an environment with all the appropriate tools is by means of the provided Docker config” (for Linux or Mac). Since my primary laptop is running Windows 10, I will deploy a Ubuntu 14.04 LTS virtual machine in my Azure subscription, use it to build Hadoop 2.8 binary tar.gz file, download the resultant file, and delete the VM once I am done.

I am using Standard_DS2 VM size created from Canonical Ubuntu 14.04 LTS Azure gallery image https://portal.azure.com/#create/Canonical.UbuntuServer1404LTS-ARM

Image for post
Image for post

Install Docker on Ubuntu 14.04

After the VM is deployed, I SSH into it using its public IP and quickly install Docker following the instructions for Ubuntu 14.04 from https://docs.docker.com/engine/installation/linux/ubuntulinux/

By default, I am not able to run “docker run hello-world” using my user account (i.e. azureuser) without using sudo. When I try it, I get back this message “docker: Cannot connect to the Docker daemon. Is the docker daemon running on this host?” This happens because by default docker daemon’s Unix socket is owned by the user root and other users can access it only with sudo.

To enable azureuser to run docker without sudo, I follow the instructions from Docker to create group called “docker”, add my user to that group, logout, log back in, and try docker run again.

After logging back in, I now can run “docker run hello-world” without problems.

Clone Hadoop 2.8 Branch

Since I want to compile specifically the branch called “branch-2.8”, I use Git to clone only that specific branch to my home directory (/home/azureuser/hadoop-2.8) using this command:

Image for post
Image for post

Start Docker Container with Hadoop Build Environment

Following instructions from https://github.com/apache/hadoop/blob/trunk/BUILDING.txt, I start the Hadoop build environment using the provided script:

Image for post
Image for post

This process will take some time (~5–10 min) since it installs all of the required build environment tools (JDK, Maven, etc.) in the container.

Building Hadoop within the Docker Container

After the creation process is finished, I see my Hadoop Dev docker container running.

Image for post
Image for post

I try to start the Maven binary distribution build without native code, without running the tests, and without documentation.

Resolving Permissions Error

However, I get a permissions error regarding the /home/azureuser/.m2 directory (used by Maven).

Image for post
Image for post

To fix this problem, I exit the docker container, and set the ownership of the /home/azureuser/.m2 directory to azureuser:azureuser.

Image for post
Image for post

Restarting Container and Starting Maven Build

After the permission problem is resolved, I restart the docker container:

Once within the container, I again try to start the Maven build and package:

Image for post
Image for post

This process will take some time to complete. For me, on the Standard_DS2 Azure VM, it took about 9 minutes.

Image for post
Image for post

Download Binary Distribution File

After the build process is complete, the resultant files are found in the hadoop-dist/target directory.

Image for post
Image for post

I download the hadoop-dist-2.8.0-SNAPSHOT.tar.gz (200MB) file to my local machine from the Ubuntu Azure VM (e.g. using WinSCP, MobaXterm SFTP, etc.).

I also store this file as a block blob in a Azure Storage container so that I can quickly download it from there without rebuilding (https://avdatarepo1.blob.core.windows.net:443/hadoop/hadoop-2.8.0-SNAPSHOT.tar.gz)

Once I have the binary distribution file ready, I can go ahead and delete my Azure VM.

Conclusion

It is very convenient and quick to be able to use an Azure VM running Ubuntu 14.04-LTS and Docker to setup the temporary Hadoop build environment. Although in this case I specifically built the “branch-2.8” branch, the same process can be used to build other Hadoop branches (or trunk) from source.

I’m looking forward to your feedback and questions via Twitter https://twitter.com/ArsenVlad

Originally published at blogs.msdn.microsoft.com on August 2, 2016.

Written by

Principal Engineer / Architect, FastTrack for Azure at Microsoft

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store