Accessing Azure Data Lake Store using WebHDFS with OAuth2 from Spark 2.0 that is running locally

Update from March 2017: Since posting this article in August 2016, Azure Data Lake Product Team published three new and highly recommended blog posts:

  1. Connecting your own Hadoop or Spark to Azure Data Lake Store
  2. Making Azure Data Lake Store the default file system in Hadoop
  3. Wiring your older Hadoop clusters to access Azure Data Lake Store

I highly recommend to review the official guidance above instead of my experimental and dated swebhdfs approach below.

In an earlier blog post, I described how to access Azure Storage Blobs from Spark that is running locally. This time, I want to try accessing Azure Data Lake Store (as of August 2016 in preview) using its WebHDFS-compatible REST API from a local Hadoop client tools and local Spark.

Although, Azure Data Lake Store (ADLS) can be accessed via a WebHDFS endpoint (i.e. swebhdfs://), the most performant and recommended way to access it is via the new ADL file system implementation that is being build into Hadoop (HADOOP-12666 and HADOOP-13037). This approach extends the Hadoop file system class with AdlFileSystem implementation for accessing ADLS using schema: adl://<accountname>.azuredatalake.net/path/to/file.

However, in this article, I specifically want to try accessing ADLS using the more generic swebhdfs:// scheme since my use case is very basic and this is really just a proof-of-concept.

Azure Data Lake Store (ADLS) uses Azure Active Directory (Azure AD) for authentication and access control lists. Therefore, each WebHDFS REST API request to ADLS must include an Authorization header with a Bearer access token that was issued by Azure AD via OAuth2. To obtain this access token from Azure AD, the Hadoop client and Spark will need to have support for WebHDFS with OAuth2. Luckily, Hadoop 2.8 branch includes this feature: “Support OAuth2 in WebHDFS” (HDFS-8155).

Get Hadoop 2.8 Binary

I wrote another article in which I describe how to compile and build specific Hadoop source code branch using an Azure VM. If you don’t want to try that simple build process yourself (I do recommend you try it), on the bottom of that post, I link to a hadoop-2.8.0-SNAPSHOT.tar.gz file that you can download directly.

Download Spark without Hadoop

Extract Downloaded Hadoop 2.8 and Spark 2.0

Using 7Zip, I uncompress the downloaded hadoop-2.8.0-SNAPSHOT.tar.gz and spark-2.0.0-bin-without-hadoop.tgz files into two folders in my C:\Experiment directory.

Update hadoop-env.cmd

@rem The java implementation to use. Required. set JAVA_HOME=C:\PROGRA~1\Java\jdk1.8.0_77

We can now use Command Prompt to execute bin\hadoop classpath to see the full class path that we will later need for configuring Spark so that it can see all of the Hadoop jars and the required libraries.

Copy winutils.exe into Hadoop bin folder

As described in this article https://blogs.msdn.microsoft.com/arsen/2016/02/09/resolving-spark-1-6-0-java-lang-nullpointerexception-not-found-value-sqlcontext-error-when-running-spark-shell-on-windows-10-64-bit/, download the winutils.exe and copy it into the Hadoop bin folder.

Now we are able to run bin/hadoop fs -ls /Experiment command and see the folder listing:

Try Listing Files in Azure Data Lake Store

bin\hadoop fs -ls swebhdfs://avdatalake2.azuredatalakestore.net:443/

We do not expect it to work since we didn’t yet configure anything in Hadoop to tell it how to authenticate access to that account.

As expected, we see a response telling us that Hadoop client is not able to access the storage location since it is Unauthorized (i.e. we didn’t provide any credentials for it to use).

Add WebHDFS and OAuth2 Properties to core-site.xml

We set the following relevant settings in the Hadoop etc/hadoop/core-site.xml. For now, we keep some of the most important values empty since we will fill them in later.

<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property><name>dfs.webhdfs.oauth2.enabled</name><value>true</value></property><property><name>dfs.webhdfs.oauth2.access.token.provider</name><value>org.apache.hadoop.hdfs.web.oauth2.ConfRefreshTokenBasedAccessTokenProvider</value></property><property><name>dfs.webhdfs.oauth2.refresh.url</name><value>WILL FILL IN LATER</value></property><property><name>dfs.webhdfs.oauth2.client.id</name><value>WILL FILL IN LATER</value></property><property><name>dfs.webhdfs.oauth2.refresh.token.expires.ms.since.epoch</name><value>0</value></property><property><name>dfs.webhdfs.oauth2.refresh.token</name><value>WILL FILL IN LATER</value></property> </configuration>

Obtaining values from Azure Active Directory for core-site.xml

dfs.webhdfs.oauth2.access.token.provider

Ideally, we would like to use the ConfCredentialBasedAccessTokenProvider which would allows us to obtain an access token via the Client Credential Grant workflow (i.e. using only client_id and client_secret). However, unfortunately, the current branch-2.8 generic WebHDFS OAuth2 implementation does not provide a way to specify a “resource” parameter at the time of obtaining the access token. The “resource” parameter must be set to “https://management.core.windows.net" for Azure Active Directory to issue an access token that would work properly with Azure Data Lake Store.

The AdlFileSystem (i.e. adl:// scheme) method of accessing Azure Data Lake Store will provide a more convenient way to obtain the access token using org.apache.hadoop.hdfs.web.oauth2.AzureADClientCredentialBasedAccesTokenProvider Azure Data Lake specific token provider. However, this class is not available in the generic WebHDFS OAuth2 implementation within Hadoop 2.8.

Therefore, to use the generic WebHDFS with OAuth2, we must use the ConfRefreshTokenBasedAccessTokenProvider which will obtain the access token via the provided refresh token using the “second half” of the Authorization Code Grant workflow.

As we will see in the dfs.webhdfs.oauth2.refresh.token section below, obtaining the refresh token without writing code, requires a bit of a hack.

dfs.webhdfs.oauth2.refresh.url

Copy the value of the “OAuth 2.0 token endpoint” into the core-site.xml for the dfs.webhdfs.oauth2.refresh.url value.

This URL contains a GUID representing your Azure Active Directory tenant id:

https://login.microsoftonline.com/dd74924a-88ce-421a-ac87-00fc9dbxxxxx/oauth2/token

dfs.webhdfs.oauth2.client.id

To get the client_id, click on the application and its “Configure” tab:

We also must provide this application permissions to access Azure Service Management APIs.

Add application, select Windows Azure Service Management API, check Delegated Permissions as shown below, and save.

Copy the client_id to the core-site.xml for the dfs.webhdfs.oauth2.client.id value.

dfs.webhdfs.oauth2.refresh.token

Use notepad to construct a URL to your Azure Active Directory OAuth2 authorize endpoint containing your client_id and request for the authorization code:

https://login.microsoftonline.com/YOUR_TENANT_ID/oauth2/authorize?client_id=YOUR_CLIENT_ID&response_type=code& response_mode=query&resource=https%3A%2F%2Fmanagement.core.windows.net%2F

Navigate to the properly constructed URL in your browser and you will asked to login.

Enter the credentials of the Azure user who has access to your Azure Data Lake Store account.

After you click the Accept button, you will be redirected to a “not existing page” on your localhost (depending on what you have running on your workstation’s port 80 you may either see a page not found or a connection failure error). This happens because we don’t really have a real application here requesting the refresh token; rather, we are doing it completely manually. The main thing we want to do is copy the full URL of this “not existing page” including the “code=” parameter.

The value of the “code=” query string parameter is what we will use in the next step to obtain the refresh token.

Copy the long value of the “code=” query string parameter (make sure to not include the extra parameters like &session_state) from the URL and use curl or Postman to obtain the refresh token using the following HTTP POST request to Azure Active Directory oauth2/token endpoint:

  • grant_type = authorization_code
  • client_id = client_id of the application we created above
  • code = the long string we copied from the not-working localhost page to which we were redirected after login
  • resource = https://management.core.windows.net/ (important: make sure to include the trailing slash “/” in the resource URL)

Copy the full string (it should not include any line breaks or spaces) from the “refresh_token” property of the JSON response to the core-site.xml for the dfs.webhdfs.oauth2.refresh.token value.

Review Final core-site.xml

Use “hadoop fs” to Access Azure Data Lake Store Files

hadoop fs -ls swebhdfs://avdatalake2.azuredatalakestore.net:443/

We can also confirm that we can copy local files (e.g. C:\Temp\4.txt) to Azure Data Lake Store using fs -copyFromLocal

hadoop fs -copyFromLocal C:\Temp\files\4.txt swebhdfs://avdatalake2.azuredatalakestore.net:443/

We can also download files from Azure Data Lake Store to local filesystem using fs -copyToLocal

hadoop fs -copyToLocal swebhdfs://avdatalake2.azuredatalakestore.net:443/4.txt C:\Temp\files\4-downloaded.txt

Using Local Spark 2.0 to access data in Azure Data Lake Store

Create spark-env.cmd

Next, we create a spark-env.cmd file in the C:\Experiment\spark-2.0.0-bin-without-hadoop\conf folder to set the following environment variables:

  • HADOOP_HOME = set to the full path of the Hadoop 2.8 that we configured above
  • SPARK_HOME = set to the directory that contains the Spark binaries that we downloaded and extracted
  • SPARK_DIST_CLASSPATH = set to the string returned by “bin\hadoop classpath” command above

Once the spark-env.cmd file is created, we execute it from the command line.

Spark 2.0 “without Hadoop” spark-shell error on Windows

java.lang.NoClassDefFoundError: Could not initialize class scala.tools.fusesource_embedded.jansi.internal.Kernel32
at scala.tools.fusesource_embedded.jansi.internal.WindowsSupport.getConsoleMode(WindowsSupport.java:50)

The error refers to the JLine Java library that is used to handle console input and provide autocomplete in the REPL shell. Although for our purposes, we could ignore this error since spark-shell “mostly” works even though this error is shown, I spent some time troubleshooting and looking for a way to resolve this error. At first, I thought that this issue was related to some kind of an incompatibility between Scala 2.11 and JLine on Windows as discussed in SPARK-13710, but it turned out to be something else.

To narrow down the problem, I downloaded spark-2.0.0-bin-with-hadoop-2.7 and confirmed that this error does not happen when using that binary distribution. Knowing this, I searched through all of the JAR files available in the spark-2.0-bin-with-hadoop-2.7 using the following command (which I found in this StackOverflow answer):

C:\Experiment\spark-2.0.0-bin-hadoop2.7>for /R %G in (*.jar) do @jar -tvf "%G" | find "jline" > NUL && echo %G

I confirmed that jline-2.12.jar file is not present in the spark-2.0.0-bin-without-hadoop\jars directory. Therefore, I copied the jline-2.12.jar file into the jars directory of my Spark “without Hadoop”, restarted the spark-shell, and confirmed that the error disappeared. I am not sure why jline-2.12.jar is not part of the “without Hadoop” binary distribution. For convenience (i.e. not having to download the spark-2.0.0-bin-hadoop2.7), I uploaded this file here https://avdatarepo1.blob.core.windows.net:443/hadoop/jline-2.12.jar

Create RDD from file in Azure Data Lake Store

val rdd = sc.textFile("swebhdfs://avdatalake2.azuredatalakestore.net:443/user.csv")rdd.count()rdd.take(10)

Save RDD as text files in Azure Data Lake Store

val numbers = sc.parallelize(10000000 to 20000000) numbers.saveAsTextFile("swebhdfs://avdatalake2.azuredatalakestore.net:443/numbers")val rdd = sc.textFile("swebhdfs://avdatalake2.azuredatalakestore.net:443/numbers") rdd.count()rdd.take(10)

We can see the generated data in Azure Data Lake Store data explorer view in the https://portal.azure.com/

Read CSV file using Spark 2.0 DataFrame API

Let’s try reading data from a CSV file that is stored in the Azure Data Lake Store.

val userDataFrame = spark.read.csv("swebhdfs://avdatalake2.azuredatalakestore.net:443/user.csv")

Unfortunately, when trying this on my Windows 10 laptop with the Spark 2.0.0 binaries, I get the following error:

java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/Experiment/spark-2.0.0-bin-without-hadoop/spark-warehouse
at org.apache.hadoop.fs.Path.initialize(Path.java:254)
at org.apache.hadoop.fs.Path.<init>(Path.java:212)

After a bit of research, turns out that this error is caused an issue (SPARK-15899) in the latest Spark 2.0.0 build that does not properly construct file:/// scheme URIs on Windows. As you see in the error message above, instead of using file:///C:/Experiment/ the URI is constructed as file:C:/Experiment and this is not getting parsed properly.

I mentioned a temporary workaround in SPARK-15899 and including it below:

  • Set the spark.sql.warehouse.dir=file:///C:/temp config setting your in the conf/spark-defaults.conf file
  • Or pass it when starting your spark-shell: bin\spark-shell — conf spark.sql.warehouse.dir=file:///C:/temp
  • Also, instead of C:/temp, you can point the setting to the directory where your Spark binaries are located (i.e. spark.sql.warehouse.dir=file:///C:/Experiment/spark-2.0.0-bin-without-hadoop/spark-warehouse)

Now, we are able to read the CSV file using the SparkSession:

val userDataFrame = spark.read.option("header",true).csv("swebhdfs://avdatalake2.azuredatalakestore.net:443/user.csv")userDataFrame.show()

Conclusion

Also, it is important to note that you are unlikely to use this approach to access a lot of data in Azure Data Lake Store from a large production Spark cluster that is running at a colocation facility or your own data center due to: (1) the latency in transferring data from the remote location to the Spark executor nodes and (2) because transferring data out of cloud services usually involves nominal egress data transfer charges. If you are running Spark on Azure IaaS virtual machines within the same region as your Azure Data Lake Store account, the latency would be reasonable and there would usually be no data egress charges. However, in this case, you might want to use the managed Apache Spark for Azure HDInsight which has all of this pre-configured for you.

Thank you for reading!

I’m looking forward to your feedback and questions via Twitter https://twitter.com/ArsenVlad

Originally published at blogs.msdn.microsoft.com on August 5, 2016.

Principal Engineer / Architect, FastTrack for Azure at Microsoft

Principal Engineer / Architect, FastTrack for Azure at Microsoft