Accessing Azure Data Lake Store using WebHDFS with OAuth2 from Spark 2.0 that is running locally
Update from March 2017: Since posting this article in August 2016, Azure Data Lake Product Team published three new and highly recommended blog posts:
- Connecting your own Hadoop or Spark to Azure Data Lake Store
- Making Azure Data Lake Store the default file system in Hadoop
- Wiring your older Hadoop clusters to access Azure Data Lake Store
I highly recommend to review the official guidance above instead of my experimental and dated swebhdfs approach below.
In an earlier blog post, I described how to access Azure Storage Blobs from Spark that is running locally. This time, I want to try accessing Azure Data Lake Store (as of August 2016 in preview) using its WebHDFS-compatible REST API from a local Hadoop client tools and local Spark.
Although, Azure Data Lake Store (ADLS) can be accessed via a WebHDFS endpoint (i.e. swebhdfs://), the most performant and recommended way to access it is via the new ADL file system implementation that is being build into Hadoop (HADOOP-12666 and HADOOP-13037). This approach extends the Hadoop file system class with AdlFileSystem implementation for accessing ADLS using schema: adl://<accountname>.azuredatalake.net/path/to/file.
However, in this article, I specifically want to try accessing ADLS using the more generic swebhdfs:// scheme since my use case is very basic and this is really just a proof-of-concept.
Azure Data Lake Store (ADLS) uses Azure Active Directory (Azure AD) for authentication and access control lists. Therefore, each WebHDFS REST API request to ADLS must include an Authorization header with a Bearer access token that was issued by Azure AD via OAuth2. To obtain this access token from Azure AD, the Hadoop client and Spark will need to have support for WebHDFS with OAuth2. Luckily, Hadoop 2.8 branch includes this feature: “Support OAuth2 in WebHDFS” (HDFS-8155).
Get Hadoop 2.8 Binary
I could not find a way to download Hadoop 2.8 binary distribution and also see that Spark binaries are only available for latest Hadoop 2.7.x. Therefore, we need to build our own Hadoop 2.8 binary and use Spark that is “Pre-build with user-provided Hadoop”.
I wrote another article in which I describe how to compile and build specific Hadoop source code branch using an Azure VM. If you don’t want to try that simple build process yourself (I do recommend you try it), on the bottom of that post, I link to a hadoop-2.8.0-SNAPSHOT.tar.gz file that you can download directly.
Download Spark without Hadoop
To download Spark without Hadoop, visit http://spark.apache.org/downloads.html and select the following options.
Extract Downloaded Hadoop 2.8 and Spark 2.0
I am doing this experiment on my Windows 10 laptop with Java Development Kit (JDK) 1.8 installed.
Using 7Zip, I uncompress the downloaded hadoop-2.8.0-SNAPSHOT.tar.gz and spark-2.0.0-bin-without-hadoop.tgz files into two folders in my C:\Experiment directory.
To be able to run bin/hadoop, we must set the proper JAVA_HOME reference in the /etc/hadoop/hadoop-env.cmd file. On my Windows 10 laptop, JDK 1.8 is installed in C:\Program Files\Java\jdk1.8.0_77. However, since this path contains a space, it does not work properly in the hadoop-env.cmd file. To find the short name for the “C:\Program Files”, we can run “dir /X C:\” to get back PROGRA~1 and update hadoop-env.cmd file to point to proper JAVA_HOME:
@rem The java implementation to use. Required. set JAVA_HOME=C:\PROGRA~1\Java\jdk1.8.0_77
We can now use Command Prompt to execute bin\hadoop classpath to see the full class path that we will later need for configuring Spark so that it can see all of the Hadoop jars and the required libraries.
Copy winutils.exe into Hadoop bin folder
On Windows, if we try to run Hadoop commands now, we will get the following error:
As described in this article https://blogs.msdn.microsoft.com/arsen/2016/02/09/resolving-spark-1-6-0-java-lang-nullpointerexception-not-found-value-sqlcontext-error-when-running-spark-shell-on-windows-10-64-bit/, download the winutils.exe and copy it into the Hadoop bin folder.
Now we are able to run bin/hadoop fs -ls /Experiment command and see the folder listing:
Try Listing Files in Azure Data Lake Store
I already have an Azure Data Lake Store account in my Azure subscription. Let’s see what happens when we now try to list the files in the Azure Data Lake Store account called “avdatalake2” using swebhdfs:// scheme.
bin\hadoop fs -ls swebhdfs://avdatalake2.azuredatalakestore.net:443/
We do not expect it to work since we didn’t yet configure anything in Hadoop to tell it how to authenticate access to that account.
As expected, we see a response telling us that Hadoop client is not able to access the storage location since it is Unauthorized (i.e. we didn’t provide any credentials for it to use).
Add WebHDFS and OAuth2 Properties to core-site.xml
The WebHDFS documentation in Hadoop 2.8 branch, contains a section that describes various core-site.xml settings (e.g. dfs.webhdfs.oauth2.enabled) that are relevant for client access to WebHDFS using OAuth2 authentication: https://github.com/apache/hadoop/blob/branch-2.8/hadoop-hdfs-project/hadoop-hdfs/src/site/markdown/WebHDFS.md#authentication
We set the following relevant settings in the Hadoop etc/hadoop/core-site.xml. For now, we keep some of the most important values empty since we will fill them in later.
<?xml version="1.0" encoding="UTF-8"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property><name>dfs.webhdfs.oauth2.enabled</name><value>true</value></property><property><name>dfs.webhdfs.oauth2.access.token.provider</name><value>org.apache.hadoop.hdfs.web.oauth2.ConfRefreshTokenBasedAccessTokenProvider</value></property><property><name>dfs.webhdfs.oauth2.refresh.url</name><value>WILL FILL IN LATER</value></property><property><name>dfs.webhdfs.oauth2.client.id</name><value>WILL FILL IN LATER</value></property><property><name>dfs.webhdfs.oauth2.refresh.token.expires.ms.since.epoch</name><value>0</value></property><property><name>dfs.webhdfs.oauth2.refresh.token</name><value>WILL FILL IN LATER</value></property> </configuration>
Obtaining values from Azure Active Directory for core-site.xml
As we see from the core-site.xml listing above, we need to fill in information for a few of the settings.
This value should be set to the name of an implementation of org.apache.hadoop.hdfs.web.oauth2.AccessTokenProvider abstract class. Hadoop 2.8 includes two providers whose code we can review here https://github.com/apache/hadoop/tree/branch-2.8/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/web/oauth2
Ideally, we would like to use the ConfCredentialBasedAccessTokenProvider which would allows us to obtain an access token via the Client Credential Grant workflow (i.e. using only client_id and client_secret). However, unfortunately, the current branch-2.8 generic WebHDFS OAuth2 implementation does not provide a way to specify a “resource” parameter at the time of obtaining the access token. The “resource” parameter must be set to “https://management.core.windows.net" for Azure Active Directory to issue an access token that would work properly with Azure Data Lake Store.
The AdlFileSystem (i.e. adl:// scheme) method of accessing Azure Data Lake Store will provide a more convenient way to obtain the access token using org.apache.hadoop.hdfs.web.oauth2.AzureADClientCredentialBasedAccesTokenProvider Azure Data Lake specific token provider. However, this class is not available in the generic WebHDFS OAuth2 implementation within Hadoop 2.8.
Therefore, to use the generic WebHDFS with OAuth2, we must use the ConfRefreshTokenBasedAccessTokenProvider which will obtain the access token via the provided refresh token using the “second half” of the Authorization Code Grant workflow.
As we will see in the dfs.webhdfs.oauth2.refresh.token section below, obtaining the refresh token without writing code, requires a bit of a hack.
Using the classic Azure management portal https://manage.windowsazure.com/, we navigate to the Active Directory in left-sidebar, click on the Applications sub-tab, and click the View Endpoints button.
Copy the value of the “OAuth 2.0 token endpoint” into the core-site.xml for the dfs.webhdfs.oauth2.refresh.url value.
This URL contains a GUID representing your Azure Active Directory tenant id:
We will create a “native application” with redirect URL set to http://localhost. The reason we use “native application” instead of “web application” is because we need to be able to obtain the access token without having to provide client_secret — and “native application” type provides us that ability.
To get the client_id, click on the application and its “Configure” tab:
We also must provide this application permissions to access Azure Service Management APIs.
Add application, select Windows Azure Service Management API, check Delegated Permissions as shown below, and save.
Copy the client_id to the core-site.xml for the dfs.webhdfs.oauth2.client.id value.
To generate the refresh token without writing code, we will do something a little bit hacky by manually simulating an OAuth2 authorization code grant flow.
Use notepad to construct a URL to your Azure Active Directory OAuth2 authorize endpoint containing your client_id and request for the authorization code:
Navigate to the properly constructed URL in your browser and you will asked to login.
Enter the credentials of the Azure user who has access to your Azure Data Lake Store account.
After you click the Accept button, you will be redirected to a “not existing page” on your localhost (depending on what you have running on your workstation’s port 80 you may either see a page not found or a connection failure error). This happens because we don’t really have a real application here requesting the refresh token; rather, we are doing it completely manually. The main thing we want to do is copy the full URL of this “not existing page” including the “code=” parameter.
The value of the “code=” query string parameter is what we will use in the next step to obtain the refresh token.
Copy the long value of the “code=” query string parameter (make sure to not include the extra parameters like &session_state) from the URL and use curl or Postman to obtain the refresh token using the following HTTP POST request to Azure Active Directory oauth2/token endpoint:
- grant_type = authorization_code
- client_id = client_id of the application we created above
- code = the long string we copied from the not-working localhost page to which we were redirected after login
- resource = https://management.core.windows.net/ (important: make sure to include the trailing slash “/” in the resource URL)
Copy the full string (it should not include any line breaks or spaces) from the “refresh_token” property of the JSON response to the core-site.xml for the dfs.webhdfs.oauth2.refresh.token value.
Review Final core-site.xml
After making all of the required modifications to the Hadoop’s etc\hadoop\core-site.xml file, its contents should look approximately like this.
Use “hadoop fs” to Access Azure Data Lake Store Files
If all of the core-site.xml settings were set properly, and Azure Data Lake Store is configured to allow access to your user, we should now be able list the files in the Azure Data Lake Store account using the standard Hadoop fs -ls command.
hadoop fs -ls swebhdfs://avdatalake2.azuredatalakestore.net:443/
We can also confirm that we can copy local files (e.g. C:\Temp\4.txt) to Azure Data Lake Store using fs -copyFromLocal
hadoop fs -copyFromLocal C:\Temp\files\4.txt swebhdfs://avdatalake2.azuredatalakestore.net:443/
We can also download files from Azure Data Lake Store to local filesystem using fs -copyToLocal
hadoop fs -copyToLocal swebhdfs://avdatalake2.azuredatalakestore.net:443/4.txt C:\Temp\files\4-downloaded.txt
Using Local Spark 2.0 to access data in Azure Data Lake Store
Now we will configure Spark 2.0 (which we downloaded without Hadoop) to use our local install of Hadoop 2.8
First, we need to execute hadoop classpath to get the classpath for configuring the SPARK_DIST_CLASSPATH environment variable:
Next, we create a spark-env.cmd file in the C:\Experiment\spark-2.0.0-bin-without-hadoop\conf folder to set the following environment variables:
- HADOOP_HOME = set to the full path of the Hadoop 2.8 that we configured above
- SPARK_HOME = set to the directory that contains the Spark binaries that we downloaded and extracted
- SPARK_DIST_CLASSPATH = set to the string returned by “bin\hadoop classpath” command above
Once the spark-env.cmd file is created, we execute it from the command line.
Spark 2.0 “without Hadoop” spark-shell error on Windows
When starting Spark 2.0 bin\spark-shell on Windows, I encountered this error:
java.lang.NoClassDefFoundError: Could not initialize class scala.tools.fusesource_embedded.jansi.internal.Kernel32
The error refers to the JLine Java library that is used to handle console input and provide autocomplete in the REPL shell. Although for our purposes, we could ignore this error since spark-shell “mostly” works even though this error is shown, I spent some time troubleshooting and looking for a way to resolve this error. At first, I thought that this issue was related to some kind of an incompatibility between Scala 2.11 and JLine on Windows as discussed in SPARK-13710, but it turned out to be something else.
To narrow down the problem, I downloaded spark-2.0.0-bin-with-hadoop-2.7 and confirmed that this error does not happen when using that binary distribution. Knowing this, I searched through all of the JAR files available in the spark-2.0-bin-with-hadoop-2.7 using the following command (which I found in this StackOverflow answer):
C:\Experiment\spark-2.0.0-bin-hadoop2.7>for /R %G in (*.jar) do @jar -tvf "%G" | find "jline" > NUL && echo %G
I confirmed that jline-2.12.jar file is not present in the spark-2.0.0-bin-without-hadoop\jars directory. Therefore, I copied the jline-2.12.jar file into the jars directory of my Spark “without Hadoop”, restarted the spark-shell, and confirmed that the error disappeared. I am not sure why jline-2.12.jar is not part of the “without Hadoop” binary distribution. For convenience (i.e. not having to download the spark-2.0.0-bin-hadoop2.7), I uploaded this file here https://avdatarepo1.blob.core.windows.net:443/hadoop/jline-2.12.jar
Create RDD from file in Azure Data Lake Store
In spark-shell, we can now try to access the Azure Data Lake Store account that we configured in Hadoop’s etc\hadoop\core-site.xml. Since our Spark 2.0 local instance is using the same core-site.xml from the local Hadoop 2.8 installation, it should be able to access the same ADLS account using the same WebHDFS OAuth2 credentials.
val rdd = sc.textFile("swebhdfs://avdatalake2.azuredatalakestore.net:443/user.csv")rdd.count()rdd.take(10)
Save RDD as text files in Azure Data Lake Store
Finally, we can try generating an RDD that contains 1M numbers and saving it as text files in Azure Data Lake Store.
val numbers = sc.parallelize(10000000 to 20000000) numbers.saveAsTextFile("swebhdfs://avdatalake2.azuredatalakestore.net:443/numbers")val rdd = sc.textFile("swebhdfs://avdatalake2.azuredatalakestore.net:443/numbers") rdd.count()rdd.take(10)
We can see the generated data in Azure Data Lake Store data explorer view in the https://portal.azure.com/
Read CSV file using Spark 2.0 DataFrame API
Spark 2.0 provides a new class called SparkSession which is a new entry point for programming Spark with the Dataset and DataFrame API. SparkSession is available as ‘spark’ within the spark-shell REPL. In addition, Spark 2.0 provides built-in support for reading CSV files via the DataFrameReader csv() method.
Let’s try reading data from a CSV file that is stored in the Azure Data Lake Store.
val userDataFrame = spark.read.csv("swebhdfs://avdatalake2.azuredatalakestore.net:443/user.csv")
Unfortunately, when trying this on my Windows 10 laptop with the Spark 2.0.0 binaries, I get the following error:
java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:C:/Experiment/spark-2.0.0-bin-without-hadoop/spark-warehouse
After a bit of research, turns out that this error is caused an issue (SPARK-15899) in the latest Spark 2.0.0 build that does not properly construct file:/// scheme URIs on Windows. As you see in the error message above, instead of using file:///C:/Experiment/ the URI is constructed as file:C:/Experiment and this is not getting parsed properly.
I mentioned a temporary workaround in SPARK-15899 and including it below:
- Set the spark.sql.warehouse.dir=file:///C:/temp config setting your in the conf/spark-defaults.conf file
- Or pass it when starting your spark-shell: bin\spark-shell — conf spark.sql.warehouse.dir=file:///C:/temp
- Also, instead of C:/temp, you can point the setting to the directory where your Spark binaries are located (i.e. spark.sql.warehouse.dir=file:///C:/Experiment/spark-2.0.0-bin-without-hadoop/spark-warehouse)
Now, we are able to read the CSV file using the SparkSession:
val userDataFrame = spark.read.option("header",true).csv("swebhdfs://avdatalake2.azuredatalakestore.net:443/user.csv")userDataFrame.show()
In this article, we looked at how to configure a local Hadoop 2.8 and Spark 2.0 to use generic WebHDFS with OAuth2 to access data in Azure Data Lake Store service. As mentioned in the beginning of the article, the recommended and most efficient way to access Azure Data Lake Store from Hadoop will be via the AdlFileSystem using adl:// scheme. My intent in this article was to show a proof-of-concept of how we can access Azure Data Lake Store specifically using the generic WebHDFS and OAuth2 support that is already part of Hadoop 2.8 branch.
Also, it is important to note that you are unlikely to use this approach to access a lot of data in Azure Data Lake Store from a large production Spark cluster that is running at a colocation facility or your own data center due to: (1) the latency in transferring data from the remote location to the Spark executor nodes and (2) because transferring data out of cloud services usually involves nominal egress data transfer charges. If you are running Spark on Azure IaaS virtual machines within the same region as your Azure Data Lake Store account, the latency would be reasonable and there would usually be no data egress charges. However, in this case, you might want to use the managed Apache Spark for Azure HDInsight which has all of this pre-configured for you.
Thank you for reading!
I’m looking forward to your feedback and questions via Twitter https://twitter.com/ArsenVlad
Originally published at blogs.msdn.microsoft.com on August 5, 2016.