It seems impossible to write to Azure Datalake Gen2 using spark, unless you're using Databricks.
I'm using jupyter with almond to run spark in a notebook locally.
I have imported the hadoop dependencies:
import $ivy.`org.apache.hadoop:hadoop-azure:2.7.7`
import $ivy.`com.microsoft.azure:azure-storage:8.4.0`
which allows me to use the wasbs:// protocol when trying to write my dataframe to azure
spark.conf.set(
"fs.azure.sas.[container].prodeumipsadatadump.blob.core.windows.net",
"?sv=2018-03-28&ss=b&srt=sco&sp=rwdlac&se=2019-09-09T23:33:45Z&st=2019-09-09T15:33:45Z&spr=https&sig=[truncated]")
This is where the error comes:
val data = spark.read.json(spark.createDataset(
"""{"name":"Yin", "age": 25.35,"address":{"city":"Columbus","state":"Ohio"}}""" :: Nil))
data
.write
.orc("wasbs://[filesystem]#[datalakegen2storageaccount].blob.core.windows.net/lalalalala")
We are now greeted with "Blob API is not yet supported for hierarchical namespace accounts" error:
org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Blob API is not yet supported for hierarchical namespace accounts.
So is this indeed impossible? Should I just abandon the Datalake gen2 and just use regular blob storage? Microsoft really dropped the ball in creating a "Data lake" product but creating no documentation for a connector with spark.
Working with ADLS Gen2 in spark is straightforward and microsoft haven't "dropped the ball", so much as "the hadoop binaries shipped with ASF Spark don't include the ABFS client". Those in HD/Insights, Cloudera CDH6.x etc do.
consistently upgrade the hadoop-* JARs to Hadoop 3.2.1. That means all of them, not dropping in a later hadoop-azure-3.2.1 JAR and expecting things to work.
use abfs:// URLs
Configure the client as per the docs.
ADLS Gen2 is the best object store Microsoft have deployed - with hierarchical namespaces you get O(1) directory operations, which for spark means High performance task and job commits. Security and permissions are great too.
Yes it is unfortunate that it doesn't work out the box with the spark distribution you have -but Microsoft are not in a position to retrofit a new connector to a set of artifacts released in 2017. You're going to have to upgrade your dependencies.
I think you have to enable the preview feature to use the Blob API with Azure DataLake Gen2: Data Lake Gen2 Multi-Protocol-Access
Another thing that I found: The endpoint format needs to be updated by exchanging the "blob" to "dfs". See here. But I am not sure if that helps with your problem.
On the other hand, you could use the ABFS driver to access the data. This is not officially supported, but you could start from a hadoop-free spark solution and install a newer hadoop version containing the driver. I think this might be an option depending on your scenario: Adding hadoop ABFS driver to spark distribution
Related
I am new to Spark Structured Streaming and its concepts. Was reading through the documentation for Azure HDInsight cluster here and it's mentioned that the structured streaming applications run on HDInsight cluster and connects to streaming data from .. Azure Storage, or Azure Data Lake Storage. I was looking at how to get started with the streaming listening to new file created events from the storage or ADLS. The spark documentation does provide an example, but i am looking for how to tie up streaming with the blob/file creation event, so that I can store the file content in a queue from my spark job. It will be great if anyone can help me out on this.
happy to help you on this, but can you be more precise with the requirement. Yes, you can run the Spark Structured Streaming jobs on Azure HDInsight. Basically mount the azure blob storage to cluster and then you can directly read the data available in the blob.
val df = spark.read.option("multiLine", true).json("PATH OF BLOB")
Azure Data Lake Gen2 (ADL2) has been released for Hadoop 3.2 only. Open Source Spark 2.4.x supports Hadoop 2.7 and if you compile it yourself Hadoop 3.1. Spark 3 will support Hadoop 3.2, but it's not released yet (only preview release).
Databricks offers support for ADL2 natively.
My solution to tackle this problem was to manually patch and compile Spark 2.4.4 with Hadoop 3.2 to be able to use the ADL2 libs from Microsoft.
Recently, Databricks launched Databricks Connect that
allows you to write jobs using Spark native APIs and have them execute remotely on an Azure Databricks cluster instead of in the local Spark session.
It works fine except when I try to access files in Azure Data Lake Storage Gen2. When I execute this:
spark.read.json("abfss://...").count()
I get this error:
java.lang.RuntimeException: java.lang.ClassNotFoundException: Class shaded.databricks.v20180920_b33d810.org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem not found at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:2195)
Does anybody know how to fix this?
Further information:
databricks-connect version: 5.3.1
If you mount the storage rather use a service principal you should find this works: https://docs.databricks.com/spark/latest/data-sources/azure/azure-datalake-gen2.html
I posted some instructions around the limitations of databricks connect here. https://datathirst.net/blog/2019/3/7/databricks-connect-limitations
Likely too late but for completeness' sake, there's one issue to look out for on this one. If you have this spark conf set, you'll see that exact error (which is pretty hard to unpack):
fs.abfss.impl org.apache.hadoop.fs.azurebfs.SecureAzureBlobFileSystem
So you can double check the spark configs to make sure you have the permissions to directly access ADLS gen2 using the storage account access key.
I have installed Apache Hive on my local system and I need to connect to Azure Data Lake to query the data from it. How to configure it?
Details on how you can connect Hadoop to Azure Data Lake are available here - https://hadoop.apache.org/docs/current/hadoop-azure-datalake/index.html.
You will need to have a recent version of Hadoop running in order to have the modules natively available.
There are blogs which talk about enabling this connectivity e.g. - https://medium.com/azure-data-lake/connecting-your-own-hadoop-or-spark-to-azure-data-lake-store-93d426d6a5f4.
But unless you are running Hadoop in an Azure Region where the Azure Data Lake Store (ADLS) account is located, your solution will be non-optimal. You will incur latency in data read/writes, as well as costs since you will be egressing data out of an Azure region during reads. Trust you have factored these into your planning.
Thanks,
Sachin Sheth,
Program Manager, Azure Data Lake.
How to get list of file from Azure blob storage in Spark and Scala.
I am not getting any idea to approach this.
I don't know the Spark you used is either on Azure or on local. So they are two cases, but similar.
For Spark running on local, there is an offical blog which introduces how to access Azure Blob Storage from Spark. The key is that you need to configure Azure Storage account as HDFS-compatible storage in core-site.xml file and add two jars hadoop-azure & azure-storage to your classpath for accessing HDFS via the protocol wasb[s]. You can refer to the offical tutorial to know HDFS-compatible storage with wasb, and the blog about configuration for HDInsight more details.
For Spark running on Azure, the difference is just only access HDFS with wasb, the other preparations has been done by Azure when creating HDInsight cluster with Spark.
The method for listing files is listFiles or wholeTextFiles of SparkContext.
Hope it helps.
If you are using databricks, try the below
dbutils.fs.ls(“blob_storage_location”)
I have some basic clarifications about azure hdInsight.
The following article gives some basic input on using hdinsight.
https://azure.microsoft.com/en-in/documentation/articles/hdinsight-hadoop-emulator-get-started/.
It says that HDinsight internally uses azure blob storage .
Having this in mind, my question is as follows:
I have a hdinsight hd1 which uses storage account stg1.
If I want to just uploading and download files using azure storage explorer to stg1 , then whats the use of having hd1 , I can do it without even creating hdinsight which costs heavily.
So, is hadoop hdinsight only used for processing some data stored in stg1 to produce some results like wordcount?Is that the only reason why we use HDInsight?
If you want to understand the HDInsight and blob storage better, you need to read https://azure.microsoft.com/en-us/documentation/articles/hdinsight-hadoop-use-blob-storage/.
HDInsight is Microsoft's implementation of Hadoop. So far there 4 different base types which include Hadoop, HBase, Storm, Spark. You can always install additional components to the base types.
Your question is really about why using Hadoop. Hadoop shines when you need to process a lot of data - big data.
One of the differences between HDInsight and other Hadoop implementations is the separation of storage (blob storage) from compute (HDInsight clusters). You would still need to copy the data (or store the data directly in Azure blob storage). When you are ready to process, you create an HDInsight cluster, submit a job, and then delete the cluster. You delete the cluster so you don't need to pay for the cluster anymore. Even after the cluster is deleted, your date stored in the Blob storage retains.
HDInsight is a family of products, including Hadoop, Spark, HBase, and Storm. They all do different things, and storage is but only one aspect.