Multiple S3 credentials in a Spark Structured Streaming application - apache-spark

I want to migrate our Delta lake from S3 to Parquet files in our own on-prem Ceph storage, both accessible through the S3-compliant s3a API in Spark. Is there a possibility to provide different credentials for readStream and writeStream to achieve this?

the s3a connector supports per-bucket configuration, so you can declare a different set of secrets, endpoint etc for your internal buckets from your external ones.
consult the hadoop s3a docs for the normative details and examples

Related

How to connect Spark Structured Streaming to blob/file creation events from Azure Data Lake Storage Gen2 or Blob Storage

I am new to Spark Structured Streaming and its concepts. Was reading through the documentation for Azure HDInsight cluster here and it's mentioned that the structured streaming applications run on HDInsight cluster and connects to streaming data from .. Azure Storage, or Azure Data Lake Storage. I was looking at how to get started with the streaming listening to new file created events from the storage or ADLS. The spark documentation does provide an example, but i am looking for how to tie up streaming with the blob/file creation event, so that I can store the file content in a queue from my spark job. It will be great if anyone can help me out on this.
happy to help you on this, but can you be more precise with the requirement. Yes, you can run the Spark Structured Streaming jobs on Azure HDInsight. Basically mount the azure blob storage to cluster and then you can directly read the data available in the blob.
val df = spark.read.option("multiLine", true).json("PATH OF BLOB")
Azure Data Lake Gen2 (ADL2) has been released for Hadoop 3.2 only. Open Source Spark 2.4.x supports Hadoop 2.7 and if you compile it yourself Hadoop 3.1. Spark 3 will support Hadoop 3.2, but it's not released yet (only preview release).
Databricks offers support for ADL2 natively.
My solution to tackle this problem was to manually patch and compile Spark 2.4.4 with Hadoop 3.2 to be able to use the ADL2 libs from Microsoft.

How to configure confluent kafka with azur SQL with CDC approach?

The main thing is i want to connect Azur SQL to confluent kafka using CDC approach and then i want to take that data into s3.
There are various ways of getting data out of a database into Kafka. You'll need to check what Azure SQL supports but this talk (slides) goes into the options and examples, usually built using Kafka Connect.
To stream data to S3 from Kafka use Kafka Connect (which is part of Apache Kafka), using the S3 sink connector which is detailed in this article.
To see an example of database-S3 pipelines with transformations included have a look at this blog post.

Can Spark write to Azure Datalake Gen2?

It seems impossible to write to Azure Datalake Gen2 using spark, unless you're using Databricks.
I'm using jupyter with almond to run spark in a notebook locally.
I have imported the hadoop dependencies:
import $ivy.`org.apache.hadoop:hadoop-azure:2.7.7`
import $ivy.`com.microsoft.azure:azure-storage:8.4.0`
which allows me to use the wasbs:// protocol when trying to write my dataframe to azure
spark.conf.set(
"fs.azure.sas.[container].prodeumipsadatadump.blob.core.windows.net",
"?sv=2018-03-28&ss=b&srt=sco&sp=rwdlac&se=2019-09-09T23:33:45Z&st=2019-09-09T15:33:45Z&spr=https&sig=[truncated]")
This is where the error comes:
val data = spark.read.json(spark.createDataset(
"""{"name":"Yin", "age": 25.35,"address":{"city":"Columbus","state":"Ohio"}}""" :: Nil))
data
.write
.orc("wasbs://[filesystem]#[datalakegen2storageaccount].blob.core.windows.net/lalalalala")
We are now greeted with "Blob API is not yet supported for hierarchical namespace accounts" error:
org.apache.hadoop.fs.azure.AzureException: com.microsoft.azure.storage.StorageException: Blob API is not yet supported for hierarchical namespace accounts.
So is this indeed impossible? Should I just abandon the Datalake gen2 and just use regular blob storage? Microsoft really dropped the ball in creating a "Data lake" product but creating no documentation for a connector with spark.
Working with ADLS Gen2 in spark is straightforward and microsoft haven't "dropped the ball", so much as "the hadoop binaries shipped with ASF Spark don't include the ABFS client". Those in HD/Insights, Cloudera CDH6.x etc do.
consistently upgrade the hadoop-* JARs to Hadoop 3.2.1. That means all of them, not dropping in a later hadoop-azure-3.2.1 JAR and expecting things to work.
use abfs:// URLs
Configure the client as per the docs.
ADLS Gen2 is the best object store Microsoft have deployed - with hierarchical namespaces you get O(1) directory operations, which for spark means High performance task and job commits. Security and permissions are great too.
Yes it is unfortunate that it doesn't work out the box with the spark distribution you have -but Microsoft are not in a position to retrofit a new connector to a set of artifacts released in 2017. You're going to have to upgrade your dependencies.
I think you have to enable the preview feature to use the Blob API with Azure DataLake Gen2: Data Lake Gen2 Multi-Protocol-Access
Another thing that I found: The endpoint format needs to be updated by exchanging the "blob" to "dfs". See here. But I am not sure if that helps with your problem.
On the other hand, you could use the ABFS driver to access the data. This is not officially supported, but you could start from a hadoop-free spark solution and install a newer hadoop version containing the driver. I think this might be an option depending on your scenario: Adding hadoop ABFS driver to spark distribution

Specify Azure key in Spark 2.x version

I'm trying to access a wasb(Azure blob storage) file in Spark and need to specify the account key.
How do I specify the account in the spark-env.sh file?
fs.azure.account.key.test.blob.core.windows.net
EC5sNg3qGN20qqyyr2W1xUo5qApbi/zxkmHMo5JjoMBmuNTxGNz+/sF9zPOuYA==
WHen I try this it throws the following error
fs.azure.account.key.test.blob.core.windows.net: command not found
From your description, it is not clear that the Spark you used is either on Azure or on local.
For Spark running on local, refer this blog post which introduces how to access Azure Blob Storage from Spark. The key is that you need to configure Azure Storage account as HDFS-compatible storage in core-site.xml file and add two jars hadoop-azure & azure-storage to your classpath for accessing HDFS via the protocol wasb[s].
For Spark running on Azure, the difference is just only access HDFS with wasb, all configurations have been done by Azure when creating HDInsight cluster with Spark.

How to get list of file from Azure blob using Spark/Scala?

How to get list of file from Azure blob storage in Spark and Scala.
I am not getting any idea to approach this.
I don't know the Spark you used is either on Azure or on local. So they are two cases, but similar.
For Spark running on local, there is an offical blog which introduces how to access Azure Blob Storage from Spark. The key is that you need to configure Azure Storage account as HDFS-compatible storage in core-site.xml file and add two jars hadoop-azure & azure-storage to your classpath for accessing HDFS via the protocol wasb[s]. You can refer to the offical tutorial to know HDFS-compatible storage with wasb, and the blog about configuration for HDInsight more details.
For Spark running on Azure, the difference is just only access HDFS with wasb, the other preparations has been done by Azure when creating HDInsight cluster with Spark.
The method for listing files is listFiles or wholeTextFiles of SparkContext.
Hope it helps.
If you are using databricks, try the below
dbutils.fs.ls(“blob_storage_location”)

Resources