How to ingest data from Eventhub to ADLS using Databricks cluster(Scala) - apache-spark

I'm want to ingest streaming data from Eventhub to ADLS gen2 with specified format.
I did for batch data ingestion, from DB to ADLS and Container to Container but now I want to try with streaming data ingestion.
Can you please guide me from where to start to proceed further step. I did create Eventhub, Databrick Instance and Storage Account in Azure.

You just need to follow the documentation (for Scala, for Python) for EventHubs Spark connector. In the simplest way the code looks as following (for Python):
readConnectionString = "..."
ehConf = {}
# this is required for versions 2.3.15+
ehConf['eventhubs.connectionString']=sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(readConnectionString)
df = spark.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
# casting binary payload to String (but it's really depends on the
# data format inside the topic)
cdf = df.withColumn("body", F.col("body").cast("string"))
# write data to storage
stream = cdf.writeStream.format("delta")\
.outputMode("append")\
.option("checkpointLocation", "/path/to/checkpoint/directory")\
.start("ADLS location")
You may need to add additional options, like, starting positions, etc. but everything is described well in the documentation.

Related

Converting Kinesis Streams to Pyspark Dataframe and reading dataframe using .show()

I have a kinesis stream resource connected to a dynamodb stream. Whenever any sort of operation happens in the dynamoDB , it is reflected in my kinesis stream. My kinesis is provisioned to only one shard. I am able to see the json arrive into my kinesis streams.
I'm trying to write a script in glue notebook so that when I run the glue notebook cell , i'm able to access all the json from Dbstreams at that particular point of time and convert them into dataframe and view them as a pyspark.dataframe. Is this even possible?
Trying to do something like this :
kinesisDF = spark \
.readStream \
.format("kinesis") \
.option("streamName",'sample-kinesis') \
.option("streamARN", 'arn:aws:kinesis:us-east-1:589840918737:stream/sample-kinesis') \
.option("region", "us-east-1") \
.option("initialPosition", "TRIM_HORIZON") \
.option("format", "json") \
.option("inferSchema", "true") \
.load()
kinesisDF.show()
Ended up with this error:
AnalysisException: 'Queries with streaming sources must be executed with writeStream.start();;\nkinesis'
Is there any other way to do this and do a .show() on the dataframe so as to display all the json that arrived up until i ran the cell in the glue notebook?
I also created a data glue table with source as my kinesis stream and tried using
data_frame = glueContext.create_data_frame.from_catalog(database = "my_db", table_name = "sample-kinesis-tbl", transformation_ctx = "DataSource0", additional_options = {"startingPosition": "TRIM_HORIZON", "inferSchema": "true"})
ended up with same error even this being run as part of glue job:
AnalysisException: Queries with streaming sources must be executed with writeStream.start();
Is there any other way to read kinesis streams (up until that moment of time) and convert it into dataframe and view it ?
kindly help.
EDIT1: I also tried adding a Kinesis Firehose as a delivery stream to my kinesis streams reading dbstreams and assigning the destination to that Delivery stream as S3 and reading from S3. I was able to read it as a dataframe but do not know how efficient of a method this actually is. Any sugestions?

azure databricks autoloader with structure streaming

We are facing an issue with reading and writing streaming data into the target location.
we are working with some JSON telemetry data for tracking steps. New data files land in our delta lake every 5 seconds. Need a way that automatically ingests into delta lake.
Hope this helps
query = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", <schemaLocation>)
.load(<dataset_source>)
.writeStream
.format("delta")
.option("checkpointLocation", <checkpoint_path>)
.trigger(processingTime="<Provide the time>")
.outputMode("append") # you can use complete if needed
.table("table_name"))
For more info refer: https://docs.databricks.com/ingestion/auto-loader/index.html
if you want to read particular sub folder. For Example: This is my file location /mnt/2023/01/13 .I am want to read 2023/01 inside data, then load data like thisload('/mnt/<folder>/<sub_folder>') or /mnt/2023/*
query = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", <Location>)
.load('/mnt/<folder>/<sub_folder>')

Write data to specific partitions in Azure Dedicated SQL pool

At the moment ,we are using steps in the below article to do a full load of the data from one of our spark data sources(delta lake table) and write them to a table on SQL DW.
https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/synapse-analytics
Specifically, the write is carried out using,
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("tempDir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.option("maxStrLength",4000).mode("overwrite").save()
Now,our source data,by virture of it being a delta lake, is partitioned on the basis of countryid. And we would to load/refresh only certain partitions to the SQL DWH, instead of the full drop table and load(because we specify "overwrite") that is happening now.I tried adding an adding a additional option (partitionBy,countryid) to the above script,but that doesnt seem to work.
Also the above article doesn't mention partitioning.
How do I work around this?
There might be better ways to do this, but this is how I achieved it. If the target Synapse table is partitioned, then we could leverage the "preActions" option provided by the Synapse connector to delete the existing data at that partition. And then we append new data pertaining to that partition(read as a dataframe from source), instead of overwriting the whole data.

Azure Event Hubs to Databricks, what happens to the dataframes in use

I've been developing a proof of concept on Azure Event Hubs Streaming json data to an Azure Databricks Notebook, using Pyspark. In the examples I've seen, I've created my rough code as follows, taking the data from the event hub to the delta table I'll be using as a destination
connectionString = "My End Point"
ehConf = {'eventhubs.connectionString' : connectionString}
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
readEventStream = df.withColumn("body", \
df["body"].cast("string")). \
withColumn("date_only", to_date(col("enqueuedTime")))
readEventStream.writeStream.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/delta/testSink/streamprocess") \
.table("testSink")
After reading around googling, what happens to the df & readEventStream dataframes? Will they just get bigger as they retain the data or will they empty during the normal process? Or is it just a temporary store before dumping the data to the Delta table? Is there a way of setting X amount of items streamed before writing out to the Delta table?
Thanks
I carefully reviewed the description of the APIs you used in the code from the PySpark offical document of pyspark.sql module, I think the memory usage of bigger and bigger was caused by the function table(tableName) as the figure below which is for a DataFrame, not for a streaming DataFrame.
So table function create the data strcuture to fill the streaming data in memory.
I recommanded you need to use start(path=None, format=None, outputMode=None, partitionBy=None, queryName=None, **options) to complete the stream write operation first, then to get a table from delta lake again. And there seems not to be a way for setting X amount of items streamed using PySpark before writing out to the Delta table.

Azure Data Factory: Output Copied File and Folder Information from Copy Activity

I'm using the Self-Hosted Integration Runtime in Azure Data Factory to copy data from an On-Premises source (normal file system) to an Azure Blob Storage destination. After being transferred, I want to process the files automatically by attaching a Notebook running on a Databricks cluster. The pipeline works fine, but my question concerns the output of the Copy Activity.
Is there a way to get information on the transferred files and folders for each run? I would pass this information as parameters to the notebook.
Looking at the documentation, it seems only aggregated information is available:
https://learn.microsoft.com/en-us/azure/data-factory/copy-activity-overview
Which kind of makes sense, if you transfer huge amounts of files. If not possible, I guess an alternate approach would be to just leave the copy process to itself, and create another pipeline based on storage account events? Or maybe store the new file and folder information for each run in a fixed text file, transfer it also, and read it in the notebook?
If you want to get information of files or directories beeing read from data factory this can be done using the Get Metadata Activity, see the following answer for an example.
Another approach to detect new files in your notebook would be to use structured streaming with file sources. This works pretty well and you just call the notebook activity after the copy activity.
For this you define a streaming input data frame:
streamingInputDF = (
spark
.readStream
.schema(pqtSchema)
.parquet(inputPath)
)
with inputPath pointing to the input dir in the Blob Storage. Supported file formats are text, csv, json, orc, parquet, so it depends on your concrete scenario if this will work for you.
Important is that on the target you use the trigger once option, so the notebook does not need to run pemananently, e. g.:
streamingOutputDF \
.repartition(1) \
.writeStream \
.format("parquet") \
.partitionBy('Id') \
.option("checkpointLocation", adlpath + "spark/checkpointlocation/data/trusted/sensorreadingsdelta") \
.option("path", targetPath + "delta") \
.trigger(once=True) \
.start()
Another approach could be using Azure Queue Storage (AQS), see the following documentation.
The solution was actually quite simple in this case. I just created another pipeline in Azure Data Factory, which was triggered by a Blob Created event, and the folder and filename passed as parameters to my notebook. Seems to work well, and a minimal amount of configuration or code required. Basic filtering can be done with the event, and the rest is up to the notebook.
For anyone else stumbling across this scenario, details below:
https://learn.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger

Resources