Azure Event Hubs to Databricks, what happens to the dataframes in use - databricks

I've been developing a proof of concept on Azure Event Hubs Streaming json data to an Azure Databricks Notebook, using Pyspark. In the examples I've seen, I've created my rough code as follows, taking the data from the event hub to the delta table I'll be using as a destination
connectionString = "My End Point"
ehConf = {'eventhubs.connectionString' : connectionString}
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
readEventStream = df.withColumn("body", \
df["body"].cast("string")). \
withColumn("date_only", to_date(col("enqueuedTime")))
readEventStream.writeStream.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/delta/testSink/streamprocess") \
.table("testSink")
After reading around googling, what happens to the df & readEventStream dataframes? Will they just get bigger as they retain the data or will they empty during the normal process? Or is it just a temporary store before dumping the data to the Delta table? Is there a way of setting X amount of items streamed before writing out to the Delta table?
Thanks

I carefully reviewed the description of the APIs you used in the code from the PySpark offical document of pyspark.sql module, I think the memory usage of bigger and bigger was caused by the function table(tableName) as the figure below which is for a DataFrame, not for a streaming DataFrame.
So table function create the data strcuture to fill the streaming data in memory.
I recommanded you need to use start(path=None, format=None, outputMode=None, partitionBy=None, queryName=None, **options) to complete the stream write operation first, then to get a table from delta lake again. And there seems not to be a way for setting X amount of items streamed using PySpark before writing out to the Delta table.

Related

Converting Kinesis Streams to Pyspark Dataframe and reading dataframe using .show()

I have a kinesis stream resource connected to a dynamodb stream. Whenever any sort of operation happens in the dynamoDB , it is reflected in my kinesis stream. My kinesis is provisioned to only one shard. I am able to see the json arrive into my kinesis streams.
I'm trying to write a script in glue notebook so that when I run the glue notebook cell , i'm able to access all the json from Dbstreams at that particular point of time and convert them into dataframe and view them as a pyspark.dataframe. Is this even possible?
Trying to do something like this :
kinesisDF = spark \
.readStream \
.format("kinesis") \
.option("streamName",'sample-kinesis') \
.option("streamARN", 'arn:aws:kinesis:us-east-1:589840918737:stream/sample-kinesis') \
.option("region", "us-east-1") \
.option("initialPosition", "TRIM_HORIZON") \
.option("format", "json") \
.option("inferSchema", "true") \
.load()
kinesisDF.show()
Ended up with this error:
AnalysisException: 'Queries with streaming sources must be executed with writeStream.start();;\nkinesis'
Is there any other way to do this and do a .show() on the dataframe so as to display all the json that arrived up until i ran the cell in the glue notebook?
I also created a data glue table with source as my kinesis stream and tried using
data_frame = glueContext.create_data_frame.from_catalog(database = "my_db", table_name = "sample-kinesis-tbl", transformation_ctx = "DataSource0", additional_options = {"startingPosition": "TRIM_HORIZON", "inferSchema": "true"})
ended up with same error even this being run as part of glue job:
AnalysisException: Queries with streaming sources must be executed with writeStream.start();
Is there any other way to read kinesis streams (up until that moment of time) and convert it into dataframe and view it ?
kindly help.
EDIT1: I also tried adding a Kinesis Firehose as a delivery stream to my kinesis streams reading dbstreams and assigning the destination to that Delivery stream as S3 and reading from S3. I was able to read it as a dataframe but do not know how efficient of a method this actually is. Any sugestions?

azure databricks autoloader with structure streaming

We are facing an issue with reading and writing streaming data into the target location.
we are working with some JSON telemetry data for tracking steps. New data files land in our delta lake every 5 seconds. Need a way that automatically ingests into delta lake.
Hope this helps
query = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", <schemaLocation>)
.load(<dataset_source>)
.writeStream
.format("delta")
.option("checkpointLocation", <checkpoint_path>)
.trigger(processingTime="<Provide the time>")
.outputMode("append") # you can use complete if needed
.table("table_name"))
For more info refer: https://docs.databricks.com/ingestion/auto-loader/index.html
if you want to read particular sub folder. For Example: This is my file location /mnt/2023/01/13 .I am want to read 2023/01 inside data, then load data like thisload('/mnt/<folder>/<sub_folder>') or /mnt/2023/*
query = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", <Location>)
.load('/mnt/<folder>/<sub_folder>')

How to ingest data from Eventhub to ADLS using Databricks cluster(Scala)

I'm want to ingest streaming data from Eventhub to ADLS gen2 with specified format.
I did for batch data ingestion, from DB to ADLS and Container to Container but now I want to try with streaming data ingestion.
Can you please guide me from where to start to proceed further step. I did create Eventhub, Databrick Instance and Storage Account in Azure.
You just need to follow the documentation (for Scala, for Python) for EventHubs Spark connector. In the simplest way the code looks as following (for Python):
readConnectionString = "..."
ehConf = {}
# this is required for versions 2.3.15+
ehConf['eventhubs.connectionString']=sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(readConnectionString)
df = spark.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
# casting binary payload to String (but it's really depends on the
# data format inside the topic)
cdf = df.withColumn("body", F.col("body").cast("string"))
# write data to storage
stream = cdf.writeStream.format("delta")\
.outputMode("append")\
.option("checkpointLocation", "/path/to/checkpoint/directory")\
.start("ADLS location")
You may need to add additional options, like, starting positions, etc. but everything is described well in the documentation.

Write data to specific partitions in Azure Dedicated SQL pool

At the moment ,we are using steps in the below article to do a full load of the data from one of our spark data sources(delta lake table) and write them to a table on SQL DW.
https://learn.microsoft.com/en-us/azure/databricks/data/data-sources/azure/synapse-analytics
Specifically, the write is carried out using,
df.write \
.format("com.databricks.spark.sqldw") \
.option("url", "jdbc:sqlserver://<the-rest-of-the-connection-string>") \
.option("forwardSparkAzureStorageCredentials", "true") \
.option("dbTable", "<your-table-name>") \
.option("tempDir", "wasbs://<your-container-name>#<your-storage-account-name>.blob.core.windows.net/<your-directory-name>") \
.option("maxStrLength",4000).mode("overwrite").save()
Now,our source data,by virture of it being a delta lake, is partitioned on the basis of countryid. And we would to load/refresh only certain partitions to the SQL DWH, instead of the full drop table and load(because we specify "overwrite") that is happening now.I tried adding an adding a additional option (partitionBy,countryid) to the above script,but that doesnt seem to work.
Also the above article doesn't mention partitioning.
How do I work around this?
There might be better ways to do this, but this is how I achieved it. If the target Synapse table is partitioned, then we could leverage the "preActions" option provided by the Synapse connector to delete the existing data at that partition. And then we append new data pertaining to that partition(read as a dataframe from source), instead of overwriting the whole data.

delta lake in databricks - a consistent "view" of just the last half hour of a stream

I have consisently updated table from spark structured streaming (kafka source)
Written like this (in eachBatch)
parsedDf \
.select("somefield", "anotherField",'partition', 'offset') \
.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save(f"/mnt/defaultDatalake/{append_table_name}")
I need to keep a fast view on this table for "items inserted in the last half hour"
How can this be achieved?
I can have a readStream from this table, but what I'm missing is how keep just the "tail" of the stream there
Databricks 7.5 spark 3.
Given that Delta lake does not have materalized views and that Delta Lake time-travel is not relevant as you want the most current data:
You can load the data and include a key that does not need to be looked up whilst inserting.
Pre-populate a time dimension for joining with your data. See it as a dimension with a grain of a minute.
Join the data with this dimension, relying on dynamic file pruning. Thus you need to query per minute every 30 minutes with rolling window and set those values in the query.
See https://databricks.com/blog/2020/04/30/faster-sql-queries-on-delta-lake-with-dynamic-file-pruning.html#:~:text=Dynamic%20File%20Pruning%20(DFP)%2C%20a%20new%20feature%20now%20enabled,queries%20on%20non%2Dpartitioned%20tables.

Resources