azure databricks autoloader with structure streaming - azure

We are facing an issue with reading and writing streaming data into the target location.
we are working with some JSON telemetry data for tracking steps. New data files land in our delta lake every 5 seconds. Need a way that automatically ingests into delta lake.

Hope this helps
query = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", <schemaLocation>)
.load(<dataset_source>)
.writeStream
.format("delta")
.option("checkpointLocation", <checkpoint_path>)
.trigger(processingTime="<Provide the time>")
.outputMode("append") # you can use complete if needed
.table("table_name"))
For more info refer: https://docs.databricks.com/ingestion/auto-loader/index.html

if you want to read particular sub folder. For Example: This is my file location /mnt/2023/01/13 .I am want to read 2023/01 inside data, then load data like thisload('/mnt/<folder>/<sub_folder>') or /mnt/2023/*
query = (spark.readStream
.format("cloudFiles")
.option("cloudFiles.format", "json")
.option("cloudFiles.schemaLocation", <Location>)
.load('/mnt/<folder>/<sub_folder>')

Related

Streaming from a Delta Live Tables in databrick to kafka instance

I have the following live table
And i'm looking to write that into a stream to be written back into my kafka source.
I've seen in the apache spark docs that I can use writeStream ( I've used readStream to get it out of my kafka stream already ). But how do I transform the table into the medium it needs so it can use this?
I'm fairly new to both kafka and the data world so any further explanation's are welcome here.
writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "updates")
.start()
Thanks in Advance,
Ben
I've seen in the apache spark docs that I can use writeStream ( I've used readStream to get it out of my kafka stream already ). But how do I transform the table into the medium it needs so it can use this?I'm fairly new to both kafka and the data world so any further explanation's are welcome here.
writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "updates")
.start()
As of right now, Delta Live Tables can only write data as a Delta table - it's not possible to write in other formats. You can implement a workaround by creating a Databricks workflow that consist of two tasks (with dependencies or not depending if the pipeline is triggered or not):
DLT Pipeline that will do the actual data processing
A task (easiest way to do with notebook) that will read a table generated by DLT as a stream and write its content into Kafka, with something like that:
df = spark.readStream.format("delta").table("database.table_name")
(df.write.format("kafka").option("kafka....", "")
.trigger(availableNow=True) # if it's not continuous
.start()
)
P.S. If you have solution architect or customer success engineer attached to your Databricks account, you can communicate this requirement to them for product prioritization.
The transformation is done after the read stream process is started
read_df = spark.readStream.format('kafka') ... .... # other options
processed_df = read_df.withColumn('some column', some_calculation )
processed_df.writeStream.format('parquet') ... .... # other options
.start()
The spark documentation is helpful and detailed but some articles are not for beginners. You can look on youtube or read articles to help you get started like this one

How to ingest data from Eventhub to ADLS using Databricks cluster(Scala)

I'm want to ingest streaming data from Eventhub to ADLS gen2 with specified format.
I did for batch data ingestion, from DB to ADLS and Container to Container but now I want to try with streaming data ingestion.
Can you please guide me from where to start to proceed further step. I did create Eventhub, Databrick Instance and Storage Account in Azure.
You just need to follow the documentation (for Scala, for Python) for EventHubs Spark connector. In the simplest way the code looks as following (for Python):
readConnectionString = "..."
ehConf = {}
# this is required for versions 2.3.15+
ehConf['eventhubs.connectionString']=sc._jvm.org.apache.spark.eventhubs.EventHubsUtils.encrypt(readConnectionString)
df = spark.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
# casting binary payload to String (but it's really depends on the
# data format inside the topic)
cdf = df.withColumn("body", F.col("body").cast("string"))
# write data to storage
stream = cdf.writeStream.format("delta")\
.outputMode("append")\
.option("checkpointLocation", "/path/to/checkpoint/directory")\
.start("ADLS location")
You may need to add additional options, like, starting positions, etc. but everything is described well in the documentation.

Spark Structured Streaming custom partition directory name

I'm porting a streaming job (Kafka topic -> AWS S3 Parquet Files) from Kafka Connect to Spark Structured Streaming Job.
I partition my data by year/month/day.
The code is very simple:
df.withColumn("year", functions.date_format(col("createdAt"), "yyyy"))
.withColumn("month", functions.date_format(col("createdAt"), "MM"))
.withColumn("day", functions.date_format(col("createdAt"), "dd"))
.writeStream()
.trigger(processingTime='15 seconds')
.outputMode(OutputMode.Append())
.format("parquet")
.option("checkpointLocation", "/some/checkpoint/directory/")
.option("path", "/some/directory/")
.option("truncate", "false")
.partitionBy("year", "month", "day")
.start()
.awaitTermination();
The output files are in the following directory (as expected):
/s3-bucket/some/directory/year=2021/month=01/day=02/
Question:
Is there a way to customize the output directory name? I need it to be
/s3-bucket/some/directory/2021/01/02/
For backward compatibility reasons.
No, there is no way to customize the output directory names into the format you have mentioned within your Spark Structured Streaming application.
Partitions are based on the values of particular columns and without their column names in the directory path it would be ambiguous to which column their value belong to. You need to write a seperate application that transforms those directories into the desired format.

Azure Event Hubs to Databricks, what happens to the dataframes in use

I've been developing a proof of concept on Azure Event Hubs Streaming json data to an Azure Databricks Notebook, using Pyspark. In the examples I've seen, I've created my rough code as follows, taking the data from the event hub to the delta table I'll be using as a destination
connectionString = "My End Point"
ehConf = {'eventhubs.connectionString' : connectionString}
df = spark \
.readStream \
.format("eventhubs") \
.options(**ehConf) \
.load()
readEventStream = df.withColumn("body", \
df["body"].cast("string")). \
withColumn("date_only", to_date(col("enqueuedTime")))
readEventStream.writeStream.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/delta/testSink/streamprocess") \
.table("testSink")
After reading around googling, what happens to the df & readEventStream dataframes? Will they just get bigger as they retain the data or will they empty during the normal process? Or is it just a temporary store before dumping the data to the Delta table? Is there a way of setting X amount of items streamed before writing out to the Delta table?
Thanks
I carefully reviewed the description of the APIs you used in the code from the PySpark offical document of pyspark.sql module, I think the memory usage of bigger and bigger was caused by the function table(tableName) as the figure below which is for a DataFrame, not for a streaming DataFrame.
So table function create the data strcuture to fill the streaming data in memory.
I recommanded you need to use start(path=None, format=None, outputMode=None, partitionBy=None, queryName=None, **options) to complete the stream write operation first, then to get a table from delta lake again. And there seems not to be a way for setting X amount of items streamed using PySpark before writing out to the Delta table.

Spark Structured Streaming writing to parquet creates so many files

I used structured streaming to load messages from kafka, do some aggreation then write to parquet file. The problem is that there are so many parquet files created (800 files) for only 100 messages from kafka.
The aggregation part is:
return model
.withColumn("timeStamp", col("timeStamp").cast("timestamp"))
.withWatermark("timeStamp", "30 seconds")
.groupBy(window(col("timeStamp"), "5 minutes"))
.agg(
count("*").alias("total"));
The query:
StreamingQuery query = result //.orderBy("window")
.writeStream()
.outputMode(OutputMode.Append())
.format("parquet")
.option("checkpointLocation", "c:\\bigdata\\checkpoints")
.start("c:\\bigdata\\parquet");
When loading one of the parquet file using spark, it shows empty
+------+-----+
|window|total|
+------+-----+
+------+-----+
How can I save the dataset to only one parquet file?
Thanks
My idea was to use Spark Structured Streaming to consume events from Azure Even Hub then store them on storage in a parquet format.
I finally figured out how to deal with many small files created.
Spark version 2.4.0.
This how my query looks like
dfInput
.repartition(1, col('column_name'))
.select("*")
.writeStream
.format("parquet")
.option("path", "adl://storage_name.azuredatalakestore.net/streaming")
.option("checkpointLocation", "adl://storage_name.azuredatalakestore.net/streaming_checkpoint")
.trigger(processingTime='480 seconds')
.start()
As a result, I have one file created on a storage location every 480 seconds.
To figure out the balance between file size and number of files to avoid OOM error, just play with two parameters: number of partitions and processingTime, which means the batch interval.
I hope you can adjust the solution to your use case.

Resources