Azure Databricks: Switching from batch to streaming mode - azure

Hello spark community,
Currently we have a batch pipeline in azure databricks that reads from a delta table. Job is run every night once per day. We extract its data, save it in our own location as delta table again and then we write to azure sql database table. Everything is partioned on date like this: Date=2021-01-01, etc.
Things are about to change now since our source delta table is about to get refreshed every 2-3 minutes and the requirement is to change the ETL from nightly batch to streaming mode but still using the same source and target tables as well as the same sql database table.
Right now this streaming challenge is imposing several questions:
Our delta source table is quite huge (30B+ rows), so far in our nightly batch we extracted only the changed date keys and used MERGE INTO to write the updates/inserts, however once we switch to streaming mode is it possible to tell spark to start streaming from a specific point in time since we do not want to load the whole table again this time with streaming mode?
How do you write a stream to a target delta table with MERGE INTO having to check huge amount of keys on both sides. I suppose we can get use of a foreachBatch and do that in a micro batch manner, but still how each new micro batch to check for existing/non-existing keys in our target delta table in a time-efficient manner (30B+ rows in the current target table)?
Not sure about that but is it possible for a streaming spark job to write directly to a sql database (azure) and will this become a bottleneck situation hence not advisable to be done?
I am really looking forward for some good advice and design decisions since this would be quite a big issue with the current data size. Appreciate every opinion here!

Related

Optimizing Merge in Delta Lake (Databricks Open Source )

I am trying to implement merge using delta lake oss and my history data is around 7 billions records and delta is around 5 millions.
The merge is based on the composite key(5 columns).
I am spinning up a 10 node cluster r5d.12xlarge(~3TB MEMORY / ~480 CORES).
The job took 35 Minutes for first time and the subsequent runs are taking more time.
Tried using optimization techniques , but nothing worked and i started to get heap memory issues after 3 runs , i see lot spill on disk while data shuffles, tried with re writing the history using order by on merge keys ,got performance improvement and merge completed in 20 minutes and the spill was around 2TB ,however the problem is that the data written as part of merge process was not in same order as I have no control on order of writing data ,so subsequent runs are taking longer .
I was not able to use Zorder in delta lake oss as it only comes with subscription .I tried compaction but that did not help either .
Please let me know if there is a better way to optimize the merge process .
Here's a recommendation, it seems you are running your databricks notebook on AWS.
The way to optimize it to use Hive metastore or any catalog service alongside. Now how this will help?
While saving the data you can use bucketing to order you data according to the merge keys and this metadata information needs to be stored in the metastore which will require hive.
If you use bucketing the data will be in order and will not result in excessive shuffling of data which will inevitably improve the performance of your job.
I am not very sure about databricks but if you use EMR you gets the options to use glue catalog as metastore or you can have your own metastore in EMR also.
20 min sounds pretty good in my experience;) What is your partition scheme? Merges are slowed in the same way that SELECTS are, so if you can eliminate lake scans by way of partition filters, that should help tremendously.
Also take a look at the shuffle partitions settings in spark, as I have found these to have a huge impact on performance.
Lastly, compacting you data will have a huge impact on merge performance.
I am having the same issues, same data size as well. I'm going to get rid of the merge statement and break it into two pieces.
Join where match, so UPSATE statement.
Join where no match, INSERT.
If you really want optimize it through code , you can Launching parallel tasks. this is sample code which we have used to parallelized for S3 writing . You can use same logic for adls location as well .
with futures.ThreadPoolExecutor(max_workers=total_days+1) as e:
print(f"{raw_bucket}/{db}/{table}/")
for single_date in daterange(start_date, end_date):
curr_date = single_date.strftime("%Y-%m-%d")
jobs.append(e.submit(writeS3, curr_date))
for job in futures.as_completed(jobs):
result_done = job.result()
print(f"Job Completed - {result_done}")
print("Task complete")
ref : https://docs.python.org/3/library/concurrent.futures.html

Spark Structured Streaming - Streaming data joined with static data which will be refreshed every 5 mins

For spark structured streaming job one input is coming from a kafka topic while second input is a file (which will be refreshed every 5 mins by a python API). I need to join these 2 inputs and write to a kafka topic.
The issue I am facing is when second input file is being refreshed and spark streaming job is reading the file at the same time I get the error below:
File file:/home/hduser/code/new/collect_ip1/part-00163-55e17a3c-f524-4dac-89a4-b9e12f1a79df-c000.csv does not exist
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by recreating the Dataset/DataFrame involved.
Any help will be appreciated.
Use HBase as your store for static. It is more work for sure but allows for concurrent updating.
Where I work, all Spark Streaming uses HBase for lookup of data. Far faster. What if you have a 100M customers for a microbatch of 10k records? I know it was a lot of work initially.
See https://medium.com/#anchitsharma1994/hbase-lookup-in-spark-streaming-acafe28cb0dc
If you have a small static ref table, then static join is fine, but you also have updating, causing issues.

Spark Streaming to Hive, too many small files per partition

I have a spark streaming job with a batch interval of 2 mins(configurable).
This job reads from a Kafka topic and creates a Dataset and applies a schema on top of it and inserts these records into the Hive table.
The Spark Job creates one file per batch interval in the Hive partition like below:
dataset.coalesce(1).write().mode(SaveMode.Append).insertInto(targetEntityName);
Now the data that comes in is not that big, and if I increase the batch duration to maybe 10mins or so, then even I might end up getting only 2-3mb of data, which is way less than the block size.
This is the expected behaviour in Spark Streaming.
I am looking for efficient ways to do a post processing to merge all these small files and create one big file.
If anyone's done it before, please share your ideas.
I would encourage you to not use Spark to stream data from Kafka to HDFS.
Kafka Connect HDFS Plugin by Confluent (or Apache Gobblin by LinkedIn) exist for this very purpose. Both offer Hive integration.
Find my comments about compaction of small files in this Github issue
If you need to write Spark code to process Kafka data into a schema, then you can still do that, and write into another topic in (preferably) Avro format, which Hive can easily read without a predefined table schema
I personally have written a "compaction" process that actually grabs a bunch of hourly Avro data partitions from a Hive table, then converts into daily Parquet partitioned table for analytics. It's been working great so far.
If you want to batch the records before they land on HDFS, that's where Kafka Connect or Apache Nifi (mentioned in the link) can help, given that you have enough memory to store records before they are flushed to HDFS
I have exactly the same situation as you. I solved it by:
Lets assume that your new coming data are stored in a dataset: dataset1
1- Partition the table with a good partition key, in my case I have found that I can partition using a combination of keys to have around 100MB per partition.
2- Save using spark core not using spark sql:
a- load the whole partition in you memory (inside a dataset: dataset2) when you want to save
b- Then apply dataset union function: dataset3 = dataset1.union(dataset2)
c- make sure that the resulted dataset is partitioned as you wish e.g: dataset3.repartition(1)
d - save the resulting dataset in "OverWrite" mode to replace the existing file
If you need more details about any step please reach out.

Keeping only small window of output data in Spark Structured Streaming

I am interested in using Spark Structured Streaming for real-time data processing using data from for example last ~24hours of running, however I am not able to find correct solution for this problem.
Some useful information about entire situation:
Data is constantly flowing all the time as an input for Spark so stream is active 24/7
Spark does some actions and then writes some data to files(eg. parquet)
Watermarking is used to reduce state size
Someone wants to work on only most recent data returned from Spark Structured Streaming(for example all data from last 24 hours) to have a quick view on what happened in last time and for further very specific analysis.
From what I understand, watermarking helps managing state size so Spark does not hold entire data about the state. This is a good thing and solves one problem with 24/7 running.
The other problem is output data. Currently Spark appends the data and nothing else, this makes it grow bigger and bigger. Using memory sink for testing it creates memory problems. I didn't try it with file sink because it creates one file for each record(ugh) so there's a risk of using all available inodes in system extremely quickly. I can create one file per window with file sink.
So my question is:
Is it possible to force Spark Structured Streaming to delete output data after some amount of time when it is no longer needed? I want to keep output data only from for example last 24 hours. Is there any build-in solution or do I need to do it on my own? If I needed to do it on my own, wouldn't checkpoint data and spark metadata get corrupted?
Using watermarking will allow us to keep only the last 24 hours
df.withWatermark("timestamp", "24 hours") //when timestamp is your event time field
From Spark Spark doc
in Spark 2.1, we have introduced watermarking, which lets the engine
automatically track the current event time in the data and attempt to
clean up old state accordingly.

Read from Hbase + Convert to DF + Run SQLs

Edit
My use case is a Spark streaming app (spark 2.1.1 + Kafka 0.10.2.1), wherein I read from Kafka and for each message/trigger need to pull data from HBase. post the pull, I need to run some SQL statements on the data (so received from HBase)
Naturally, I intend to push the processing (read from HBase & SQL execution) to the worker nodes to achieve parallelism.
So far, my attempts to convert the data from HBase to a data frame (so that i can launch SQK statements) are failing. Another gent mentioned that it's not "allowed " since that part is running on executors. However, this is my conscious choice to run those pieces on worker nodes.
Is that sound thinking? If not, why not?
What's the recommendation on that? or on the overall idea?
For every streamed rec, reading from hbase and sql seems to be "too much happening in streaming app".
Anyways, you can create connection for every partition to hbase and get records and then compare. Not sure about sql. If its just another reading for every streaming record, again handle at partition level in spark.
But the above approach will be time consuming - just make sure you finish all stuff before the next batch starts.
You also mentioned converting "hbase to dataframe" and "parallel". Both seemed to be in opposite direction. Because you start with dataframe(may be reading from hbase once and then you parallelize. Hope I cleared some of your doubts

Resources