Optimizing Merge in Delta Lake (Databricks Open Source ) - apache-spark

I am trying to implement merge using delta lake oss and my history data is around 7 billions records and delta is around 5 millions.
The merge is based on the composite key(5 columns).
I am spinning up a 10 node cluster r5d.12xlarge(~3TB MEMORY / ~480 CORES).
The job took 35 Minutes for first time and the subsequent runs are taking more time.
Tried using optimization techniques , but nothing worked and i started to get heap memory issues after 3 runs , i see lot spill on disk while data shuffles, tried with re writing the history using order by on merge keys ,got performance improvement and merge completed in 20 minutes and the spill was around 2TB ,however the problem is that the data written as part of merge process was not in same order as I have no control on order of writing data ,so subsequent runs are taking longer .
I was not able to use Zorder in delta lake oss as it only comes with subscription .I tried compaction but that did not help either .
Please let me know if there is a better way to optimize the merge process .

Here's a recommendation, it seems you are running your databricks notebook on AWS.
The way to optimize it to use Hive metastore or any catalog service alongside. Now how this will help?
While saving the data you can use bucketing to order you data according to the merge keys and this metadata information needs to be stored in the metastore which will require hive.
If you use bucketing the data will be in order and will not result in excessive shuffling of data which will inevitably improve the performance of your job.
I am not very sure about databricks but if you use EMR you gets the options to use glue catalog as metastore or you can have your own metastore in EMR also.

20 min sounds pretty good in my experience;) What is your partition scheme? Merges are slowed in the same way that SELECTS are, so if you can eliminate lake scans by way of partition filters, that should help tremendously.
Also take a look at the shuffle partitions settings in spark, as I have found these to have a huge impact on performance.
Lastly, compacting you data will have a huge impact on merge performance.

I am having the same issues, same data size as well. I'm going to get rid of the merge statement and break it into two pieces.
Join where match, so UPSATE statement.
Join where no match, INSERT.

If you really want optimize it through code , you can Launching parallel tasks. this is sample code which we have used to parallelized for S3 writing . You can use same logic for adls location as well .
with futures.ThreadPoolExecutor(max_workers=total_days+1) as e:
print(f"{raw_bucket}/{db}/{table}/")
for single_date in daterange(start_date, end_date):
curr_date = single_date.strftime("%Y-%m-%d")
jobs.append(e.submit(writeS3, curr_date))
for job in futures.as_completed(jobs):
result_done = job.result()
print(f"Job Completed - {result_done}")
print("Task complete")
ref : https://docs.python.org/3/library/concurrent.futures.html

Related

Spark SQL output multiple small files

We are having multiple joins involving a large table (about 500gb in size). The output of the joins is stored into multiple small files each of size 800kb-1.5mb. Because of this the job is split into multiple tasks and taking a long time to complete.
We have tried using spark tuning configurations like using broadcast join, changing partition size, changing max records per file etc., But there is no performance improvement with this methods and the issue is also not fixed. Using coalesce makes the job struck at that stage and there is no progress.
Please view this link for Spark UI metrics screenshot, https://i.stack.imgur.com/FfyYy.png
The spark UI confirms your report of too many small files. You will get a file for every spark partition, and you have 33,479 in your final stage where you're writing the output. 33k partitions was probably the right number of partitions for your join but not the right number for your write.
You need to add another stage in your job that comes after your join. That 2nd needs to reduce the number of spark partitions to a reasonable number (that outputs 32MB - ~128MB files)
Something like a coalesce, or repartition. Maybe even a sort :(
You want to target ~350 partitions.
This diagram shows what you want to do manually or automatically (with spark on Databricks)
If you're using Databricks then it's easy as with Delta Lake you can turn on Auto Optimize

Azure Databricks: Switching from batch to streaming mode

Hello spark community,
Currently we have a batch pipeline in azure databricks that reads from a delta table. Job is run every night once per day. We extract its data, save it in our own location as delta table again and then we write to azure sql database table. Everything is partioned on date like this: Date=2021-01-01, etc.
Things are about to change now since our source delta table is about to get refreshed every 2-3 minutes and the requirement is to change the ETL from nightly batch to streaming mode but still using the same source and target tables as well as the same sql database table.
Right now this streaming challenge is imposing several questions:
Our delta source table is quite huge (30B+ rows), so far in our nightly batch we extracted only the changed date keys and used MERGE INTO to write the updates/inserts, however once we switch to streaming mode is it possible to tell spark to start streaming from a specific point in time since we do not want to load the whole table again this time with streaming mode?
How do you write a stream to a target delta table with MERGE INTO having to check huge amount of keys on both sides. I suppose we can get use of a foreachBatch and do that in a micro batch manner, but still how each new micro batch to check for existing/non-existing keys in our target delta table in a time-efficient manner (30B+ rows in the current target table)?
Not sure about that but is it possible for a streaming spark job to write directly to a sql database (azure) and will this become a bottleneck situation hence not advisable to be done?
I am really looking forward for some good advice and design decisions since this would be quite a big issue with the current data size. Appreciate every opinion here!

Issues with long lineages (DAG) in Spark

We usually use Spark as processing engines for data stored on S3 or HDFS. We use Databricks and EMR platforms.
One of the issues I frequently face is when the task size grows, the job performance is degraded severely. For example, let's say I read data from five tables with different levels of transformation like (filtering, exploding, joins, etc), union subset of data from these transformations, then do further processing (ex. remove some rows based on a criteria that requires windowing functions etc) and then some other processing stages and finally save the final output to a destination s3 path. If we run this job without it takes very long time. However, if we save(stage) temporary intermediate dataframes to S3 and use this saved (on S3) dataframe for the next steps of queries, the job finishes faster. Does anyone have similar experience? Is there a better way to handle this kind of long tasks lineages other than checkpointing?
What is even more strange is for longer lineages spark throws an expected error like column not found, while the same code works if intermediate results are temporarily staged.
Writing the intermediate data by saving the dataframe, or using a checkpoint is the only way to fix it. You're probably running into an issue where the optimizer is taking a really long time to generate the plan. The quickest/most efficient way to fix this is to use localCheckpoint. This materializes a checkpoint locally.
val df = df.localCheckpoint()

What is difference between overwrite and append to parquet

What is the difference between append and overwrite to parquet in spark.
I'm processing huge amount of data for say 10 days. At present I'm processing daily logs into parquet files using "append" method and partitioning the data based on date. But the problem I'm facing is daily data is also very huge and taking a lot of time, contributing to high CPU usage as well while processing data using EMR cluster. This is making my job very slow and expensive. So I'm looking for a way where I can further split the data and can merge the data to day cluster.
Please see spark SaveMode docs
https://spark.apache.org/docs/latest/api/java/index.html

spark datasax cassandra connector slow to read from heavy cassandra table

I am new to Spark/ Spark Cassandra Connector. We are trying spark for the first time in our team and we are using spark cassandra connector to connect to cassandra Database.
I wrote a query which is using a heavy table of the database and I saw that Spark Task didn't start until the query to the table fetched all the records.
It is taking more than 3 hours just to fetch all the records from the database.
To get the data from the DB we use.
CassandraJavaUtil.javaFunctions(sparkContextManager.getJavaSparkContext(SOURCE).sc())
.cassandraTable(keyspaceName, tableName);
Is there a way to tell spark to start working even if all the data didn't finish to download ?
Is there an option to tell spark-cassandra-connector to use more threads for the fetch ?
thanks,
kokou.
If you look at the Spark UI, how many partitions is your table scan creating? I just did something like this and I found that Spark was creating too many partitions for the scan and it was taking much longer as a result. The way I decreased the time on my job was by setting the configuration parameter spark.cassandra.input.split.size_in_mb to a value higher than the default. In my case it took a 20 minute job down to about four minutes. There are also a couple more Cassandra read specific Spark variables that you can set found here.
These stackoverflow questions are what I referenced originally, I hope they help you out as well.
Iterate large Cassandra table in small chunks
Set number of tasks on Cassandra table scan
EDIT:
After doing some performance testing with regards to fiddling with some Spark configuration parameters, I found that Spark was creating far too many table partitions when I wasn't giving the Spark executors enough memory. In my case, upping the memory by a gigabyte was enough to render the input split size parameter unnecessary. If you can't give the executors more memory, you may still need to set spark.cassandra.input.split.size_in_mbhigher as a form of workaround.

Resources