Can you stream and batch from the same delta table? - apache-spark

I tried to stream and batch from the same delta table but ran into small files problem on the batch side. But if you optimize the delta table, the streaming size will lose track of the files it reads because the compaction results of the optimization.

When the OPTIMIZE command removes small files and adds back in compacted ones, these operations are flagged with the dataChange flag set to false. This flag tells streams that are following the transaction log that it is safe to ignore this transaction to avoid processing duplicate data.
I'll also note that DBR 5.3 contains a private preview features called Auto-Optimize, that can perform this compaction before small files even make it into the table. This feature will be GA-ed in the next release of DBR.

Related

What happens at VACUUM with delta tables?

When we run VACUUM command, does it go through each parquet file and remove older versions of each record or does it retain all the parquet files even id it has one record with the latest version? What about compaction? Is this any different?
Vacuum and Compaction go through the _delta_log/ folder in your Delta Lake Table and identify the files that are still being referenced.
Vacuum deletes all unreferenced files.
Compaction reads in the referenced files and writes your new partitions back to the table, unreferencing the existing files.
Think of a single version of a Delta Lake table as a set of parquet data files. Every version adds an entry (about files added and removed) to the transaction log (under _delta_log directory).
VACUUM
VACUUM allows defining what number of hours to retain (using RETAIN number HOURS clause). That gives Delta Lake the versions to delete (up to the number HOURS). These versions are "translated" into a series of parquet files (remember one single parquet file belongs to a single version until it is deleted that may take a couple of versions).
This translation gives the files to be deleted.
Compaction
Compaction is basically an optimization (and is usually triggered by OPTIMIZE command or a combination of repartition, dataChange disabled and overwrite).
This is nothing else as another version of a delta table (but this time data is not changed so other transactions can happily be all committed).
The explanation about VACUUM above applies here.

use of df.coalesce(1) in csv vs delta table

When saving to a delta table we avoid 'df.coalesce(1)' but when saving to csv or parquet we(my team) add 'df.coalesce(1)'. Is it a common practise? Why? Is it mandatory?
In most cases when I have seen df.coalesce(1) it was done to generate only one file, for example, import CSV file into Excel, or for Parquet file into the Pandas-based program. But if you're doing .coalesce(1), then the write happens via single task, and it's becoming the performance bottleneck because you need to get data from other executors, and write it.
If you're consuming data from Spark or other distributed system, having multiple files will be beneficial for performance because you can write & read them in parallel. By default, Spark writes N files into the directory where N is the number of partitions. As #pltc noticed, this may generate the big number of files that's often not desirable because you'll get performance overhead from accessing them. So we need to have a balance between the number of files and their size - for Parquet and Delta (that is based on Parquet), having the bigger files bring several performance advantages - you read less files, you can get better compression for data inside the file, etc.
For Delta specifically, having .coalesce(1) having the same problem as for other file formats - you're writing via one task. Relying on default Spark behaviour and writing multiple files is beneficial from performance point of view - each node is writing its data in parallel, but you can get too many small files (so you may use .coalesce(N) to write bigger files). For Databricks Delta, as it was correctly pointed by #Kafels, there are some optimizations that will allow to remove that .coalesce(N) and do automatic tuning achieve the best throughput (so called "Optimized Writes"), and create bigger files ("Auto compaction") - but they should be used carefully.
Overall, the topic of optimal file size for Delta is an interesting topic - if you have big files (1Gb is used by default by OPTIMIZE command), you can get better read throughput, but if you're rewriting them with MERGE/UPDATE/DELETE, then big files are bad from performance standpoint, and it's better to have smaller (16-64-128Mb) files, so you can rewrite less data.
TL;DR: it's not mandatory, it depends on the size of your dataframe.
Long answer:
If your dataframe is 10Mb, and you have 1000 partitions for example, each file would be about 10Kb. And having so many small files would reduce Spark performance dramatically, not to mention when you have too many files, you'll eventually reach OS limitation of the number of files. Any how, when your dataset is small enough, you should merge them into a couple of files by coalesce.
However, if your dataframe is 100G, technically you still can use coalesce(1) and save to a single file, but later on you will have to deal with less parallelism when reading from it.

Spark: writing data to place that is being read from without loosing data

Help me please to understand how can I write data to the place that is also being read from without any issue, using EMR and S3.
So I need to read partitioned data, find old data, delete it, write new data back and I'm thinking about 2 ways here:
Read all data, apply a filter, write data back with save option SaveMode.Overwrite. I see here one major issue - before writing it will delete files in S3, so if EMR cluster goes down by some reason after deletion but before writing - all data will be lost. I can use dynamic partition but that would mean that in such situation I'm gonna lost data from 1 partition.
Same as above but write to the temp directory, then delete original, move everything from temp to original. But as this is S3 storage it doesn't have move operation and all files will be copied, which can be a bit pricy(I'm going to work with 200GB of data).
Is there any other way or am I'm wrong in how spark works?
You are not wrong. The process of deleting a record from a table on EMR/Hadoop is painful in the ways you describe and more. It gets messier with failed jobs, small files, partition swapping, slow metadata operations...
There are several formats, and file protocols that add transactional capability on top of a table stored S3. The open Delta Lake (https://delta.io/) format, supports transactional deletes, updates, merge/upsert and does so very well. You can read & delete (say for GDPR purposes) like you're describing. You'll have a transaction log to track what you've done.
On point 2, as long as you have a reasonable # of files, your costs should be modest, with data charges at ~$23/TB/mo. However, if you end with too many small files, then the API costs of listing the files, fetching files can add up quickly. Managed Delta (from Databricks) will help speed of many of the operations on your tables through compaction, data caching, data skipping, z-ordering
Disclaimer, I work for Databricks....

Delta Lake (OSS) Table on EMR and S3 - Vacuum takes a long time with no jobs

I'm writing a lot of data into Databricks Delta lake using the open source version, running on AWS EMR with S3 as storage layer. I'm using EMRFS.
For performance improvements, I'm compacting and vacuuming the table every so often like so:
spark.read.format("delta").load(s3path)
.repartition(num_files)
.write.option("dataChange", "false").format("delta").mode("overwrite").save(s3path)
t = DeltaTable.forPath(spark, path)
t.vacuum(24)
It's then deleting 100k's of files from S3. However, the vacuum step takes an extremly long time. During this time, it appears the job is idle, however every ~5-10 minutes there will be a small task that indicates the job is alive and doing something.
I've read through this post Spark: long delay between jobs which seems to suggest it may be related to parquet? But I don't see any options on the delta side to tune any parameters.
I've also observed that the Delta vacuum command is quite slow. The open source developers are probably limited from making AWS specific optimizations in the repo because this library is cross platform (needs to work on all clouds).
I've noticed that vacuum is even slow locally. You can clone the Delta repo, run the test suite on your local machine, and see for yourself.
Deleting hundreds of thousands of files stored in S3 is slow, even if you're using the AWS CLI. You should see if you can refactor your compaction operation to create fewer files that need to be vacuumed.
Suppose your goal is to create 1GB files. Perhaps you have 15,000 one-gig files and 20,000 small files. Right now, your compaction operation is rewriting all of the data (so all 35,000 original files need to be vacuumed post-compaction). Try to refactor your code to only compact the 20,000 small files (so the vacuum operation only needs to delete 20,000 files).
The real solution is to build a vacuum command that's optimized for AWS. Delta Lake needs to work with all the popular clouds and the local filesystem. It should be pretty easy to make an open source library that reads the transaction log, figures out what files need to be deleted, makes a performant file deletion API call, and then writes out an entry to the transaction log that's Delta compliant. Maybe I'll make that repo ;)
Here's more info on the vacuum command. As a sidenote, you may way to use coalesce instead of repartition when compacting, as described here.
EDIT:
Delta issue: https://github.com/delta-io/delta/issues/395
and PR: https://github.com/delta-io/delta/pull/416
There was issue filed for this in deltalake
Problem Statement:
Deltalake vacuum jobs are taking too long to finish as underneath file deletion logic was sequential. Known bug for deltalake (v0.6.1) Ref: https://github.com/delta-io/delta/issues/395
Solution:
Deltalake team has resolved this issue & yet to be released stable version for this. Pull Request: https://github.com/delta-io/delta/pull/522
For v0.6.x
Lot of organizations are using 0.6.x in production & want this to be part of 0.6.x. Following are quick steps to generate delta 0.6.1 jar with this patch
https://swapnil-chougule.medium.com/delta-with-improved-vacuum-patch-381378e79d1d
With this change, parallel deletion of files is supported during vacuum job. It speeds up process & reduces execution time

Apache Spark bucketed table not scalable in time partitioned table

I am trying to utilize Spark Bucketing on a key-value table that is frequently joined on the key column by batch applications. The table is partitioned by timestamp column, and new data arrives periodically and added in a new timestamp partition. Nothing new here.
I thought it was ideal use case for Spark bucketing, but some limitations seem to be fatal when the table is incremental:
Incremental table forces multiple files per bucket, forcing Spark to sort every bucket upon join even though every file is sorted locally. Some Jira's suggest that this is a conscious design choice, not going to change any time soon. This is quite understood, too, as there could be thousands of locally sorted files in each bucket, and iterating concurrently over so many files does not seem a good idea, either.
Bottom line is, sorting cannot be avoided.
Upon map side join, every bucket is handled by a single Task. When the table is incremented, every such Task would consume more and more data as more partitions (increments) are included in the join. Empirically, this ultimately failed on OOM regardless to the configured memory settings. To my understanding, even if the failures can be avoided, this design does not scale at all. It imposes an impossible trade-off when deciding on the number of buckets - aiming for a long term table results in lots of small files during every increment.
This gives the immediate impression that bucketing should not be used with incremental tables. I wonder if anyone has a better opinion on that, or maybe I am missing some basics here.

Resources