VACUUM/OPTIMIZE Effect on Autoloader Checkpoints - apache-spark

I'm using Databricks Autoloader to incrementally stream from a Delta Lake table into a SQL database. If an OPTIMIZE or VACUUM statement is ran against the Delta table, new files are added/subtracted.
My question is, will the autoloader checkpoint discount these optimized files on the next stream? Or will my entire Delta table be streamed into SQL because autoloader doesn't recognize it's already processed the data?

As long as you specify the format of the readStream correctly, the autoloader checkpoint will disregard all aggregated files created by OPTIMIZE command. In this case, the code should be started as follows. df.readStream.format('delta')

Related

spark streaming and delta tables: java.lang.UnsupportedOperationException: Detected a data update

The setup:
Azure Event Hub -> raw delta table -> agg1 delta table -> agg2 delta table
The data is processed by spark structured streaming.
Updates on target delta tables are done via foreachBatch using merge.
In the result I'm getting error:
java.lang.UnsupportedOperationException: Detected a data update (for
example
partKey=ap-2/part-00000-2ddcc5bf-a475-4606-82fc-e37019793b5a.c000.snappy.parquet)
in the source table at version 2217. This is currently not supported.
If you'd like to ignore updates, set the option 'ignoreChanges' to
'true'. If you would like the data update to be reflected, please
restart this query with a fresh checkpoint directory.
Basically I'm not able to read the agg1 delta table via any kind of streaming. If I switch the last streaming from delta to memory I'm getting the same error message. With first streaming I don't have any problems.
Notes.
Between aggregations I'm changing granuality: agg1 delta table (trunc date to minutes), agg2 delta table (trunc date to days).
If I turn off all other streaming, the last one still doesn't work
The agg2 delta table is new fresh table with no data
How the streaming works on the source table:
It reads the files that belongs to our source table. It's not able to handle changes in these files (updates, deletes). If anything like that happens you will get the error above. In other words. DDL operations modify the underlying files. The only difference is for INSERTS. New data arrives in new file if not configured differently.
To fix that you would need to set an option: ignoreChanges to True.
This option will cause that you will get all the records from the modified file. So, you will get again the same records as before plus this one modified.
The problem: we have aggregations, the aggregated values are stored in the checkpoint. If we get again the same record (not modified) we will recognize it as an update and we will increase the aggregation for its grouping key.
Solution: we can't read agg table to make another aggregations. We need to read the raw table.
reference: https://docs.databricks.com/structured-streaming/delta-lake.html#ignore-updates-and-deletes
Note: I'm working on Databricks Runtime 10.4, so I'm using new shuffle merge by default.

How to do an "overwrite" output mode using spark structured streaming without deleting all the data and the checkpoint

I have this delta lake in ADLS to sink data through spark structured streaming. We usually append new data from our data source to our delta lake, but there are some cases when we find errors in the data that we need to reprocess everything. So what we do is delete all the data and the checkpoints and re-run the pipeline, having the correct data inside our ADLS.
But the problem with this approach is that the end-users stay one whole day without the data to analyze (because we need to delete it to re-run). So, to fix that, I would like to know if there's a way to do an "overwrite" output using the structured streaming so we can overwrite the data into a new delta version, and the end-user could still query the data using the current version.
I don't know if it's possible using streaming, but I would like to know if anyone had a similar problem and how you went to solve it :)
Thanks!

Pyspark dataframe parquet vs delta : different number of rows

I have data written in Delta on HDFS. From what I understand, Delta is storing the data as parquet, just has an additional layer over it with advanced features.
But when reading data with Pyspark, I get a different result if dataframe is read with spark.read.parquet() or spark.read.format('delta').load()
df = spark.read.format('delta').load("my_data")
df.count()
> 184511389
df = spark.read.parquet("my_data")
df.count()
> 369022778
As you can see the difference is quite big.
Is there something I misunderstood about delta vs parquet?
Pyspark version is 2.4.
The most probable explanation is that you wrote into the Delta two times using the overwrite option. But Delta is versioned data format - when you use overwrite, it doesn't delete previous data, it just writes new files, and don't delete files immediately - they are just marked as deleted in the manifest file that Delta uses. And when you read from Delta, it knows which files are deleted, or not, and read only actual data. Actual deletion of the data files happens when you're performing VACUUM on Delta lake.
But when you read with Parquet, it doesn't have information about deleted files, so it reads everything that you have in directory, so you get twice as many rows.

Optimize command not helping on Delta lake table being written by structured streaming job

I have a structured streaming job which reads from event hub and write to delta lake table as /mytablepath , which is stored on Azure blob storage. In last 2 months run in Production it has created ~1000 small files in storage with each file having only 2-3 rows.
I tried to run optimize command on my delta lake table(path), but even after that number of files on blob storage has not reduced and when i run any query on table in notebook, it continue to show warning " query is on a delta table with many small files, run optimize to improve performance".
Thanks
You need to run vacuum after you run optimize to clean up the small files.

Spark Streaming to Hive, too many small files per partition

I have a spark streaming job with a batch interval of 2 mins(configurable).
This job reads from a Kafka topic and creates a Dataset and applies a schema on top of it and inserts these records into the Hive table.
The Spark Job creates one file per batch interval in the Hive partition like below:
dataset.coalesce(1).write().mode(SaveMode.Append).insertInto(targetEntityName);
Now the data that comes in is not that big, and if I increase the batch duration to maybe 10mins or so, then even I might end up getting only 2-3mb of data, which is way less than the block size.
This is the expected behaviour in Spark Streaming.
I am looking for efficient ways to do a post processing to merge all these small files and create one big file.
If anyone's done it before, please share your ideas.
I would encourage you to not use Spark to stream data from Kafka to HDFS.
Kafka Connect HDFS Plugin by Confluent (or Apache Gobblin by LinkedIn) exist for this very purpose. Both offer Hive integration.
Find my comments about compaction of small files in this Github issue
If you need to write Spark code to process Kafka data into a schema, then you can still do that, and write into another topic in (preferably) Avro format, which Hive can easily read without a predefined table schema
I personally have written a "compaction" process that actually grabs a bunch of hourly Avro data partitions from a Hive table, then converts into daily Parquet partitioned table for analytics. It's been working great so far.
If you want to batch the records before they land on HDFS, that's where Kafka Connect or Apache Nifi (mentioned in the link) can help, given that you have enough memory to store records before they are flushed to HDFS
I have exactly the same situation as you. I solved it by:
Lets assume that your new coming data are stored in a dataset: dataset1
1- Partition the table with a good partition key, in my case I have found that I can partition using a combination of keys to have around 100MB per partition.
2- Save using spark core not using spark sql:
a- load the whole partition in you memory (inside a dataset: dataset2) when you want to save
b- Then apply dataset union function: dataset3 = dataset1.union(dataset2)
c- make sure that the resulted dataset is partitioned as you wish e.g: dataset3.repartition(1)
d - save the resulting dataset in "OverWrite" mode to replace the existing file
If you need more details about any step please reach out.

Resources