Say I have a table called data and it's some time-series. It's stored like this:
/data
/date=2022-11-30
/region=usa
part-000001.parquet
part-000002.parquet
Where I have two partition keys and two partitions for the parquet files. I can easily list the files for the partitions keys with:
dbfs.fs.ls('/data/date=2022-11-30/region=usa')
But, if I now make an update to the table, it regenerates the parquet files and now I have 4 files in that directory.
How can I retrieve the latest version of the parquet files? Do I really have to loop through all the _delta_log state files and rebuild the state? Or do I have to run VACCUM to cleanup the old versions so I can get the most recent files?
There has to be a magic function.
Delta Lake itself tracks all of this information in its transaction log. When you query a Delta table with an engine or API that supports Delta Lake, underneath the covers it is reading this transaction log to determine what files make up that version of the table.
For your example, say the four files are:
/data
/date=2022-11-30
/region=usa
part-000001.parquet
part-000002.parquet
part-000003.parquet
part-000004.parquet
The Delta transaction log itself contains the path of the files for each table version, e.g.:
# VO | first version of the table
/data
/date=2022-11-30
/region=usa
part-000001.parquet
part-000002.parquet
# V1 | second version of the table
/data
/date=2022-11-30
/region=usa
part-000003.parquet
part-000004.parquet
You can use Delta Standalone if you want to use the Scala/JVM to get the list of files and/or Delta Rust to use the Delta Rust and/or Python bindings.
If you would like to do it in Spark SQL and/or dive into the details on this, please check out Diving into Delta Lake: Unpacking the Transaction Log which includes video, blog, and notebook on this topic. There is also a follow up video called Under the sediments v2.
Related
When we run VACUUM command, does it go through each parquet file and remove older versions of each record or does it retain all the parquet files even id it has one record with the latest version? What about compaction? Is this any different?
Vacuum and Compaction go through the _delta_log/ folder in your Delta Lake Table and identify the files that are still being referenced.
Vacuum deletes all unreferenced files.
Compaction reads in the referenced files and writes your new partitions back to the table, unreferencing the existing files.
Think of a single version of a Delta Lake table as a set of parquet data files. Every version adds an entry (about files added and removed) to the transaction log (under _delta_log directory).
VACUUM
VACUUM allows defining what number of hours to retain (using RETAIN number HOURS clause). That gives Delta Lake the versions to delete (up to the number HOURS). These versions are "translated" into a series of parquet files (remember one single parquet file belongs to a single version until it is deleted that may take a couple of versions).
This translation gives the files to be deleted.
Compaction
Compaction is basically an optimization (and is usually triggered by OPTIMIZE command or a combination of repartition, dataChange disabled and overwrite).
This is nothing else as another version of a delta table (but this time data is not changed so other transactions can happily be all committed).
The explanation about VACUUM above applies here.
I want to read the delta data after a certain timestamp/version. The logic here suggests to read the entire data and read the specific version, and then find the delta. As my data is huge, I would prefer not to read the entire data and if somehow be able to read only the data after certain timestamp/version.
Any suggestions?
If you need data that have timestamp after some specific date, then you still need to shift through all data. But Spark & Delta Lake may help here if you organize your data correctly:
You can have time-based partitions, for example, store data by day/week/month, so when Spark will read data it may read only specific partitions (perform so-called predicate pushdown), for example, df = spark.read.format("delta").load(...).filter("day > '2021-12-29'") - this will work not only for Delta, but for other formats as well. Delta Lake may additionally help here because is supports so-called generated columns where you don't need to create a partition column explicitly, but allow Spark to generate it for you based on other columns
On top of partitioning, formats like Parquet (and Delta that is based on Parquet) allow to skip reading all data because they maintain the min/max statistics inside the files. But you will still need to read these files
On Databricks, Delta Lake has more capabilities for selective read of the data - for example, that min/max statistics that Parquet has inside the file, could be saved into the transaction log, so Delta won't need to open file to check if timestamp in the given range - this technique is called data skipping. Additional performance could come from the ZOrdering of the data that will collocate data closer to each other - that's especially useful when you need to filter by multiple columns
Update 14.04.2022: Data Skipping is also available in OSS Delta, starting with version 1.2.0
To be clear about the format, this is how the DataFrame is saved in Databricks:
folderpath = "abfss://container#storage.dfs.core.windows.net/folder/path"
df.write.format("delta").mode("overwrite").save(folderPath)
This produces a set of Parquet files (often in 2-4 chunks) that are in the main folder, with a _delta_log folder that contains the files describing the data upload.
The delta log folder dictates which set of Parquet files in the folder should be read.
In Databricks, i would read the latest dataset for exmaple, by doing the following:
df = spark.read.format("delta").load(folderpath)
How would i do this in Azure Data Factory?
I have chosen Azure Data Lake Gen 2, then the Parquet format, however this doesn't seem to work, as i get the entire set of parquets read (i.e. all data sets) and not just the latest.
How can i set this up properly?
With Data Factory pipeline, it seems to be hard to achieve that. But I have some ideas for you:
Use lookup active to get the content of delta_log file. If there many files, use get metadata to get the all the files schema(last modified date).
Use an if condition active or swich active to filter the latest data.
After the data filtered, pass the lookup output to set the copy active source(set as parameter).
The hardest thing is that you need figure out how to filter the latest dataset with delta_log. You could try this way, the whole work flow should like this but I can't tell you if it really works. I couldn't test that for you without same environment.
HTP.
I have an Impala table backed by parquet files which is used by another team.
Every day I run a batch Spark job that overwrites the existing parquet files (creating new data set, the existing files will be deleted and new files will be created)
Our Spark code look like this
dataset.write.format("parquet").mode("overwrite").save(path)
During this update (overwrite parquet data file and then REFRESH Impala table), if someone accesses the table then they would end up with error saying the underlying data files are not there.
Is there any solution or workaround available for this issue? Because I do not want other teams see the error at any point in time when they access the table.
Maybe I can write the new data files into different location then make Impala table point to that location?
The behaviour you are seeing is because of the way how Impala is designed to work. Impala fetches the Metadata of the table such as Table structure, Partition details, HDFS File paths from HMS and the block details of the corresponding HDFS File paths from NameNode. All these details are fetched by Catalog and will be distributed across the Impala daemons for their execution.
When the table's underlying files are removed and new files are written outside Impala, it is necessary to perform a REFRESH so that the new file details (such as files and corresponding block details) will be fetched and distributed across daemons. This way Impala becomes aware of the newly written files.
Since, you're overwriting the files, Impala queries would fail to find the files that it is aware of because they have been removed already and the new files are being written. This is an expected event.
As a solution, you can perform one of the below,
Append the new files in the same HDFS Path of the table, instead of overwriting. This way, Impala queries run on the table would still return the results. However the results would be only the older data (because Impala is not aware of new files yet) but the error you said would be avoided during the time when the overwrite is occurring. Once the new files are created in the Table's directories, you can perform a HDFS Operation to remove the files followed by an Impala REFRESH statement for this table.
OR
As you said, you can write the new parquet files in a different HDFS Path and once the write is complete, you can either [remove the old files, move the new files into the actual HDFS Path of the table, followed by a REFRESH] OR [Issue an ALTER statement against the table to modify the location of the table's data pointing to the new directory]. If it's a daily process, you might have to implement this through a script that runs upon successful write process done by Spark by passing the directories (new and old directories) as arguments.
Hope this helps!
I'm storing a stream of data coming via firehose to S3 and have created tables in Athena wo query this data. The data in S3 is partitioned based on fields like clientID, date. A spark job is processing this incoming data which is coming at regular interval. At each run, spark job takes the data (delta - for that interval) merges it with base data already available in that partition(by last modified time in case there are duplicate records) in S3 and overwrites the partition to save. when S3A committer writing these files, it deletes the existing files and copies the newly created files. Is there a possibility that when querying data from athena tables, it does not return any data because old files are deleted and newly files are not written completely yet. If yes, How do you handle that.
Yes, if the underlying S3 object are deleted, the athena query will return zero rows. The S3A committer will delete objects prior to uploading, and so will always risk some period of time where table backing data is missing or incomplete.
To have athena queries be highly available on data that is being updated, write the query data in batch to a versioned path in S3 (like s3://my-data/2020-02-07) at the appropriate frequency. When a batch has completed, send ALTER TABLE SET LOCATION DDL to the athena DB pointing it to the newest versioned path. Then, cleanup old paths (newest version - n) in line with your retention policy.