I haven't used Delta Lake's Change Data Feed yet and I want to understand whether it could be relevant for us or not.
We have the following set-up:
Raw Data (update events from DynamoDB) ends up in a Staging Area ->
We clean the new data and append it to a Bronze Table ->
We merge the appended data into a Silver Table, to represent the latest state ->
We run SQL queries on top of the Silver Tables, joining and aggregating them, thus creating our Gold Tables
Currently we track new data by using Streaming Checkpoints. This is very efficient in the Bronze -> Silver stage, since it is append only.
To my understanding CDF could improve the performance of our Silver -> Gold jobs, since with Streaming Checkpoints you still have to read the whole parquet file if one line changed, with CDF you just read the table changes, correct?
Also, is there a reason to use CDF instead of streaming checkpoints in the Bronze -> Silver jobs?
To my understanding CDF could improve the performance of our Silver -> Gold jobs, since with Streaming Checkpoints you still have to read the whole parquet file if one line changed, with CDF you just read the table changes, correct?
Yes, this is correct in principle. The nuance here is that you mentioned your Gold tables are joins+aggregates, so you may actually need all that unchanged data anyway depending on the type of aggregates you have, and whether or not there is some referential integrity you need to maintain.
Also, is there a reason to use CDF instead of streaming checkpoints in the Bronze -> Silver jobs?
So long as this stage is append-only, no. In fact, if you were to enable CDF on this table, you wouldn't write any independent CDF files, so you would just be reading the same data files you are reading now, but with some additional metadata (change operation, version, timestamp) attached (that might be useful?)
Related
I have a pipeline like this:
kafka->bronze->silver
The bronze and silver tables are Delta Tables. I'm streaming from bronze to silver using regular spark structured-streaming.
I changed the silver schema, so I want to reload from the bronze into silver using the new schema. Unfortunately, the reload is taking forever, and I'm wondering if I can load the data more quickly using a batch job, and then turn the stream back on.
I am concerned that the checkpoint will tell the stream from bronze->silver to pick up where it left off and it will write a bunch of duplicates that I will then need to remove. Is there a way I can advance the checkpoint with the batch load, or play other tricks?
Will that be faster than just letting the stream run? I get the feeling that it is spending a lot of resources writing microbatch transactions.
Any suggestions greatly appreciated!!!
Delta Lake tables follow the "WRITE ONCE READ MANY(WORM)" concept which means the partitions are immutable. This makes sense and usually the approach most of the other datawarehouse products also take. This approach, however has write explosion. Every time, I update an existing record, the entire partition of the record is copied and than updated. So, definitely insert one record at a time is not a good option.
So, my question is, what is recommended batch size for loading Delta lake tables?
I'm trying to convert large parquet files to delta format for performance optimization and a faster job run.
I'm trying to research the best practices to migrate huge parquet files to delta format on Databricks.
There are two general approaches to that, but it's really depends on your requirements:
Do in-place upgrade using the CONVERT TO DELTA (SQL Command) or corresponding Python/Scala/Java APIs (doc). You need to take into account following consideration - if you have a huge table, then default CONVERT TO DELTA command may take too long as it will need to collect statistics for your data. You can avoid this by adding NO STATISTICS to the command, and then it will run faster. With it, you won't be able to get benefits of data skipping, and other optimizations, but these statistics could be collected later when executing OPTIMIZE command.
Create a copy of your original table by reading original Parquet data & writing as a Delta table. After you check that everything is correct, you may remove original table. This approach have following benefits:
You can change partitioning schema if you have too many levels of partitioning in your original table
You can change the order of columns in the table to take advantage of data skipping for numeric & date/time data types - it should improve the query performance.
I have a Spark Job that reads data from S3. I apply some transformations and write 2 datasets back to S3. Each write action is treated as a separate job.
Question: Does Spark guarantees that I read the data each time in the same order? For example, if I apply the function:
.withColumn('id', f.monotonically_increasing_id())
Will the id column have the same values for the same records each time?
You state very little, but the following is easily testable and should serve as a guideline:
If you re-read the same files again with same content you will get the same blocks / partitions again and the same id using f.monotonically_increasing_id().
If the total number of rows differs on the successive read(s) with different partitioning applied before this function, then typically you will get different id's.
If you have more data second time round and apply coalesce(1) then the prior entries will have same id still, newer rows will have other ids. A less than realistic scenario of course.
Blocks for files at rest remain static (in general) on HDFS. So partition 0..N will be the same upon reading from rest. Otherwise zipWithIndex would not be usable either.
I would never rely on the same data being in same place when read twice unless there were no updates (you could cache as well).
Objective
Suppose you're building Data Lake and Star Schema with help of ETL. Storage format is Delta Lake. One of the ETL responsibilities is to build Slowly Changing Dimension (SCD) tables (cummulative state). This means that every day for every SCD table ETL reads full table's state, applies updates and saves them back (full overwrite).
Question
One of the questions we argued within my team: should we add time partition to SCD (full overwrite) tables? Means, should I save the latest (full) table state to SOME_DIMENSION/ or to SOME_DIMENSION/YEAR=2020/MONTH=12/DAY=04/?
Considerations
In one hand, Delta Lake has all required features: time-travel & ACID. When its overwritting the whole table, logical deletion happens, and you're still able to query old versions and rollback to them. So Delta Lake is almost managing time partition for you, the code get simpler.
In other hand, I said "almost" because IMHO time-travel & ACID don't cover 100% of use cases. It hasn't got a notion of arrival time. For example:
Example (when you need time partition)
BA team reported that SOME_FACT/YEAR=2019/MONTH=07/DAY=15 data are broken (facts must be stored with time partition any case, because data are processed by arrival time). In order to reproduce the issue on DEV/TEST environment you need 1 fact table raw inputs and 10 SCD tables.
With facts everything is simple, because you have raw inputs in Data Lake. But with incremental state (SCD tables) things get complex - how to get the state of 10 SCD tables for the point in time when SOME_FACT/YEAR=2019/MONTH=07/DAY=15 was processed? How to do this automatically?
To complicate the things even more, your environment may come through bunch of bugfixes and history re-processings. Means 2019-07 data may be reprocessed somewhere in 2020. And Delta Lake allow you to rollback only based on processing or version number. So you actually don't know which version you should use.
In other hand, with date partitioning, you are always sure that SOME_FACT/YEAR=2019/MONTH=07/DAY=15 was calculated over SOME_DIMENSION/YEAR=2019/MONTH=07/DAY=15.
It depends, and I think it's a bit more complicated.
Some context first - Delta gives you time travel only limited to the current commit history, which is by default 30 days. If you are doing optimizations, that time might be significantly shorter (default 7 days).
Also, you actually can query Delta tables as of specific time, not only version, but due to above limitations (unless you are willing to pay the performance and financial cost of storing really long commit history), it's not useful from long-term perspective.
This is why a very common data lake architecture right now is medallion tables approach (Bronze->Silver->Gold). Ideally, I'd want to store the raw inputs in the 'bronze' layer, have a whole historical perspective in the silver layer (already clean, validated, best source of truth, but with whole history as needed), and consume the current version directly from "golden" tables.
This would avoid increasing the complexity of querying the SCDs due to additional partitions, while giving you the option to "go back" to silver layer if need arises. But it's always a tradeoff decision - in any case, don't rely on Delta for long-term versioning.