Delta Lake tables follow the "WRITE ONCE READ MANY(WORM)" concept which means the partitions are immutable. This makes sense and usually the approach most of the other datawarehouse products also take. This approach, however has write explosion. Every time, I update an existing record, the entire partition of the record is copied and than updated. So, definitely insert one record at a time is not a good option.
So, my question is, what is recommended batch size for loading Delta lake tables?
Related
I'm trying to convert large parquet files to delta format for performance optimization and a faster job run.
I'm trying to research the best practices to migrate huge parquet files to delta format on Databricks.
There are two general approaches to that, but it's really depends on your requirements:
Do in-place upgrade using the CONVERT TO DELTA (SQL Command) or corresponding Python/Scala/Java APIs (doc). You need to take into account following consideration - if you have a huge table, then default CONVERT TO DELTA command may take too long as it will need to collect statistics for your data. You can avoid this by adding NO STATISTICS to the command, and then it will run faster. With it, you won't be able to get benefits of data skipping, and other optimizations, but these statistics could be collected later when executing OPTIMIZE command.
Create a copy of your original table by reading original Parquet data & writing as a Delta table. After you check that everything is correct, you may remove original table. This approach have following benefits:
You can change partitioning schema if you have too many levels of partitioning in your original table
You can change the order of columns in the table to take advantage of data skipping for numeric & date/time data types - it should improve the query performance.
Objective
Suppose you're building Data Lake and Star Schema with help of ETL. Storage format is Delta Lake. One of the ETL responsibilities is to build Slowly Changing Dimension (SCD) tables (cummulative state). This means that every day for every SCD table ETL reads full table's state, applies updates and saves them back (full overwrite).
Question
One of the questions we argued within my team: should we add time partition to SCD (full overwrite) tables? Means, should I save the latest (full) table state to SOME_DIMENSION/ or to SOME_DIMENSION/YEAR=2020/MONTH=12/DAY=04/?
Considerations
In one hand, Delta Lake has all required features: time-travel & ACID. When its overwritting the whole table, logical deletion happens, and you're still able to query old versions and rollback to them. So Delta Lake is almost managing time partition for you, the code get simpler.
In other hand, I said "almost" because IMHO time-travel & ACID don't cover 100% of use cases. It hasn't got a notion of arrival time. For example:
Example (when you need time partition)
BA team reported that SOME_FACT/YEAR=2019/MONTH=07/DAY=15 data are broken (facts must be stored with time partition any case, because data are processed by arrival time). In order to reproduce the issue on DEV/TEST environment you need 1 fact table raw inputs and 10 SCD tables.
With facts everything is simple, because you have raw inputs in Data Lake. But with incremental state (SCD tables) things get complex - how to get the state of 10 SCD tables for the point in time when SOME_FACT/YEAR=2019/MONTH=07/DAY=15 was processed? How to do this automatically?
To complicate the things even more, your environment may come through bunch of bugfixes and history re-processings. Means 2019-07 data may be reprocessed somewhere in 2020. And Delta Lake allow you to rollback only based on processing or version number. So you actually don't know which version you should use.
In other hand, with date partitioning, you are always sure that SOME_FACT/YEAR=2019/MONTH=07/DAY=15 was calculated over SOME_DIMENSION/YEAR=2019/MONTH=07/DAY=15.
It depends, and I think it's a bit more complicated.
Some context first - Delta gives you time travel only limited to the current commit history, which is by default 30 days. If you are doing optimizations, that time might be significantly shorter (default 7 days).
Also, you actually can query Delta tables as of specific time, not only version, but due to above limitations (unless you are willing to pay the performance and financial cost of storing really long commit history), it's not useful from long-term perspective.
This is why a very common data lake architecture right now is medallion tables approach (Bronze->Silver->Gold). Ideally, I'd want to store the raw inputs in the 'bronze' layer, have a whole historical perspective in the silver layer (already clean, validated, best source of truth, but with whole history as needed), and consume the current version directly from "golden" tables.
This would avoid increasing the complexity of querying the SCDs due to additional partitions, while giving you the option to "go back" to silver layer if need arises. But it's always a tradeoff decision - in any case, don't rely on Delta for long-term versioning.
I am trying to utilize Spark Bucketing on a key-value table that is frequently joined on the key column by batch applications. The table is partitioned by timestamp column, and new data arrives periodically and added in a new timestamp partition. Nothing new here.
I thought it was ideal use case for Spark bucketing, but some limitations seem to be fatal when the table is incremental:
Incremental table forces multiple files per bucket, forcing Spark to sort every bucket upon join even though every file is sorted locally. Some Jira's suggest that this is a conscious design choice, not going to change any time soon. This is quite understood, too, as there could be thousands of locally sorted files in each bucket, and iterating concurrently over so many files does not seem a good idea, either.
Bottom line is, sorting cannot be avoided.
Upon map side join, every bucket is handled by a single Task. When the table is incremented, every such Task would consume more and more data as more partitions (increments) are included in the join. Empirically, this ultimately failed on OOM regardless to the configured memory settings. To my understanding, even if the failures can be avoided, this design does not scale at all. It imposes an impossible trade-off when deciding on the number of buckets - aiming for a long term table results in lots of small files during every increment.
This gives the immediate impression that bucketing should not be used with incremental tables. I wonder if anyone has a better opinion on that, or maybe I am missing some basics here.
I am building data lake to integrate multiple data sources for advanced analytics.
In the begining, I select HDFS as data lake storage. But I have a requirement for updates and deletes in data sources which I have to synchronise with data lake.
To understand the immutable nature of Data Lake I will consider LastModifiedDate from Data source to detect that this record is updated and insert this record in Data Lake with a current date. The idea is to select the record with max(date).
However, I am not able to understand how
I will detect deleted records from sources and what I will do with Data Lake?
Should I use other data storage like Cassandra and execute a delete command? I am afraid it will lose the immutable property.
can you please suggest me good practice for this situation?
1. Question - Detecting deleted records from datasources
Detecting deleted records from data sources, requires that your data sources supports this. Best is that deletion is only done logically, e. g. with a change flag. For some databases it is possible to track also deleted rows (see for example for SQL-Server). Also some ETL solutions like Informatica offer CDC (Changed Data Capture) capabilities.
2. Question - Changed data handling in a big data solution
There are different approaches. Of cause you can use a key value store adding some kind of complexity to the overall solution. First you have to clarify, if it is also of interest to track changes and deletes. You could consider loading all data (new/changed/deleted) into daily partitions and finally build an actual image (data as it is in your data source). Also consider solutions like Databricks Delta addressing this topics, without the need of an additional store. For example you are able to do an upsert on parquet files with delta as follows:
MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN
UPDATE SET
events.data = updates.data
WHEN NOT MATCHED
THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
If your solution also requires low latency access via a key (e. g. to support an API) then a key-values store like HBase, Cassandra, etc. would be helpfull.
Usually this is always a constraint while creating datalake in Hadoop, one can't just update or delete records in it. There is one approach that you can try is
When you are adding lastModifiedDate, you can also add one more column naming status. If a record is deleted, mark the status as Deleted. So the next time, when you want to query the latest active records, you will be able to filter it out.
You can also use cassandra or Hbase (any nosql database), if you are performing ACID operations on a daily basis. If not, first approach would be your ideal choice for creating datalake in Hadoop
I have a cassandra cluster with multiple data centres. I want to archive data monthly and purge that data. There are numerous articles of backing up and restoring but not where its mentioned to archive data in cassandra cluster.
Can someone please let me know how can I archive my data in cassandra cluster monthly and purge the data.
I think there is no such tool that can be used for archive cassandra.You have to write either Spark Jobs or map reduce job that use CqlInputFormat to archive the data.You can follow below links that help you to understand how people are archiving data in cassandra:
[1] - [http://docs.wso2.org/display/BAM240/Archive+Cassandra+Data]
[2] - http://docs.wso2.org/pages/viewpage.action?pageId=32345660
[3] - http://accelconf.web.cern.ch/AccelConf/ICALEPCS2013/papers/tuppc004.pdf
There is also a way using which you can turn on incremental backup in cassandra which can be used like CDC.
It is the best practice to use timewindow compaction strategy and set the window of monthly on your tables along with TTL(month), so that data older than a month can be purged.
If you write a purge job that does this work of deletion (on tables which do not have correct compaction strategy applied) then this can impact the cluster performance because searching the data on date/month basic will overwhelm the cluster.
I have experienced this, where we ultimately have to go back changing the structure of tables and altered the compaction strategy. That is why having the table design right at the first place is very important. We need to think about (in the beginning itself) not only how the data will be inserted and read in tables but also how it will be deleted and then frame the keys, compaction, ttl, etc.
For archiving just write a few lines of code to read data from Cassandra and put it to you archival location.
Let me know if this help in getting the end result you want or if you have further question that I can help with.