How to dynamically pass save_args to kedro catalog? - databricks

I'm trying to write delta tables in Kedro. Changing file format to delta makes the write as delta tables with mode as overwrite.
Previously, a node in the raw layer (meta_reload) creates a dataset that determines what's the start date for incremental load for each dataset. each node uses that raw dataset to filter the working dataset to apply the transformation logic and write partitioned parquet tables incrementally.
But now writing delta with mode as overwrite with just file type change to delta makes current incremental data overwrite all the past data instead of just those partitions. So I need to use replaceWhere option in save_args in the catalog.
How would I determine the start date for replaceWhere in the catalog when I need to read the meta_reload raw dataset to determine the date.
Is there a way to dynamically pass the save_args from inside the node?
my_dataset:
type: my_project.io.pyspark.SparkDataSet
filepath: "s3://${bucket_de_pipeline}/${data_environment_project}/${data_environment_intermediate}/my_dataset/"
file_format: delta
layer: intermediate
save_args:
mode: "overwrite"
replaceWhere: "DATE_ID > xyz" ## what I want to implement dynamically
partitionBy: [ "DATE_ID" ]

I've answered this on the GH discussion. In short you would need to subclass and define your own SparkDataSet we avoid changing the underlying API of the datasets at a Kedro level, but you're encouraged to alter and remix this for your own purposes.

Related

What are best practices to migrate parquet table to Delta?

I'm trying to convert large parquet files to delta format for performance optimization and a faster job run.
I'm trying to research the best practices to migrate huge parquet files to delta format on Databricks.
There are two general approaches to that, but it's really depends on your requirements:
Do in-place upgrade using the CONVERT TO DELTA (SQL Command) or corresponding Python/Scala/Java APIs (doc). You need to take into account following consideration - if you have a huge table, then default CONVERT TO DELTA command may take too long as it will need to collect statistics for your data. You can avoid this by adding NO STATISTICS to the command, and then it will run faster. With it, you won't be able to get benefits of data skipping, and other optimizations, but these statistics could be collected later when executing OPTIMIZE command.
Create a copy of your original table by reading original Parquet data & writing as a Delta table. After you check that everything is correct, you may remove original table. This approach have following benefits:
You can change partitioning schema if you have too many levels of partitioning in your original table
You can change the order of columns in the table to take advantage of data skipping for numeric & date/time data types - it should improve the query performance.

What Happens When a Delta Table is Created in Delta Lake?

With the Databricks Lakehouse platform, it is possible to create 'tables' or to be more specific, delta tables using a statement such as the following,
DROP TABLE IF EXISTS People10M;
CREATE TABLE People10M
USING parquet
OPTIONS (
path "/mnt/training/dataframes/people-10m.parquet",
header "true"
);
What I would like to know is, what exactly happens behind the scenes when you create one of these tables? What exactly is a table in this context? Because the data is actually contained in files in data lake (data storage location) that delta lake is running on top of.. right? Are tables some kind of abstraction that allows us to access the data stored in these files using something like SQL?
What does the USING parquet portion of this statement do? Are parquet tables different to CSV tables in some way? Or does this just depend on the format of the source data?
Any links to material that explains this idea would be appreciated? I want to understand this in depth from a technical point of view.
There are few aspects here. Your table definition is not a Delta Lake, it's Spark SQL (or Hive) syntax to define a table. It's just a metadata that allows users easily use the table without knowing where it's located, what data format, etc. You can read more about databases & tables in Databricks documentation.
The actual format for data storage is specified by the USING directive. In your case it's parquet, so when people or code will read or write data, underlying engine will first read table metadata, figure out location of the data & file format, and then will use corresponding code.
Delta is another file format (really a storage layer) that is built on the top of Parquet as data format, but adding additional capabilities such as ACID, time travel, etc. (see doc). If you want to use Delta instead of Parquet then you either need to use CONVERT TO DELTA to convert existing Parquet data into Delta, or specify USING delta when creating a completely new table.

Reading version specific files delta-lake

I want to read the delta data after a certain timestamp/version. The logic here suggests to read the entire data and read the specific version, and then find the delta. As my data is huge, I would prefer not to read the entire data and if somehow be able to read only the data after certain timestamp/version.
Any suggestions?
If you need data that have timestamp after some specific date, then you still need to shift through all data. But Spark & Delta Lake may help here if you organize your data correctly:
You can have time-based partitions, for example, store data by day/week/month, so when Spark will read data it may read only specific partitions (perform so-called predicate pushdown), for example, df = spark.read.format("delta").load(...).filter("day > '2021-12-29'") - this will work not only for Delta, but for other formats as well. Delta Lake may additionally help here because is supports so-called generated columns where you don't need to create a partition column explicitly, but allow Spark to generate it for you based on other columns
On top of partitioning, formats like Parquet (and Delta that is based on Parquet) allow to skip reading all data because they maintain the min/max statistics inside the files. But you will still need to read these files
On Databricks, Delta Lake has more capabilities for selective read of the data - for example, that min/max statistics that Parquet has inside the file, could be saved into the transaction log, so Delta won't need to open file to check if timestamp in the given range - this technique is called data skipping. Additional performance could come from the ZOrdering of the data that will collocate data closer to each other - that's especially useful when you need to filter by multiple columns
Update 14.04.2022: Data Skipping is also available in OSS Delta, starting with version 1.2.0

Update existing records of parquet file in Azure

I am converting my table into parquet file format using Azure Data Factory. Performing query on parquet file using databricks for reporting. I want to update only existing records which are updated in original sql server table. Since I am performing it on very big table and daily I don't want to perform truncate and reload entire table as it will be costly.
Is there any way I can update those parquet file without performing truncate and reload operation.
Parquet is by default immutable, so only way to rewrite the data is to rewrite the table. But that is possible to do if you switch to use of Delta file format that supports updating/deleting the entries, and is also supports MERGE operation.
You can still use Parquet format for production of the data, but then you need to use that data to update the Delta table.
I have found a workaround to this problem.
Read the parquet file into data frame using any tool or Python scripts.
create a temporary table or view from data frame.
Run SQL query to modify, update and delete the record.
Convert table back into data frame
Overwrite existing parquet files with new data.
Always go for soft Delete while working in No-Sql. Hard delete if very costly.
Also, with soft-Delete, down stream pipeline can consume the update and act upon it.

Incremental batch processing in pyspark

In our spark application, we are running multiple batch processes everyday. sources for these batch process are different like Oracle, mongoDB, Files. We are storing different value for incremental processing based on source like latest timestamp for some oracle tables, ID for some oracle table, list for some file system and using those values for next incremental run.
Currently calculation of these offset value are dependent on source types, we need to customize code to store this value every time when we add new source type.
Is there any generic way to resolve this issue like checkpoint in streaming.
I always like to look into the destination for the last written partition, or get some max(primary_key) and then based on that value select data from the source database to write during the current run.
There would be no need to store anything, you would just need to supply to your batch processing algorithm the table name, source type, and primary key/timestamp column. The algorithm would then find the latest value you already have.
It really depends on your load philosophy and how your storage is divided; if you have raw/source/prepared layers. It is a good idea to load data in a raw format which can be easily compared to the original source in order to do what I described above.
Alternatives include:
Writing a file which contains that primary column and the latest value, your batch job would read this file to determine what to read next.
Updating the job execution configuration with an argument corresponding to the latest value, so on the next run the latest value is passed to your algorithm.

Resources