Azure Data Lake incremental load with file partition - azure

I'm designing Data Factory piplelines to load data from Azure SQL DB to Azure Data Factory.
My initial load/POC was a small subset of data and was able to load from SQL tables to Azure DL.
Now, there are huge volume of tables (that has even billion +) that I want to load from SQL DB using DF to Azure DL.
MS docs mentioned two options, i.e. watermark columns and change tracking.
Let's say I have a "cust_transaction" table that has millions of rows and if I load to DL then it loads as "cust_transaction.txt".
Questions.
1) What would an optimal design to incrementally load the source data from SQL DB into that file in the data lake?
2) How do I split or partition the files into smaller files?
3) How should I merge and load the deltas from source data into the files?
Thanks.

You will want multiple files. Typically, my data lakes have multiple zones. The first zone is Raw. It contains a copy of the source data organized into entity/year/month/day folders where entity is a table in your SQL DB. Typically, those files are incremental loads. Each incremental load for an entity has a file name similar to Entity_YYYYMMDDHHMMSS.txt (and maybe even more info than that) rather than just Entity.txt. And the timestamp in the file name is the end of the incremental slice (max possible insert or update time in the data) rather than just current time wherever possible (sometimes they are relatively the same and it doesn't matter, but I tend to get a consistent incremental slice end time for all tables in my batch). You can achieve the date folders and timestamp in the file name by parameterizing the folder and file in the dataset.
Melissa Coates has two good articles on Azure Data Lake: Zones in a Data Lake and Data Lake Use Cases and Planning. Her naming conventions are a bit different than mine, but both of us would tell you to just be consistent. I would land the incremental load file in Raw first. It should reflect the incremental data as it was loaded from the source. If you need to have a merged version, that can be done with Data Factory or U-SQL (or your tool of choice) and landed in the Standardized Raw zone. There are some performance issues with small files in a data lake, so consolidation could be good, but it all depends on what you plan to do with the data after you land it there. Most users would not access data in the RAW zone, instead using data from Standardized Raw or Curated Zones. Also, I want Raw to be an immutable archive from which I could regenerate data in other zones, so I tend to leave it in the files as it landed. But if you found you needed to consolidate there, that would be fine.
Change tracking is a reliable way to get changes, but I don't like their naming conventions/file organization in their example. I would make sure your file name has the entity name and a timestamp on it. They have Incremental - [PipelineRunID]. I would prefer [Entity]_[YYYYMMDDHHMMSS]_[TriggerID].txt (or leave the run ID off) because it is more informative to others. I also tend to use the Trigger ID rather than the pipeline RunID. The Trigger ID is across all the packages executed in that trigger instance (batch) whereas the pipeline RunID is specific to that pipeline.
If you can't do the change tracking, the watermark is fine. I usually can't add change tracking to my sources and have to go with watermark. The issue is that you are trusting that the application's modified date is accurate. Are there ever times when a row is updated and the modified date is not changed? When a row is inserted, is the modified date also updated or would you have to check two columns to get all new and changed rows? These are the things we have to consider when we can't use change tracking.
To summarize:
Load incrementally and name your incremental files intelligently
If you need a current version of the table in the data lake, that is a separate file in your Standardized Raw or Curated Zone.

Related

What's the most efficient way to delete rows in target table that are missing in source table? (Azure Databricks)

I am working with Azure Databricks and we are moving hundreds of gigabytes of data with Spark. We stream them with Databricks' autoloader function from a source storage on Azure Datalake Gen2, process them with Databricks notebooks, then load them into another storage. The idea is that the end result is a replica, a copy-paste of the source, but with some transformations involved.
This means if a record is deleted at the source, we also have to delete it. If a record is updated or added, then we do that too. For the latter autoloader with a file level listener, combined with a MERGE INTO and with .forEachBatch() is an efficient solution But what about deletions? For technical reasons (dynamics365 azure synapse link export being extremely limited in configuration) we are not getting delta files, we have no data on whether a certain record got updated, added or deleted. We only have the full data dump every time.
To simply put: I want to delete records in a target dataset if the record's primary key is no longer found in a source dataset. In T-SQL MERGE could check both ways, whether there is a match by the target or the source, however in Databricks this is not possible, MERGE INTO only checks for the target dataset.
Best idea so far:
DELETE FROM a WHERE NOT EXISTS (SELECT id FROM b WHERE a.id = b.id)
Occasionally a deletion job might delete millions of rows, which we have to replicate, so performance is important. What would you suggest? Any best practices to this?

AWS Glue for handling incremental data loading of different sources

I am planning to leverage AWG Glue for incremental data processing. Based on hourly schedule a trigger will invoke Glue Crawler and Glue ETL Job which loads incremental data to catalog and processed the incremental files through ETL. And looks pretty straight forward as well. With this I ran into couple of issues.
Let's say we have data getting streamed for various tables and for various data bases to S3 locations, and we want to create data bases and tables based on landing data.
eg: s3://landingbucket/database1/table1/YYYYMMDDHH/some_incremental_files.json
s3://landingbucket/database1/table2/YYYYMMDDHH/some_incremental_files.json
s3://landingbucket/database1/somedata/tablex/YYYYMMDDHH/some_incremental_files.json
s3://landingbucket/database2/table1/YYYYMMDDHH/some_incremental_files.json
s3://landingbucket/datasource_external/data/table1/YYYYMMDDHH/some_incremental_files.json
With the data getting landed in above s3 structure, we want to create glue catalog for these data bases and tables with limited Crawlers. Here we have number of databases as number of crawlers.
Note: We have a crawler for database1, its creating tables under database1, which is good and as expected, but we have an exceptional guy "somedata" in database1, whose structure is not in standard with other tables, with this it created table somedata and with partitions "partitions_0=tablex and partition_1=YYYYMMDDHH". Is there a better way to handle these with less number of crawlers than one crawler per data base.
Glue ETL, we have similar challenge, we want to format the incoming data to standard parquet format, and have one bucket per database and tables will be sitting under that, as the data is huge we don't want one table with partitions as data_base and data. So that we will not getting into s3 slowdown issues for the incoming load. As many teams will be querying the data from this, so we don't want to have s3 slowdown issue coming for their analytics jobs.
Instead of having one ETL job per table, per data base, is there a way we can handle this with limited jobs. As and when new tables are coming, there should be a way the ETL job should transform this json data to formatted zone. So input data and output path both can be handled dynamically, instead of hardcoding.
Open for any better idea!
Thanks,
Krish!

How do I store run-time data in Azure Data Factory between pipeline executions?

I have been following Microsoft's tutorial to incrementally/delta load data from an SQL Server database.
It uses a watermark (timestamp) to keep track of changed rows since last time. The tutorial stores the watermark to an Azure SQL database using the "Stored Procedure" activity in the pipeline so it can be reused in the next execution.
It seems overkill to have an Azure SQL database just to store that tiny bit of meta information (my source database is read-only btw). I'd rather just store that somewhere else in Azure. Maybe in the blob storage or whatever.
In short: Is there an easy way of keeping track of this type of data or are we limited to using stored procs (or Azure Functions et al) for this?
I had come across a very similar scenario, and from what I found you can't store any watermark information in ADF - at least not in a way that you can easily access.
In the end I just created a basic tier Azure SQL database to store my watermark / config information on a SQL server that I was already using in my pipelines.
The nice thing about this is when my solution scaled out to multiple business units, all with different databases, I could still maintain watermark information for each of them by simply adding a column that tracks which BU that specific watermark info was for.
Blob storage is indeed a cheaper option but I've found it to require a little more effort than just using an additional database / table in an existing database.
I agree it would be really useful to be able to maintain a small dataset in ADF itself for small config items - probably a good suggestion to make to Microsoft!
There is a way to achieve this by using Copy activity, but it is complicated to get latest watermark in 'LookupOldWaterMarkActivity', just for reference.
Dataset setting:
Copy activity setting:
Source and sink dataset is the same one. Change the expression in additional columns to #{activity('LookupNewWaterMarkActivity').output.firstRow.NewWatermarkvalue}
Through this, you can save watermark as column in .txt file. But it is difficult to get the latest watermark with Lookup activity. Because your output of 'LookupOldWaterMarkActivity' will be like this:
{
"count": 1,
"value": [
{
"Prop_0": "11/24/2020 02:39:14",
"Prop_1": "11/24/2020 08:31:42"
}
]
}
The name of key is generated by ADF. If you want to get "11/24/2020 08:31:42", you need to get column count and then use expression like this: #activity('LookupOldWaterMarkActivity').output.value[0][Prop_(column count - 1)]
How to get latest watermark:
use GetMetadata activity to get columnCount
use this expression:#activity('LookupOldWaterMarkActivity').output.value[0][concat('Prop_',string(sub(activity('Get Metadata1').output.columnCount,1)))]

Synchronize data lake with the deleted record

I am building data lake to integrate multiple data sources for advanced analytics.
In the begining, I select HDFS as data lake storage. But I have a requirement for updates and deletes in data sources which I have to synchronise with data lake.
To understand the immutable nature of Data Lake I will consider LastModifiedDate from Data source to detect that this record is updated and insert this record in Data Lake with a current date. The idea is to select the record with max(date).
However, I am not able to understand how
I will detect deleted records from sources and what I will do with Data Lake?
Should I use other data storage like Cassandra and execute a delete command? I am afraid it will lose the immutable property.
can you please suggest me good practice for this situation?
1. Question - Detecting deleted records from datasources
Detecting deleted records from data sources, requires that your data sources supports this. Best is that deletion is only done logically, e. g. with a change flag. For some databases it is possible to track also deleted rows (see for example for SQL-Server). Also some ETL solutions like Informatica offer CDC (Changed Data Capture) capabilities.
2. Question - Changed data handling in a big data solution
There are different approaches. Of cause you can use a key value store adding some kind of complexity to the overall solution. First you have to clarify, if it is also of interest to track changes and deletes. You could consider loading all data (new/changed/deleted) into daily partitions and finally build an actual image (data as it is in your data source). Also consider solutions like Databricks Delta addressing this topics, without the need of an additional store. For example you are able to do an upsert on parquet files with delta as follows:
MERGE INTO events
USING updates
ON events.eventId = updates.eventId
WHEN MATCHED THEN
UPDATE SET
events.data = updates.data
WHEN NOT MATCHED
THEN INSERT (date, eventId, data) VALUES (date, eventId, data)
If your solution also requires low latency access via a key (e. g. to support an API) then a key-values store like HBase, Cassandra, etc. would be helpfull.
Usually this is always a constraint while creating datalake in Hadoop, one can't just update or delete records in it. There is one approach that you can try is
When you are adding lastModifiedDate, you can also add one more column naming status. If a record is deleted, mark the status as Deleted. So the next time, when you want to query the latest active records, you will be able to filter it out.
You can also use cassandra or Hbase (any nosql database), if you are performing ACID operations on a daily basis. If not, first approach would be your ideal choice for creating datalake in Hadoop

Cassandra get backup of expired data

I have a typically very huge amount of data on one of my Cassandra table. I want to keep only last two months of data in my table for that i used TTL of two month on every data. But now i want to keep expired data as a backup for later use case. Please suggest me what should i do to take backup?
Data in Cassandra is stored as files on the disk. You could just copy those files off your production machines onto whatever storage medium you would like to restore them later. You can follow the link below to see how you would do this:
https://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_backup_restore_c.html

Resources