We have requirement where we receive CSV files in a blob storage container from where have logic that matches the CSV files based on file name and records within the files (i.e. similar to a SQL join operation). These files are direct dumps from DB tables. For instance, for an Employee entity, we are receiving 2 files, one file containing Employee information and another file containing other Employee related details. In the DB this would correspond to 2 tables, which we are receiving direct dumps of.
In addition, we need to compare the current received batch (again join the files based on filename & the containing records) and compare the content with the previous batch to calculate any deltas, i.e. which records have been Added/Updated/Deleted between batches.
We then store the outcome (delta records) in a separate storage account for further processing.
As it stands, we are performing the logic in a Function App, but are considering to potentially do the delta processing in Azure Data Factory. I.e. ADF to perform the matching of CSV files, join the records and do the batch comparison to produce the delta records.
We don’t have any control on how the source system is sending us the data.
I’m looking for recommendation/viability for using ADF (or alternatives).
Appreciate any pointers, thought and recommendation.
Cheers.
Related
I am using Data factory to copy collection from Mongo Atlas to ADLS Gen 2.
By default data factory will create one json file per collection. But that leaves me with one huge json file.
I checked data flows and transformation but they work on file that is already present in ADLS. Is there a way I can split the data as it comes in to ADLS rather than first getting a huge file and then post processing and splitting it into smaller files?
If the collection size is 5GB, is it possible for data factory to split them in chunks of 100MB as the copy runs?
I would suggest you to use Partitioning as Partition option in Sink. As shown in below screenshot.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/concepts-data-flow-performance#optimize-tab
I followed the example below, and all is going well.
https://learn.microsoft.com/en-gb/azure/data-factory/tutorial-data-flow
Below is about the output files and rows:
If you followed this tutorial correctly, you should have written 83
rows and 2 columns into your sink folder.
Below is the result from my example, which is correct having the same number of rows and columns.
Below is the output. Please note that the total number of files is 77, not 83, not 1.
Question:: Is it correct to have so many csv files (77 items)?
Question:: How to combine all files into one file without slowing down the process?
I can create one file by following the link below, which warns of slowing down the process.
How to remove extra files when sinking CSV files to Azure Data Lake Gen2 with Azure Data Factory data flow?
The number of files generated from the process is dependent upon a number of factors. If you've set the default partitioning in the optimize tab on your sink, that will tell ADF to use Spark's current partitioning mode, which will be based on the number of cores available on the worker nodes. So the number of files will vary based upon how your data is distributed across the workers. You can manually set the number of partitions in the sink's optimize tab. Or, if you wish to name a single output file, you can do that, but it will result in Spark coalescing to a single partition, which is why you see that warning. You may find it takes a little longer to write that file because Spark has to coalesce existing partitions. But that is the nature of a big data distributed processing cluster.
I am planning to leverage AWG Glue for incremental data processing. Based on hourly schedule a trigger will invoke Glue Crawler and Glue ETL Job which loads incremental data to catalog and processed the incremental files through ETL. And looks pretty straight forward as well. With this I ran into couple of issues.
Let's say we have data getting streamed for various tables and for various data bases to S3 locations, and we want to create data bases and tables based on landing data.
eg: s3://landingbucket/database1/table1/YYYYMMDDHH/some_incremental_files.json
s3://landingbucket/database1/table2/YYYYMMDDHH/some_incremental_files.json
s3://landingbucket/database1/somedata/tablex/YYYYMMDDHH/some_incremental_files.json
s3://landingbucket/database2/table1/YYYYMMDDHH/some_incremental_files.json
s3://landingbucket/datasource_external/data/table1/YYYYMMDDHH/some_incremental_files.json
With the data getting landed in above s3 structure, we want to create glue catalog for these data bases and tables with limited Crawlers. Here we have number of databases as number of crawlers.
Note: We have a crawler for database1, its creating tables under database1, which is good and as expected, but we have an exceptional guy "somedata" in database1, whose structure is not in standard with other tables, with this it created table somedata and with partitions "partitions_0=tablex and partition_1=YYYYMMDDHH". Is there a better way to handle these with less number of crawlers than one crawler per data base.
Glue ETL, we have similar challenge, we want to format the incoming data to standard parquet format, and have one bucket per database and tables will be sitting under that, as the data is huge we don't want one table with partitions as data_base and data. So that we will not getting into s3 slowdown issues for the incoming load. As many teams will be querying the data from this, so we don't want to have s3 slowdown issue coming for their analytics jobs.
Instead of having one ETL job per table, per data base, is there a way we can handle this with limited jobs. As and when new tables are coming, there should be a way the ETL job should transform this json data to formatted zone. So input data and output path both can be handled dynamically, instead of hardcoding.
Open for any better idea!
Thanks,
Krish!
I have a requirement to regularly update an existing set of 30+ CSV files with new data (append to the end). There is also a requirement to possibly remove the first X rows as Y rows are added to the end.
Am I using the correct services for this and in the correct manner?
Azure Blob Storage to store the Existing and Update files.
Azure DataFactory with DataFlows. A PipeLine and DataFlow per CSV I want to transform that conducts a merge of datasets (existing + update), producing a
sink fileset that drops the new combined CSV back into Blob
storage.
A trigger on the Blob Storage Updates directory to trigger the pipeline when a new update file is uploaded.
Questions:
Is this the best approach for this problem, I need a solution with minimal input from users (I'll take care of Azure ops so long as all they have to do is upload a file and download the new one)
Do I need a pipeline and dataflow per CSV file? Or could I have one per transformation type (ie one for just appending, another for appending and removing first X rows)
I was going to create a directory in blob storage for each of the CSVs (30+ Dirs) and create a dataset for each directories existing and update files.
Then create a dataset for each output file into some new/ directory
Depending on the size of your CSVs, you can either perform the append right inside of the data flow by taking both the new data as well as the existing CSV file as a source and then Union the 2 files together to make a new file.
Or, with larger files, use the Copy Activity "merge files" setting to merge the 2 files together.
I'm designing Data Factory piplelines to load data from Azure SQL DB to Azure Data Factory.
My initial load/POC was a small subset of data and was able to load from SQL tables to Azure DL.
Now, there are huge volume of tables (that has even billion +) that I want to load from SQL DB using DF to Azure DL.
MS docs mentioned two options, i.e. watermark columns and change tracking.
Let's say I have a "cust_transaction" table that has millions of rows and if I load to DL then it loads as "cust_transaction.txt".
Questions.
1) What would an optimal design to incrementally load the source data from SQL DB into that file in the data lake?
2) How do I split or partition the files into smaller files?
3) How should I merge and load the deltas from source data into the files?
Thanks.
You will want multiple files. Typically, my data lakes have multiple zones. The first zone is Raw. It contains a copy of the source data organized into entity/year/month/day folders where entity is a table in your SQL DB. Typically, those files are incremental loads. Each incremental load for an entity has a file name similar to Entity_YYYYMMDDHHMMSS.txt (and maybe even more info than that) rather than just Entity.txt. And the timestamp in the file name is the end of the incremental slice (max possible insert or update time in the data) rather than just current time wherever possible (sometimes they are relatively the same and it doesn't matter, but I tend to get a consistent incremental slice end time for all tables in my batch). You can achieve the date folders and timestamp in the file name by parameterizing the folder and file in the dataset.
Melissa Coates has two good articles on Azure Data Lake: Zones in a Data Lake and Data Lake Use Cases and Planning. Her naming conventions are a bit different than mine, but both of us would tell you to just be consistent. I would land the incremental load file in Raw first. It should reflect the incremental data as it was loaded from the source. If you need to have a merged version, that can be done with Data Factory or U-SQL (or your tool of choice) and landed in the Standardized Raw zone. There are some performance issues with small files in a data lake, so consolidation could be good, but it all depends on what you plan to do with the data after you land it there. Most users would not access data in the RAW zone, instead using data from Standardized Raw or Curated Zones. Also, I want Raw to be an immutable archive from which I could regenerate data in other zones, so I tend to leave it in the files as it landed. But if you found you needed to consolidate there, that would be fine.
Change tracking is a reliable way to get changes, but I don't like their naming conventions/file organization in their example. I would make sure your file name has the entity name and a timestamp on it. They have Incremental - [PipelineRunID]. I would prefer [Entity]_[YYYYMMDDHHMMSS]_[TriggerID].txt (or leave the run ID off) because it is more informative to others. I also tend to use the Trigger ID rather than the pipeline RunID. The Trigger ID is across all the packages executed in that trigger instance (batch) whereas the pipeline RunID is specific to that pipeline.
If you can't do the change tracking, the watermark is fine. I usually can't add change tracking to my sources and have to go with watermark. The issue is that you are trusting that the application's modified date is accurate. Are there ever times when a row is updated and the modified date is not changed? When a row is inserted, is the modified date also updated or would you have to check two columns to get all new and changed rows? These are the things we have to consider when we can't use change tracking.
To summarize:
Load incrementally and name your incremental files intelligently
If you need a current version of the table in the data lake, that is a separate file in your Standardized Raw or Curated Zone.