How to handle Incremental & Full Upload in a Azure Data Factory - azure

We have a Azure Storage Account with 2 blob stores. A Full and a Inc.
In the Full we place the full upload CSV files whenever a Full Upload is needed, in the Inc we just place day by day small incremental CSV Files.
We load all our data first in a staging, then to the ODS en finally to a Edw (Enterprise DW).
A full upload is only needed when there are structural changes to the tables.
Basically the only difference between the two uploads is that the full also cleares all data in the ODS and the EDW, but runs the sames stored procedures in the pipelines, ...
Anybody has tips on how to handle such a situation in a Azure Data Factory.
I would prefer not to double the data-factories, but due to the different avalability/frequency of the output datasets I can't use the same staging logical (in the data-factory) table as output dataset ....
So any hint(s) are appreciated ...

First of all to be clear ADF is just there to invoke other Azure services, it doesn't do any of the work itself. So the question really is; what services in Azure could you call from ADF to do this work and manage this situation?
To answer that...
Option 1: I would suggest you look at Azure Data Lake. I've written simply procedures to what you've described above in USQL where parameters can be passed to the USQL procedures from ADF for different types of behaviour.
The code you create can live in an Azure Data Lake Analytics database, similar to TSQL objects. Then maybe start using Azure Data Lake Storage as well, instead of normal blobs.
Option 2: Break out the C# and create yourself an Azure Data Factory custom activity and create a set of classes to do exactly what you require. Again with params passed by ADF or include logic in the methods to check the 'full' table contents. This will however involve a lot more development work and require an Azure Batch Service for the compute.

Related

Triggered copy data from blob to ADLS extracting path from filename

I am trying to centralize our data to a ADLSgen2 data lake. One of our datasets is 'dumped' in a blob storage and I want to have a triggered copy.
The files that are stored in the blob storage have a data as a filename (can be a arbitrary date) in JSON format. What I want is that new files are (binary) copied to a folder on the data lake with path using pieces of the date that are present in the filename.
2020-01-01.json → raw/blob/2020/01/raw_reports_blob_2020-01-01.json
First I tried a data copy job and a Pipeline in Azure Synapse but I am not sure how to set the sink path with details from source filename. It seems that the copy-data-tool cannot be triggered by new blob files. The pipeline method looks pretty powerful and I guess it is possible. What I want is not that difficult on Linux so I guess it must be possible in Azure as well.
Second, I tried to create an Azure Function as I am pretty comfortable with Python, however here I have a similar problem as I need to define in/out bindings. The out bindings are defined at design time and do not give me the freedom to the kind of path based on the source filename. Also, it feels somewhat overkill for a simple binary copy action. I can have the function triggered with new files in the blob and reading them is no problem.
I am relatively new to Azure and any help towards a solution is more than welcome.
See this answer as well: https://stackoverflow.com/a/66393471/496289
There is concept of "copy" per sē in ADLS. You read/download from source and write/upload to target.
As someone mentioned Data Factory can do this.
You can also use:
azcopy from a Power Shell Azure Function. azcopy cp "https://[srcaccount].blob.core.windows.net/[container]/[path/to/blob]?[SAS]" "https://[destaccount].blob.core.windows.net/[container]/[path/to/blob]?[SAS]"
Python/Java/... Azure Function. You'll have to download the file (in chunks if it's big) and upload it (in chunks if big).
Databricks. This would be similar misuse of a tool as using Azure Synapse Analytics to copy data between storage accounts.
Azure Logic apps. See this and this. Never used them, but I believe they are less code than Azure Function and have some programming capabilities, if it helps you create destination path programmatically.
Things to remember:
Data Factory, can be expensive. Especially compared to Azure Functions on consumption plan.
Azure Functions on consumption plan have 10 minute max before they timeout. So can't use it if you have files in GBs/TBs.
You'll be paying egress costs if applicable.

Is there a simple way to ETL from Azure Blob Storage to Snowflake EDW?

I have the following ETL requirements for Snowflake on Azure and would like to implement the simplest possible solution because of timeline and technology constraints.
Requirements :
Load CSV data (only a few MBs) from Azure Blob Storage into Snowflake Warehouse daily into a staging table.
Transform the loaded data above within Snowflake itself where transformation is limited to just a few joins and aggregations to obtain a few measures. And finally, park this data into our final tables in a Datamart within the same Snowflake DB.
Lastly, automate the above pipeline using a schedule OR using an event based trigger (i.e. steps to kick in as soon as file lands in Blob Store).
Constraints :
We cannot use use Azure Data Factory to achieve this simplest design.
We cannot use Azure Functions to deploy Python Transformation scripts and schedule them either.
And, I found that Transformation using Snowflake SQL is a limited feature where it only allows certain things as part of COPY INTO command but does not support JOINS and GROUP BY. Furthermore, although the following THREAD suggests that scheduling SQL is possible, but that doesn't address my Transformation requirement.
Regards,
Roy
Attaching the following Idea diagram for more clarity.
https://community.snowflake.com/s/question/0D50Z00009Z3O7hSAF/how-to-schedule-jobs-from-azure-cloud-for-loading-data-from-blobscheduling-snowflake-scripts-since-dont-have-cost-for-etl-tool-purchase-for-scheduling
https://docs.snowflake.com/en/user-guide/data-load-transform.html#:~:text=Snowflake%20supports%20transforming%20data%20while,columns%20during%20a%20data%20load.
You can create snowpipe on Azure blob storage, Once snowpipe created on top of your azure blob storage, It will monitor bucket and file will be loaded into your stage table as soon as new file comes in. After copied the data into stage table you can schedule transformation SQL using snowflake task.
You can refer snowpipe creation step for azure blob storage in below link:
Snowpipe on microsoft Azure blob storage

Staging or landing on Azure

I am performing ETL in Azure Data Factory and I just wanted to confirm my understanding of it before going further. Please find the image attached below.
I am collecting data from multiple source and storing in Azure Blob Storage then perform Transformation and Loading. What I am confused about is that whether Azure Blob Storage is a landing or staging area here in my case. Some people use these terms interchangeably and couldn't understand the fine line between these two terms.
Also, can anyone explain me which part is Extract, Transform and Load is. In my understating, collecting the data from multiple source and store into Azure Blob Storage is Extracting, Azure Data Factory is Transformation and copying the transformed data into Azure Database is Loading. Am i correct or is there something I am misunderstanding here?
What I am confused about is that whether Azure Blob Storage is a
landing or staging area here in my case.
In your case, Azure Blob Storage is both landing area and staging area. Landing area means a area collecting data from different places. Staing area means it only save data for a little time, staging data should be deleted during ETL process.
Also, can anyone explain me which part is Extract, Transform and Load
is.
Copy Activity is a typical technology based on ETL. If only talking about the Copy Activity of Azure Data Factory, after you specify the copy source, the ADF will perform copy activities based on this, this is 'extract'. The part of the ADF that transfers data to the specified Sink according to your settings, this is 'Load', and the details of the copy behavior is 'Transform'. If you look at your entire process, you collect data to blob storage is also 'Extract'.

azure data factory dataset cleanup

Does anyone tell me how Azure Data factory datasets are cleaned up (removed, deleted etc). Is there any policy or settings to control it?
From what I can say, all the time series of data sets are left behind intact.
Say, I want to develop an activity which overwrites data daily in the destination folder in Azure Blob or Data Lake storage (for example which is mapped to external table in Azure Datawarehouse and it is a full data extract). How can I achieve this with just copy activity? Shall I add custom .Net activity to do the cleanup no longer needed datasets myself?
Yes, you would need a custom activity to delete pipepline output.
You could have pipeline activities that overwrite the same output but you must be careful to understand how ADF slicing model and activity dependency works, so that anything using the output gets a clear and repeatable set.

Azure Data Sync - Copy Each SQL Row to Blob

I'm trying to understand the best way to migrate a large set of data - ~ 6M text rows from (an Azure Hosted) SQL Server to Blob storage.
For the most part, these records are archived records, and are rarely accessed - blob storage made sense as a place to hold these.
I have had a look at Azure Data Factory and it seems to be the right option, but I am unsure of it fulfilling requirements.
Simply the scenario is, for each row in the table, I want to create a blob, with the contents of 1 column from this row.
I see the tutorial (i.e. https://learn.microsoft.com/en-us/azure/data-factory/data-factory-copy-activity-tutorial-using-azure-portal) is good at explaining migration of bulk-to-bulk data pipeline, but I would like to migrate from a bulk-to-many dataset.
Hope that makes sense and someone can help?
As of now, Azure Data Factory does not have anything built in like a For Each loop in SSIS. You could use a custom .net activity to do this but it would require a lot of custom code.
I would ask, if you were transferring this to another database, would you create 6 million tables all with the same structure? What is to be gained by having the separate items?
Another alternative might be converting it to JSON which would be easy using Data Factory. Here is an example I did recently moving data into DocumentDB.
Copy From OnPrem SQL server to DocumentDB using custom activity in ADF Pipeline
SSIS 2016 with the Azure Feature Pack, giving Azure Tasks such as Azure Blob Upload Task and Azure Blob Destination. You might be better off using this, maybe an OLEDB command or the For Each loop with an Azure Blob destination could be another option.
Good luck!
Azure has a ForEach activity which can be place after LookUp or Metadata to get the each row from SQL to blob
ForEach

Resources