ADF Delta Copy Activity fails, delete partly copied file from data lake

ADF Delta Copy Activity fails, delete partly copied file from data lake - azure

I have a hourly delta load pipeline in that i have a copy activity to copy data from sql server to datalake parquet format with folder structure (YY/MM/DD/table_nameHH) in For each activity:
after copy activity i have success/failure procedure to update records in control table
i need to ensure if any copy activity failed in between i need to remove that partly copied file from data lake. how to put that condition to pickand delete that dynamic file in my pipeline.
Thanks in Advance
p.s i am pretty new to this tool and learning daily.

Please ref these steps bellow:
You could add an If condition behind the Copy_DimCustomer_AC active:
In If condition expressions, build the expression to judge if Copy_DimCustomer_AC.executionDetails.status equals "Succeeded", if ture, then the copy active succeeded:
#equals(activity('Copy_Dimcustomer_AC').output.executionDetails.status,'Succeeded')
True actives:
False active: add a delete active to delete the file and run the Log_failure_status_AC:
HTH.

Related

Reading folders with Date format (YYYY-MM) using Azure Data Factory

I have few folders inside the Data lake (Example:Test1 container) that are created every month in this format YYYY-MM (Example:2022-11) and inside this folder I have few set of data files, I want to copy this data files to different folders in the data lake.
And again in the next month new folder is created in the same data lake (Example:Test1 container) with 2022-12 and list goes on, 2023-01.....etc., I want to copy files inside these folders every month to different data lake folder.
How to achieve this?

Solution is mentioned in this thread, Create a folder based on date (YYYY-MM) using Data Factory?
Follow the Sink Dataset section and Copy Sink section....remove the parameter sinkfilename from the dataset, and use this dataset as source in the copy activity.
It worked for me.

Alternative approach. For reading folders with Date format as (YYYY-MM)
I reproduce the same in my environment with copy activity.
Open sink dataset and create a parameter with Name: Folder.
Go to Connection and Add this dynamic content: #dataset().folder
You can Add this dynamic content:
#concat(formatDateTime(utcnow(), 'yyyy/MM'))
Or
#concat(formatDateTime(utcnow(), 'yyyy'), '/',formatDateTime(utcnow(),'MM')
Pipeline successfully executed and got the output:

Create a folder based on date (YYYY-MM) using Data Factory?

I have few set of monthly files dropping in my data lake folder and I want to copy them to a different folder in the data lake and while copying the data to the target data lake folder, I want to create a folder in the format YYYY-MM (Ex: 2022-11) and I want to copy the files inside this folder.
And again in the next month I will get new set of data and I want to copy them to (2022-12) folder and so on.
I want to run the pipeline every month because we will get monthly load of data.

As you want to copy only the new files using the ADF every month,
This can be done in two ways.
First will be using a Storage event trigger.
Demo:
I have created pipeline parameters like below for new file names.
Next create a storage event trigger and give the #triggerBody().fileName for the pipeline parameters.
Parameters:
Here I have used two parameters for better understanding. If you want, you can do it with single pipeline parameter also.
Source dataset with dataset parameter for filename:
Sink dataset with dataset parameter for Folder name and filename:
Copy source:
Copy sink:
Expression for foldername: #formatDateTime(utcnow(),'yyyy-MM')
File copied to required folder successfully when I uploaded to source folder.
So, every time a new file uploaded to your folder it gets copied to the required folder. If you don't want the file to be exist after copy, use delete activity to delete source file after copy activity.
NOTE: Make sure you publish all the changes before triggering the pipeline.
Second method can be using Get Meta data activity and ForEach and copy activity inside ForEach.
Use schedule trigger for this for every month.
First use Get Meta data(use another source dataset and give the path only till folder) to get the child Items and in the filter by Last modified of Meta data activity give your month starting date in UTC(use dynamic content utcnow() and FormatDatetime() for correct format).
Now you will get all the list of child Items array which have last modified date as this month. Give this array to ForEach and inside ForEach use copy activity.
In copy activity source, use dataset parameter for file name (same as above) and give #item().name.
In copy activity sink, use two dataset parameters, one for Folder name and another for file name.
In Folder name give the same dynamic content for yyyy-MM format as above and for file name give as #item().name.

Skip null rows while reading Azure Data Factory

I am new to Azure Data Factory, and I currently have the following setup for a pipeline.
Azure Data Factory Pipeline
Inside the for each
The pipeline does the following:
Reads files for a directory everyday
Filters the children in the directory based on file type [only selects TSV files]
Iterates over each file and copies the data to Azure Data Explorer if they have the correct schema, which I have defined in mapping for the copy activity.
It copied files are then moved to a different directory and deleted from the original directory so that they aren't copied again.
[Question]: I want to delete or skip the rows which have null value in any one of the attributes.
I was looking into using data flow, but I am not sure how to use data flows to read multiple tsv files and validate their schema before applying transformations to delete the null records.
Please let me know if there is a solution where I can skip the null values in the for each loop or if I can use data flow to do the same.
If I can use data flow, how do I read multiple files and validate their column names (schema) before applying row transformations?
Any suggestions that would help me delete or skip those null values will be hugely helpful
Thanks!

Ok, inside the ForEach activity, you only need to add a dataflow activity.
The main idea is to do the filter/assert activity then you write to multiple sinks.
ADF dataflow :
Source:
add your tsv file as requested, and make sure to select in After completion ->Delete source files this will save you from adding a delete activity.
Filter activity:
Now, depends on your use case, do you want to filter rows with null values? or do you want to validate that you don't have null values.
if you want to filter, just add a filter activity, in filter settings -> filter on -> 'here add your condition'.
if you need to validate rows and make the dataflow fail, use the assert activity
filter condition : false(isNull(columnName))
Sink:
i added 2 sinks,one for ADE and one for new directory.
You can read more about it here:
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-assert
https://learn.microsoft.com/en-us/azure/data-factory/data-flow-filter
https://microsoft-bitools.blogspot.com/2019/05/azure-incremental-load-using-adf-data.html
please consider the incremental load and change the dataflow accordingly.

How to parse each row of an excel using Azure Data Factory

here is my requirement:
I have an excel with few columns in it and few rows with data
I have uploaded this excel in Azure blob storage
Using ADF I need to read this excel and parse the records in it one by one and perform an action of creating dynamic folders in Azure blob.
This needs to be done for each and every record present in the excel.
Each record in the excel has some information that is going to help me create the folders dynamically.
Could someone help me in choosing the right set of activities or data flow in ADF to do this work?
Thanks in advance!

This is my Excel file as a Source.
I have created folders in Blob storage based on Country column.
I have selected DataFlow activity.
As shown in below screenshot, Go to Optimize tab of Sink configuration.
Now select Partition option as Set Partition.
Partition type as Key.
And Unique value per partition as Country column.
Now run Pipeline.
Expected Output:-
Inside these folders you will get files with corresponding data.

Copy Blob Data To Sql Database in Azure Data Factory with Conditions

I am performing a a trigger based pipeline to copy data from blob storage to SQL database. In every blob file there are bunch of JSONs from which I need to copy just few of them and I can differenciate them on the basis of a Key-value pair present in every JSON.
So How to filter those JSON containing that Value corresponding to a common key?
One Blob file looks like this. Now While the copy activity is happening ,it should filter data according to the Event- Name: "...".

Data factory in general only moves data, it doesnt modify it. What you are trying to do might be done using a staging table in the sink sql.
You should first load the json values as-is from the blob storage in the staging table, then copy it from the staging table to the real table where you need it, applying your logic to filter in the sql command used to extract it.
Remember that sql databases have built in functions to treat json values: https://learn.microsoft.com/en-us/sql/relational-databases/json/json-data-sql-server?view=sql-server-2017
Hope this helped!

At this time we do not have an option for the copy activity to filter the content( with an exception of sql source ) .
In your scenario it looks like that already know which values needs to omitted , on way to go will be have a "Stored Procedure" activity , after the copy activity which will be just delete the values which you don't want from the table ,this should be easy to implement but depending on the volume of data it may lead to performance issues . The other option is to have the JSON file cleaned on the storage side before it is ingested .

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string