How to log errors in dataflow adf of parallel sources - azure

I have to do some data engineering by reading manifest.cdm.json files from datalake.
add pipeline run id column and push to sql database.
I have one json list file which have required parameter to read CDM json file in source of dataflow.
Previous Approach: I used Foreach and passed parameter to dataflow with single activity then error capturing. But use of Dataflow with for each costs too much..
Current Approch: I mannually created Dataflow with all cdm files. But here I'm not able to capture error. If any source got error all dataflow activity fails. and If I select skip error in dataflow activity I'm not getting any error.
So what should be the approch to get errors from current approch.

You can capture the error using set variable activity in Azure Data Factory.
Use below expression to capture the error message using Set Variable activity:
#activity('Data Flow1').Error.message
Later you can store the error message in blob storage for future reference using copy activity. In below example we are saving error message in .csv file using DelimitedText dataset.

Related

Azure Data Factory, how to pass parameters from trigger/pipeline in to data source

I need help. I've create a pipeline for data processing, which is importing csv and copy data to DB. I've also configure a Blob storage trigger, which is triggering pipeline with dataflow, when speciffic file will be uploaded in to container. For the moment, this trigger is set to monitor one container, however I would like to set it to be more universal. To monitor all containers in desired Storage Account and if someone will send some files, pipeline will be triggered. But for that I need to pass container name to the pipeline to be used in datasource file path. for now I've create something like that:
in the pipeline, I've add this parameter #pipeline().parameters.sourceFolder:
Next in Trigger, I've set this:
Now what should I set here, to pass this folder path?
You need to use dataset parameters for this.
Like folderpath parameter in pipeline create another pipeline parameter for the file name also and give #triggerBody().folderPath and #triggerBody().fileName to those when creating trigger.
Pipeline parameters:
Make sure you give all containers in storage event trigger while creating trigger.
Assiging trigger parameters to pipeline parameters:
Now, create two dataset parameters for the folder and file name like below.
Source dataset parameters:
Use these in the file path of the dataset dynamic content.
If you use copy activity for this dataset, then assign the pipeline parameters values(which we can get from trigger parameters) to dataset parameters like below.
If you use dataflows for the dataset, you can assign these in the dataflow activity itself like below after giving dataset as source in the dataflow.
Thank you Rakesh
I need to process few speciffic files from package that will be send to container. Each time user/application will send same set of files so in trigger I'm checking does new drive.xml file was send to any container. This file defines type of the data that was send, so if it comes, I know that new datafiles has been send as well and they will be present in lover folder.
F.eg. drive.xml was found in /container/data/somefolder/2022-01-22/drive.xml and then I know that in /container/data/somefolder/2022-01-22/datafiles/, are located 3 files that I need to process.
Therefor in parameters, I need to pass only file path, file names will be always the same.
The Dataset configuration looks like that:
and the event trigger like that:

How do I pull the last modified file with data flow in azure data factory?

I have files that are uploaded into an onprem folder daily, from there I have a pipeline pulling it to a blob storage container (input), from there I have another pipeline from blob (input) to blob (output), here is were the dataflow is, between those two blobs. Finally, I have output linked to sql. However, I want the blob to blob pipeline to pull only the file that was uploaded that day and run through the dataflow. The way I have it setup, every time the pipeline runs, it doubles my files. I've attached images below
[![Blob to Blob Pipeline][1]][1]
Please let me know if there is anything else that would make this more clear
[1]: https://i.stack.imgur.com/24Uky.png
I want the blob to blob pipeline to pull only the file that was uploaded that day and run through the dataflow.
To achieve above scenario, you can use Filter by last Modified date by passing the dynamic content as below:
#startOfDay(utcnow()) : It will take start of the day for the current timestamp.
#utcnow() : It will take current timestamp.
Input and Output of Get metadata activity: (Its filtering file for that day only)
If the files are multiple for particular day, then you have to use for each activity and pass the output of Get metadata activity to foreach activity as
#activity('Get Metadata1').output.childItems
Then add Dataflow activity in Foreach and create source dataset with filename parameter
Give filename parameter which is created as dynamic value in filename
And then pass source parameter filename as #item().name
It will run dataflow for each file get metadata is returning.
I was able to solve this by selecting "Delete source files" in dataflow. This way the the first pipeline pulls the new daily report into the input, and when the second pipeline (with the dataflow) pulls the file from input to output, it deletes the file in input, hence not allowing it to duplicate

Failed to parse data from SAP table via Azure Data Factory

I am trying to extract data from SAP using SAP CDC Connector in ADF. The source data looks something like this.
START_DT|PROD_NAME|END_DT
20201230165830.0|BBEESABX|20180710143703.0
When we perform a preview data on the source, we are getting data just like above. But while performing copy via copy activity, below failure is observed :-
Failure happened on 'Source' side. ErrorCode=SapParsingDataFailure,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Failed when parsing data, parsing value: 'ESABX 201807', expected data type 'Microsoft.DataTransfer.Common.Shared.ClrTypeCode'.Please check your origin data in SAP side,Source=Microsoft.DataTransfer.Runtime.SapRfcHelper,''Type=System.FormatException,Message=Input string was not in a correct format.,Source=mscorlib,'
I have tried several combination and changes on sink side such as changing parquet to csv, changing Copy behavior to all available options...but nothing seems to work.
Probably you have hiding fields in the SAP Extractor? (RSA6). Try this workaround, make a selection of all fields in the SAP CDC connector and run it again.

Azure Data Factory v2 - How to get Copy Data tool output

I'm trying to create an archiving pipeline which essentially does the following:
Call stored procedure (GET) from a SQL Azure DB that returns a resultset
Archive the result from #1 onto storage account (e.g. json files)
Extract ID column from #1 into an array of int
Use result from #4 as a parameter to call a stored procedure (DELETE) in the same SQL Azure DB
So far, I've tried the Copy Data activity/tool which satisfies steps 1 & 2. However, I'm not sure how to get the outputs from that step and can't find any documentation at Microsoft.
Is it not the correct usage? Do I have to manually do it instead?
Also, I'd like to do some validation in between steps (i.e. no result? don't proceed).
I've managed to try the bare/general stored procedure activity but also can't find where to retrieve its output for use in the next step. I'm pretty new to Data Factory and don't really work with data engineering/piplines so please bear with me.
Using Copy data activity, you can copy stored procedure data to storage. Connect the source to SQL database and use stored procedure as query option, connect the sink to sink folder in a storage account.
Once the data is copied to a storage account, use lookup activity to read the data from the file which is generated from #1.
Extract the output of lookup activity into an array variable using set variable activity.
You can use If condition activity validates the result and run the other activities if it’s true.

How to set and get variable value in Azure Synapse or Data Factory pipeline

I have created a pipeline with Copy Activity, say, activity1in Azure Synapse Analytics workspace that loads the following JSON to Azure Data Lake Storage Gen2 (ADLSGen2) using source as a REST Api and Sink (destination) as ADLSGen2. Ref.
MyJsonFile.json (stored in ADLSGen2)
{"file_url":"https://files.testwebsite.com/Downloads/TimeStampFileName.zip"}
In the same pipeline, I need to add an activity2 that reads the URL from the above JSON, and activity3 that loads the zip file (mentioned in that URL) to the same Gen2 storage.
Question: How can we add an activity2 to the existing pipeline that will get the URL from the above JSON and then pass it to activity3? Or, are there any better suggestions/solutions to achieve this task.
Remarks: I have tried Set Variable Activity (shown below) by first declaring a variable in the pipeline and the using that variable, say, myURLVar in this activity, but I am not sure how to dynamically set the value of myURLVar to the value of the URL from the above JSON. Please NOTE the Json file name (MyJsonFile.json) is a constant, but zip file name in the URL is dynamic (based on timestamp), hence we cannot just hard code the above url.
As #Steve Zhao mentioned in the comments, use lookup activity to get the data from the JSON file and extract the required URL from the lookup output value using set variable activity.
Connect the lookup activity to the sink dataset of previous copy data activity.
Output of lookup activity:
I have used the substring function in set activity to extract the URL from the lookup output.
#replace(substring(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'),sub(length(string(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''))),indexof(replace(replace(replace(string(activity('Lookup1').output.value),'"',''),'}',''),'{',''),'http'))),']','')
Check the output of set variable:
Set variable output value:
There is a way to do this without needing complex string manipulation to parse the JSON. The caveat is that the JSON file needs to be formatted such that there are no line breaks (or that each line break represents a new record).
First setup a Lookup activity that loads the JSON file in the same way as #NiharikaMoola-MT's answer shows.
Then for the Set Variable activity's Value setting, use the following dynamic expression: #activity('<YourLookupActivityNameHere>').output.firstRow.file_url

Resources