Orchestrating Pipelines in Azure Data Factory

Orchestrating Pipelines in Azure Data Factory - azure

I need help on ADF(Not Devops) orchestration. I am giving process flow with ADF activity which are denoted by numbers
SAP tables---> Raw Zone---->Prepared Zone----->Trusted Zone------->sFTP
1 2 3 4
Kafka Ingestion (Run by ADF)
Databrick jar(Run by ADF)
Databrick jar(Run by ADF)
ADF Copy activity
The below tasks need to be done
After files are generated in trusted zone, a synchronous process would copy the files into sFTP location.
To copy files into sFTP, it would get all .ctl files (trigger/control files) and compare with what’s been flagged as processed in JOB_CONTROL table. Copy the new files that were not processed/copied before.
The copy program should poll for .ctl files and following steps to be performed
a. Copy csv file with same as ctl file.
b. Copy ctl file
c. Insert/Update a record in JOB_CONTROL using file type that the file is processed successfully. If it is successful, the file will not be considered for next run.
d. In the event of error, it should mark with respective status flag so that the same file to be considered in next run as well
Please help me to achieve this.
Regards,
SK

This is my understanding about the ask , you are logging the file copied in a table and the intend is to initiate the copy of the files which failed .
I think you can use a Lookup activity to read the file(s) which failed and then pass that to a foreach(FE) loop . Inside the FE loop you can add the copy activity ( you will have to paramterized the dataset ) .
HTH

Related

How do I pull the last modified file with data flow in azure data factory?

I have files that are uploaded into an onprem folder daily, from there I have a pipeline pulling it to a blob storage container (input), from there I have another pipeline from blob (input) to blob (output), here is were the dataflow is, between those two blobs. Finally, I have output linked to sql. However, I want the blob to blob pipeline to pull only the file that was uploaded that day and run through the dataflow. The way I have it setup, every time the pipeline runs, it doubles my files. I've attached images below
[![Blob to Blob Pipeline][1]][1]
Please let me know if there is anything else that would make this more clear
[1]: https://i.stack.imgur.com/24Uky.png

I want the blob to blob pipeline to pull only the file that was uploaded that day and run through the dataflow.
To achieve above scenario, you can use Filter by last Modified date by passing the dynamic content as below:
#startOfDay(utcnow()) : It will take start of the day for the current timestamp.
#utcnow() : It will take current timestamp.
Input and Output of Get metadata activity: (Its filtering file for that day only)
If the files are multiple for particular day, then you have to use for each activity and pass the output of Get metadata activity to foreach activity as
#activity('Get Metadata1').output.childItems
Then add Dataflow activity in Foreach and create source dataset with filename parameter
Give filename parameter which is created as dynamic value in filename
And then pass source parameter filename as #item().name
It will run dataflow for each file get metadata is returning.

I was able to solve this by selecting "Delete source files" in dataflow. This way the the first pipeline pulls the new daily report into the input, and when the second pipeline (with the dataflow) pulls the file from input to output, it deletes the file in input, hence not allowing it to duplicate

Why are files not being filtered through my data factory pipeline?

I'm copying data from on-prem files to blob azure storage. The first part of the pipeline is exactly that, pulling the last modified file from on-prem to blob storage (blob_input). After that I'm having problems. I have one file out of 5 that doesn't have actual data but only shows "no data" which makes it impossible to transport to sql, that's the ultimate goal. So I connected my "copy data" to the filtering pipeline that I created. However, my pipeline doesn't filter out the file. Could someone tell me what my pipeline is missing? I have added pictures below to make things clearer but Stackoverflow wouldn't let me post more so I have created a google docs with all the info Files and Path forEach1 forEach2 forEach3 lookup1 ifcondition1 copydata1 copydata2

I reproduced your scenario the problem causing to filter file where it contains only no data is you are using first row as header then the lookup count for that file is 0 which is not equal to 1 because of it is not filtering out that file.
To resolve this, you can try:
This condition
#greater(activity('Lookup1').output.count,0), it will filter file if the lookup count is greater than 0.
Output
Copy activity is not coping data from daily4 file because it has sample input like you, and it failed in if condition.

how to segregate files in a blob storage using ADF copy activity

I have a copy data activity in ADF and I want to segregate the files into a different container based on the file type.
ex.
Container A - .jpeg, .png
Container B - .csv, .xml and .doc
My initial idea was to use 'if condition' and 'or' statement but looks like my approach won't work.
I'd appreciate it if you could give some inputs.

First, get the list files from the source container and loop each file in the Foreach activity to check the extension using If condition and copy files based on the condition to their respective containers.
I have the below files in my source container.
In ADF:
Using the Get metadata activity, get the list of all files from the source container.
Output of Get Metadata activity:
Pass the output list to Foreach activity.
#activity('Get Metadata1').output.childitems
Inside the Foreach activity, add If Condition activity to separate the files based on extension.
#or(contains(item().name, '.xml'),contains(item().name, '.csv'))
If the condition is true, copy the current file to container1.
If the condition returns false, copy the current file to container2 in False activity.
Files in the container after running the pipeline.
Container1:
Container2:

You should use the GetMetadata activity to first get all the file types that needs to be first passed to a CopyData activity which copies to Container A, and then add next Getmetadata activity to get file types for next copydata activity that copies to container B.
so, your ADF pipeline may be like GetMetaData1 - > Copydata1 -> GetMetaData2 -> Copydata2. Refer how to use GetMetaData activity in this article, and documentation
Copy activity itself allows more than one wildcard path, so you could use that in the Data source, see this.

Azure Data Factory "flatten hierarchy"

I was hoping someone went through the same process and can help me see if the following scenario is possible.
I currently build out a pipeline that copies from an S3 bucket. That bucket contains a large number of folders. Does Azure Data factory have a way, when copying data from the S3 bucket, to them disregard the folders and just copy the files themselves? I have read that the COPY activity has "flatten hierarchy", but the big limitation that I see is that all the files are renamed and I am never sure if those are all of the files that are contained in those folders since it mentions that it only does it "in the first level of target folder".
The other problem is that the S3 bucket has nested folders (ex: "domain/yyyy/mm/dd/file") and some folders contain data and some do not. The only advantage is that all of those files contain the same schema.
The end result of this pipeline would be the following:
1) COPY files from S3 bucket without copying the folder structure
2) Load the files into an Azure Database
If anyone has done something similar with Azure Data Factory or with another tool I would greatly appreciate your insight.

vlado101,firstly,i have to say that the "flatten hierarchy" which you mentioned in your question is for sink,not source:
Since your destination is SQL DB, i think this copy behavior is not related to your requirements. Based on my test(blob storage,not aws s3,sorry for that because i don't have asw services):
2 json files resident in the subfolder:
I configured source dataset:
Please make sure the recursive is selected as true(Indicates whether the data is read recursively from the subfolders or only from the specified folder. Note that when recursive is set to true and the sink is a file-based store, an empty folder or subfolder isn't copied or created at the sink) and preview source data as below
Execute copy activity, all my data in the subfolder files will be transferred into destination sql db table:
Surely,this test is based on blob storage,not s3 bucket. I believe they are similar,you could test it.Any concern,please let me know.

azure data factory v2 copy data activity recursive

I am new to azure Data factory v2
I have a folder having 2 files F1.csv and F2.csv in a blob storage.
I have a created a copy data pipeline activity to load the data from file to a table in azure DWH with 3 parameters and copy recursively was made to false.
Parameter1: container
Parameter2: directory
Parameter3: F1.csv
executed successfully when used the above parameters for the copy data activity.
But the data has been loaded from two files, only one file has given as parameter for the activity

Can you please check the "wildcard file name" parameter ( Select the pipeline and look under the source tab ) ?
If it is set as . please remove that and it should work .

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Orchestrating Pipelines in Azure Data Factory - azure

Related

How do I pull the last modified file with data flow in azure data factory?

Why are files not being filtered through my data factory pipeline?

how to segregate files in a blob storage using ADF copy activity

Azure Data Factory "flatten hierarchy"

azure data factory v2 copy data activity recursive

Categories

Resources