How do I pull the last modified file with data flow in azure data factory?

How do I pull the last modified file with data flow in azure data factory? - azure

I have files that are uploaded into an onprem folder daily, from there I have a pipeline pulling it to a blob storage container (input), from there I have another pipeline from blob (input) to blob (output), here is were the dataflow is, between those two blobs. Finally, I have output linked to sql. However, I want the blob to blob pipeline to pull only the file that was uploaded that day and run through the dataflow. The way I have it setup, every time the pipeline runs, it doubles my files. I've attached images below
[![Blob to Blob Pipeline][1]][1]
Please let me know if there is anything else that would make this more clear
[1]: https://i.stack.imgur.com/24Uky.png

I want the blob to blob pipeline to pull only the file that was uploaded that day and run through the dataflow.
To achieve above scenario, you can use Filter by last Modified date by passing the dynamic content as below:
#startOfDay(utcnow()) : It will take start of the day for the current timestamp.
#utcnow() : It will take current timestamp.
Input and Output of Get metadata activity: (Its filtering file for that day only)
If the files are multiple for particular day, then you have to use for each activity and pass the output of Get metadata activity to foreach activity as
#activity('Get Metadata1').output.childItems
Then add Dataflow activity in Foreach and create source dataset with filename parameter
Give filename parameter which is created as dynamic value in filename
And then pass source parameter filename as #item().name
It will run dataflow for each file get metadata is returning.

I was able to solve this by selecting "Delete source files" in dataflow. This way the the first pipeline pulls the new daily report into the input, and when the second pipeline (with the dataflow) pulls the file from input to output, it deletes the file in input, hence not allowing it to duplicate

Related

Azure Data Factory, how to pass parameters from trigger/pipeline in to data source

I need help. I've create a pipeline for data processing, which is importing csv and copy data to DB. I've also configure a Blob storage trigger, which is triggering pipeline with dataflow, when speciffic file will be uploaded in to container. For the moment, this trigger is set to monitor one container, however I would like to set it to be more universal. To monitor all containers in desired Storage Account and if someone will send some files, pipeline will be triggered. But for that I need to pass container name to the pipeline to be used in datasource file path. for now I've create something like that:
in the pipeline, I've add this parameter #pipeline().parameters.sourceFolder:
Next in Trigger, I've set this:
Now what should I set here, to pass this folder path?

You need to use dataset parameters for this.
Like folderpath parameter in pipeline create another pipeline parameter for the file name also and give #triggerBody().folderPath and #triggerBody().fileName to those when creating trigger.
Pipeline parameters:
Make sure you give all containers in storage event trigger while creating trigger.
Assiging trigger parameters to pipeline parameters:
Now, create two dataset parameters for the folder and file name like below.
Source dataset parameters:
Use these in the file path of the dataset dynamic content.
If you use copy activity for this dataset, then assign the pipeline parameters values(which we can get from trigger parameters) to dataset parameters like below.
If you use dataflows for the dataset, you can assign these in the dataflow activity itself like below after giving dataset as source in the dataflow.

Thank you Rakesh
I need to process few speciffic files from package that will be send to container. Each time user/application will send same set of files so in trigger I'm checking does new drive.xml file was send to any container. This file defines type of the data that was send, so if it comes, I know that new datafiles has been send as well and they will be present in lover folder.
F.eg. drive.xml was found in /container/data/somefolder/2022-01-22/drive.xml and then I know that in /container/data/somefolder/2022-01-22/datafiles/, are located 3 files that I need to process.
Therefor in parameters, I need to pass only file path, file names will be always the same.
The Dataset configuration looks like that:
and the event trigger like that:

how to segregate files in a blob storage using ADF copy activity

I have a copy data activity in ADF and I want to segregate the files into a different container based on the file type.
ex.
Container A - .jpeg, .png
Container B - .csv, .xml and .doc
My initial idea was to use 'if condition' and 'or' statement but looks like my approach won't work.
I'd appreciate it if you could give some inputs.

First, get the list files from the source container and loop each file in the Foreach activity to check the extension using If condition and copy files based on the condition to their respective containers.
I have the below files in my source container.
In ADF:
Using the Get metadata activity, get the list of all files from the source container.
Output of Get Metadata activity:
Pass the output list to Foreach activity.
#activity('Get Metadata1').output.childitems
Inside the Foreach activity, add If Condition activity to separate the files based on extension.
#or(contains(item().name, '.xml'),contains(item().name, '.csv'))
If the condition is true, copy the current file to container1.
If the condition returns false, copy the current file to container2 in False activity.
Files in the container after running the pipeline.
Container1:
Container2:

You should use the GetMetadata activity to first get all the file types that needs to be first passed to a CopyData activity which copies to Container A, and then add next Getmetadata activity to get file types for next copydata activity that copies to container B.
so, your ADF pipeline may be like GetMetaData1 - > Copydata1 -> GetMetaData2 -> Copydata2. Refer how to use GetMetaData activity in this article, and documentation
Copy activity itself allows more than one wildcard path, so you could use that in the Data source, see this.

Azure data factory file creation

I have a basic requirement where I want to append time stamp to file extracted from sql db and put it in blob.i use utcnow() and it creates a timestamp with T and all which I dont need.
any format expression to get date and just time??
New to javascript expressions as I am from ssis background
Help appreciated

The only way you can do that is copy and create a new blob with a new name concat with the timestamp.
Data Factory doesn't support rename the blob.
I only succeed with one file.
You can follow my steps:
Using lookup activity to get the timestamp from SQL database.
Using Get metadata to get the blob name from Storage.
Using Copy data activity to copy and create new file name blob.
Pileline preview:
Lookup preview:
Get metadata and Source Dataset:
Copy data activity Source setting:
Copy data activity Sink setting:
Add parameter to set the new file name in source datasaet:
Using expression to create the new file with the filename and timestamp:
#concat(split(activity('Get Metadata1').output.itemName,'.')[0],activity('Lookup1').output.firstRow.tt)
Then check the output file in the Blob Storage:
Hope this helps.

You can use expression in the destination file name, in the sink.
toTimestamp(utcnow(), 'yyyyMMdd_HHmm_ss')

How to include blob metadata in copy data mapping

I'm working on a ADF v2 pipeline, which copies data from csv blob to Azure SQL database table. For each load I would like to collect source metadata, like source blob name, and save it to a target table as a part of data lineage framework.
My blob source run the following schema:
StoreName,
StoreLocation,
StoreTaxId.
My destination table run the following schema:
StoreName,
StoreLocation,
DwhProcessDate,
DwhSourceName.
I do not know, how to properly include name of the source in the mapping section of Copy Data activity.
For the moment I have:
defined a [Get Metadata1] activity to get references to all blobs that are available from Azure Blob Storage
defined a [ForEach1] activity, iterating through the output of an expression #activity('Get Metadata1').output.childitems
inside the [ForEach1] activity, I have placed [Copy Data1] activity, where I have source and sink sections defined.
What I'm looking for is a way to add extra line to the mapping section, which will samehow bind #item().name to destination column [DwhSourceName]
Thanks for all suggestion on how to achieve this.

Actually,based on my test,you can specify the dymatic content of column key,but you can't set blob metadata as value of columns in copy data mapping at the pipeline run time. Please see the rules mentioned in this document.
You still need to add the FileName column in your source data before the copy activity.Maybe you could use Azure Blob Trigger Function to get the blob file name so hat you could add the FileName column when any data stream into the blob.(Please refer to this case:How Do I get the Name of The inputBlob That Triggered My Azure Function With Python)

Stream Analytics job reference data join creating duplicates

I am using Stream Analytics to join streaming data (via IoT Hub) and reference data (via blob storage). The reference data blob file is generated every minute with latest data and is in a format "filename-{date} {time}.csv". The reference blob file data is used in the Azure Machine Learning function as parameters in SA job. The output of stream analytics job (into Azure SQL or Power BI) seems to be generating multiple rows instead of one for Azure Machine Learning function's output, one each for parameter values from previous blob files. My understanding is that it should only use the latest blob file content but looks like it is using all the blob files and generating multiple rows from AML output. Here is the query I am using:
SELECT
AMLFunction(Ref.Input1, Ref.Input2), *
FROM IoTInput Stream
LEFT JOIN RefBlobInput Ref ON Stream.DeviceId = Ref.[DeviceID]
Please can you advice if the query or the file path needs changing to avoid duplicating records? Thanks

To take effect of only latest file, you need to store your file in particular folder structure.
If you have note down, whenever you select reference data file as stream input; stream input dialog asks you for folder structure along with date and time format.
Stream always search for reference file from latest {date}/{time} folder. i.e. you need to store your file like,
2018-01-25/07:30/filename.json (YYYY-MM-DD/HH-mm/filename.json)
NOTE: Here your time folder needs to be unique for each minute. Same as, date folder needs to be unique for each date. Whenever you create new file, create it with under new time stamp folder and under current date folder.
You can use any datetime format that stream input supports.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do I pull the last modified file with data flow in azure data factory? - azure

I was able to solve this by selecting "Delete source files" in dataflow. This way the the first pipeline pulls the new daily report into the input, and when the second pipeline (with the dataflow) pulls the file from input to output, it deletes the file in input, hence not allowing it to duplicate

Related

Azure Data Factory, how to pass parameters from trigger/pipeline in to data source

how to segregate files in a blob storage using ADF copy activity

Azure data factory file creation

How to include blob metadata in copy data mapping

Stream Analytics job reference data join creating duplicates

Categories

Resources