Manage Azure BlobStorage file append with ADF - azure

I've azure data factory pipeline which stores data with some operation by calling Azure data flow.
Here file name in blob storage should be the pipeline-run-id.
Pipeline copy activity has 'Copy Behavior', I can not find a related option in the sink stream in a data flow?
Now I have a situation where I'm going to call the same azure data flow in the same pipeline execution more than one time. And because of that my file get overwritten in the blob. But I want to append new data to the same file if it exists.
Ex. If pipeline run id '9500d37b-70cc-4dfb-a351-3a0fa2475e32' and data flow call from that pipeline execution 2 times. In that case, 9500d37b-70cc-4dfb-a351-3a0fa2475e32.csv only has data with 2nd azure data flow process.

Data Flow doesn't support copyBehavior. It means that Data Flow doesn't support merge/append files.
Every time you call the Data Flow, it will create a new file '9500d37b-70cc-4dfb-a351-3a0fa2475e32.csv' and replace the exist '9500d37b-70cc-4dfb-a351-3a0fa2475e32.csv'.
Hope this helps.

Related

Prevent triggering next activity if no files were copied in previous activity in Azure Data Factory

I am using Azure Data Factory to copy the data from one Blob Storage Account to Data Lake Storage Gen2 account.
I have created a pipeline and created a copy activity inside that. I trigger this pipeline from a Timer Trigger Azure Function using C# SDK.
I am copying only incremental data by making use of Filter by last modified feature. I am passing UTC StartTime and EndTime.
Now, the question is - I don't want to trigger the second activity if no files are found within this range. How can I do that?
You can use If Condition activity, and check whether has files through this expression: #greater(activity('Copy data1').output.filesWritten,0). Then put the second activity into case True activity like this:

Use of Azure Grid Events to trigger ADF Pipe to move On-premises CSV files to Azure database

We have series of CSV files landing every day (daily Delta) then these need to be loaded to Azure database using Azure Data Factory (ADF). We have created a ADF Pipeline which moves data straight from an on-premises folder to an Azure DB table and is working.
Now, we need to make this pipeline executed based on an event, not based on a scheduled time. Which is, based on creation of a specific file on the same local folder. This file is created when the daily delta files landing is completed. Let's call this SRManifest.csv.
The question is, how to create a Trigger to start the pipeline when SRManifest.csv is created? I have looked into Azure event grid. But it seems, it doesn't work in on-premises folders.
You're right that you cannot configure an Event Grid trigger to watch local files, since you're not writing to Azure Storage. You'd need to generate your own signal after writing your local file content.
Aside from timer-based triggers, Event-based triggers are tied to Azure Storage, so the only way to use that would be to drop some type of "signal" file in a well-known storage location, after your files are written locally, to trigger your ADF pipeline to run.
Alternatively, you can trigger an ADF pipeline programmatically (.NET and Python SDKs support this; maybe other ones do as well, plus there's a REST API). Again, you'd have to build this, and run your trigger program after your local content has been created. If you don't want to write a program, you can use PowerShell (via Invoke-AzDataFactoryV2Pipeline).
There are other tools/services that integrate with Data Factory as well; I wasn't attempting to provide an exhaustive list.
Have a look at the Azure Logic Apps for File System connector Triggers. More details here.

Data from HTTP endpoint to be loaded into Azure Data Lake using Azure Data Factory

I am trying to build a so called "modern data warehouse" using Azure services.
First step is to gather all the data in its native raw format into Azure Data Lake store. For some of the data sources we have no other choice than to use API for consuming the data. There's not much information when searching, therefore I am asking.
Is it possible to define 2 Web Activities in my pipeline that will handle below scenario?
Web1 activity gets an API URL generated from C# (Azure Function). It returns data in JSON format and saves it to Web1.Output - this is working fine.
Web2 activity consumes Web1.Output and saves it into Azure Data Lake as a plain txt file (PUT or POST) - this is needed.
Above scenario is achievable by using Copy activity, but then I am not able to pass dynamic URL generated by Azure Functions. How do I save the JSON output to ADL? Is there any other way?
Thanks!
Since you are using blob storage as an intermediary, and want to consume the blob upon creation, you could take advantage of Event Triggers. You can set up the Event trigger to run a pipeline containing Web2 activity. Which kicks off when the Web1 activity completes (separate pipeline).
By separating the two activities into separate pipelines, the workflow becomes asynchronous. This means you will not need to wait for both activities to complete before doing the next URL. There are many other benefits as well.

Filter blob data in Copy Activity

I have a copy Activity that copies data from Blob to Azure Data Lake. The Blob is populated by an Azure function with an event hub trigger. Blob files are appended with UNIX timestamp which is the event enqueued time in the event hub. Azure data factory is triggered every hour to merge the files and move them over to Data lake.
Inside the source dataset I have filters by Last Modified date in UTC time out of the box. I can use this but it limits me to use Last modified date in the blob. I want to use my own date filters and decide where I want to apply these filters. Is this possible in Data factory? If yes, can you please point me in the right direction.
For ADF in any case,the only idea that came to my mind is using combination of Look Up Activity ,ForEach Activity and Filter Activity.Maybe it is kind of complex.
1.Use Look up to retrieve the data from the blob file.
2.Use ForEach Activity to loop the result and set your data time filters.
3.Inside the ForEach Activity, do the copy task.
Please refer to this blog to get some clues.
Reviewing your descriptions of all the tasks you did now, I suggest you getting an idea of Azure Stream Analytics Service. No matter the data source is Event Hub or Azure Blob Storage, ASA supports them as input. And it supports ADL as output.
You could create a job to configure input and output,then use popular SQL language to filter your data however you want.Such as Where operator or DataTime Functions.

How to archive my on premise files once they are processed into azure data lake?

I have a pipeline activity which processes my on premise files inside file share into azure data lake. And then I have to automate the process of whenever the data is processed inside the data lake, I want to archive my source files.
STEP1: I have builtin a logic app which can automatically copy any new data from IN folder to OUT directory and delete the old files in IN directory.
STEP2: I have builtin my pipeline which processes data inside my Data lake.
Now how can I trigger my logic app inside my pipeline through some activity which will automatically delete my on premise files by calling my logic app ?
Please suggest
ADF Web Activity can be used to call a custom REST endpoint from a Data Factory pipeline. enter link description here
And as far as I know,logic app can listen http request. So maybe you could chain a web activity after your data processing activity?
ADF custom activity runs your customized code logic on an Azure Batch pool of virtual machines. So I think you could also put your trigger logic into custom activity?
Another way I could think is, maybe you could invoke your logic app based on some schedule? like daily or weekly?

Resources