How to create a trigger to ( to data factory or azure function or databricks ) when there are 500 new files land in a azure storage blob - azure

I have a azure storage container where I will be getting many files on daily basis.My requirement is that I need a trigger in azure data factory or databricks when each time 500 new files are arrived so I can process them.
In Datafatory we have event trigger which will trigger for each new files(with filename and path). but is it possible to get multiple new files and their details at same time?
What services in azure I can use for this scenario.. event hub? azure function.. queues?

One of the characteristics of a serverless architecture is to execute something whenever a new event occurs. Based on that, you can't use those services alone.
Here's what I would do:
#1 Azure Functions with Blob Trigger, to execute whenever a new file arrives. This would not start the processing of the file, but just 'increment' the count of files that would be stored on Cosmos DB.
#2 Azure Cosmos DB, also offers a Change Feed, which is like an event sourcing whenever something changes in a collection. As the document created / modified on #1 will hold the count, you can use another Azure Function #3 to consume the change feed.
#3 This function will just contain an if statement which will "monitor" the current count, and if it's above the threshold, start the processing.
After that, you just need to update the document and reset the count.

Related

How to execute a trigger based on Blob created in Azure Data Factory?

I have pipeline executes with a trigger every time that a blob storage is created. Sometimes the process needs to execute many files at once, so I created in my pipeline a 'For Each' activity as follow, in order to load data when multiple blob storages are created:
That part of the pipeline uploads the data of every blob in the container to a SQL Data Base, and here is the problem, when I execute manually everything is fine, but when the trigger is executed, it executes many times as the number of blob storages in the container, and load the data multiple times no matter what (down bellow is the trigger configuration).
What I'm doing wrong? Is there any way to execute just one time the pipeline by using a trigger when a blob storage is created no matter how many files are in the container?
Thanks by the way, best regards.
Your solution triggers on a storage event. So that part is working.
When triggered, it retrieves all files in the container and processes every blob in that container. Not working as intended.
I think you have a few options here. You may want to follow this MSFT tutorial where they use a single copy activity to a sink. Step 11 shows you have to pass the #triggerBody().path & #triggerBody().fileName to the copy activity.
The other options is to aggregate all blob storage events and use a batch proces to do the operation.
I would first try the simple one-on-one processing option first.

In data factory does Azure Event trigger wait till full file is copied?

Let us say that I am copying a 10 GB file to an ADLS location and this location is being monitored by an Azure Event trigger. Will the Event trigger wait for the full 10 GB file to be copied to trigger the event OR trigger the pipeline as soon as file starts copying?. If the pipeline gets kicked off as soon as the file starts to copy how can we delay it so that the pipeline can wait till the full file is copied ?
Based on my knowledge, the ADF is triggered once the entire file is uploaded based on event trigger.
ADF trigger :
Based on documentation: https://learn.microsoft.com/en-us/azure/data-factory/how-to-create-event-trigger
Once the file is created (is fully uploaded) then the trigger fires.
According to the docs:
It depends which API was used to copy the file.
If it's Blob REST APIs
"In that case, the Microsoft.Storage.BlobCreated event is triggered when the CopyBlob operation is initiated and not when the Block Blob is completely committed."
If it's Azure Data Lake Storage Gen 2 REST APIs
"when clients use the CreateFile and FlushWithClose operations that are available in the Azure Data Lake Storage Gen2 REST API."

Add Message to Azure Queue when Azure Table Storage is updated

Currently, I have an Azure Function App which runs every hour (timer trigger) that pulls data from Azure table storage and updates a NSG. I only did it this way because Function Apps currently DON'T support Azure Table triggers; however Function Apps DO support Azure queue triggers.
With that said, i'd like a message be sent to the queue every time my Azure Table is updated. That way, the Azure Table updates can happen immediately compared to every hour. Haven't figured out how to send messages to Azure Queue from Azure Tables though.
Any help?
There is no change feed, update triggers etc. on Azure Table storage. You could achieve this by switching to Tables API on Cosmos DB - which does have a Change Feed.

How to create a trigger in Azure Data Factory which triggers once file available in ADLS

I have a web app where some python code is running which generates csv files and stores it in ADLS ,I wanted to have ADF pipeline which triggers when files arrives in ADLS and load data into DB.
I wanted to know is there any automated triggering facility available in ADF as my files will be based on user input from a front end tool and we do not have idea when user will generate the files it can be very random.I went through Event based triggering option but it says that only 500 triggers are allowed per storage account but in our case there might be more that 500 in single day. Is there any way to achieve the trigger or am I understanding 500 triggers as something wrong.Any suggestion
Azure Data Factory only supports a maximum of 500 event triggers per storage account, it means that you can only create a maximum of 500 event triggers per storage account.
When you created an event trigger for the pipeline, the trigger times up to your quantity of the created files and it is not limited.
I've made a test here that shows one pipeline can be triggered more than 500 times in one day:

Use of Azure Grid Events to trigger ADF Pipe to move On-premises CSV files to Azure database

We have series of CSV files landing every day (daily Delta) then these need to be loaded to Azure database using Azure Data Factory (ADF). We have created a ADF Pipeline which moves data straight from an on-premises folder to an Azure DB table and is working.
Now, we need to make this pipeline executed based on an event, not based on a scheduled time. Which is, based on creation of a specific file on the same local folder. This file is created when the daily delta files landing is completed. Let's call this SRManifest.csv.
The question is, how to create a Trigger to start the pipeline when SRManifest.csv is created? I have looked into Azure event grid. But it seems, it doesn't work in on-premises folders.
You're right that you cannot configure an Event Grid trigger to watch local files, since you're not writing to Azure Storage. You'd need to generate your own signal after writing your local file content.
Aside from timer-based triggers, Event-based triggers are tied to Azure Storage, so the only way to use that would be to drop some type of "signal" file in a well-known storage location, after your files are written locally, to trigger your ADF pipeline to run.
Alternatively, you can trigger an ADF pipeline programmatically (.NET and Python SDKs support this; maybe other ones do as well, plus there's a REST API). Again, you'd have to build this, and run your trigger program after your local content has been created. If you don't want to write a program, you can use PowerShell (via Invoke-AzDataFactoryV2Pipeline).
There are other tools/services that integrate with Data Factory as well; I wasn't attempting to provide an exhaustive list.
Have a look at the Azure Logic Apps for File System connector Triggers. More details here.

Resources