Stop Azure blob trigger function from being triggered on existing blobs when function is published to the cloud - azure

I have an Azure Function which is initiated on a blob trigger. Interestingly, if I publish an updated version of this Azure Function to the cloud and if there are blobs already existing, then the Azure Function will be triggered on each of those already-existing blobs.
This is not the functionality I would like. Instead, I would like a newly published Azure Function to only be triggered on newly uploaded blobs, not on blobs that already exist. How can I disable triggering on existing blobs?

How can I disable triggering on existing blobs?
There is no way to do this currently and not recommended.
Internally we track which blobs we have processed by storing receipts in our control container azure-webjobs-hosts. Any blob not having a receipt, or an old receipt (based on blob ETag) will be processed (or reprocessed). That's why your existing blobs are being processed, they don't have receipts.
BlobTrigger is currently designed to ensure that all blobs in a container matching the path pattern are eventually processed, and reprocessed any time they are updated.
So after all the blobs have a receipt, when you upload a new blob, it will only triggered by newly blob.
For more details, you could refer to this article.
The blob trigger function is triggered when files are uploaded to or updated in Azure Blob storage. If you disable triggering on existing blobs, when you update you blob it would not get the latest content, not recommended.
Workaround:
If you still want to trigger on newly uploaded blobs, you could add a judgement when invoke the function.
List all of existing blobs in the container and when a blob triggered invoke, check if the blob name is in the list, if not you could go ahead the triggered method.

Related

Azure blob trigger fired once for multiple files upload in azure blob

I need help for following scenario.
I have setup azure blob trigger on one of the azure blob storage.
Now suppose from media shuttle I am uploading 3 files to that azure blob and I have setup blob trigger on that azure blob but now what I want is that, I want to trigger some logic only when blob trigger function get call for last file(3rd one). for 1st and 2nd file blob trigger function should get call but should not execute the logic which supposed to execute for last file(3rd one).
so basically somewhere we need to maintain count(for total upload) and count for number of time blob trigger function get call and compare and run final logic if condition satisfied but I am not aware about how to do that.
I'm using .Net core to write azure blob trigger function.

Using a new azure webjobs storage for the azure function

we have a set of blob trigger functions and we are planning to use a new azure webjobs storage for these azure functions. My question is: since the new storage account doesn't have any track of the already processed file, will the blobs be reprocessed? If yes, can we avoid this reprocessing and in which?
I think you're talking about Blob Receipt feature.
When you're using a new azure webjobs storage for the azure function, it definitely re-process the already processed file. This is by design.
The only way I can think of is that, when using a new azure webjobs storage, you can add a list which contains all the processed files in your function code, and when the code detects the file is already processed, then do nothing with it.

Duplicate Blob Created Events When Writing to Azure Blob Storage from Azure Databricks

We are using an Azure Storage Account (Blob, StorageV2) with a single container in it. We are also using Azure Data Factory to trigger data copy pipelines from blobs (.tar.gz) created in the container. The trigger works fine when creating the blobs from an Azure App Service or by manually uploading via the Azure Storage Explorer. But when creating the blob from a Notebook on Azure Databricks, we get two (2) events for every blob created (same parameters for both events). The code for creating the blob from the notebook resembles:
dbutils.fs.cp(
"/mnt/data/tmp/file.tar.gz",
"/mnt/data/out/file.tar.gz"
)
The tmp folder is just used to assemble the package, and the event trigger is attached to the out folder. We also tried with dbutils.fs.mv, but same result. The trigger rules in Azure Data Factory are:
Blob path begins with: out/
Blob path ends with: .tar.gz
The container name is data.
We did find some similar posts relating to zero-length files, but at least we can't see them anywhere (if some kind of by-product to dbutils).
As mentioned, just manually uploading file.tar.gz works fine - a single event is triggered.
We had to revert to uploading the files from Databricks to the Blob Storage using the azure-storage-blob library. Kind of a bummer, but it works now as expected. Just in case anyone else runs into this.
More information:
https://learn.microsoft.com/en-gb/azure/storage/blobs/storage-quickstart-blobs-python

Azure Storage Webhook being triggered by historical events

I have an Azure Function which uses the webhook bindings to be triggered by each upload or modification of a blob in an Azure Storage container.
This seems to work fine on an empty test container, i.e. when uploading the first blob or modifying one of two or three blobs in the test container.
However, when I point it towards a container with approximately a million blobs it receives a continuous stream of historic blob events.
I've read that
If the blob container being monitored contains more than 10,000 blobs,
the Functions runtime scans log files to watch for new or changed
blobs.
[source]
Is there any way I can ignore these historical events and consider only current events?

Azure - Check if a new blob is uploaded to a container

Are there ways to check if a container in Azure has a new blob (doesn't matter which blob it is)? LastModifiedUtc does not seem to change if a blob is dropped into the container
You should use a BlobTrigger function in an App Service resource.
Documentation
Windows Azure Blob Storage does not provide this functionality out of the box. You would need to handle this on your end. A few things come to my mind (just thinking out loud):
If the blobs are uploaded using your application (and not through 3rd party tools), after the blob is uploaded, you could just update the container properties (may be add/update a metadata entry with information about the last blob uploaded). You could also make an entry into Azure Table Storage and keep on updating it with the information about last blob uploaded. As I said above, this method will only work if all blobs are uploaded through your application.
You could manually iterate through blobs in the blob container periodically and then sort them by last modified date. This method would work fine for a blob container having lesser number of blobs. If the number of blobs are more (say in tens of thousands), then you would end up fetching a long list because blob storage only sorts the blob by blob name.

Resources