I have an Azure Function which uses the webhook bindings to be triggered by each upload or modification of a blob in an Azure Storage container.
This seems to work fine on an empty test container, i.e. when uploading the first blob or modifying one of two or three blobs in the test container.
However, when I point it towards a container with approximately a million blobs it receives a continuous stream of historic blob events.
I've read that
If the blob container being monitored contains more than 10,000 blobs,
the Functions runtime scans log files to watch for new or changed
blobs.
[source]
Is there any way I can ignore these historical events and consider only current events?
Related
We are using an Azure Storage Account (Blob, StorageV2) with a single container in it. We are also using Azure Data Factory to trigger data copy pipelines from blobs (.tar.gz) created in the container. The trigger works fine when creating the blobs from an Azure App Service or by manually uploading via the Azure Storage Explorer. But when creating the blob from a Notebook on Azure Databricks, we get two (2) events for every blob created (same parameters for both events). The code for creating the blob from the notebook resembles:
dbutils.fs.cp(
"/mnt/data/tmp/file.tar.gz",
"/mnt/data/out/file.tar.gz"
)
The tmp folder is just used to assemble the package, and the event trigger is attached to the out folder. We also tried with dbutils.fs.mv, but same result. The trigger rules in Azure Data Factory are:
Blob path begins with: out/
Blob path ends with: .tar.gz
The container name is data.
We did find some similar posts relating to zero-length files, but at least we can't see them anywhere (if some kind of by-product to dbutils).
As mentioned, just manually uploading file.tar.gz works fine - a single event is triggered.
We had to revert to uploading the files from Databricks to the Blob Storage using the azure-storage-blob library. Kind of a bummer, but it works now as expected. Just in case anyone else runs into this.
More information:
https://learn.microsoft.com/en-gb/azure/storage/blobs/storage-quickstart-blobs-python
I have an Azure Function which is initiated on a blob trigger. Interestingly, if I publish an updated version of this Azure Function to the cloud and if there are blobs already existing, then the Azure Function will be triggered on each of those already-existing blobs.
This is not the functionality I would like. Instead, I would like a newly published Azure Function to only be triggered on newly uploaded blobs, not on blobs that already exist. How can I disable triggering on existing blobs?
How can I disable triggering on existing blobs?
There is no way to do this currently and not recommended.
Internally we track which blobs we have processed by storing receipts in our control container azure-webjobs-hosts. Any blob not having a receipt, or an old receipt (based on blob ETag) will be processed (or reprocessed). That's why your existing blobs are being processed, they don't have receipts.
BlobTrigger is currently designed to ensure that all blobs in a container matching the path pattern are eventually processed, and reprocessed any time they are updated.
So after all the blobs have a receipt, when you upload a new blob, it will only triggered by newly blob.
For more details, you could refer to this article.
The blob trigger function is triggered when files are uploaded to or updated in Azure Blob storage. If you disable triggering on existing blobs, when you update you blob it would not get the latest content, not recommended.
Workaround:
If you still want to trigger on newly uploaded blobs, you could add a judgement when invoke the function.
List all of existing blobs in the container and when a blob triggered invoke, check if the blob name is in the list, if not you could go ahead the triggered method.
I've inherited a solution that uses Stream Analytics with blobs as the input and then writes to an Azure SQL database.
Initially, the solution worked fine, but after adding several million blobs to a container (and not deleting old blobs), Stream Analytics is slow in processing new blobs. Also, it appears that some blobs are being missed/skipped.
Question: How does Stream Analytics know there are new blobs in a container?
Prior to EventGrid, Blob storage did not have a push notification mechanism to let Stream Analytics know that a new blob needs to be processed, so I'm assuming that Stream Analytics is polling the container to get the list of blobs (with something like CloudBlobContainer.ListBlobs()) and saves the list of blobs internally, so that when it goes to poll again it can compare the new list with the old list and know which blobs are new and need to be processed.
The documentation states:
Stream Analytics will view each file only once
However, besides that note, I have not seen any other documentation to explain how Stream Analytics knows which blobs to process.
ASA uses list blobs to get list of blobs.
If you can partition the blob path by date time pattern, it would be better. ASA will only have to list a specific path to discover new blobs, without a date pattern, all blobs will have to be listed. This is probably why it gets slower with huge number of blobs.
Currently, I have a large set of text files which contain (historical) raw data from various sensors. New files are received and processed every day. I'd like to move this off of an on-premises solution to the cloud.
Would Azure's Blob storage be an appropriate mechanism for this volume of small(ish) private files? or is there another Azure solution that I should be pursuing?
Relevent Data (no pun intended) & Requirements-
The data set contains a millions files of mostly small files, for a total of near 400gb. The average file size is around 50kb, but some files could exceed 40mb.
I need to maintain the existing data set for posterity's sake.
New files would be uploaded daily, and then processed once. Processing would be handled by Background Workers reading files off a queue.
Certain files would be downloaded / reviewed / reprocessed after the initial processing.
Let me elaborate more on David's comments.
As David mentioned, there's no limit on number of objects (files) that you can store in Azure Blob Storage. The limit is of the size of the storage account which currently is 500TB. As long as you stay in this limit you will be good. Further, you can have 100 storage accounts in an Azure Subscription so essentially the amount of data that you will be able to store is practically limitless.
I do want to mention one more thing though. It seems that the files that are uploaded in blob storage are once processed and then kind of archived. For this I suggest you take a look at Azure Cool Blob Storage. It is essentially meant for this purpose only where you want to store objects that are not frequently accessible yet when you need those objects they are accessible almost immediately. The advantage of using Cool Blob Storage is that writes and storage is cheaper as compared to Hot Blob Storage accounts however the reads are expensive (which makes sense considering their intended use case).
So a possible solution would be to save the files in your Hot Blob Storage accounts. Once the files are processed, they are moved to Cool Blob Storage. This Cool Blob Storage account can be in the same or different Azure Subscription.
I'm guessing it CAN be used as a file system, is the right (best) tool for the job.
Yes, Azure Blobs Storage can be used as cloud file system.
The data set contains a millions files of mostly small files, for a total of near 400gb. The average file size is around 50kb, but some files could exceed 40mb.
As David and Gaurav Mantri mentioned, Azure Blob Storage could meet this requirement.
I need to maintain the existing data set for posterity's sake.
Data in Azure Blob Storage is durable. You could reference the SERVICE LEVEL AGREEMENTS of Storage.
New files would be uploaded daily, and then processed once. Processing would be handled by Background Workers reading files off a queue.
You can use Azure Function to do the file processing work. Since it will do once a day, you could add a TimerTrigger Function.
//This function will be executed once a day
public static void TimerJob([TimerTrigger("0 0 0 * * *")] TimerInfo timerInfo)
{
//write the processing job here
}
Certain files would be downloaded / reviewed / reprocessed after the initial processing.
Blobs can be downloaded or updated at anytime you want.
In addition, if your data processing job is very complicated, you also could store your data in Azure Data Lake Store and do the data processing job using Hadoop analytic frameworks such as MapReduce or Hive. Microsoft Azure HDInsight clusters can be provisioned and configured to directly access data stored in Data Lake Store.
Here are the differences between Azure Data Lake Store and Azure Blob Storage.
Comparing Azure Data Lake Store and Azure Blob Storage
Are there ways to check if a container in Azure has a new blob (doesn't matter which blob it is)? LastModifiedUtc does not seem to change if a blob is dropped into the container
You should use a BlobTrigger function in an App Service resource.
Documentation
Windows Azure Blob Storage does not provide this functionality out of the box. You would need to handle this on your end. A few things come to my mind (just thinking out loud):
If the blobs are uploaded using your application (and not through 3rd party tools), after the blob is uploaded, you could just update the container properties (may be add/update a metadata entry with information about the last blob uploaded). You could also make an entry into Azure Table Storage and keep on updating it with the information about last blob uploaded. As I said above, this method will only work if all blobs are uploaded through your application.
You could manually iterate through blobs in the blob container periodically and then sort them by last modified date. This method would work fine for a blob container having lesser number of blobs. If the number of blobs are more (say in tens of thousands), then you would end up fetching a long list because blob storage only sorts the blob by blob name.