we have a set of blob trigger functions and we are planning to use a new azure webjobs storage for these azure functions. My question is: since the new storage account doesn't have any track of the already processed file, will the blobs be reprocessed? If yes, can we avoid this reprocessing and in which?
I think you're talking about Blob Receipt feature.
When you're using a new azure webjobs storage for the azure function, it definitely re-process the already processed file. This is by design.
The only way I can think of is that, when using a new azure webjobs storage, you can add a list which contains all the processed files in your function code, and when the code detects the file is already processed, then do nothing with it.
Related
What are possible ways to implement such scenario?
I can think of some Azure function which will periodically check the share for new files. Are there any other possibilities.
I have been thinking also about duplicating the files to Blob storage and generate the notifications from there.
Storage content trigger is by default available for blobs. If you look for migrating to blob storage, then you can utilise BlobTrigger Azure function. In case of file trigger in File Share, the below are my suggestions as requested:
A TimerTrigger Azure function that acts as a poll to check for new file in that time frame the previous trigger occured.
Recurrence trigger in logic app to poll and check for new contents.
A continuous WebJob to continuously poll the File Share checking for new contents.
In my opinion, duplicating the files to Blob storage and making your notification work may not be a great option, because such operation once again requires a polling mechanism which can be achieved with options like a few mentioned above, but is still unnecessary.
I have some e-mail attachments being saved to Azure Blob.
I am now trying to write a Azure Functions App that would connect to that blob storage, run some scripts and re-save the file.
However, when selecting a storage account for the function, I couldn't select my blob storage account.
I went on the website and it said this:
When creating a function app, you must create or link to a general-purpose Azure Storage account that supports Blob, Queue, and Table storage. Some storage accounts don't support queues and tables. These accounts include blob-only storage accounts and Azure Premium Storage.
I'm wondering, is there any workaround this? and if not, perhaps any other suggestions? I'm becoming a little lost in all the options, and which one to actually choose.
Thanks!
EDIT: Might I add I writing the function Python
I think you are overlooking the fact that you can have multiple storage accounts. In order for an Azure Function to work you need a storage account. That storage account is used to store runtime information of the Azure Function for internal purposes like state management. This storage account is subject to restrictions as you already found out. There is no workaround for that.
However, if the function you are writing needs to access another storage account it is free to do so. You just have to provide details to connect to that specific storage account. In that case you also have a clear seperation between the storage account that is used by the azure function for its internal operations and the storage account your application needs to connect and which you have total control about withouth having to worry that you break things by deleting internal used blobs/tables/queues.
You can have a blob triggered function that gets triggered when changes occur on your specific blob storage. That doesn't need to be the storage account that the azure function internally uses, which is created/selected when creating the azure function.
Here is a sample that shows how to add a blob triggered azure function in Python. MyStorageAccountAppSetting refers to an app setting that holds the connection string to the storage account that you use for storage.
The snippet from the website you are quoting is for storing the function app code itself and any related modules. It does not pertain to what your function can access when the code of your function executes.
When your function executes it will need to use the Azure Blob Storage SDK/modules to connect to your blob storage account and read the email attachments. Here's a quickstart guide for using Azure Storage with Python: Quickstart with Azure Storage Blobs SDK for Python
General-purpose v2 storage accounts support the latest Azure Storage features and incorporate all of the functionality of general-purpose v1 and Blob storage accounts here
There are more integration options with GPv2 accounts including Azure Function Triggers. See: Azure Blob storage bindings for Azure Functions
Further refer: Types of storage accounts
If Blob, based on your need, you can choose an access tier based on the frequency of access for the data (e-mail attachments)Access tiers for Azure Blob Storage - hot, cool, and archive. If General purpose storage account, its standard performance tier.
I have an Azure Function which is initiated on a blob trigger. Interestingly, if I publish an updated version of this Azure Function to the cloud and if there are blobs already existing, then the Azure Function will be triggered on each of those already-existing blobs.
This is not the functionality I would like. Instead, I would like a newly published Azure Function to only be triggered on newly uploaded blobs, not on blobs that already exist. How can I disable triggering on existing blobs?
How can I disable triggering on existing blobs?
There is no way to do this currently and not recommended.
Internally we track which blobs we have processed by storing receipts in our control container azure-webjobs-hosts. Any blob not having a receipt, or an old receipt (based on blob ETag) will be processed (or reprocessed). That's why your existing blobs are being processed, they don't have receipts.
BlobTrigger is currently designed to ensure that all blobs in a container matching the path pattern are eventually processed, and reprocessed any time they are updated.
So after all the blobs have a receipt, when you upload a new blob, it will only triggered by newly blob.
For more details, you could refer to this article.
The blob trigger function is triggered when files are uploaded to or updated in Azure Blob storage. If you disable triggering on existing blobs, when you update you blob it would not get the latest content, not recommended.
Workaround:
If you still want to trigger on newly uploaded blobs, you could add a judgement when invoke the function.
List all of existing blobs in the container and when a blob triggered invoke, check if the blob name is in the list, if not you could go ahead the triggered method.
I've got some image processing code that I need to run in Azure. It's perfect for an Azure Function, but unfortunately requires a component with a complex installation procedure and therefore will need to run in a VM.
However, I'd like to make it behave much like an Azure Function, and trigger whenever new items arrive in blob storage.
My question is: Does Azure provide me with any handy way of doing this, or do I have to write code that polls the blob storage looking for new items?
Have a look at Azure WebJobs SDK. It shares API model with Functions, but you can host it in any .NET application. Blob Trigger.
I have a bunch of files in Azure Blob storage and it's constantly getting new ones. I was wondering if there is a way for me to first take all the data I have in Blob and move it over to BigQuery and then keep a script or some job running so that all new data in there gets sent over to BigQuery?
BigQuery offers support for querying data directly from these external data sources: Google Cloud Bigtable, Google Cloud Storage, Google Drive. Not include Azure Blob storage. As Adam Lydick mentioned, as a workaround, you could copy data/files from Azure Blob storage to Google Cloud Storage (or other BigQuery-support external data sources).
To copy data from Azure Blob storage to Google Cloud Storage, you can run WebJobs (or Azure Functions), and BlobTriggerred WebJob can trigger a function when a blob is created or updated, in WebJob function you can access the blob content and write/upload it to Google Cloud Storage.
Note: we can install this library: Google.Cloud.Storage to make common operations in client code. And this blog explained how to use Google.Cloud.Storage sdk in Azure Functions.
I'm not aware of anything out-of-the-box (on Google's infrastructure) that can accomplish this.
I'd probably set up a tiny VM to:
Scan your Azure blob storage looking for new content.
Copy new content into GCS (or local disk).
Kick off a LOAD job periodically to add the new data to BigQuery.
If you used GCS instead of Azure Blob Storage, you could eliminate the VM and just have a Cloud Function that is triggered on new items being added to your GCS bucket (assuming your blob is in a form that BigQuery knows how to read). I presume this is part of an existing solution that you'd prefer not to modify though.