Parallel Blob processing with Azure WebJobs - azure

I have multiple instances of a WebSite, all running the same WebJob, connected to the same AzureWebJobsStorage and configured to trigger on incoming blobs.
Unfortunetely, the result is different from what I intended - when a blob is saved to the storage, only one of the WebJobs instances picks it up. I would rather have all of them process this blob. My guess is that it works like this beacuse of how the blob receipt name is constructed - it contains the WebJob funcion name which is the same for all my WebJobs.
Is there any way to have the blob processed by all WebJob instances in this situation?

I just found an answer to this question by myself. It turns out that one can configure the HostId which gets included in the blob's receipt name. For me, the following piece of code does the job:
var applicationName = Environment.GetEnvironmentVariable("WEBSITE_SITE_NAME");
var host = new JobHost(new JobHostConfiguration
{
HostId = applicationName ?? "localhost"
});

Related

Re-play/Repeat/Re-Fire Azure BlobStorage Function Triggers for existing files

I've just uploaded several 10s of GBs of files to Azure CloudStorage.
Each file should get picked up and processed by a FunctionApp, in response to a BlobTrigger:
[FunctionName(nameof(ImportDataFile))]
public async Task ImportDataFile(
// Raw JSON Text file containing data updates in expected schema
[BlobTrigger("%AzureStorage:DataFileBlobContainer%/{fileName}", Connection = "AzureStorage:ConnectionString")]
Stream blobStream,
string fileName)
{
//...
}
This works in general, but foolishly, I did not do a final test of that Function prior to uploading all the files to our UAT system ... and there was a problem with the uploads :(
The upload took a few days (running over my Domestic internet uplink due to CoViD-19) so I really don't want to have to re-do that.
Is there some way to "replay" the BlobUpload Triggers? so that the function triggers again as if I'd just re-uploaded the files ... without having to transfer any data again!
As per this link
Azure Functions stores blob receipts in a container named
azure-webjobs-hosts in the Azure storage account for your function app
(defined by the app setting AzureWebJobsStorage).
To force reprocessing of a blob, delete the blob receipt for that blob
from the azure-webjobs-hosts container manually. While reprocessing
might not occur immediately, it's guaranteed to occur at a later point
in time. To reprocess immediately, the scaninfo blob in
azure-webjobs-hosts/blobscaninfo can be updated. Any blobs with a last
modified timestamp after the LatestScan property will be scanned
again.
I found a hacky-AF work around, that re-processes the existing file:
If you add Metadata to a blob, that appears to re-trigger the BlobStorage Function Trigger.
Accessed in Azure Storage Explorer, but Right-clicking on a Blob > Properties > Add Metadata.
I was settings Key: "ForceRefresh", Value "test".
I had a problem with the processing of blobs in my code which meant that there were a bunch of messages in the webjobs-blobtrigger-poison queue. I had to move them back to azure-webjobs-blobtrigger-name-of-function-app. Removing the blob receipts and adjusting the scaninfo blob did not work without the above step.
Fortunately Azure Storage Explorer has a menu option to move the messages from one queue to another:
I found a workaround, if you aren't invested in the file name:
Azure Storage Explorer, has a "Clone with new name" button in the top bar, which will add a new file (and trigger the Function) without transferring the data via your local machine.
Note that "Copy" followed by "Paste" also re-triggers the blob, but appears to transfer the data down to your machine and then back up again ... incredibly slowly!

Azure Storage Queue message to Azure Blob Storage

I have access to a Azure Storage Queue using a connection string which was provided to me (not my created queue). The messages are sent once every minute. I want to take all the messages and place them in Azure Blob Storage.
My issue is that I haven't been succesful in getting the message from the attached Storage Queue. What is the "easiest" way of doing this data storage?
I've tried accessing the external queue using Logic Apps and then tried to place it in my own queue before moving it to Blob Storage, however without luck.
If you want to access and external storage in the logic app, you will need the name of the storage account and the Key.
You have to choose the trigger for an azure queues and then click in the "Manually enter connection information".
And in the next step you will be able to choose the queue you want to listen for.
I recomend you to use and azure function, something like in this article:
https://learn.microsoft.com/en-us/azure/azure-functions/functions-bindings-storage-blob-output?tabs=csharp
Firts you can try only reading the messages, and then add the output that create your blob:
[FunctionName("GetMessagesFromQueue")]
public IActionResult GetMessagesFromQueue(
[QueueTrigger("%ExternalStorage.QueueName%", Connection = "ExternalStorage.StorageConnection")ModelMessage modelmessage,
[Blob("%YourStorage.ContainerName%/{id}", FileAccess.Write, Connection = "YourStorage.StorageConnection")] Stream myBlob)
{
//put the modelmessage into the stream
}
You can bind to a lot of types not only Stream. In the link you have all the information.
I hope I've helped

Do I need an Azure Storage Account to run a WebJob?

So I'm fairly new to working with Azure and there are some things I can't quite wrap my head around. One of them being the Azure Storage Account.
My web jobs keeps stopping with the following error "Unhandled Exception: System.InvalidOperationException: The account credentials for '[account_name]' are incorrect." Understanding the error however is not the problem, at least that's what I think. The problem lies in understanding why I need an Azure Storage Account to overcome it.
Please read on as I try to take you through the steps taken thus far. Hopefuly the real question will become more clear to you.
In my efforts to deploy a WebJob on Azure we have created the following resources so far:
App Service Plan
App Service
SQL server
SQL database
I'm using the following code snippet to prevent my web job from exiting:
JobHostConfiguration config = new JobHostConfiguration();
config.DashboardConnectionString = null;
new JobHost(config).RunAndBlock();
To my understanding from other sources the Dashboard connection string is optional but the AzureWebJobsStorage connection string is required.
I tried setting the required connection string in portal using the configuration found here.
DefaultEndpointsProtocol=[http|https];AccountName=myAccountName;AccountKey=myAccountKey
Looking further I found this answer that clearly states where I would get the values needed, namely an/my missing Azure Storage Account.
So now for the actualy question: Why do I need an Azure Storage Account when I seemingly have all the resources I need place for the WebJob to run? What does it do? Is it a billing thing, cause I thought we had that defined in the App Service Plan. I've tried reading up on Azure Storage Accounts over here but I need a bit more help understanding how it relates to everything.
From the docs:
An Azure storage account provides resources for storing queue and blob data in the cloud.
It's also used by the WebJobs SDK to store logging data for the dashboard.
Refer to the getting started guide and documentation for further information
The answer to your question is "No", it is not mandatory to use Azure Storage when you are trying to setup and run a Azure web job.
If you are using JobHost or JobHostConfiguration then there is indeed a dependency for Storage accounts.
Sample code snippet is give below.
class Program
{
static void Main()
{
Functions.ExecuteTask();
}
}
public class Functions
{
[NoAutomaticTrigger]
public static void ExecuteTask()
{
// Execute your task here
}
}
The answer is no, you don't. You can have a WebJob run without being tied to an Azure Storage Account. Like Murray mentioned, your WebJob dashboard does use a storage account to log data but that's completely independent.

What is the clean up mechanism for the blobs that WebJobs SDK creates in the AzureWebJobsDashboard connection?

Azure WebJob SDK uses the storage connection string defined in the AzureWebJobsStorage and AzureWebJobsDashboard app settings for its logging and dashboard.
WebJob SDK creates the following blob container in AzureWebJobsStorage:
azure-webjobs-hosts
WebJob SDK creates the following blob containers in AzureWebJobsDashboard
azure-jobs-host-output
azure-webjobs-hosts
Many blobs are created in the above blob containers as the WebJob runs. The containers can be bloated or saturated if there is no clean-up mechanism.
What is the cleanup mechanism for the above blob containers?
Update
The answer below is a workaround. At this point, there is no built-in mechanism to clean up the WebJobs logs. The logs can pile up quite large as the Job runs in a long term. Developers must create the cleanup mechanism on their own. Azure Functions is a good way of implementing such cleanup process. An example is provided in the below answer.
What is the clean up mechanism for the blobs that WebJobs SDK creates in the AzureWebJobsDashboard connection?
I haven’t found a way to do it. There is an open issue on GitHub which related to this topic, but haven’t been closed.
No way to set webjob logging retention policy
In a similar issue on GitHub we found that Azure WebJob SDK have changed a way of saving logs to multi tables of Azure Table Storage. We can easily delete the table per month. For logs writen in Azure Blob Storage haven’t been grouped by month until now.
WebJobs.Logging needs to support log purge / retention policies
To delete the older WebJob log, I suggest you create a time triggered WebJob to delete the logs which you wanted.
Is there any AzureFunction code sample shows how to do the blob cleanup?
Code below is for your reference.
// Parse the connection string and return a reference to the storage account.
CloudStorageAccount storageAccount = CloudStorageAccount.Parse(storageConnectionString);
// Create the table client.
CloudBlobClient blobClient = storageAccount.CreateCloudBlobClient();
// Retrieve a reference to a container.
var container = blobClient.GetContainerReference("azure-webjobs-hosts");
// Query out all the blobs which created after 30 days
var blobs = container.GetDirectoryReference("output-logs").ListBlobs().OfType<CloudBlob>()
.Where(b => b.Properties.LastModified < new DateTimeOffset(DateTime.Now.AddDays(-30)));
// Delete these blobs
foreach (var item in blobs)
{
item.DeleteIfExists();
}

What happened when using same Storage account for multiple Azure WebJobs (dev/live)?

In my small job, I just use same Storage Account for AzureWebJobsDashboard and AzureWebJobsStorage configs.
But what happened when we use same connection string for local Debugging and Published job equally? Are they treated in isolated manner? Or do they have any conflict issue?
I looked into blobs of published job and found azure-webjobs-dashboad/functions/instances or azure-webjobs-dashboad/functions/recent/by-job-run/{jobname}, or azure-webjobs-hosts/output-logs directories; they have no discriminator specified among jobs while some other directories have GUID with job name.
Note that my job will be run in continuous.
Or do they have any conflict issue?
No, there is no conflict issue. Base on my experience, it is not recommended to local debugging while Published job is running in the azure with the same connection string. Take Azure Storage queue for example, we can't control which queue should be executed in the local or in the azure. If we try to use debug it locally, please have a try to stop the continue WebJob from Azure Portal.
If we try to know WebJob is executed from which instance we could log the instance info in the code with the environment variable WEBSITE_INSTANCE_ID.The following is the code sample:
public static void ProcessQueueMessage([QueueTrigger("queue")] string message, TextWriter log)
{
string instance = Environment.GetEnvironmentVariable("WEBSITE_INSTANCE_ID");
string newMsg = $"WEBSITE_INSTANCE_ID:{instance}, timestamp:{DateTime.Now}, Message:{message}";
log.WriteLine(newMsg);
Console.WriteLine(newMsg);
}
More info please refer to how to use azure queue storage with the WebJob SDK. The following is snipped from the document.
If your web app runs on multiple instances, a continuous WebJob runs on each machine, and each machine will wait for triggers and attempt to run functions. The WebJobs SDK queue trigger automatically prevents a function from processing a queue message multiple times; functions do not have to be written to be idempotent
Update :
About timer trigger we could find more explanation of timer trigger in the WebJob from the GitHub document.So if you want to debug locally, please have a try to stop the WebJob from the azure portal.
only a single instance of a particular timer function will be running across all instances (you don't want multiple instances to process the same timer event).

Resources