Limit number of blobs processed at one time by Azure Webjobs - azure

I have an Azure Webjob that copies large CSVs (500 MB to 10+ GB) into a SQL Azure table. I get a new CSV every day and I only retain records for 1 month, because it's expensive to keep them in SQL, so they are pretty volatile.
To get them started, I bulk uploaded last month's data (~200 GBs) and I'm seeing all 30 CSV files getting processed at the same time. This causes a pretty crazy backup in the uploads, as shown by this picture:
I have about 5 pages that look like this counting all of the retries.
If I upload them 2 at a time, everything works great! But as you can see from the running times, some can take over 14 hours to complete.
What I want to do is bulk upload 30 CSVs and have the Webjob only process 3 of the files at a time, then once one completes, start the next one. Is this possible with the current SDK?

Yes, absolutely possible.
Assuming the pattern you are using here is to use Scheduled or On-Demand WebJobs that pop a message on a queue which is then picked up by a constantly running WebJob that processes messages from the queue and then does the work you can use the JobHost.Queues.BatchSize property to limit the number of queue messages that can be processed at one time. H
static void Main()
{
JobHostConfiguration config = new JobHostConfiguration();
//AzCopy cannot be invoked multiple times in the same host
//process, so read and process one message at a time
config.Queues.BatchSize = 1;
var host = new JobHost(config);
host.RunAndBlock();
}
If you would like to see what this looks like in action feel free to clone this GitHub repo I published recently on how to use WebJobs and AzCopy to create your own Blob Backup service. I had the same problem you're facing which is I could not run too many jobs at once.
https://github.com/markjbrown/AzCopyBackup
Hope that is helpful.
Edit, I almost forgot. While you can change the BatchSize property above you can also take advantage of having multiple VM's host and process these jobs too which basically allows you to scale this into multiple, independent, parallel processes. You may find that you can scale up the number of VM's and process the data very quickly instead of having to throttle it using BatchSize.

Related

How do you scale Azure function app(background job) based on the number items pending in a database?

So suppose that you have an application that lets user request a job. For example (hypothetical): user uploads a video. There is an entry made in RDBMs with the URL to video on blob and the status is set to "Pending".
There is a recurring time triggered functionapp that is executed every 10 seconds or so which gets 10 pending jobs from RDBMS and performs some compression etc.
The problem here is that as long as the number of requests stay 10-30 videos per 10 seconds we should be fine. But if the number of requests increase all of a sudden .. say 200 requests per 10 seconds this would mean that there will be a lot of job pending and the user would have to wait 10 times longer than usual to see status change. How do you scale out function app automatically in such scenario? Does it have to be manual?
There's an easier way to get fan out and parallel processing through multiple concurrently running Azure Functions.
Add an Azure Service Bus Queue to your solution.
For each video that needs to be processed, enqueue a service bus message with the appropriate data you'll need to retrieve and process the video (like the BlobId).
Have your Azure Function triggered by an ServiceBusTrigger.
Azure will spin up additional instances of your Azure Function as the queue depth increases. It'll also scale in idle instances after there's no more data to process.

Parallelize Azure Logic App executions when copying a file from SFTP to Blob Storage

I have an Azure Logic App which gets triggered when a new file is added or modified in an SFTP server. When that happens the file is copied to Azure Blob Storage and then gets deleted from the SFTP server. This operation takes approximately 2 seconds per file.
The only problem I have is that these files (on average 500kb) are processed one by one. Given that I'm looking to transfer around 30,000 files daily this approach becomes very slow (something around 18 hours).
Is there a way to scale out/parallelize these executions?
I am not sure that there is a scale out/parallelize execution on Azure Logic App. But based on my experience, if the timeliness requirements are not very high, we could use Foreach to do that, ForEach parallelism limit is 50 and the default is 20.
In your case, my suggestion is that we could do loop to trigger when a new file is added or modified in an SFTP then we could insert a queue message with file path as content to azure storage queue, then according to time or queue length to end the loop. We could get the queue message collection. Finally, fetch the queue message and fetch the files from the SFTP to create blob in the foreach action.
If you're C# use Parallel.ForEach like Tom Sun said. If you use this one I also recommend to use async/await pattern for IO operation (save to blob). It will free up the executing thread when file is being saved to serve some other request.

Azure Storage Performance Queue vs Table

I've got a nice logging system I've set up that writes to Azure Table Storage and it has worked well for a long time. However, there are certain places in my code where I need to now write a lot of messages to the log (50-60 msgs) instead of just a couple. It is also important enough that I can't start a new thread to finish writing to the log and return the MVC action before I know the log is successful because theoretically that thread could die. I have to write to the log before I return data to the web user.
According to the Azure dashboard, Table Storage transactions take ~37ms to commit, end to end (E2E), while queues only take ~6ms E2E to commit.
I'm now considering not logging directly to table storage, and instead log to an Azure Queue, then have a batch job run that reads off the queue and then puts them in their proper place in table storage. That way I can still index them properly via their partition and row keys. I can also write just a single queue message with all of the log entries. So it should only take 6ms instead of (37 * 50) ms.
I know that there are Table Storage batch operations. However, each of the log entries typically goes to different partition, and batch ops need to stay within a single partition.
I know that queue messages only live for 7 days, so I'll make sure I store queue messages in a new mechanism if they're older than a day (if it doesn't work the first 50 times, it just isn't going to work).
My question, then is: what am I not thinking about? How could this completely kick me in the balls in 4 months down the road?

Can Azure WebJobs poll queues on demand?

I have a WebJob which gets triggered when a user uploads a file to the blob storage - it is triggered by a queue storage message which is created once the upload is complete.
Depending on the purpose of the file, it will post messages to other queues to trigger processing jobs.
Some of these jobs are time critical, and run relatively quickly. In one case the processing takes about three seconds, and the user is waiting for the result.
However, because the minimum queue polling interval is 2 seconds, the time the user must wait for the two WebJobs to be invoked is generally doubling their wait time.
I tried combining the two WebJobs into one, hoping that when the first handler posts a queue message the corresponding processing handler would be immediately triggered, but in fact it consistently waits two seconds before picking up the message.
My question is, is there a way for me to tell my WebJob to check the queue triggers immediately from within the same WebJob if I know there is a message waiting? Or even better configure it to immediately check the queue triggers if I post to a queue from inside the WebJob?
Or would switching to a service bus queue improve the responsiveness to new messages?
Update
In the docs about using blob triggers, it says:
There is an exception for blobs that you create by using the Blob attribute. When the WebJobs SDK creates a new blob, it passes the new blob immediately to any matching BlobTrigger functions. Therefore if you have a chain of blob inputs and outputs, the SDK can process them efficiently. But if you want low latency running your blob processing functions for blobs that are created or updated by other means, we recommend using QueueTrigger rather than BlobTrigger.
http://azure.microsoft.com/en-gb/documentation/articles/websites-dotnet-webjobs-sdk-storage-blobs-how-to/
However there is no mention of anything similar for queues. Meaning if you need really low latency in this scenario then blobs are the better than queues, which seems wrong.
Update 2
I ended up working around this by pulling the orchestrating code out of the first WebJob and into the service layer of the application and removing the WebJob.. it was fast running anyway so perhaps separating it into its own WebJob was an overkill. This means only the processing WebJob has to be triggered after the file upload.
Currently 2 sec is the minimum time it will take for the SDK to poll for the new message. The SDK does an exponential back off polling so you can configure the MaxPollingInterval to be 2 sec always.
config.Queues.MaxPollingInterval = TimeSpan.FromSeconds(15);
For more details please see http://azure.microsoft.com/en-us/documentation/articles/websites-dotnet-webjobs-sdk-storage-queues-how-to/#config

What assumptions can I make about global time on Azure?

I want my Azure role to reprocess data in case of sudden failures. I consider the following option.
For every block of data to process I have a database table row and I could add a column meaning "time of last ping from a processing node". So when a node grabs a data block for processing it sets "processing" state and that time to "current time" and then it's the node responsibility to update that time say every one minute. Then periodically some node will ask for "all blocks that have processing state and ping time larger than ten minutes" and consider those blocks as abandoned and somehow queue them for reprocessing.
I have one very serious concern. The above approach requires that nodes have more or less the same time. Can I rely on all Azure nodes having the same time with some reasonable precision (say several seconds)?
For processing times under 2 hrs, you can usually rely on queue semantics (visibility timeout). If you have the data stored in blob storage, you can have a worker pop a queue message containing the name of the blob to work on and set a reasonable visibility timeout on the message (up to 2 hrs today). Once it completes the work, it can delete the queue message. If it fails, the delete is never called and after the visibility timeout, it will reappear on the queue for reprocessing. This is why you want your work to be idempotent, btw.
For processing that lasts longer than two hours, I generally recommend a leasing strategy where the worker leases the underlying blob data (if possible or a dummy blob otherwise) using the intrisic lease functionality in Windows Azure blob storage. When a worker goes to retrieve a file, it tries to lease it. A file that is already leased is indicative of a worker role currently processing it. If failure occurs, the lease will be broken and it will become leasable by another instance. Leases must be renewed every min or so, but they can be held indefinitely.
Of course, you are keeping the data to be processed in blob storage, right? :)
As already indicated, you should not rely on synchronized times between VM nodes. If you store datetimes for any reason - use UTC or you will be sorry later.
The answer here isn't to use time based synchronization (if you would however, make sure you use UTCNow), but there is still no guarantee anywhere that the clocks are synced. Nor should there be.
For the problem you are describing a queue based system is the answer. I've been referencing a lot to it, and will do it again, but I've explained some benefits of queue based systems in my blog post.
The idea is the following:
You put a work item to the queue
Your worker role (one or many of them) peeks & locks the message
You try to process the message, if you succeed, you remove the message from the queue,
if not, you let it stay where it is
With your approach I would use AppFabric Queues because you can also have topics & subscriptions, which allows you to monitor the data items. The example in my blog post coveres this exact scenario, with the only difference being that instead of having a worker role I poll the queue from my web application. But the concept is the same.
I would try this a different way using queue storage. If you pop your block of data on a queue with a timeout then have your processing nodes (worker roles?) pull this data off the queue.
After the data is popped off the queue if the processing node does not delete the entry from the queue it will reappear on the queue for processing after the timeout period.
Remote desktop into a role instance and check (a) the time zone (UTC, I think), and (b) that Internet Time is enabled in Date and Time settings. If so then you can rely on them being no more than a few ms apart. (This is not to say that the suggestions to use a message queue instead won't work, but perhaps they do not suit your needs.)

Resources