Azure Storage Performance Queue vs Table - azure

I've got a nice logging system I've set up that writes to Azure Table Storage and it has worked well for a long time. However, there are certain places in my code where I need to now write a lot of messages to the log (50-60 msgs) instead of just a couple. It is also important enough that I can't start a new thread to finish writing to the log and return the MVC action before I know the log is successful because theoretically that thread could die. I have to write to the log before I return data to the web user.
According to the Azure dashboard, Table Storage transactions take ~37ms to commit, end to end (E2E), while queues only take ~6ms E2E to commit.
I'm now considering not logging directly to table storage, and instead log to an Azure Queue, then have a batch job run that reads off the queue and then puts them in their proper place in table storage. That way I can still index them properly via their partition and row keys. I can also write just a single queue message with all of the log entries. So it should only take 6ms instead of (37 * 50) ms.
I know that there are Table Storage batch operations. However, each of the log entries typically goes to different partition, and batch ops need to stay within a single partition.
I know that queue messages only live for 7 days, so I'll make sure I store queue messages in a new mechanism if they're older than a day (if it doesn't work the first 50 times, it just isn't going to work).
My question, then is: what am I not thinking about? How could this completely kick me in the balls in 4 months down the road?

Related

How do you scale Azure function app(background job) based on the number items pending in a database?

So suppose that you have an application that lets user request a job. For example (hypothetical): user uploads a video. There is an entry made in RDBMs with the URL to video on blob and the status is set to "Pending".
There is a recurring time triggered functionapp that is executed every 10 seconds or so which gets 10 pending jobs from RDBMS and performs some compression etc.
The problem here is that as long as the number of requests stay 10-30 videos per 10 seconds we should be fine. But if the number of requests increase all of a sudden .. say 200 requests per 10 seconds this would mean that there will be a lot of job pending and the user would have to wait 10 times longer than usual to see status change. How do you scale out function app automatically in such scenario? Does it have to be manual?
There's an easier way to get fan out and parallel processing through multiple concurrently running Azure Functions.
Add an Azure Service Bus Queue to your solution.
For each video that needs to be processed, enqueue a service bus message with the appropriate data you'll need to retrieve and process the video (like the BlobId).
Have your Azure Function triggered by an ServiceBusTrigger.
Azure will spin up additional instances of your Azure Function as the queue depth increases. It'll also scale in idle instances after there's no more data to process.

Parallelize Azure Logic App executions when copying a file from SFTP to Blob Storage

I have an Azure Logic App which gets triggered when a new file is added or modified in an SFTP server. When that happens the file is copied to Azure Blob Storage and then gets deleted from the SFTP server. This operation takes approximately 2 seconds per file.
The only problem I have is that these files (on average 500kb) are processed one by one. Given that I'm looking to transfer around 30,000 files daily this approach becomes very slow (something around 18 hours).
Is there a way to scale out/parallelize these executions?
I am not sure that there is a scale out/parallelize execution on Azure Logic App. But based on my experience, if the timeliness requirements are not very high, we could use Foreach to do that, ForEach parallelism limit is 50 and the default is 20.
In your case, my suggestion is that we could do loop to trigger when a new file is added or modified in an SFTP then we could insert a queue message with file path as content to azure storage queue, then according to time or queue length to end the loop. We could get the queue message collection. Finally, fetch the queue message and fetch the files from the SFTP to create blob in the foreach action.
If you're C# use Parallel.ForEach like Tom Sun said. If you use this one I also recommend to use async/await pattern for IO operation (save to blob). It will free up the executing thread when file is being saved to serve some other request.

Can a date and time be specified when sending data to Azure event hub?

Here's the scenario. I'm not working with real-time data. Instead, I get data from my electric company for the past day's electric usage. Specifically, each day I can get # of kwhs for each hour on the clock on the past day.
So, I'd like to load this past information into event hub each following day. Is this doable? Does event hub support loading past information, or is it only and forever about realtime streaming data, with no ability to load past data in?
I'm afraid this is the case, as I've not seen any date specification in what limited api documentation I could find for it. I'd like to confirm, though...
Thanks,
John
An Azure Event Hub is really meant for short-term storage. By default you may only retain data up to 7 days. After which the data will be deleted based upon an append timestamp that was created when the message first entered the Event Hub. Therefore it is not practical to use an Azure Event Hub for data that's older than 7 days.
An Azure Event Hub is meant for message/event management, not long term storage. A possible solution would be to write the Event Hub data to an Azure SQL server or blob storage for long term storage. Then use Azure Stream Analytics (an event processor) to join the active stream with the legacy data that has accumulated on the SQL server. Also note, you can call this appended attribute. It's called "EventEnqueuedUtcTime". Keep in mind that it will be on the server time, whose clock may be different from the date/time of actual measurement.
As for appending a date time. If you are sending it in as a JSON, just simply append it as a key and message value. Example Message with Time: { "Time": "My UTC Time here" }
A streaming system of this type doesn't care about times a particular application may wish to apply to the items. There simply isn't any processing that happens based on a time field unless your code does it.
Each message sent is an EventData which contains a message with an arbitrary set of bytes. You can easily include a date/time in that serialized data structure, but EventHubs won't care about it. There is no sorting performed or fixed ordering other than insertion order within a partition which is defined by the sequence number. While the enqueued time is available it's mostly useful for monitoring how far behind in processing you are.
As to the rest of your problem, I'd agree with the comment that EventHubs may not really be the best choice. You can certainly load data into it once per day, but if it's really only 24 data points/day, it's not really the appropriate technology choice unless it's a prototype/tech demo for a system that's eventually supposed to have a whole load of smart meters reporting to it with fair frequency. (Note also that EventHubs cost $11/month minimum, Service Bus Queue $10/Month min, and AWS SQS $0 min)

Limit number of blobs processed at one time by Azure Webjobs

I have an Azure Webjob that copies large CSVs (500 MB to 10+ GB) into a SQL Azure table. I get a new CSV every day and I only retain records for 1 month, because it's expensive to keep them in SQL, so they are pretty volatile.
To get them started, I bulk uploaded last month's data (~200 GBs) and I'm seeing all 30 CSV files getting processed at the same time. This causes a pretty crazy backup in the uploads, as shown by this picture:
I have about 5 pages that look like this counting all of the retries.
If I upload them 2 at a time, everything works great! But as you can see from the running times, some can take over 14 hours to complete.
What I want to do is bulk upload 30 CSVs and have the Webjob only process 3 of the files at a time, then once one completes, start the next one. Is this possible with the current SDK?
Yes, absolutely possible.
Assuming the pattern you are using here is to use Scheduled or On-Demand WebJobs that pop a message on a queue which is then picked up by a constantly running WebJob that processes messages from the queue and then does the work you can use the JobHost.Queues.BatchSize property to limit the number of queue messages that can be processed at one time. H
static void Main()
{
JobHostConfiguration config = new JobHostConfiguration();
//AzCopy cannot be invoked multiple times in the same host
//process, so read and process one message at a time
config.Queues.BatchSize = 1;
var host = new JobHost(config);
host.RunAndBlock();
}
If you would like to see what this looks like in action feel free to clone this GitHub repo I published recently on how to use WebJobs and AzCopy to create your own Blob Backup service. I had the same problem you're facing which is I could not run too many jobs at once.
https://github.com/markjbrown/AzCopyBackup
Hope that is helpful.
Edit, I almost forgot. While you can change the BatchSize property above you can also take advantage of having multiple VM's host and process these jobs too which basically allows you to scale this into multiple, independent, parallel processes. You may find that you can scale up the number of VM's and process the data very quickly instead of having to throttle it using BatchSize.

What assumptions can I make about global time on Azure?

I want my Azure role to reprocess data in case of sudden failures. I consider the following option.
For every block of data to process I have a database table row and I could add a column meaning "time of last ping from a processing node". So when a node grabs a data block for processing it sets "processing" state and that time to "current time" and then it's the node responsibility to update that time say every one minute. Then periodically some node will ask for "all blocks that have processing state and ping time larger than ten minutes" and consider those blocks as abandoned and somehow queue them for reprocessing.
I have one very serious concern. The above approach requires that nodes have more or less the same time. Can I rely on all Azure nodes having the same time with some reasonable precision (say several seconds)?
For processing times under 2 hrs, you can usually rely on queue semantics (visibility timeout). If you have the data stored in blob storage, you can have a worker pop a queue message containing the name of the blob to work on and set a reasonable visibility timeout on the message (up to 2 hrs today). Once it completes the work, it can delete the queue message. If it fails, the delete is never called and after the visibility timeout, it will reappear on the queue for reprocessing. This is why you want your work to be idempotent, btw.
For processing that lasts longer than two hours, I generally recommend a leasing strategy where the worker leases the underlying blob data (if possible or a dummy blob otherwise) using the intrisic lease functionality in Windows Azure blob storage. When a worker goes to retrieve a file, it tries to lease it. A file that is already leased is indicative of a worker role currently processing it. If failure occurs, the lease will be broken and it will become leasable by another instance. Leases must be renewed every min or so, but they can be held indefinitely.
Of course, you are keeping the data to be processed in blob storage, right? :)
As already indicated, you should not rely on synchronized times between VM nodes. If you store datetimes for any reason - use UTC or you will be sorry later.
The answer here isn't to use time based synchronization (if you would however, make sure you use UTCNow), but there is still no guarantee anywhere that the clocks are synced. Nor should there be.
For the problem you are describing a queue based system is the answer. I've been referencing a lot to it, and will do it again, but I've explained some benefits of queue based systems in my blog post.
The idea is the following:
You put a work item to the queue
Your worker role (one or many of them) peeks & locks the message
You try to process the message, if you succeed, you remove the message from the queue,
if not, you let it stay where it is
With your approach I would use AppFabric Queues because you can also have topics & subscriptions, which allows you to monitor the data items. The example in my blog post coveres this exact scenario, with the only difference being that instead of having a worker role I poll the queue from my web application. But the concept is the same.
I would try this a different way using queue storage. If you pop your block of data on a queue with a timeout then have your processing nodes (worker roles?) pull this data off the queue.
After the data is popped off the queue if the processing node does not delete the entry from the queue it will reappear on the queue for processing after the timeout period.
Remote desktop into a role instance and check (a) the time zone (UTC, I think), and (b) that Internet Time is enabled in Date and Time settings. If so then you can rely on them being no more than a few ms apart. (This is not to say that the suggestions to use a message queue instead won't work, but perhaps they do not suit your needs.)

Resources