I have a pipeline with a few copy activities. Some of those activities are in charge of copying large amounts of data from a storage account to the same storage account but in a compressed manner (I'm talking about a few TB of data).
After running the pipeline for a few hours I noticed that some activities show "Queue" time on the monitoring blade and I was wondering what can be the reason for that "Queue" time. And more importantly if I'm being billed for that time also because from what I understand my ADF is not doing anything.
Can someone shed some light? :)
(Posting this as an answer because of the comment chars limit)
After a long discussion with Azure Support and reaching out to someone at the ADF product team I got some answers:
1 - The queue time is not being billed.
2 - Initially, the orchestration ADF system puts the job in a queue and it gets "queue time" until the infrastructure picks it up and start the processing part.
3 - In my case the queue time was increasing after the job started because of a bug in the underlying backend executor (it uses Azure Batch). Apparently the executors were crashing and my job was suffering from "re-pickup" time, thus increasing the queue time. This explained why after some time I started to see that the execution time and the transferred data were decreasing. The ETA for this bugfix is at the end of the month. Additionally the job that I was executing timed out (after 7 days) and after checking the billing I confirmed that I wasn't charged a dime for it.
Based on the the chart in this ADF Monitor, you could find the same metrics in the example.
In fact,it's metrics in the executionDetails parameter.Queue Time+ Transfer Time= Duration Time.
More details on the stages copy activity goes through, and the
corresponding steps, duration, used configurations, etc. It's not
recommended to parse this section as it may change.
Please refer to the Parallel Copy, copy activity will create parallel tasks to transfer data internally. Activities are all in active state in both queue time and transfer time, never stop in queue time so that it's billed during the whole duration time. I think it's inevitable loss in data transfer process and has been digested by adf internally. You could try to adjust parallelCopies param to see if anything changes.
If you do concern the cost, you could submit feedback here to ask for statements from Azure Team.
Related
I am trying to copy a backup file that is provided from a HTTP source, the download URL is only valid for 60 seconds and the Copy step is timing out before it can complete (timeout set to 1 min 1 second). It will complete on occasion but is very inconsistent. When it completes the step queues for around 40 seconds, other times it will be queued for over a minute and the link has expired when it eventually gets to downloading the file. It is one zipped JSON file that is being downloaded, less than 100 KB.
Both Source and Sink datasets are using a Managed VNet IR we have created (must be used on Sink due to company policy), using the AutoResolve IR on the Source and it takes even longer queueing.
I've tried all variations of 'Max concurrent connections', 'DIU' and 'Degree of copy parallelism' in the Copy activity I can think of and none seem to have any effect. It appears to be random if the queue time is short enough for the download to succeed.
Is there any way to speed up the queue process to try and get more consistent successful downloads?
Both Source and Sink datasets are using a Managed VNet IR we have
created (must be used on Sink due to company policy), using the
AutoResolve IR on the Source and it takes even longer queueing.
This is pretty much confusing. If your Source and Sink are in Managed VNet, your IR also should be in same Managed VNet for better security and performance.
As per this official document:
By design, Managed VNet IR takes longer queue time than Azure IR as we
are not reserving one compute node per service instance, so there is a
warm up for each copy activity to start, and it occurs primarily on
VNet join rather than Azure IR.
Since there is no reserved node, there isn't any possible way to speed up the queue process.
We have about 190 hourly usage files that need to arrive in the data lake in a 24 hour period before we can kick off our pipeline which starts off with an analytics activity. We have had this pipeline run on a scheduler on an estimated time of when we expect all files to have arrived but doesn't always happen so we would need to re-run the slices for the missing files.
Is there a more efficient way to handle this and not have the pipeline on a schedule and have it triggered by the event that all files have arrived in the datalake.
TIA for input!
You can add an Event Trigger when a new blob is created (or deleted). We do this in production with a logic app, but data factory V2 appear to support it now as well. The benefit is that you don't have to estimate the proper frequency, you can just execute when necessary.
NOTE: there is a limit to the number of concurrent pipelines you can have executing, so if you dropped all 190 files into blob storage at once, you may run into resource availability issues.
I am developing an application using Azure Cloud Service and web api. I would like to allow users that create a consultation session the ability to change the price of that session, however I would like to allow all users 30 days to leave the session before the new price affects the price for all members currently signed up for the session. My first thought is to use queue storage and set the visibility timeout for the 30 day time limit, but this seems like this could grow the queue really fast over time, especially if the message should not run for 30 days; not to mention the ordering issues. I am looking at the task scheduler as well but the session pricing changes are not a recurring concept but more random. Is the queue idea a good approach or is there a better and more efficient way to accomplish this?
The stuff you are trying to do should be done with a relational database. You can use timestamps to record when prices for session changed. I wouldn't use a queue at all for this. A queue is more for passing messages in a distributed system. Your problem is just about tracking what prices changed on what sessions and when. That data should be modeled in a database.
I think this scenario is more suitable to use Azure Scheduler. Programatically create a Job with one time recurrence with set date as 30 days later to run once. Once this job gets triggered automatically by scheduler, assign an action to callback to one of your API/Service to do the price & other required updates and also remove this Job from the scheduler as part of this action to have a clean jobs list. Anyways premium plan of Azure Scheduler Job Collection will give you unlimited number of jobs to run.
Hope this is exactly what you were looking for...
I would consider using Azure WebJobs. A WebJob basically gives you the ability to run a .NET console application within the context of an Azure Web App. It can be run on demand, continuously, or in response to a reoccurring schedule. If your processing requirements are low and allow for it they can also run in the same process that your Web App is running in to save you $$$ as they are free that way.
You could schedule the WebJob to run once or twice per day and examine the situation and react as is appropriate. Since it's really just a .NET worker role you have ultimate flexibility.
My application relies heavily on a queue in in Windows Azure Storage (not Service Bus). Until two days ago, it worked like a charm, but all of a sudden my worker role is no longer able to process all the items in the queue. I've added several counters and from that data deduced that deleting items from the queue is the bottleneck. For example, deleting a single item from the queue can take up to 1 second!
On a SO post How to achive more 10 inserts per second with azure storage tables and on the MSDN blog
http://blogs.msdn.com/b/jnak/archive/2010/01/22/windows-azure-instances-storage-limits.aspx I found some info on how to speed up the communication with the queue, but those posts only look at insertion of new items. So far, I haven't been able to find anything on why deletion of queue items should be slow. So the questions are:
(1) Does anyone have a general idea why deletion suddenly may be slow?
(2) On Azure's status pages (https://azure.microsoft.com/en-us/status/#history) there is no mentioning of any service disruption in West Europe (which is where my stuff is situated); can I rely on the service pages?
(3) In the same storage, I have a lot of data in blobs and tables. Could that amount of data interfere with the ability to delete items from the queue? Also, does anyone know what happens if you're pushing the data limit of 2TB?
1) Sorry, no. Not a general one.
2) Can you rely on the service pages? They certainly will give you information, but there is always a lag from the time an issue occurs and when it shows up on the status board. They are getting better at automating the updates and in the management portal you are starting to see where they will notify you if your particular deployments might be affected. With that said, it is not unheard of that small issues crop up from time to time that may never be shown on the board as they don't break SLA or are resolved extremely quickly. It's good you checked this though, it's usually a good first step.
3) In general, no the amount of data you have within a storage account should NOT affect your throughput; however, there is a limit to the amount of throughput you'll get on a storage account (regardless of the data amount stored). You can read about the Storage Scalability and Performance targets, but the throughput target is up to 20,000 entities or messages a second for all access of a storage account. If you have a LOT of applications or systems attempting to access data out of this same storage account you might see some throttling or failures if you are approaching that limit. Note that as you saw with the posts on improving throughput for inserts these are the performance targets and how your code is written and configurations you use have a drastic affect on this. The data limit for a storage account (everything in it) is 500 TB, not 2TB. I believe once you hit the actual storage limit all writes will simply fail until more space is available (I've never even got close to it, so I'm not 100% sure on that).
Throughput is also limited at the partition level, and for a queue that's a target of Up to 2000 messages per second, which you clearly aren't getting at all. Since you have only a single worker role I'll take a guess that you don't have that many producers of the messages either, at least not enough to get near the 2,000 msgs per second.
I'd turn on storage analytics to see if you are getting throttled as well as check out the AverageE2ELatency and AverageServerLatency values (as Thomas also suggested in his answer) being recorded in the $MetricsMinutePrimaryTransactionQueue table that the analytics turns on. This will help give you an idea of trends over time as well as possibly help determine if it is a latency issue between the worker roles and the storage system.
The reason I asked about the size of the VM for the worker role is that there is a (unpublished) amount of throughput per VM based on it's size. An XS VM gets much less of the total throughput on the NIC than larger sizes. You can sometimes get more than you expect across the NIC, but only if the other deployments on the physical machine aren't using their portion of that bandwidth at the time. This can often lead to varying performance issues for network bound work when testing. I'd still expect much better throughput than what you are seeing though.
There is a network in between you and the Azure storage, which might degrade the latency.
Sudden peaks (e.g. from 20ms to 2s) can happen often, so you need to deal with this in your code.
To pinpoint this problem further down the road (e.g. client issues, network errors etc.) You can turn on storage analytics to see where the problem exists. There you can also see if the end2end latency is too big or just the server latency is the limiting factor. The former usually tells about network issues, the latter about something beeing wrong on the Queue itself.
Usually those latency issues a transient (just temporary) and there is no need to announce that as a service disruption, because it isn't one. If it has constantly bad performance, you should open a support ticket.
We would like to make our customers able to schedule recurring tasks on a daily, weekly and monthly basis. Linear scalability is really important to us, that is why we use Windows Azure Table Storage instead of SQL Azure. The current design is the following:
- Scheduling information is stored in a Table Storage table. For example: Task A, daily; Task B, weekly; ...
- There are worker processes, which run hourly and query this table. Then decide, they have to run a given task or not.
But what if, multiple worker roles start to run the same task?
Some other requirements:
- The worker processes can be in different time zones.
Windows Azure Queue Storage could solve all cuncurrency problems mentioned above, but it also introduces some new issues:
- How many queue items should we generate?
- What if the customer changes the recurrence rate or revokes the scheduling?
So, my question is: how to design a recurring task scheduler with multiple asynchronous workers using Windows Azure Storage?
Perhaps the new Azure Scheduler service could help?
http://www.windowsazure.com/en-us/services/scheduler/
Some thoughts:
But what if, multiple worker roles start to run the same task?
This could very well happen. To avoid this, what you could do is have a worker role instance (any worker role instance from the pool) read from table and push messages in a queue. While this instance is doing this work, all other instances wait. To decide which instance does this work, you can make use of blob lease functionality.
Some other requirements: - The worker processes can be in different
time zones.
Not sure about this. Assuming you're talking about Cloud Services Worker Roles, they could be in different data centers but all of them will be in UTC time zone.
How many queue items should we generate?
It really depends on how much work needs to be done. You could put all messages in a queue. Only a maximum of 32 messages can be dequeued from a queue by a client at a time. So if you have say 100 tasks and thus 100 messages, each instance can only read up to 32 messages from the queue in a single call to queue service.
What if the customer changes the recurrence rate or revokes the
scheduling?
That should be OK as once the task is completed you must remove the message from the queue. Next time when the task is invoked, you can read from the table again and it will give you latest information about the task from the table.
I would continue using the Azure Table Storage, but mark the process as "in progress" before a worker starts working on it. Since ATS supports concurrency which is controlled by Etags, you can be assured that two processes won't be able to start the same process
I would, however, think about retry logic when jobs fail unexpectedly and have a process that restarts job that appear to have gone orphan