I'm using Azure Autoscale feature to process hundreds of files. The system scales up correctly to 8 instances and each instance processes one file at a time.
The problem is with scaling in. Because the scale in rules seem to be based on ALL instances, if I tell it to reduce the instance count back to 1 after an average CPU load of < 25% it will arbitrarily kill instances that are still processing data.
Is there a way to prevent it from shutting down individual instances that are still in use?
Scale down will remove the highest instance numbers first. For example, if you have WorkerRole_IN_0, WorkerRole_IN_1, ..., WorkerRole_IN_8, and then you scale down by 1, Azure will remove WorkerRole_IN_8 first. Azure has no idea what your code is doing (ie. if it is still processing a file) or if it is finished and ready to shut down.
You have a few options:
If the file processing is quick, you can delay the shutdown for up to 5 minutes in the OnStop event, giving your instance enough time to finish processing the file. This is the easiest solution to implement, but not the most reliable.
If processing the file can be broken up into shorter chunks of work then you can have the instances process chunks until the file is complete. This way it doesn't really matter if an arbitrary instance is shut down since you don't lose any significant amount of work and another instance will pick up where it left off. See https://learn.microsoft.com/en-us/azure/architecture/patterns/pipes-and-filters for a pattern. This is the ideal solution as it is an optimized architecture for distributed workloads, but some workloads (ie. image/video processing) may not be able to break up easily.
You can implement your own autoscale algorithm and manually shut down individual instances that you choose. To do this you would call the Delete Role Instance API (https://msdn.microsoft.com/en-us/library/azure/dn469418.aspx). This requires some external process to be monitoring your workload and executing management operations so may not be a good solution depending on your infrastructure.
Related
So i'm trying to understand service bus timings... Especially how the locks works. One can choose to manually call CompleteAsync which is what we're doing. It could also be the case that the processing takes some time. In these cases we want to make sure we don't get unneccessary MessageLockLostException.
Seems there are a couple of numbers to relate to:
Lock duration (found in azure portal on the bus, currently set to 1 minute which is think is default)
AutoRenewTimeout (property on OnMessageOptions, currently set to 1 minute)
AutoComplete (property on OnMessageOptions, currently set to false)
Assuming the processing is running for around 2 minutes, and then either succeeds or crases (doesn't matter which case for now). Let's say this is the normal scenario, so this means that processing takes roughly 2 minutes for each message.
Also, it's indeed a queue and not a topic. And we only have one consumer that asynchronoulsy processes the messages with MaxConcurrentCalls set to 100. We're using OnMessageAsync with ReceiveMode.PeekLock.
What should my settings now be as a single consumer to robustly process all messages?
I'm thinking that leaving Lock duration to 1 minute would be fine, as that's the default, and set my AutoRenewTimeout to 5 minutes for safety, because as i've understood this value should be the maximum time it takes to process a message (atleast according to this answer). Performance is not critical for this system, so i'm resonating as that leaving a message locked for some unneccessary 1, 2 or 3 minutes is not evil, as long as we don't get LockedException because these give no real value.
This thread and this thread gives great examples of how to manually renew the locks, but I thought there is a way to automatically renew the locks.
What should my settings now be as a single consumer to robustly process all messages?
Aside from LockDuration, MaxConcurrentCalls, AutoRenewTimeout, and AutoComplete there are some configurations of the Azure Service Bus client you might want to look into. For example, create not a single client with MaxConcurrentCalls set to 100, but a few clients with total concurrency level distributed among the clients. Note that you'd want to use different MessagingFactory instances to create those clients to ensure you have more than a single "pipe" to receive messages. And even with that, it would be way better to scale out and have competing consumers rather than having a single consumer handling all the load.
Now back to the settings. If your normal processing time is 2 minutes, it's better to set MaxLockDuration on the entities to this time and not 1 minute. This will remove unnecessary lock extension calls to the broker and eliminate MessageLockLostException.
Also, keep in mind that AutoRenewTimeout is a client based operation, not broker, and therefore not guaranteed. You will run into cases where lock will be lost even though the AutoRenewTimeout time has not elapsed yet.
AutoRenewTimeout should always be set to longer than MaxLockDuration as it will be counterproductive to have them equal. Have it somewhat larger than MaxLockDuration as this is clients' "insurance" that when processing takes longer than MaxLockDuration, message lock won't be lost. Having those two equal is, in essence, disables this fallback.
I have written a sybase stored procedure to move data from certain tables[~50] on primary db for given id to archive db. Since it's taking a very long time to archive, I am thinking to execute the same stored procedure in parallel with unique input id for each call.
I manually ran the stored proc twice at same time with different input and it seems to work. Now I want to use Perl threads[maximum 4 threads] and each thread execute the same procedure with different input.
Please advise if this is recommended way or any other efficient way to achieve this. If the experts choice is threads, any pointers or examples would be helpful.
What you do in Perl does not really matter here: what matters is what happens on the side of the Sybase server. Assuming each client task creates its own connection to the database, then it's all fine and how the client achieved this makes no diff for the Sybase server. But do not use a model where the different client tasks will try to use the same client-server connection as that will never happen in parallel.
No 'answer' per se, but some questions/comments:
Can you quantify taking a very long time to archive? Assuming your archive process consists of a mix of insert/select and delete operations, do query plans and MDA data show fast, efficient operations? If you're seeing table scans, sort merges, deferred inserts/deletes, etc ... then it may be worth the effort to address said performance issues.
Can you expand on the comment that running two stored proc invocations at the same time seems to work? Again, any sign of performance issues for the individual proc calls? Any sign of contention (eg, blocking) between the two proc calls? If the archival proc isn't designed properly for parallel/concurrent operations (eg, eliminate blocking), then you may not be gaining much by running multiple procs in parallel.
How many engines does your dataserver have, and are you planning on running your archive process during a period of moderate-to-heavy user activity? If the current archive process runs at/near 100% cpu utilization on a single dataserver engine, then spawning 4 copies of the same process could see your archive process tying up 4 dataserver engines with heavy cpu utilization ... and if your dataserver doesn't have many engines ... combined with moderate-to-heavy user activity at the same time ... you could end up invoking the wrath of your DBA(s) and users. Net result is that you may need to make sure your archive process hog the dataserver.
One other item to consider, and this may require input from the DBAs ... if you're replicating out of either database (source or archive), increasing the volume of transactions per a given time period could have a negative effect on replication throughput (ie, an increase in replication latency); if replication latency needs to be kept at a minimum, then you may want to rethink your entire archive process from the point of view of spreading out transactional activity enough so as to not have an effect on replication latency (eg, single-threaded archive process that does a few insert/select/delete operations, sleeps a bit, then does another batch, then sleeps, ...).
It's been my experience that archive processes are not considered high-priority operations (assuming they're run on a regular basis, and before the source db fills up); this in turn means the archive process is usually designed so that it's efficient while at the same time putting a (relatively) light load on the dataserver (think: running as a trickle in the background) ... ymmv ...
I wanted to process records from a database concurrently and within minimum time. So I thought of using parallel.foreach() loop to process the records with the value of MaximumDegreeOfParallelism set as ProcessorCount.
ParallelOptions po = new ParallelOptions
{
};
po.MaxDegreeOfParallelism = Environment.ProcessorCount;
Parallel.ForEach(listUsers, po, (user) =>
{
//Parallel processing
ProcessEachUser(user);
});
But to my surprise, the CPU utilization was not even close to 20%. When I dig into the issue and read the MSDN article on this(http://msdn.microsoft.com/en-us/library/system.threading.tasks.paralleloptions.maxdegreeofparallelism(v=vs.110).aspx), I tried using a specific value of MaximumDegreeOfParallelism as -1. As said in the article thet this value removes the limit on the number of concurrently running processes, the performance of my program improved to a high extent.
But that also doesn't met my requirement for the maximum time taken to process all the records in the database. So I further analyzed it more and found that there are two terms as MinThreads and MaxThreads in the threadpool. By default the values of Min Thread and MaxThread are 10 and 1000 respectively. And on start only 10 threads are created and this number keeps on increasing to a max of 1000 with every new user unless a previous thread has finished its execution.
So I set the initial value of MinThread to 900 in place of 10 using
System.Threading.ThreadPool.SetMinThreads(100, 100);
so that just from the start only minimum of 900 threads are created and thought that it will improve the performance significantly. This did create 900 threads, but it also increased the number of failure on processing each user very much. So I did not achieve much using this logic. So I changed the value of MinThreads to 100 only and found that the performance was much better now.
But I wanted to improve more as my requirement of time boundation was still not met as it was still exceeding the time limit to process all the records. As you may think I was using all the best possible things to get the maximum performance in parallel processing, I was also thinking the same.
But to meet the time limit I thought of giving a shot in the dark. Now I created two different executable files(Slaves) in place of only one and assigned them each half of the users from DB. Both the executable were doing the same thing and were executing concurrently. I created another Master program to start these two Slaves at the same time.
To my surprise, it reduced the time taken to process all the records nearly to the half.
Now my question is as simple as that I do not understand the logic behind Master Slave thing giving better performance compared to a single EXE with all the logic same in both the Slaves and the previous EXE. So I would highly appreciate if someone will explain his in detail.
But to my surprise, the CPU utilization was not even close to 20%.
…
It uses the Http Requests to some Web API's hosted in other networks.
This means that CPU utilization is entirely the wrong thing to look at. When using the network, it's your network connection that's going to be the limiting factor, or possibly some network-related limit, certainly not CPU.
Now I created two different executable files … To my surprise, it reduced the time taken to process all the records nearly to the half.
This points to an artificial, per process limit, most likely ServicePointManager.DefaultConnectionLimit. Try setting it to a larger value than the default at the start of your program and see if it helps.
I'd like to run a single Azure instance on a predetermined schedule (e.g. 9-5 PM EST, Mon-Fri) to reduce billing and am wondering what the best way to go about it is.
Two parts to the question:
Can the Service Management API [1] be used to set the InstanceCount to 0 on a predetermined schedule?
If so, are you still billed for this service, as is the case with suspended deployments?
[1] -
http://blogs.msdn.com/b/gonzalorc/archive/2010/02/07/auto-scaling-in-azure.aspx
You can't set the instance count to zero, but you can suspend and then delete the deployment and then redeploy all programmatically.
Microsoft shipped the Autoscaling Application Block (Wasabi) which will guard your budget by changing instance counts based on timetables. It offers many other features, including an optimizing stabilizer that will take care of the hourly boundaries (concretely, it will limit scaling up operations to the beginning of the hour and scaling down operations to the end of the hour).
See my detailed answer with supported scenarios on this thread.
Steve covered your first bullet point.
For the second: if you suspend your deployment, you are still billed for it. You have to delete the deployment to stop the accrual of compute-hours.
Alternatively, you could use Lokad.CQRS or Lokad.Cloud to combine the tasks that don't need to run all the time on a single compute instance.
Of course, this approach is not universally applicable and depending on the specifics of your application it may not be suitable for your case.
Is there any inherent advantage when using multiple workers to process pieces of procedural code versus processing the entire load?
In other words, if my workflow looks like this:
Get work from queue0 and do A
Store result from A in queue1
Get result from queue 1 and do B
Store result from B in queue2
Get result from queue2 and do C
Is there an inherent advantage to using 3 workers who each do the entire process themselves versus 3 workers that each do a part of the work (Worker 1 does 1 & 2, worker 2 does 3 & 4, worker 3 does 5).
If we only care about working being done (finished with step 5) it would seem that it scales the same way (once you're using at least 3 workers). Maybe the big job is better because workers with that setup have less bottleneck issues?
In general, the smaller the jobs are, the less work you lose when some process crashes. Also, the smaller the jobs are, the more evenly you'll be able to distribute the work. (Instead of at one point having a single worker instance doing a long job and all the others idle, you'd have all the worker instances doing small pieces of work.)
Setting aside how to break up the work into smaller pieces, there's a question of whether there should be multiple worker roles, each of which can only do one kind of work, or a single worker role (but many instances) that can do everything. I would default to the latter (code that can do everything and just checks all the queues to see what needs to be done), but there are reasons to go with the former. If you need more RAM for one kind of work, for example, you might use a bigger VM size for that worker. Another example is if you wanted to scale the different kinds of work independently.
Adding to what #smarx says:
The model of a "multipurpose" worker is of course more general. So even if you require specialized types (like the extra RAM example used above) you would simply have a single task in that particular role.
There's the extra perspective of cost. You will have an economic incentive to increase the "task density" (as in tasks/instance). If you have M types of work and you assign each one to a different worker, then you will pay for M instances, even if some those might only do some work every once in a while.
I blogged about this some time ago and it is one topic of our guide (chapter "06 week3.docx")
Many frameworks and samples (including ours) use this approach.