Why it takes so long to create Azure Container Instance?

Why it takes so long to create Azure Container Instance? - azure

I'm using the docker compose command to spin up 2 containers in Azure Container Instances, by using ACI docker context.
Sometimes, it takes only a while (below 1 min) to get the containers up and running. However, often it takes much longer (up to 5 minutes I would say). Does somebody have an idea why the speed of ACI creation and making the containers run can be slow? Can it be improved for example by running the containers in a resource group belonging to a different Azure "location"?
Thank you very much for any ideas.

Nobody here will be able to tell you exactly why there is a difference, could be anything, starting from the time it takes to find a slot in an underlying compute cluster to the time it takes to pull the image from your container registry. As far as I know there is no SLA on the startup time.
So yes, you could try different Azure regions, to maybe get lucky in finding a region which is less busy on the ACI side. But this might or might not always help. (The resource group has nothing to do with it, as it is just a logical container)

Related

Recommended Azure service to replace Azure functions

We have a service running as an Azure function (Event and Service bus triggers) that we feel would be better served by a different model because it takes a few minutes to run and loads a lot of objects in memory and it feels like it loads it every time it gets called instead of keeping in memory and thus performing better.
What is the best Azure service to move to with the following goals in mind.
Easy to move and doesn't need too many code changes.
We have long term goals of being able to run this on-prem (kubernetes might help us here)
Appreciate your help.

To achieve first goal:
Move your Azure function code inside a continuous running Webjob. It has no max execution time and it can run continuously caching objects in its context.
To achieve second goal (On-premise):
You need to explain this better, but a webjob can be run as a console program on-premise, also you can wrap it into a docker container to move it from on-premise to any cloud but if you need to consume messages from an Azure Service Bus you will need an On-Premise-Azure approach connecting your local server to the cloud with a VPN or expressroute.
Regards.

There are a couple of ways to solve the said issue, each with slightly higher amount of change from where you are.
If you are just trying to separate out the heavy initial load, then you can do it once in a Redis Cache instance and then reference it from there.
If you are concerned about how long your worker can run, then Webjobs (as explained above) can work, however, that is something I'd suggest avoiding since its not where Microsoft is putting its resources. Rather look at durable functions. Here an orchestrator function can drive a worker function. (Even here be careful, that since durable functions retain history after running for very very very long times, the history tables might get too large - so probably program in something like, restart the orchestrator after say 50,000 runs (obviously the number will vary based on your case)). Also see this.
If you want to add to this, the constrain of portability then you can run this function in a docker image that can be run in an AKS cluster in Azure. This might not work well for durable functions (try it out, who knows :) ), but will surely work for the worker functions (which would cost you the most compute anyways)
If you want to bring the workloads completely on-prem then Azure functions might not be a good choice. You can create an HTTP server using the platform of your choice (Node, Python, C#...) and have that invoke the worker routine. Then you can run this whole setup inside an image on an AKS cluster on prem and to the user it looks just like a load balanced web-server :) - You can decide if you want to keep the data on Azure or bring it down on prem as well, but beware of egress costs if you decide to move it out once you've moved it up.

It appears that the functions are affected by cold starts:
Serverless cold starts within Azure
Upgrading to the Premium plan would move your functions to pre-warmed instances, which should counter the problem you are experiencing:
Pre-warmed instances for Azure Functions
However, if you potentially want to deploy your function/triggers to on-prem, you should spin them out as microservices and deploy them with containers.
Currently, the fastest way would probably be to deploy the containerized triggers via Azure Container Instances if you don't already have a Kubernetes Cluster running. With some tweaking, you can deploy them on-prem later on.

There are few options:
Move your function app on to premium. But it will not help u a lot at the time of heavy load and scale out.
Issue: In that case u will start facing cold startup issues and problem will be persist in heavy load.
Redis Cache, it will resolve your most of the issues as the main concern is heavy loading.
Issue: If your system is multitenant system then your Cache become heavy during the time.
Create small micro durable functions. It will be not the answer of your Q as u don't want lots of changes but it will resolve your most of the issues.

Why Managed instance is taking more time to create?

why managed instance taking more time to create?
It has been almost two days managed instance creation started and it is still showing deployment under progress. This is the first time I'm creating MI. Does anyone know how long will it take to create?
Very basic specification: Gen4 8 core 256 memory location: south central us
I don't see any error yet.

Creating first instance within a subnet takes few hours as Managed Instance is customer VNet-injected service and it takes time to provision the whole dedicated cluster - it's much more work than taking random pre-provisioned VM and spinning up few processes on it.
That said, anything that takes more than 6 hours (at the moment, subject of improvement) indicates some sort of issue.
I'd suggest opening support ticket, or you can contact me via private message with more details for the specific instance.

Two days is excessive, and likely indicates a bug in Azure's back-end scripts. When provisioning large amounts of objects in your Azure subscription, you can run out of resources within your 'Ring', which is a pre-allocated set of resources you have to work with. In the event you've exhausted these resources, it does take some time for Azure to provision more before the managed instance can be created.
I've deployed 10 or so Managed Instances and have seen deployment time take anywhere from
1 to 8 hours for a healthy deploy, and over a day for a deploy that ran into a bug.

How cloud services are provisioned (and billed) once a new deployment is requested to Azure REST API?

I'm using Azure REST API to create, deploy and start a Cloud Service (classic) (cspkg hosted in Azure Storage) with hundreds of instances. I'm noticing that time Azure takes to provision and start the requested instances is really heterogeneous. First instances might start in 6-7 minutes but last ones might take up to 15-20 minutes, about 10 minutes longer than first ones. So my questions are:
Is this the expected behaviour? If so, what's the logic behind this? Could I do anything to speed things up?
How is Azure billing this? Is it counting the total count of instances since the very initial time when Cloud Service is deployed? or is it taking into account the specific timing on each individual instance?
UPDATE: I've been testing more scenarios and I've found a puzzling surprise. If I replace all the processes that my Cloud Service instances should run by a simple wait for some minutes (run .bat file with timeout command) then all the instances start almost at the same time (about 15 seconds between fastest and slowest instance). It was not just luck and random behaviour, I've proved that this behavior is repeatable and I can't even try to explain the root reason.

I also checked this a few weeks ago, and the startup time, depends on the size of the machine, if it is large it has more resources, so the boot time is faster, and also, if there is any error, exception on startup the VM will recycle till it can successfully start. I googled it, but did not find any solution to speed this up, so I don't think it is possible to do anything about the startup time. In the background every time when you deploy something, it will create a Windows Server, and boot it up and deploy your package on it and puts your web roles behind load balancer, this is why it takes so long, because a lot of things are happening.
The billing part is also not the best for the classic cloud services, you have to pay for it even during the startup and recycle, and even when it is turned off, so if you are done with your update, you should delete the VMs from your staging slot or scale it down, because you will pay for it even if it is turned off.

What is the optimal way to run a Node API in Docker on Amazon ECS?

With the advent of docker and scheduling & orchestration services like Amazon's ECS, I'm trying to determine the optimal way to deploy my Node API. With Docker and ECS aside, I've wanted to take advantage of the Node cluster library to gracefully handle crashing the node app in the event of an asynchronous error as suggested in the documentation, by creating a master process and multiple worker processors.
One of the benefits of the cluster approach, besides gracefully handling errors, is creating a worker processor for each available CPU. But does this make sense in the docker world? Would it make sense to have multiple node processes running in a single docker container that was going to be scaled into a cluster of EC2 instances on ECS?
Without the Node cluster approach, I'd lose the ability to gracefully handle errors and so I think that at a minimum, I should run a master and one worker processes per docker container. I'm still confused as to how many CPUs to define in the Task Definition for ECS. The ECS documentation says something about each container instance having 1024 units per CPU; but that isn't the same thing as EC2 compute units, is it? And with that said, I'd need to pick EC2 instance types with the appropriate amount of vCPUs to achieve this right?
I understand that achieving the most optimal configuration may require some level of benchmarking my specific Node API application, but it would be awesome to have a better idea of where to start. Maybe there is some studying/research I need to do? Any pointers to guide me on the path or recommendations would be most appreciated!
Edit: To recap my specific questions:
Does it make sense to run a master/worker cluster as described here inside a docker container to achieve graceful crashing?
Would it make sense to use nearly identical code as described in the Cluster docs, to 'scale' to available CPUs via require('os').cpus().length?
What does Amazon mean in the documentation for ECS Task Definitions, where it says for the cpus setting, that a container instance has 1024 units per CPU? And what would be a good starting point for the this setting?
What would be a good starting point for the instance type to use for an ECS cluster aimed at serving a Node API based on the above? And how do the available vCPUs affect the previous questions?

All these technologies are new and best practices are still being established, so consider these to be tips from my experience only.
One-process-per-container is more of a suggestion than a hard and fast rule. It's fine to run multiple processes in a container when you have a use for it, especially in this case where a master process forks workers. Just use a single container and allow it to fork one process per core, as you've suggested in the question.
On EC2, instance types have a number of vCPUs, which will appear as a core to the OS. For the ECS cluster use an EC2 instance type such as the c3.xlarge with four vCPUs. In ECS this translates to 4096 CPU units. If you want the app to make use of all 4 vCPUs, create a task definition that requires 4096 cpu units.
But if you're doing all this only to stop the app from crashing you could also just use a restart policy to restart the container if it crashes. It appears that restart policies are not yet supported by ECS though.

That seems like a really good pattern. It's similar to what is done with Erlang/OTP, and I don't think anyone would argue that it's one of the most robust systems on the planet. Now the question is how to implement.
I would leverage patterns from Heroku or other similar PaaS systems that have a little bit more maturity. I'm not saying that amazon is the wrong place to do this, but simply that a lot of work has been done with this in other areas that you can translate. For instance, this article has a recipe in it:
https://devcenter.heroku.com/articles/node-cluster
As far as the relationships between vCPU and Compute Units, it looks like it's just a straight ratio of 1/1024. It is a move toward microcharges based on CPU utilization. They are taking these even farther with the lambda work. They are charging you based on fractions of a second that you utilize.

In the docker world you would run 1 nodejs per docker container but you would run many such containers on each of your ec2 instances. If you use something like fig you can use fig scale <n> to run many redundant containers an an instance. This way you don't have to have to define your nodejs count ahead of time and each of your nodejs processes is isolated from the others.

How to terminate a particular Azure worker role instance

Background
I am trying to work out the best structure for an Azure application. Each of my worker roles will spin up multiple long-running jobs. Over time I can transfer jobs from one instance to another by switching them to a readonly mode on the source instance, spinning them up on the target instance, and then spinning the original down on the source instance.
If I have too many jobs then I can tell Azure to spin up extra role instance, and use them for new jobs. Conversely if my load drops (e.g. during the night) then I can consolidate outstanding jobs to a few machines and tell Azure to give me fewer instances.
The trouble is that (as I understand it) Azure provides no mechanism to allow me to decide which instance to stop. Thus I cannot know which servers to consolidate onto, and some of my jobs will die when their instance stops, causing delays for users while I restart those jobs on surviving instances.
Idea 1: I decide which instance to stop, and return from its Run(). I then tell Azure to reduce my instance count by one, and hope it concludes that the broken instance is a good candidate. Has anyone tried anything like this?
Idea 2: I predefine a whole bunch of different worker roles, with identical contents. I can individually stop and start them by switching their instance count from zero to one, and back again. I think this idea would work, but I don't like it because it seems to go against the natural Azure way of doing things, and because it involves me in a lot of extra bookkeeping to manage the extra worker roles.
Idea 3: Live with it.
Any better ideas?

In response to your ideas
Idea 1: I haven't tried doing exactly what you're describing, but in my experience your first instance has a name that ends with _0, the next _1 and I'm sure you can guess the rest. When you decrease the instance count it drops off the instance with the highest number suffix. I would be surprised if it took into account the state of any particular instance.
Idea 2: As I think you hint at, this will create management problems. You can only have 5 different workers per hosted service, so you'll need a service for each group of 5 roles that you want to be able to scale to. Also when you deploy updates you'll have to upload X times more services where X is the maximum number of instances you currently support.
Idea 3: Technically the easiest. Pending some clarification, this is probably what I'd be doing for now. To reduce the downsides of this option it may pay to investigate ways of loading the data faster. There is usually a Goldilocks level (not too much, not too little) of parallelism that helps with this.

You're right - you cannot choose which instance to stop. In general, you'd run the same jobs on each worker role instance, where each instance watches the same queue (or maybe multiple threads or jobs watching multiple queues).
If you really need to run a job on one instance (such as a scheduler), consider using blob leases as the way to constrain this. Create a blob as a mutex. Then, as each instance spins up, the scheduler job attempts to obtain a write lease on that blob. If it succeeds, it runs. If it fails, it simply sleeps (maybe for a minute) and tries again. At some point in the future, as you scale down in instance count, let's say the instance running the scheduler is killed. A minute later (or whatever time span you choose), another instance tries to acquire the lease, succeeds, and now runs the scheduler code.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string