I have a container image prefetched on the Azure batch account pool, but I notice that the pool doesn't auto refresh the container when a new image is uploaded to the container registry, even though I have asked Batch to cache the latest container container (:latest).
Is there any good way to do this (other than deleting and recreating the pool)? The pool machines are currently managed by batch; I don't have access to it. Will refreshing the pool details prefetch the new container?
Container images are loaded on to the machine at the time of the request, e.g., if specified at the pool level, then they are loaded when the node starts up. If it is specified as part of the task, then it is "pulled" as part of the implicit docker run when the task executes. Subsequent task executions would simulate docker run and do not download images if the container image is found locally as per what you would expect if running locally.
There are a few workarounds:
Delete or resize down the pool to zero nodes and scale back up to pick up new images.
Have a rolling pool strategy where you spin up a new pool with the same image reference and autoscale/drain down the old pool and redirect work to the new pool. As an aside, this is generally a good strategy to have regardless of containers or not to pick up latest Azure Batch Node Agent versions.
Create a job preparation task such that you explicitly issue a docker pull of the image when a task within the job referencing the container image will re-pull the image on first run of the job on a node. Note that you'd have to delete the job and recreate it for every instance you update your container.
Create a script that remotes into machines and executes an image refresh command across all nodes.
Related
I have ran a docker container locally and it stores data in a file (currently no volume is mounted). I stored some data using the API. After that I failed the container using process.exit(1) and started the container again. The previously stored data in the container survives (as expected). But when I do this same thing in Kubernetes (minikube) the data is lost.
Posting this as a community wiki for better visibility, feel free to edit and expand it.
As described in comments, kubernetes replaces failed containers with new (identical) ones and this explain why container's filesystem will be clean.
Also as said containers should be stateless. There are different options how to run different applications and take care about its data:
Run a stateless application using a Deployment
Run a stateful application either as a single instance or as a replicated set
Run automated tasks with a CronJob
Useful links:
Kubernetes workloads
Pod lifecycle
I am trying to create a pool of GPU based Containers supported VMs. I have valid ContainerConfiguration and start task. The VM size is Standard_NC6. But whenever i create a pool it always goes to unusable state. If i remove ContainerConfiguration setting the node are in idle state but I dont see problem with ContainerConfiguration settings because If i choose the VM size standard_f2s_v2 (not-gpu) and keep the same ContainerConfiguration settings then it works fine and installs all images on machine. I think it has to do with some nvidia libraries installation while setting up the nodes.
I wanted to know when it is safe to remove a node from a machine from a cluster.
My assumption is that it could be safe to remove a machine if the machine does not have any containers, and it does not store any useful data.
By the APIs at https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/ResourceManagerRest.html, we can do
GET http://<rm http address:port>/ws/v1/cluster/nodes
to get the information of each node like
<node>
<rack>/default-rack</rack>
<state>RUNNING</state>
<id>host1.domain.com:54158</id>
<nodeHostName>host1.domain.com</nodeHostName>
<nodeHTTPAddress>host1.domain.com:8042</nodeHTTPAddress>
<lastHealthUpdate>1476995346399</lastHealthUpdate>
<version>3.0.0-SNAPSHOT</version>
<healthReport></healthReport>
<numContainers>0</numContainers>
<usedMemoryMB>0</usedMemoryMB>
<availMemoryMB>8192</availMemoryMB>
<usedVirtualCores>0</usedVirtualCores>
<availableVirtualCores>8</availableVirtualCores>
<resourceUtilization>
<nodePhysicalMemoryMB>1027</nodePhysicalMemoryMB>
<nodeVirtualMemoryMB>1027</nodeVirtualMemoryMB>
<nodeCPUUsage>0.006664445623755455</nodeCPUUsage>
<aggregatedContainersPhysicalMemoryMB>0</aggregatedContainersPhysicalMemoryMB>
<aggregatedContainersVirtualMemoryMB>0</aggregatedContainersVirtualMemoryMB>
<containersCPUUsage>0.0</containersCPUUsage>
</resourceUtilization>
</node>
If numContainers is 0, I assume it does not run containers. However can it still store any data on disk that other downstream tasks can read?
I did not get if Spark lets us know this. I assume if a machine still stores some data useful for the running job, the machine may maintain a heart beat with Spark Driver or some central controller? Can we check this by scanning tcp or udp connections?
Is there any other way to check if a machine in a Spark cluster participates a job?
I am not sure whether you just want to know if a node is running any task (is that's what you mean by 'participate') or you want to know if it is safe to remove a node from the Spark cluster
I will try to explain the latter point.
Spark has the ability to recover from the failure, which also applies to any node being removed from the cluster.
The node removed can be an executor or an application master.
If an application master is removed, the entire job fails. But is you are using yarn as a resource manager, the job is retried and yarn gives a new application master. The number if retries is configured in :
yarn.resourcemanager.am.max-attempts
By default, this value is 2
If a node on which a task is running is removed, the resource manager (which is handled by yarn) will stop getting heartbeats from that node. Application master will know it is supposed to reschedule the failed job as it will no longer receive progress status from the previous node. It will then request resource manager for resources and then reschedule the job.
As far as data on these nodes is concerned, you need to understand how the tasks and their output are handled. Every node has its own local storage to store the output of the tasks running on them. After the tasks are run successfully, the OutputCommitter will move the output from local storage to the shared storage (HDFS) of the job from where the data is picked for the next step of the job.
When a task fails (may be because the node that runs this job failed or was removed), the task is rerun on another available node.
In fact, the application master will also rerun the successfully run tasks on this node as their output stored on the node's local storage will not longer be available.
I have a Kubernetes cluster setup with 3 worker nodes and a master node. Using docker images to generate pods and altering the working containers. I have been working on one image and have altered it heavily to make the servers within work. The problem I am facing is -
There is no space on the device left.
On further investigation, I found the docker container is set at a 10G size limit which now, I would definitely want to change. How can I change that without losing all my changes in the container and without the need of storing the changes as a separate image altogether?
Changing the limit without a restart is impossible.
To prevent data loss during the container restart, you can use Volumes and store data there, instead of the root image.
P.S. It is not possible to mount Volume dynamically to a container without restart.
Trying to figure out a way to start / run an external process in all workers before starting the jobs / tasks.
Specific use case - my job hits a service running on the node (localhost). The service itself is run via a docker container. I want to start the docker container before starting the tasks on a worker and then stop the container after all the jobs are done.
One approach could be to do rdd.mapPartitions, but that is at a executor level and I cannot cleanly do a stop as another partition might be executing on the same node. Any suggestions?
As a workaround, currently I start the docker containers while starting up the cluster itself, but that does not allow me to work with multiple different containers that may be required for different jobs (as in that case all containers will be running at all the time taking up node resources.)