Kubernetes Jobs or Pods for completion Jobs with auto scaling - linux

I have CPU Intensive Jobs/tasks,
Need to run them in kubernetes, below is the process of job/task
We get request in terms queue or API Call
POd should be created and process the task ( few Jobs may run in minutes, few in hours)
delete pod once task completed
This should happen in scale, if more jobs in queue, create more jobs (Max 10, 20, 30 2e should define it)
I am used KEDA, POD will be created and after Job completion it is going crashloopbback, It is default behaviour in POD life cycle, because it try to recreate pod since restart policy is set to Always. We have other options like OnFailure, Never, But I read it Kubernetes Jobs are more suitable
Which is the better option Kubernetes Pods or Jobs for above task, we should consider scaling POds and also required scale kubernetes nodes (Cloud vendors supports it) based on usage and numbers of tasks in queue.

KEDA ScaledJobs are best for such scenarios and can be triggered through Queue, Storage, etc. (the currently available scalers can be found here)

Related

Scale up the spark worker nodes using code

I want to scale up the spark cluster to make all the worker nodes up and running before I start my processing. The issue is because the autoscaling of worker nodes is not happening immediately on load and is leading to worker node crashes. The cluster has 32 nodes but is overloading only 4 nodes and crashing so what I am trying to do is write some lines of code in the start of the python notebook which will kick start the remaining nodes and have 24 nodes up and running and then do the actual data processing. Is this possible using code ? Please advise.
In general, autoscale is for interactive workloads. I've rarely seen it provide benefits in jobs, though marketing makes a good job of selling it as a cost saving feature.
You can use Databricks jobs to create an automated cluster. When you run a job on a new automated cluster and terminates the cluster when the job is complete.
If you know when scaling up should happen better than auto scale then you can use this resize API: https://docs.databricks.com/dev-tools/api/latest/clusters.html#resize

Kubernetes workload scaling on multi-threaded code

Getting started with Kubernetes so have the following question:
Say a microservice has the following C# code snippet:
var tasks = _componentBuilders.Select(b =>
{
return Task.Factory.StartNew(() => b.SetReference(context, typedModel));
});
Task.WaitAll(tasks.ToArray());
On my box, I understand that each thread be executed on a vCPU. So if I have 4 cores with hyperthreading enabled I will be able to execute 8 tasks concurrently. Therefore, if I have about 50000 tasks, it will take roughly
(50,000/8) * approximate time per task
to complete this work. This ignores context switch, etc.
Now, moving to the cloud and assuming this code is in a docker container managed by Kubernetes Deployment and we have a single docker container per VM to keep this simple. How does the above code scale horizontally across the VMs in the deployment? Can not find very clear guidance on this so if anyone has any reference material, that would be helpful.
You'll typically use a Kubernetes Deployment object to deploy application code. That has a replicas: setting, which launches some number of identical disposable Pods. Each Pod has a container, and each pod will independently run the code block you quoted above.
The challenge here is distributing work across the Pods. If each Pod generates its own 50,000 work items, they'll all do the same work and things won't happen any faster. Just running your application in Kubernetes doesn't give you any prebuilt way to share thread pools or task queues between Pods.
A typical approach here is to use a job queue system; RabbitMQ is a popular open-source option. One part of the system generates the tasks and writes them into RabbitMQ. One or more workers reads jobs from the queue and runs them. You can set this up and demonstrate it to yourself without using container technology, then repackage it in Docker or Kubernetes just changing the RabbitMQ broker address at deploy time.
In this setup I'd probably have the worker run jobs serially, one at a time, with no threading. That will simplify the implementation of the worker. If you want to run more jobs in parallel, run more workers; in Kubernetes, increase the Deployment replica: count.
In Kubernetes, when we deploy containers as Pods we can include the resources.limits.cpu and resources.requests.cpu fields for each container in the Pod's manifest:
resources:
requests:
cpu: "1000m"
limits:
cpu: "2000m"
In the example above we have a request for 1 CPU and a limit for a maximum of 2 CPUs. This means the Pod will be scheduled to a worker node which can satisfy above resource requirements.
One cpu, in Kubernetes, is equivalent to 1 vCPU/Core for cloud providers and 1 hyperthread on bare-metal Intel processors.
We can vertically scale by increasing / decreasing the values for the requests and limits fields. Or we can horizontally scale by increasing / decreasing the number of replicas of the pod.
For more details about resource units in Kubernetes here

Kubernetes: shutdown node between jobs runs

I run a Kubernetes CronJob every week, on a Kubernetes cluster that have a single node in it. It runs on Google Compute Engine.
I would like to shutdown the node completely between two jobs, for billing purposes (we pay the price as if the machine was used for the whole week but it is actually useful a few hours)
Is it possible boot the node, run the job, then shutdown the node?
The cluster autoscaler can help with this:
https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler

Does Kubernetes reschedule pods to other minions if the minions join latter?

I was running a test with initially a minion node and a master node. I created 5 pods on the cluster and later on 2 minion nodes joined the cluster.
So the problem I faced was that all the pods were only scheduled on the master and minion nodes. They were not re-scheduled to new nodes so as to divide the whole resources. Due to which my new minion nodes were just sitting idle and didn't do any processing.
Is there anything specially to be run to make this happen ?
Not really. The scheduler is called whenever something needs to be scheduled, so unless you deploy new replicas of the pod, the scheduler won't be bothered again.
Whenever you want to schedule something, like creating a Deployment or a Pod, the scheduler looks at the available resources to place the Pods where it thinks is best. Next time you schedule something, it will take into account the new minions added to the cluster. Or if your pods are created via a Deployment object, you could try deleting one Pod, so the ReplicationController will create a new Pod and the scheduler may choose one of the new minions.
The documentation also recommends creating a Service before creating a Deployment`, so the scheduler will spread the pods better among the minions.

Sudden surge in number of YARN apps on HDInsight cluster

For some reason sometimes the cluster seems to misbehave for I suddenly see surge in number of YARN jobs.We are using HDInsight Linux based Hadoop cluster. We run Azure Data Factory jobs to basically execute some hive script pointing to this cluster. Generally average number of YARN apps at any given time are like 50 running and 40-50 pending. None uses this cluster for ad-hoc query execution. But once in few days we notice something weird. Suddenly number of Yarn apps start increasing, both running as well as pending, but especially pending apps. So this number goes more than 100 for running Yarn apps and as for pending it is more than 400 or sometimes even 500+. We have a script that kills all Yarn apps one by one but it takes long time, and that too is not really a solution. From our experience we found that the only solution, when it happens, is to delete and recreate the cluster. It may be possible that for some time cluster's response time is delayed (Hive component especially) but in that case even if ADF keeps retrying several times if a slice is failing, is it possible that the cluster is storing all the supposedly failed slice execution requests (according to ADF) in a pool and trying to run when it can? That's probably the only explanation why it could be happening. Has anyone faced this issue?
Check if all the running jobs in the default queue are Templeton jobs. If so, then your queue is deadlocked.
Azure Data factory uses WebHCat (Templeton) to submit jobs to HDInsight. WebHCat spins up a parent Templeton job which then submits a child job which is the actual Hive script you are trying to run. The yarn queue can get deadlocked if there are too many parents jobs at one time filling up the cluster capacity that no child job (the actual work) is able to spin up an Application Master, thus no work is actually being done. Note that if you kill the Templeton job this will result in Data Factory marking the time slice as completed even though obviously it was not.
If you are already in a deadlock, you can try adjusting the Maximum AM Resource from the default 33% to something higher and/or scaling up your cluster. The goal is to be able to allow some of the pending child jobs to run and slowly draining the queue.
As a correct long term fix, you need to configure WebHCat so that parent templeton job is submitted to a separate Yarn queue. You can do this by (1) creating a separate yarn queue and (2) set templeton.hadoop.queue.name to the newly created queue.
To create queue you can do this via the Ambari > Yarn Queue Manager.
To update WebHCat config via Ambari go to Hive tab > Advanced > Advanced WebHCat-site, and update the config value there.
More info on WebHCat config:
https://cwiki.apache.org/confluence/display/Hive/WebHCat+Configure

Resources