I am running spark 3.1.1 on kubernetes 1.19. Once job finishes executor pods get cleaned up but driver pod remains in completed state. How to clean up driver pod once it is completed? any configuration option to set?
NAME READY STATUS RESTARTS AGE
my-job-0e85ea790d5c9f8d-driver 0/1 Completed 0 2d20h
my-job-8c1d4f79128ccb50-driver 0/1 Completed 0 43h
my-job-c87bfb7912969cc5-driver 0/1 Completed 0 43h
Concerning the initial question "Spark on Kubernetes driver pod cleanup", it seems that there is no way to pass, at spark-submit time, a TTL parameter to kubernetes for avoiding the never-removal of driver pods in completed status.
From Spark documentation:
https://spark.apache.org/docs/latest/running-on-kubernetes.html
When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
It is not very clear who is doing this 'eventually garbage collected'.
spark.kubernetes.driver.service.deleteOnTermination was added to spark in 3.2.0. This should solve the issue. src: https://spark.apache.org/docs/latest/core-migration-guide.html
update: this will only delete the service to the pod..but not the pod itself
According to the official documentation since Kubernetes 1.12:
Another way to clean up finished Jobs (either Complete or Failed) automatically is to use a TTL mechanism provided by a TTL controller for finished resources, by specifying the .spec.ttlSecondsAfterFinished field of the Job.
When the TTL controller cleans up the Job, it will delete the Job cascadingly, i.e. delete its dependent objects, such as Pods, together with the Job. Note that when the Job is deleted, its lifecycle guarantees, such as finalizers, will be honored.
Example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-ttl
spec:
ttlSecondsAfterFinished: 100
template:
spec:
...
The Job pi-with-ttl will be eligible to be automatically deleted, 100 seconds after it finishes.
If the field is set to 0, the Job will be eligible to be automatically deleted immediately after it finishes.
If customisation of the Job resource is not possible you may use an external tool to clean up completed jobs. For example check https://github.com/dtan4/k8s-job-cleaner
Related
I'm running YARN on an EMR cluster.
mapred queue -list returns:
Queue Name : default
Queue State : running
Scheduling Info : Capacity: 100.0, MaximumCapacity: 100.0, CurrentCapacity: 0.0
How do I clear this queue or add a new one? I've been looking for a while now and can't find CLI commands to do so. I only have access to CLI. Any Spark applications I submit hang in the ACCEPTED state, and I've killed all submitted applications via yarn app --kill [app_id]
CurrentCapacity: 0.0 means that the queue is fully unused.
Your jobs, if thats your concern, are NOT hung due to unavailability of resources.
Not sure whether EMR allows yarn cli commands such as schedulerconf
https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/YarnCommands.html#queue:~:text=ResourceManager%20admin%20client-,schedulerconf,-Usage%3A%20yarn
I have two vertx micro services running in cluster and communicate with each other using a headless service(link) in on premise cloud. Whenever I do a rolling deployment I am facing connectivity issue within services. When I analysed the log I can see that old node/pod is getting removed from cluster list but the event bus is not removing it and using it in round robin basis.
Below is the member group information before deployment
Member [192.168.4.54]:5701 - ace32cef-8cb2-4a3b-b15a-2728db068b80 //pod 1
Member [192.168.4.54]:5705 - f0c39a6d-4834-4b1d-a179-1f0d74cabbce this
Member [192.168.101.79]:5701 - ac0dcea9-898a-4818-b7e2-e9f8aaefb447 //pod 2
When deployment is started, pod 2 gets removed from the member list,
[192.168.4.54]:5701 [dev] [4.0.2] Could not connect to: /192.168.101.79:5701. Reason: SocketException[Connection refused to address /192.168.101.79:5701]
Removing connection to endpoint [192.168.101.79]:5701 Cause => java.net.SocketException {Connection refused to address /192.168.101.79:5701}, Error-Count: 5
Removing Member [192.168.101.79]:5701 - ac0dcea9-898a-4818-b7e2-e9f8aaefb447
And new member is added,
Member [192.168.4.54]:5701 - ace32cef-8cb2-4a3b-b15a-2728db068b80
Member [192.168.4.54]:5705 - f0c39a6d-4834-4b1d-a179-1f0d74cabbce this
Member [192.168.94.85]:5701 - 1347e755-1b55-45a3-bb9c-70e07a29d55b //new pod
All migration tasks have been completed. (repartitionTime=Mon May 10 08:54:19 MST 2021, plannedMigrations=358, completedMigrations=358, remainingMigrations=0, totalCompletedMigrations=3348, elapsedMigrationTime=1948ms, totalElapsedMigrationTime=27796ms)
But when a request is made to the deployed service, event though old pod is removed from member group the event bus is using the old pod/service reference(ac0dcea9-898a-4818-b7e2-e9f8aaefb447),
[vert.x-eventloop-thread-1] DEBUG io.vertx.core.eventbus.impl.clustered.ConnectionHolder - tx.id=f9f5cfc9-8ad8-4eb1-b12c-322feb0d1acd Not connected to server ac0dcea9-898a-4818-b7e2-e9f8aaefb447 - starting queuing
I checked the official documentation for rolling deployment and my deployment seems to be following two key things mentioned in documentation, only one pod removed and then the new one is added.
never start more than one new pod at once
forbid more than one unavailable pod during the process
I am using vertx 4.0.3 and hazelcast kubernetes 1.2.2. My verticle class is extending AbstractVerticle and deploying using,
Vertx.clusteredVertx(options, vertx -> {
vertx.result().deployVerticle(verticleName, deploymentOptions);
Sorry for the long post, any help is highly appreciated.
One possible reason could be due to a race condition with Kubernetes removing the pod and updating the endpoint in Kube-proxy as detailed in this extensive article. This race condition will lead to Kubernetes continuing to send traffic to the pod being removed after it has terminated.
One TL;DR solution is to add a delay when terminating a pod by either:
Have the service delay when it receives a SIGTERM (e.g. for 15 sec) such that it keeps responding to requests during that delay period like normal.
Use the Kubernetes preStop hook to execute a sleep 15 command on the container. This allows the service to continue responding to requests during that 15 second period while Kubernetes is updating it's endpoints. Kubernetes will send SIGTERM when the preStop hook completes.
Both solutions will give Kubernetes some time to propagate changes to it's internal components so that traffic stops being routed to the pod being removed.
A caveat to this answer is that I'm not familiar with Hazelcast Clustering and how your specific discover mode is setup.
I am currently seeing a strange issue where I have a Pod that is constantly being Evicted by Kubernetes.
My Cluster / App Information:
Node size: 7.5GB RAM / 2vCPU
Application Language: nodejs
Use Case: puppeteer website extraction (I have code that loads a website, then extracts an element and repeats this a couple of times per hour)
Running on Azure Kubernetes Service (AKS)
What I tried:
Check if Puppeteer is closed correctly and that I am removing any chrome instances. After adding a force killer it seems to be doing this
Checked kubectl get events where it is showing the lines:
8m17s Normal NodeHasSufficientMemory node/node-1 Node node-1 status is now: NodeHasSufficientMemory
2m28s Warning EvictionThresholdMet node/node-1 Attempting to reclaim memory
71m Warning FailedScheduling pod/my-deployment 0/4 nodes are available: 1 node(s) had taint {node.kubernetes.io/memory-pressure: }, that the pod didn't tolerate, 3 node(s) didn't match node selector
Checked kubectl top pods where it shows it was only utilizing ~30% of the node's memory
Added resource limits in my kubernetes .yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: my-app-d
spec:
replicas: 1
template:
spec:
containers:
- name: main
image: my-image
imagePullPolicy: Always
resources:
limits:
memory: "2Gi"
Current way of thinking:
A node has X memory total available, however from X memory only Y is actually allocatable due to reserved space. However when running os.totalmem() in node.js I am still able to see that Node is allowed to allocate the X memory.
What I am thinking here is that Node.js is allocating up to X due to its Garbage Collecting which should actually kick in at Y instead of X. However with my limit I actually expected it to see the limit instead of the K8S Node memory limit.
Question
Are there any other things I should try to resolve this? Did anyone have this before?
You NodeJS app is not aware that it runs in container. It sees only the amount of memory that Linux kernel reports (which always reports the total node memory). You should make your app aware of cgroup limits, see https://medium.com/the-node-js-collection/node-js-memory-management-in-container-environments-7eb8409a74e8
With regard to Evictions: when you've set memory limits - did that solve your problems with evictions?
And don't trust kubectl top pods too much. It always shows data with some delay.
My goal:
Implement a cron job run once per week and I intend to implement this topology on Knative to save the computing resources:
PingSource -> knative service
The PingSource will emit a dummy event to a knative service once per week just to bring up 1 knative service pod. The knative service pod will get huge amount of data and then process them.
My concern:
If I set enable-scale-to-zero to true, the Knative pod autoscaler probably shutdown the knative service pod even when the pod has not finished its work.
So far, I explored:
The scale-to-zero-grace-period which can be configured to tell the auto scaler how long it should wait after the last traffic ends to shutdown the pod. But I don't think this approach is subtle. I prefer somewhat similar to readinessProbe or livenessProbe. The auto scaler should send a probe to know whether the pod is processing something before sending the kill signal.
In addition, according to knative's docs, there are 2 type of event sink: callable and addressable. Addressable and Callable both return the response or acknowledgement. Would the knative auto scaler consider the pod as handling the request till the pod return the response/acknowledgement? So as long as the pod does not response, it won't be removed by the auto scaler.
The Knative autoscaler relies on the pod strictly working in a request/response fashion. As long as the "huge amount of data" is processed as part of an HTTP request (or Websocket session, or gRPC session etc.) the pod will not even be considered for deletion.
What will not work is sending the request, immediately return and then munging the data in the background. The autoscaler will think that there's no activity at all and thus shut it down. There is a sandbox project that tries to implement such asynchronous semantics though.
We are trying to setup HA on spark standalone master using zookeeper.
We have two zookeeper hosts which we are using for spark ha as well.
Configured following thing in spark-env.sh
SPARK_DAEMON_JAVA_OPTS="-Dspark.deploy.recoveryMode=ZOOKEEPER -Dspark.deploy.zookeeper.url=zk_server1:2181,zk_server2:2181"
Started both the masters.
started shell and status of the job is RUNNING.
master1 is in ALIVE and master2 is in STANDBY status.
Killed the master1 and master2 has been picked up and all the workers appeared alive in master2.
The shell which is already running has been moved to new master. However, the status is in WAITING status and executors are in LOADING status.
No error in worker log and executor log, except notification that connected to new master.
I could see the worker re-registered, but the executor does not seems to be started. Is there any thing that i am missing.?
My spark version is 1.5.0