Spark on Kubernetes executor cleanup - apache-spark

I'm running some jobs using Spark on K8S and sometimes my executors will die mid-job. Whenever that happens the driver immediately deletes the failed pod and spawns a new one.
Is there a way to stop Spark from deleting terminated executor pods? It would make debugging the failure a lot easier.
Right now I'm already collecting the logs of all pods to another storage so I can see the logs. But it's quite a hassle to query through logs for every pod and I won't be able to see K8S metadata for them.

This setting was added in SPARK-25515. It sadly isn't available for the currently released version but it should be made available in Spark 3.0.0

use the job.spec.ttlSecondsAfterFinished to determine how long the pod will exist after the job is completed/failed.
for example:
apiVersion: batch/v1
kind: Job
metadata:
name: pi-with-ttl
spec:
ttlSecondsAfterFinished: 100
template:
spec:
containers:
- name: pi
image: perl
command: ["perl", "-Mbignum=bpi", "-wle", "print bpi(2000)"]
restartPolicy: Never
The Job pi-with-ttl will be eligible to be automatically deleted, 100 seconds after it finishes.
If the field is set to 0, the Job will be eligible to be automatically deleted immediately after it finishes. If the field is unset, this Job won’t be cleaned up by the TTL controller after it finishes.
Note that this TTL mechanism is alpha, with feature gate
TTLAfterFinished. For more information, see the documentation for
TTL controller for finished resources.

Related

Strimzi cant resize PV

I followed Strimzi blog to resize PV.
I use Openshift v3.11 deployed on Azure VMs with PV as Azure managed disk
My Kafka cluster storage config
..
config:
offsets.topic.replication.factor: 2
transaction.state.log.replication.factor: 1
transaction.state.log.min.isr: 1
log.message.format.version: "2.6"
storage:
type: persistent-claim
size: 256Gi
deleteClaim: false
...
I directly edit pvc and changed the resource request to 257Gi. I wait for few mins , checked the status of PVC like blow
oc get pvc data-0-xx-dev-kafka-0 -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
...
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 257Gi
storageClassName: generic
volumeName: pvc-xx-2849-xx-913f-xx
status:
accessModes:
- ReadWriteOnce
capacity:
storage: 256Gi
conditions:
- lastProbeTime: null
lastTransitionTime: 2021-10-08T15:54:06Z
status: "True"
type: Resizing
phase: Bound
In the description of pvc, I see below
Warning VolumeResizeFailed 3s (x2 over 1m) volume_expand Error expanding volume "kafka/data-0-sirius-dev-kafka-0" of plugin kubernetes.io/azure-disk : compute.DisksClient#CreateOrUpdate:
Failure sending request: StatusCode=409 -- Original Error: failed request: autorest/azure:
Service returned an error. Status=<nil> Code="OperationNotAllowed" Message="Cannot resize disk kubernetes-dynamic-pvc-xx-2849-xx-913f-xx while it is
attached to running VM /subscriptions/xxxx/resourceGroups/xxx-dev-openshift/providers/Microsoft.Compute/virtualMachines/ocp-node-dev-1. Resizing a disk of an Azure Virtual Machine requires the virtual machine to be deallocated.
Please stop your VM and retry the operation."
I have also tried, redeployed kafka with jbod with single disk, resize and did rolling update. Same result like above
openshift v3.11.0+cbab8ee-94(K8s v1.11.0+d4cacc0)
Kafka Version: 2.6.0
Operators : 0.20.1
Note that resize PV does support in my cluster(Previously I resized a PV of an app successfully by scaling down replicas to zero)
UPDATE
$ oc describe storageclass generic
Name: generic
IsDefaultClass: Yes
Annotations: storageclass.beta.kubernetes.io/is-default-class=true
Provisioner: kubernetes.io/azure-disk
Parameters: kind=managed,location=${location},storageaccounttype=Premium_LRS
AllowVolumeExpansion: True
MountOptions:
discard
ReclaimPolicy: Delete
VolumeBindingMode: Immediate
Events: <none>
I tried to scale down reaplica to 0 with oc scale replicas=0 sts/XX-dev, but cluster-operator is not allowing since because of replication factor.
Normally Azure Disk is supporting resizing of Persistent Volumes, but currently, there are some issues in terms of compliancy and implementation check here
So before going any other place; just check one more time following points
check the Storage Class definition with "storageClassName: generic" if allowVolumeExpansion is set to true. (may be you tried resizing with different storage class before) You can only expand a PVC if its storage class's allowVolumeExpansion field is set to true.
it is looking like that you need to de-allocate everything which is pointing to that volume first(check warning at https://learn.microsoft.com/en-us/azure/aks/azure-disk-csi) then resize and lastly scale Pods back.
try to increment with multiplies check the sizes from here(https://learn.microsoft.com/en-us/azure/virtual-machines/disks-types)
maybe hitting volume limits per VM, default is 16 for Azure, for the time being, check how many volumes you have now in that node
The other issue is that if underlying capacity is not enough for resizing you have to deallocate and relocate VM in the new resource group. Yes cloud is not unlimited

Running Spark on Kubernetes with Dynamic Allocation

I am trying to use spark on Kubernetes cluster (existing setup: emr + yarn). Our use case is to handle too many jobs including short lived ones (few seconds to 15 minutes). Also, we have peak hours when many workers needs to run to handle 100s of jobs running concurrently.
So what I want to achieve, running master and fixed few workers (say 5) all time and increase workers to 40-50 at peak time. Also, I will prefer to use dynamic allocation.
I am setting it as below
Master image (spark-master:X)
FROM <BASE spark 3.1 Image build using dev/make-distribution.sh -Pkubernetes in spark>
ENTRYPOINT ["/opt/spark/sbin/start-master.sh", "-p", "8081", "<A long running server command that can accept get traffic on 8080 to submit jobs>"]
Worker worker image (spark-worker:X)
FROM <BASE spark 3.1 Image build using dev/make-distribution.sh -Pkubernetes in spark>
ENTRYPOINT ["/opt/spark/sbin/start-worker.sh", "spark//spark-master:8081" ,"-p", "8081", "<A long running server command to keep up the worker>"]
Deplyments
apiVersion: apps/v1
kind: Deployment
metadata:
name: spark-master-server
spec:
replicas: 1
selector:
matchLabels:
component: spark-master-server
template:
metadata:
labels:
component: spark-master-server
spec:
containers:
- name: spark-master-server
image: spark-master:X
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8081
---
apiVersion: v1
kind: Service
metadata:
name: spark-master
spec:
type: ClusterIP
ports:
- port: 8081
targetPort: 8081
selector:
component: spark-master-server
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: spark-worker-instance
spec:
replicas: 3
selector:
matchLabels:
component: spark-worker-instance
template:
metadata:
labels:
component: spark-worker-instance
spec:
containers:
- name: spark-worker-server
image: spark-worker:X
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8081
Questions
Is this setup recommended ?
How can we submit new job from within Kubernetes cluster in absence of yarn (k8s) ?
Reason we are trying not create master and driver dynamically per job (as given in example - http://spark.apache.org/docs/latest/running-on-kubernetes.html) is that it may be an overhead for large no. of small jobs.
Is this setup recommended ?
Don't think so.
Dynamic Resource Allocation is a property of a single Spark application "to dynamically adjust the resources your application occupies based on the workload."
Dynamic Resource Allocation spans its resource requirements regardless of available nodes in a cluster. As long as there are resources available and a cluster manager could assign them to a Spark application these resources are free to go.
What you seem to be trying to set up is how to scale the cluster itself up and down. In your case it's Spark Standalone and although technically it's possible with ReplicaSets (just a guess) I've never heard any earlier attempts at it. You're on your own as Spark Standalone does not support it out of the box.
That I think is an overkill since you're building a multi-layer cluster environment: using a cluster manager (Kubernetes) to host another cluster manager (Spark Standalone) for Spark applications. Given Spark on Kubernetes supports Dynamic Allocation out of the box the only worry of yours should simply be how to "throw in" more CPU and memory on demand while resizing Kubernetes cluster. You should rely on the capabilities of Kubernetes to resize itself up and down rather than Spark Standalone on Kubernetes.
The spark on k8s operator may provide at least one mechanism to dynamically provision the resources you need to do safe scaling of resources based on demand.
https://github.com/GoogleCloudPlatform/spark-on-k8s-operator
My thinking is that instead of performing a direct spark submit to a master in a static spark cluster, you could make a call to the k8s API to provision the required instance; or otherwise define these as a cron schedule.
In the dynamic provisioning scenario one thing to consider is how your workloads get distributed across the cluster; we can’t use simple HPA rules for this as it may not be safe to tear down a worker on CPU/Mem levels; spawning a separate cluster for each on demand workload bypasses this but may not be optimal. I would be interested to hear how you get on.

How to set pod to use all available CPU Cores

I tried to use K8s to setup spark cluster (I use standalone deployment mode, and I cannot use k8s deployment mode for some reason)
I didn't set any cpu related arguments.
for spark, that means:
Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker
http://spark.apache.org/docs/latest/spark-standalone.html
for k8s pods, that means:
If you do not specify a CPU limit for a Container, then one of these situations applies:
The Container has no upper bound on the CPU resources it can use. The Container could use all of the CPU resources available on the Node where it is running.
The Container is running in a namespace that has a default CPU limit, and the Container is automatically assigned the default limit. Cluster administrators can use a LimitRange to specify a default value for the CPU limit.
https://kubernetes.io/docs/tasks/configure-pod-container/assign-cpu-resource/
...
Addresses:
InternalIP: 172.16.197.133
Hostname: ubuntu
Capacity:
cpu: 4
memory: 3922Mi
pods: 110
Allocatable:
cpu: 4
memory: 3822Mi
pods: 110
...
But my spark worker only use 1 core (I have 4 cores on the worker node and the namespace has no resource limits).
That means the spark worker pod only used 1 core of the node (which should be 4).
How can I write yaml file to set the pod to use all available cpu cores?
Here is my yaml file:
---
apiVersion: v1
kind: Namespace
metadata:
name: spark-standalone
---
kind: DaemonSet
apiVersion: apps/v1
metadata:
name: spark-slave
namespace: spark-standalone
labels:
k8s-app: spark-slave
spec:
selector:
matchLabels:
k8s-app: spark-slave
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
template:
metadata:
name: spark-slave
namespace: spark-standalone
labels:
k8s-app: spark-slave
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/edge
operator: Exists
hostNetwork: true
containers:
- name: spark-slave
image: spark:2.4.3
command: ["/bin/sh","-c"]
args:
- "
${SPARK_HOME}/sbin/start-slave.sh
spark://$(SPARK_MASTER_IP):$(SPARK_MASTER_PORT)
--webui-port $(SPARK_SLAVE_WEBUI_PORT)
&&
tail -f ${SPARK_HOME}/logs/*
"
env:
- name: SPARK_MASTER_IP
value: "10.4.20.34"
- name: SPARK_MASTER_PORT
value: "7077"
- name: SPARK_SLAVE_WEBUI_PORT
value: "8081"
---
Kubernetes - No upper bound
The Container has no upper bound on the CPU resources it can use. The Container could use all of the CPU resources available on the Node where it is running
Unless you confgure a limit on CPU for your pod, it can use all available CPU resources on the node.
Consider dedicated nodes
If you are running other workload on the same node, they also consume CPU resources, and may be guaranteed CPU resources if they have configured request for CPU. Consider use a dedicated node for your workload using NodeSelector and Taints and Tolerations.
Spark - No upper bound
You configure the slave with parameters to the start-slave.sh e.g. --cores X to limit CPU core usage.
Total CPU cores to allow Spark applications to use on the machine (default: all available); only on worker
Multithreaded workload
In the end, if pod can use multiple CPU cores depends on how your application uses threads. Some things only uses a single thread, so the application must be designed for multithreading and have something parallelized to do.
I hit absolutely the same issue with a Spark worker. Knowing the fact, that Java sometimes failed to calculate CPU correctly, I tried to specify the CPU request or limit in Pod spec and the worker automatically understood what the environment it is. And needed cores will be assigned to the Spark worker executor.
Also, I faced this behavior in k8s only. In Docker Swarm, all available CPU cores had been taken by a worker.
What's more, in default templates for 'cores' parameters for Spark worker 1 core is mentioned. I believe it could be taken in case of the wrong calculation of CPU cores.

Cassandra replicas on single docker-swarm node

I've a single docker-swarm manager node (18.09.6) running and I'm playing with spinning up a cassandra cluster. I'm using the following definition and it works in that the seed/master and slave spin up and communicate/replicate their data/schema changes fine:
services:
cassandra-masters:
image: cassandra:2.2
environment:
- MAX_HEAP_SIZE=128m
- HEAP_NEWSIZE=32m
- CASSANDRA_BROADCAST_ADDRESS=cassandra-masters
deploy:
mode: replicated
replicas: 1
cassandra-slaves:
image: cassandra:2.2
environment:
- MAX_HEAP_SIZE=128m
- HEAP_NEWSIZE=32m
- CASSANDRA_SEEDS=cassandra-masters
- CASSANDRA_BROADCAST_ADDRESS=cassandra-slaves
deploy:
mode: replicated
replicas: 1
depends_on:
- cassandra-masters
When I change the replica count from 1 to 2, either on deployment of the stack or a post deploy scale, the second task for the cassandra slave is created, but constantly fails with an error indicating it cannot gossip with the seed node:
INFO 10:51:03 Loading persisted ring state
INFO 10:51:03 Starting Messaging Service on /10.10.0.200:7000 (eth0
INFO 10:51:03 Handshaking version with cassandra-masters/10.10.0.142
Exception (java.lang.RuntimeException) encountered during startup: Unable to gossip with any seeds
java.lang.RuntimeException: Unable to gossip with any seeds
at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1360)
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:521)
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:756)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:676)
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:562)
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:310)
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:548)
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:657)
ERROR 10:51:34 Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any seeds
at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1360) ~[apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:521) ~[apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:756) ~[apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:676) ~[apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.StorageService.initServer(StorageService.java:562) ~[apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:310) [apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:548) [apache-cassandra-2.2.14.jar:2.2.14]
at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:657) [apache-cassandra-2.2.14.jar:2.2.14]
I'd like to understand what is causing the issue and whether there is a way to work-around it? I'm just investigating what any roadblocks are to getting to production where we'd obviously be spinning the cassandra tasks/replicas up on different nodes rather than the one node.
EDIT: I've spun the same stack up on a two node swarm and I'm seeing the same behaviour, i.e. when I scale to a second "slave" task, it fails with the same error, so it's not an issue particular to trying to run two tasks on the same node.
I've not gotten to the bottom of why the gossiping fails but ultimately we agreed a production deployment strategy where we'd not require auto-scaling and should instead be making capacity planning based on the system's behaviour and expected traffic. This answer also points out to the additional strain that auto-scaling can add to an already stretched system: AWS and auto scaling cassandra

Spark 2.0 Standalone mode Dynamic Resource Allocation Worker Launch Error

I'm running Spark 2.0 on Standalone mode, successfully configured it to launch on a server and also was able to configure Ipython Kernel PySpark as option into Jupyter Notebook. Everything works fine but I'm facing the problem that for each Notebook that I launch, all of my 4 workers are assigned to that application. So if another person from my team try to launch another Notebook with PySpark kernel, it simply does not work until I stop the first notebook and release all the workers.
To solve this problem I'm trying to follow the instructions from Spark 2.0 Documentation.
So, on my $SPARK_HOME/conf/spark-defaults.conf I have the following lines:
spark.dynamicAllocation.enabled true
spark.shuffle.service.enabled true
spark.dynamicAllocation.executorIdleTimeout 10
Also, on $SPARK_HOME/conf/spark-env.sh I have:
export SPARK_WORKER_MEMORY=1g
export SPARK_EXECUTOR_MEMORY=512m
export SPARK_WORKER_INSTANCES=4
export SPARK_WORKER_CORES=1
But when I try to launch the workers, using $SPARK_HOME/sbin/start-slaves.sh, only the first worker is successfully launched. The log from the first worker end up like this:
16/11/24 13:32:06 INFO Worker: Successfully registered with master
spark://cerberus:7077
But the log from workers 2-4 show me this error:
INFO ExternalShuffleService: Starting shuffle service on port 7337
with useSasl = false 16/11/24 13:32:08 ERROR Inbox: Ignoring error
java.net.BindException: Address already in use
It seems (to me) that the first worker successfully starts the shuffle-service at port 7337, but the workers 2-4 "does not know" about this and try to launch another shuffle-service on the same port.
The problem occurs also for all workers (1-4) if I first launch a shuffle-service (using $SPARK_HOME/sbin/start-shuffle-service.sh) and then try to launch all the workers ($SPARK_HOME/sbin/start-slaves.sh).
Is any option to get around this? To be able to all workers verfy if there is a shuffle service running and connect to it instead of try to create a new service?
I had the same issue and seemed to get it working by removing the spark.shuffle.service.enabled item from the config file (in fact I don't have any dynamicAllocation-related items in there) and instead put this in the SparkConf when I request a SparkContext:
sconf = pyspark.SparkConf() \
.setAppName("sc1") \
.set("spark.dynamicAllocation.enabled", "true") \
.set("spark.shuffle.service.enabled", "true")
sc1 = pyspark.SparkContext(conf=sconf)
I start the master & slaves as normal:
$SPARK_HOME/sbin/start-all.sh
And I have to start one instance of the shuffler-service:
$SPARK_HOME/sbin/start-shuffle-service.sh
Then I started two notebooks with this context and got them both to do a small job. The first notebook's application does the job and is in the RUNNING state, the second notebook's application is in the WAITING state. After a minute (default idle timeout), the resources get reallocated and the second context gets to do its job (and both are in RUNNING state).
Hope this helps,
John

Resources