Spark Pod restarting every hour in Kubernetes

Spark Pod restarting every hour in Kubernetes - apache-spark

I have deployed spark applications in cluster-mode in kubernetes. The spark application pod is getting restarted almost every hour.
The driver log has this message before restart:
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 1 on x.x.x.x: The executor with id 1 was deleted by a user or the framework.
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 2 on y.y.y.y: The executor with id 2 was deleted by a user or the framework.
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 1 (epoch 0)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, x.x.x.x, 44879, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 0)
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 2 (epoch 1)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, y.y.y.y, 46191, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 1)
20/07/11 13:34:02 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
20/07/11 13:34:16 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
And the Executor log has:
20/07/11 15:55:01 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
20/07/11 15:55:01 INFO MemoryStore: MemoryStore cleared
20/07/11 15:55:01 INFO BlockManager: BlockManager stopped
20/07/11 15:55:01 INFO ShutdownHookManager: Shutdown hook called
How can I find what's causing the executors deletion?
Deployment:
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 0 max surge
Pod Template:
Labels: app=test
chart=test-2.0.0
heritage=Tiller
product=testp
release=test
service=test-spark
Containers:
test-spark:
Image: test-spark:2df66df06c
Port: <none>
Host Port: <none>
Command:
/spark/bin/start-spark.sh
Args:
while true; do sleep 30; done;
Limits:
memory: 4Gi
Requests:
memory: 4Gi
Liveness: exec [/spark/bin/liveness-probe.sh] delay=300s timeout=1s period=30s #success=1 #failure=10
Environment:
JVM_ARGS: -Xms256m -Xmx1g
KUBERNETES_MASTER: https://kubernetes.default.svc
KUBERNETES_NAMESPACE: test-spark
IMAGE_PULL_POLICY: Always
DRIVER_CPU: 1
DRIVER_MEMORY: 2048m
EXECUTOR_CPU: 1
EXECUTOR_MEMORY: 2048m
EXECUTOR_INSTANCES: 2
KAFKA_ADVERTISED_HOST_NAME: kafka.default:9092
ENRICH_KAFKA_ENRICHED_EVENTS_TOPICS: test-events
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: test-spark-5c5997b459 (1/1 replicas created)
Events: <none>

I did a quick research on running Spark on Kubernetes, and it seems that Spark by design will terminate executor pod when they finished running Spark applications. Quoted from the official Spark website:
When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
Therefore, I believe there is nothing to worry about the restarts as long as your Spark instance still manages to start executor pods as and when required.
Reference: https://spark.apache.org/docs/2.4.5/running-on-kubernetes.html#how-it-works

I don't know how you configured your application pod but you can use this to stop restarting pod include this in your deployment yaml file so that pod will never restart and you can debug the pod onwards.
restartPolicy: Never

Related

Optimization of spark cluster configuration（ Low configuration virtual machine cluster）

When I use Spark's standalone mode to process a large number of datasets,the log said:
ERROR TaskSchedulerImpl:70 - Lost executor 1 on : Executor heartbeat timed out after 381181 ms
I search the internet, they say I should set parameters with spark submit：
[hadoop#Master spark2.4.0]$ bin/spark-submit --master spark://master:7077 --conf spark.worker.timeout 10000000 --py-files id.py id.py --name id
Error message in log:
Error: Invalid argument to --conf: spark.worker.timeout
Questions:
How to set timeout parameter?
Thanks to meniluca's answer, I lost the symbols in instructions
After adjusting the timeout, the log displays
2019-12-05 19:42:27 WARN Utils:87 - Suppressing exception in finally: broken pipe (Write failed)
java.net.SocketException: broken pipe (Write failed)
2019-12-05 21:13:09 INFO SparkContext:54 - Invoking stop() from shutdown hook
Exception in thread "serve-DataFrame" java.net.SocketException: Connection reset
Suppressed: java.net.SocketException: broken pipe (Write failed)
then,I change thessh，add ServerAliveInterval 60 while ~/.ssh/ config
ServerAliveInterval 60
the error stil exits, then I try to increase the driver memory, error still exists, and show that the connection is disconnected
[hadoop#Master spark2.4.0]$ bin/spark-submit --master spark://master:7077 --conf spark.worker.timeout=10000000 --driver-memory 1g --py-files id.py id.py --name id
2019-12-06 10:38:49 INFO ContextCleaner:54 - Cleaned accumulator 374
Exception in thread "serve-DataFrame" java.net.SocketException: broken pipe (Write failed)
at java.net.SocketOutputStream.socketWrite0(Native Method)
at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:111)
at java.net.SocketOutputStream.write(SocketOutputStream.java:155)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at java.io.DataOutputStream.write(DataOutputStream.java:107)
at java.io.FilterOutputStream.write(FilterOutputStream.java:97)
at org.apache.spark.api.python.PythonRDD$.org$apache$spark$api$python$PythonRDD$$write$1(PythonRDD.scala:212)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRDD$$anonfun$writeIteratorToStream$1.apply(PythonRDD.scala:224)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:148)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.api.python.PythonRDD$$anonfun$serveIterator$1.apply(PythonRDD.scala:413)
at org.apache.spark.api.python.PythonRDD$$anonfun$serveIterator$1.apply(PythonRDD.scala:412)
at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply$mcV$sp(PythonRDD.scala:435)
at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
at org.apache.spark.api.python.PythonRDD$$anonfun$6$$anonfun$apply$1.apply(PythonRDD.scala:435)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:436)
at org.apache.spark.api.python.PythonRDD$$anonfun$6.apply(PythonRDD.scala:432)
at org.apache.spark.api.python.PythonServer$$anon$1.run(PythonRDD.scala:862)
2019-12-06 11:06:12 WARN HeartbeatReceiver:66 - Removing executor 1 with no recent heartbeats: 149103 ms exceeds timeout 120000 ms
2019-12-06 11:06:12 ERROR TaskSchedulerImpl:70 - Lost executor 1 on 219.226.109.129: Executor heartbeat timed out after 149103 ms
2019-12-06 11:06:13 INFO SparkContext:54 - Invoking stop() from shutdown hook
2019-12-06 11:06:13 INFO DAGScheduler:54 - Executor lost: 1 (epoch 6)
2019-12-06 11:06:13 WARN HeartbeatReceiver:66 - Removing executor 0 with no recent heartbeats: 155761 ms exceeds timeout 120000 ms
2019-12-06 11:06:13 ERROR TaskSchedulerImpl:70 - Lost executor 0 on 219.226.109.131: Executor heartbeat timed out after 155761 ms
2019-12-06 11:06:13 INFO StandaloneSchedulerBackend:54 - Requesting to kill executor(s) 1
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 1 from BlockManagerMaster.
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Removing block manager BlockManagerId(1, 219.226.109.129, 42501, None)
2019-12-06 11:06:13 INFO BlockManagerMaster:54 - Removed 1 successfully in removeExecutor
2019-12-06 11:06:13 INFO DAGScheduler:54 - Shuffle files lost for executor: 1 (epoch 6)
2019-12-06 11:06:13 INFO StandaloneSchedulerBackend:54 - Actual list of executor(s) to be killed is 1
2019-12-06 11:06:13 INFO DAGScheduler:54 - Host added was in lost list earlier: 219.226.109.129
2019-12-06 11:06:13 INFO DAGScheduler:54 - Executor lost: 0 (epoch 7)
2019-12-06 11:06:13 INFO AbstractConnector:318 - Stopped Spark#490228e{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Trying to remove executor 0 from BlockManagerMaster.
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Removing block manager BlockManagerId(0, 219.226.109.131, 42164, None)
2019-12-06 11:06:13 INFO BlockManagerMaster:54 - Removed 0 successfully in removeExecutor
2019-12-06 11:06:13 INFO DAGScheduler:54 - Shuffle files lost for executor: 0 (epoch 7)
2019-12-06 11:06:13 INFO DAGScheduler:54 - Host added was in lost list earlier: 219.226.109.131
2019-12-06 11:06:13 INFO SparkUI:54 - Stopped Spark web UI at http://Master:4040
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Registering block manager 219.226.109.129:42501 with 413.9 MB RAM, BlockManagerId(1, 219.226.109.129, 42501, None)
2019-12-06 11:06:13 INFO BlockManagerMasterEndpoint:54 - Registering block manager 219.226.109.131:42164 with 413.9 MB RAM, BlockManagerId(0, 219.226.109.131, 42164, None)
2019-12-06 11:06:14 INFO StandaloneSchedulerBackend:54 - Shutting down all executors
2019-12-06 11:06:14 INFO CoarseGrainedSchedulerBackend$DriverEndpoint:54 - Asking each executor to shut down
2019-12-06 11:06:14 INFO BlockManagerInfo:54 - Added broadcast_15_piece0 in memory on 219.226.109.129:42501 (size: 21.1 KB, free: 413.9 MB)
2019-12-06 11:06:15 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2019-12-06 11:06:15 INFO BlockManagerInfo:54 - Added broadcast_15_piece0 in memory on 219.226.109.131:42164 (size: 21.1 KB, free: 413.9 MB)
2019-12-06 11:06:16 INFO MemoryStore:54 - MemoryStore cleared
2019-12-06 11:06:16 INFO BlockManager:54 - BlockManager stopped
2019-12-06 11:06:16 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2019-12-06 11:06:17 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2019-12-06 11:06:17 ERROR TransportResponseHandler:144 - Still have 1 requests outstanding when connection from Master/219.226.109.130:7077 is closed
2019-12-06 11:06:17 INFO SparkContext:54 - Successfully stopped SparkContext
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Shutdown hook called
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-e2a29bac-7277-4476-ad23-315a27e9ccf0
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/localPyFiles-dd95954c-2e77-41ca-969d-a201269f5b5b
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-bcd56b4a-fb32-4b58-a1d5-71abc5218d32
2019-12-06 11:06:17 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-e2a29bac-7277-4476-ad23-315a27e9ccf0/pyspark-d04b799f-a116-44d5-b6a5-811cc8c03743
Question
Is SSH related to broken pipe?
Is increasing driver memory helpful to this problem?
I see the configuration posts on the Internet, but they're are highly configured. Since I use my computer to built clusters on virtual machine,
, the master has two cores , the slave has one core. How to adjust the configuration ?

Please try with
--conf spark.worker.timeout=10000000
you are missing the equal character between the configuration name and value.

java.net.SocketException: broken pipe (Write failed) occurs when something is wrong with the access port.
I suggest you to change the master which is at port 8080. The port can be changed either in the configuration file or via command-line options.
sbin/start-master.sh
Same can be tried with worker node as well if the above does not fix issue.
To see which ports are being used you can use :
sudo netstat -ltup

How to deploy spark on Kubeedge?

I tried to use k8s deployment mode to deploy spark-2.4.3 on Kubeedge 1.1.0 but failed (docker version 19.03.4,k8s version 1.16.1).
SPARK_DRIVER_BIND_ADDRESS=10.4.20.34
SPARK_IMAGE=spark:2.4.3
SPARK_MASTER="k8s://http://127.0.0.1:8080"
CMD=(
"$SPARK_HOME/bin/spark-submit"
--conf "spark.driver.bindAddress=$SPARK_DRIVER_BIND_ADDRESS"
--conf "spark.kubernetes.container.image=${SPARK_IMAGE}"
--conf "spark.executor.instances=1"
--conf "spark.kubernetes.executor.limit.cores=1"
--deploy-mode client
--master ${SPARK_MASTER}
--name spark-pi
--class org.apache.spark.examples.SparkPi
--driver-memory 1G
--executor-memory 1G
--num-executors 1
--executor-cores 1
file://${PWD}/spark-examples_2.11-2.4.3.jar
)
${CMD[#]}
Node status is normal.
kubectl get nodes
NAME STATUS ROLES AGE VERSION
edge-node-001 Ready edge 6d1h v1.15.3-kubeedge-v1.1.0-beta.0.178+c6a5aa738261e7-dirty
ubuntu-ms-7b89 Ready master 6d4h v1.16.1
But I got some errors
19/11/17 21:45:12 INFO k8s.ExecutorPodsAllocator: Going to request 1 executors from Kubernetes.
19/11/17 21:45:12 INFO util.Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 46571.
19/11/17 21:45:12 INFO netty.NettyBlockTransferService: Server created on 10.4.20.34:46571
19/11/17 21:45:12 INFO storage.BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
19/11/17 21:45:12 INFO storage.BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 10.4.20.34, 46571, None)
19/11/17 21:45:12 INFO storage.BlockManagerMasterEndpoint: Registering block manager 10.4.20.34:46571 with 366.3 MB RAM, BlockManagerId(driver, 10.4.20.34, 46571, None)
19/11/17 21:45:12 INFO storage.BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 10.4.20.34, 46571, None)
19/11/17 21:45:12 INFO storage.BlockManager: Initialized BlockManager: BlockManagerId(driver, 10.4.20.34, 46571, None)
19/11/17 21:45:12 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#451882b2{/metrics/json,null,AVAILABLE,#Spark}
19/11/17 21:45:42 INFO k8s.KubernetesClusterSchedulerBackend: SchedulerBackend is ready for scheduling beginning after waiting maxRegisteredResourcesWaitingTime: 30000(ms)
19/11/17 21:45:42 INFO spark.SparkContext: Starting job: reduce at SparkPi.scala:38
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Got job 0 (reduce at SparkPi.scala:38) with 2 output partitions
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (reduce at SparkPi.scala:38)
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Parents of final stage: List()
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Missing parents: List()
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34), which has no missing parents
19/11/17 21:45:42 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 1936.0 B, free 366.3 MB)
19/11/17 21:45:42 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 1256.0 B, free 366.3 MB)
19/11/17 21:45:42 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 10.4.20.34:46571 (size: 1256.0 B, free: 366.3 MB)
19/11/17 21:45:42 INFO spark.SparkContext: Created broadcast 0 from broadcast at DAGScheduler.scala:1161
19/11/17 21:45:42 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from ResultStage 0 (MapPartitionsRDD[1] at map at SparkPi.scala:34) (first 15 tasks are for partitions Vector(0, 1))
19/11/17 21:45:42 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
19/11/17 21:45:57 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/11/17 21:46:12 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/11/17 21:46:27 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/11/17 21:46:42 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/11/17 21:46:57 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
19/11/17 21:47:12 WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Is it possible to deploy spark on Kubeedge in Kubernetes deployment mode? Or I should try standalone deployment mode?
I'm so confused.

Spark - Master: got disassociated, removing it

I am deploying a Spark cluster with 1 Master node and 3 worker nodes. Upon moments of deploying the Master and Worker nodes, the master starts spamming the logs with the following messages;
19/07/17 12:56:51 INFO Master: I have been elected leader! New state: ALIVE
19/07/17 12:56:56 INFO Master: Registering worker 172.26.140.209:35803 with 1 cores, 2.0 GB RAM
19/07/17 12:56:57 INFO Master: 172.26.140.163:59146 got disassociated, removing it.
19/07/17 12:56:58 INFO Master: 172.26.140.132:56252 got disassociated, removing it.
19/07/17 12:56:58 INFO Master: 172.26.140.194:62135 got disassociated, removing it.
19/07/17 12:57:02 INFO Master: Registering worker 172.26.140.169:44249 with 1 cores, 2.0 GB RAM
19/07/17 12:57:02 INFO Master: 172.26.140.163:59202 got disassociated, removing it.
19/07/17 12:57:03 INFO Master: 172.26.140.132:56355 got disassociated, removing it.
19/07/17 12:57:03 INFO Master: 172.26.140.194:62157 got disassociated, removing it.
19/07/17 12:57:07 INFO Master: 172.26.140.163:59266 got disassociated, removing it.
19/07/17 12:57:08 INFO Master: 172.26.140.132:56376 got disassociated, removing it.
19/07/17 12:57:08 INFO Master: Registering worker 172.26.140.204:43921 with 1 cores, 2.0 GB RAM
19/07/17 12:57:08 INFO Master: 172.26.140.194:62203 got disassociated, removing it.
19/07/17 12:57:12 INFO Master: 172.26.140.163:59342 got disassociated, removing it.
19/07/17 12:57:13 INFO Master: 172.26.140.132:56392 got disassociated, removing it.
19/07/17 12:57:13 INFO Master: 172.26.140.194:62268 got disassociated, removing it.
19/07/17 12:57:17 INFO Master: 172.26.140.163:59417 got disassociated, removing it.
19/07/17 12:57:18 INFO Master: 172.26.140.132:56415 got disassociated, removing it.
19/07/17 12:57:18 INFO Master: 172.26.140.194:62296 got disassociated, removing it.
19/07/17 12:57:22 INFO Master: 172.26.140.163:59472 got disassociated, removing it.
19/07/17 12:57:23 INFO Master: 172.26.140.132:56483 got disassociated, removing it.
19/07/17 12:57:23 INFO Master: 172.26.140.194:62323 got disassociated, removing it.
The Worker Nodes seem to be connected to the Master correctly and are logging the following;
19/07/17 12:56:56 INFO Utils: Successfully started service 'sparkWorker' on port 35803.
19/07/17 12:56:56 INFO Worker: Starting Spark worker 172.26.140.209:35803 with 1 cores, 2.0 GB RAM
19/07/17 12:56:56 INFO Worker: Running Spark version 2.4.3
19/07/17 12:56:56 INFO Worker: Spark home: /opt/spark
19/07/17 12:56:56 INFO Utils: Successfully started service 'WorkerUI' on port 8081.
19/07/17 12:56:56 INFO WorkerWebUI: Bound WorkerWebUI to 0.0.0.0, and started at http://spark-worker-0.spark-worker-service.default.svc.cluster.local:8081
19/07/17 12:56:56 INFO Worker: Connecting to master spark-master-service.default.svc.cluster.local:7077...
19/07/17 12:56:56 INFO TransportClientFactory: Successfully created connection to spark-master-service.default.svc.cluster.local/10.0.179.236:7077 after 49 ms (0 ms spent in bootstraps)
19/07/17 12:56:56 INFO Worker: Successfully registered with master spark://172.26.140.196:7077
But the Master still logs the disassociating error, for three separate nodes, every 5 seconds.
What is strange is that the IP addresses listed in the Masters logs are all from the kube-proxy service;
kube-system kube-proxy-5vp9r 1/1 Running 0 39h 172.26.140.163 aks-agentpool-31454219-2 <none> <none>
kube-system kube-proxy-kl695 1/1 Running 0 39h 172.26.140.132 aks-agentpool-31454219-1 <none> <none>
kube-system kube-proxy-xgjws 1/1 Running 0 39h 172.26.140.194 aks-agentpool-31454219-0 <none> <none>
My questions are two-fold;
1) Why are the kube-proxy nodes connecting to the Master? Or why does the Master node think that the kube-proxy nodes are taking part in this cluster?
2) What setting do I need to change in order to clear this message from my log files.
Here is the contents of my spark-defaults.conf file
spark.master=spark://spark-master-service:7077
spark.submit.deploy-mode=cluster
spark.executor.cores=1
spark.driver.memory=500m
spark.executor.memory=500m
spark.eventLog.enabled=true
spark.eventLog.dir=/mnt/eventLog
I cannot find any meaningful reason why this is occurring and any assistance would be greatly appreciated.

I had the same problem with my Spark Cluster in Kubernetes, tested spark 2.4.3 and Spark 2.4.4 and also Kubernetes 16.0 and 13.0
This is the solution:
This is how I got my spark object first
spark = SparkSession.builder.appName('Kubernetes-Spark-app').getOrCreate()
and the issue was resolved, by using the cluster ip of the Spark master!
spark = SparkSession.builder.master('spark://10.0.106.83:7077').appName('Kubernetes-Spark-app').getOrCreate()
works with this chart
helm install microsoft/spark --generate-name

Spark executor lost when increasing the number of executor instances

My Hadoop cluster currently has 4 nodes and 45 cores running pyspark 2.4 through YARN. When I run spark-submit with one executor everything works fine, but if I change the number of executor-instances to 3 or 4 the executor is killed by the driver and only one task is working.
I have changed the below settings on Cloudera manager:
yarn.nodemanager.resource.memory-mb : 64 GB
yarn.nodemanager.resource.cpu-vcores:45
And below is the log that I get:
19/03/21 11:28:48 INFO cluster.YarnScheduler: Adding task set 0.0 with 1 tasks
19/03/21 11:28:48 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, datanode1, executor 2, partition 0, PROCESS_LOCAL, 7701 bytes)
19/03/21 11:28:48 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on datanode1:42432 (size: 71.0 KB, free: 366.2 MB)
19/03/21 11:29:43 INFO spark.ExecutorAllocationManager: Request to remove executorIds: 1, 3
19/03/21 11:29:43 INFO cluster.YarnClientSchedulerBackend: Requesting to kill executor(s) 1, 3
19/03/21 11:29:43 INFO cluster.YarnClientSchedulerBackend: Actual list of executor(s) to be killed is 1, 3
19/03/21 11:29:43 INFO spark.ExecutorAllocationManager: Removing executor 1 because it has been idle for 60 seconds (new desired total will be 2)
19/03/21 11:29:43 INFO spark.ExecutorAllocationManager: Removing executor 3 because it has been idle for 60 seconds (new desired total will be 1)
19/03/21 11:29:45 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 3.
19/03/21 11:29:45 INFO scheduler.DAGScheduler: Executor lost: 3 (epoch 0)
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(3, datanode2, 32853, None)
19/03/21 11:29:45 INFO storage.BlockManagerMaster: Removed 3 successfully in removeExecutor
19/03/21 11:29:45 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 1.
19/03/21 11:29:45 INFO scheduler.DAGScheduler: Executor lost: 1 (epoch 0)
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
19/03/21 11:29:45 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, datanode3, 39466, None)
19/03/21 11:29:45 INFO storage.BlockManagerMaster: Removed 1 successfully in removeExecutor
19/03/21 11:29:45 INFO cluster.YarnScheduler: Executor 3 on datanode2 killed by driver.
19/03/21 11:29:45 INFO cluster.YarnScheduler: Executor 1 on datanode3 killed by driver.
19/03/21 11:29:45 INFO spark.ExecutorAllocationManager: Existing executor 3 has been removed (new total is 2)
19/03/21 11:29:45 INFO spark.ExecutorAllocationManager: Existing executor 1 has been removed (new total is 1)

Collect failed in ... s due to Stage cancelled because SparkContext was shut down

I want to display the number of elements in each partition, so I write the following:
def count_in_a_partition(iterator):
yield sum(1 for _ in iterator)
If I use it like this
print("number of element in each partitions: {}".format(
my_rdd.mapPartitions(count_in_a_partition).collect()
))
I get the following:
19/02/18 21:41:15 INFO DAGScheduler: Job 3 failed: collect at /project/6008168/tamouze/testSparkCedar.py:435, took 30.859710 s
19/02/18 21:41:15 INFO DAGScheduler: ResultStage 3 (collect at /project/6008168/tamouze/testSparkCedar.py:435) failed in 30.848 s due to Stage cancelled because SparkContext was shut down
19/02/18 21:41:15 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
19/02/18 21:41:16 INFO MemoryStore: MemoryStore cleared
19/02/18 21:41:16 INFO BlockManager: BlockManager stopped
19/02/18 21:41:16 INFO BlockManagerMaster: BlockManagerMaster stopped
19/02/18 21:41:16 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
19/02/18 21:41:16 WARN BlockManager: Putting block rdd_3_14 failed due to exception java.net.SocketException: Connection reset.
19/02/18 21:41:16 WARN BlockManager: Block rdd_3_14 could not be removed as it was not found on disk or in memory
19/02/18 21:41:16 WARN BlockManager: Putting block rdd_3_3 failed due to exception java.net.SocketException: Connection reset.
19/02/18 21:41:16 WARN BlockManager: Block rdd_3_3 could not be removed as it was not found on disk or in memory
19/02/18 21:41:16 INFO SparkContext: Successfully stopped SparkContext
....
noting that my_rdd.take(1) return :
[(u'id', u'text', array([-0.31921682, ...,0.890875]))]
How can I solve this issue?

You have to use glom() function for that. Let’s take an example.
Let's create a DataFrame first.
rdd=sc.parallelize([('a',22),('b',1),('c',4),('b',1),('d',2),('e',0),('d',3),('a',1),('c',4),('b',7),('a',2),('f',1)] )
df=rdd.toDF(['key','value'])
df=df.repartition(5,"key") # Make 5 Partitions
The number of partitions -
print("Number of partitions: {}".format(df.rdd.getNumPartitions()))
Number of partitions: 5
Number of rows/elements on each partition. This can give you an idea of skew -
print('Partitioning distribution: '+ str(df.rdd.glom().map(len).collect()))
Partitioning distribution: [3, 3, 2, 2, 2]
See how actually are rows distributed on the partitions. Behold that if the dataset is big, then your system could crash because of Out of Memory OOM issue.
print("Partitions structure: {}".format(df.rdd.glom().collect()))
Partitions structure: [
#Partition 1 [Row(key='a', value=22), Row(key='a', value=1), Row(key='a', value=2)],
#Partition 2 [Row(key='b', value=1), Row(key='b', value=1), Row(key='b', value=7)],
#Partition 3 [Row(key='c', value=4), Row(key='c', value=4)],
#Partition 4 [Row(key='e', value=0), Row(key='f', value=1)],
#Partition 5 [Row(key='d', value=2), Row(key='d', value=3)]
]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Spark Pod restarting every hour in Kubernetes - apache-spark

I don't know how you configured your application pod but you can use this to stop restarting pod include this in your deployment yaml file so that pod will never restart and you can debug the pod onwards. restartPolicy: Never

Related

Optimization of spark cluster configuration（ Low configuration virtual machine cluster）

How to deploy spark on Kubeedge?

Spark - Master: got disassociated, removing it

Spark executor lost when increasing the number of executor instances

Collect failed in ... s due to Stage cancelled because SparkContext was shut down

Categories

Resources