livy session can not get sparkcontext, stuck on starting - apache-spark

I try to use pyspark on jupyterhub (which on kubernetes) for interactive programming to a remote spark cluster on kubernetes. So I use sparkmagic and livy (which on kubernetes, too)
When I try to get sparkContext and sparkSession in notebook, util the livy session dead, it is still stuck on 'starting' status.
My spark-driver-pod is running, and I can see this log:
53469 [pool-8-thread-1] INFO org.apache.livy.rsc.driver.SparkEntries - Spark context finished initialization in 34532ms
53625 [pool-8-thread-1] INFO org.apache.livy.rsc.driver.SparkEntries - Created Spark session.
128775 [dispatcher-CoarseGrainedScheduler] INFO org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint - Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.83.128.194:35040) with ID 1, ResourceProfileId 0
128927 [dispatcher-BlockManagerMaster] INFO org.apache.spark.storage.BlockManagerMasterEndpoint - Registering block manager 10.83.128.194:42385 with 4.6 GiB RAM, BlockManagerId(1, 10.83.128.194, 42385, None)
131902 [dispatcher-CoarseGrainedScheduler] INFO org.apache.spark.scheduler.cluster.k8s.KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint - Registered executor NettyRpcEndpointRef(spark-client://Executor) (10.83.128.130:58232) with ID 2, ResourceProfileId 0
132041 [dispatcher-BlockManagerMaster] INFO org.apache.spark.storage.BlockManagerMasterEndpoint - Registering block manager 10.83.128.130:37991 with 4.6 GiB RAM, BlockManagerId(2, 10.83.128.130, 37991, None)
My spark-executor-pod is also running.
This is my livy-server's log:
2022-05-19 08:36:54,959 DEBUG LivySession Session 0 in state starting. Sleeping 2 seconds.
2022-05-19 08:36:56,969 DEBUG LivySession Session 0 in state starting. Sleeping 2 seconds.
2022-05-19 08:36:58,979 DEBUG LivySession Session 0 in state starting. Sleeping 2 seconds.
2022-05-19 08:37:01,002 DEBUG LivySession Session 0 in state starting. Sleeping 2 seconds.
2022-05-19 08:37:03,015 ERROR LivySession Session 0 did not reach idle status in time. Current status is starting.
2022-05-19 08:37:03,016 INFO EventsHandler InstanceId: 0139a7a9-a0b5-439e-84f5-a9ca6c896360,EventName: notebookSessionCreationEnd,Timestamp: 2022-05-19 08:37:03.016038,SessionGuid: 14da96d9-8b24-4beb-a5ad-a32009c9f772,LivyKind: pyspark,SessionId: 0,Status: starting,Success: False,ExceptionType: LivyClientTimeoutException,ExceptionMessage: Session 0 did not start up in 600 seconds.
2022-05-19 08:37:03,016 INFO EventsHandler InstanceId: 0139a7a9-a0b5-439e-84f5-a9ca6c896360,EventName: notebookSessionDeletionStart,Timestamp: 2022-05-19 08:37:03.016288,SessionGuid: 14da96d9-8b24-4beb-a5ad-a32009c9f772,LivyKind: pyspark,SessionId: 0,Status: starting
2022-05-19 08:37:03,016 DEBUG LivySession Deleting session '0'
2022-05-19 08:37:03,037 INFO EventsHandler InstanceId: 0139a7a9-a0b5-439e-84f5-a9ca6c896360,EventName: notebookSessionDeletionEnd,Timestamp: 2022-05-19 08:37:03.036919,SessionGuid: 14da96d9-8b24-4beb-a5ad-a32009c9f772,LivyKind: pyspark,SessionId: 0,Status: dead,Success: True,ExceptionType: ,ExceptionMessage:
2022-05-19 08:37:03,037 ERROR SparkMagics Error creating session: Session 0 did not start up in 600 seconds.
Please tell me how can I solve this problem, thanks!
My spark version:3.2.1
livy version:0.8.0

Related

Spark Pod restarting every hour in Kubernetes

I have deployed spark applications in cluster-mode in kubernetes. The spark application pod is getting restarted almost every hour.
The driver log has this message before restart:
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 1 on x.x.x.x: The executor with id 1 was deleted by a user or the framework.
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 2 on y.y.y.y: The executor with id 2 was deleted by a user or the framework.
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 1 (epoch 0)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, x.x.x.x, 44879, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 0)
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 2 (epoch 1)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, y.y.y.y, 46191, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 1)
20/07/11 13:34:02 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
20/07/11 13:34:16 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
And the Executor log has:
20/07/11 15:55:01 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
20/07/11 15:55:01 INFO MemoryStore: MemoryStore cleared
20/07/11 15:55:01 INFO BlockManager: BlockManager stopped
20/07/11 15:55:01 INFO ShutdownHookManager: Shutdown hook called
How can I find what's causing the executors deletion?
Deployment:
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 0 max surge
Pod Template:
Labels: app=test
chart=test-2.0.0
heritage=Tiller
product=testp
release=test
service=test-spark
Containers:
test-spark:
Image: test-spark:2df66df06c
Port: <none>
Host Port: <none>
Command:
/spark/bin/start-spark.sh
Args:
while true; do sleep 30; done;
Limits:
memory: 4Gi
Requests:
memory: 4Gi
Liveness: exec [/spark/bin/liveness-probe.sh] delay=300s timeout=1s period=30s #success=1 #failure=10
Environment:
JVM_ARGS: -Xms256m -Xmx1g
KUBERNETES_MASTER: https://kubernetes.default.svc
KUBERNETES_NAMESPACE: test-spark
IMAGE_PULL_POLICY: Always
DRIVER_CPU: 1
DRIVER_MEMORY: 2048m
EXECUTOR_CPU: 1
EXECUTOR_MEMORY: 2048m
EXECUTOR_INSTANCES: 2
KAFKA_ADVERTISED_HOST_NAME: kafka.default:9092
ENRICH_KAFKA_ENRICHED_EVENTS_TOPICS: test-events
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: test-spark-5c5997b459 (1/1 replicas created)
Events: <none>
I did a quick research on running Spark on Kubernetes, and it seems that Spark by design will terminate executor pod when they finished running Spark applications. Quoted from the official Spark website:
When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
Therefore, I believe there is nothing to worry about the restarts as long as your Spark instance still manages to start executor pods as and when required.
Reference: https://spark.apache.org/docs/2.4.5/running-on-kubernetes.html#how-it-works
I don't know how you configured your application pod but you can use this to stop restarting pod include this in your deployment yaml file so that pod will never restart and you can debug the pod onwards.
restartPolicy: Never

Livy session stuck on starting after successful spark context creation

I've been trying to create a new spark session with Livy 0.7 server that runs on Ubuntu 18.04.
On that same machine I have a running spark cluster with 2 workers and I'm able to create a normal spark-session.
My problem is that after running the following request to Livy server the session stays stuck on starting state:
import json, pprint, requests, textwrap
host = 'http://localhost:8998'
data = {'kind': 'spark'}
headers = {'Content-Type': 'application/json'}
r = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
r.json()
I can see that the session is starting and created the spark session from the session log:
20/06/03 13:52:31 INFO SparkEntries: Spark context finished initialization in 5197ms
20/06/03 13:52:31 INFO SparkEntries: Created Spark session.
20/06/03 13:52:46 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (xxx.xx.xx.xxx:1828) with ID 0
20/06/03 13:52:47 INFO BlockManagerMasterEndpoint: Registering block manager xxx.xx.xx.xxx:1830 with 434.4 MB RAM, BlockManagerId(0, xxx.xx.xx.xxx, 1830, None)
and also from the spark master UI:
and after the livy.rsc.server.idle-timeout is reached the session log then outputs:
20/06/03 14:28:04 WARN RSCDriver: Shutting down RSC due to idle timeout (10m).
20/06/03 14:28:04 INFO SparkUI: Stopped Spark web UI at http://172.17.52.209:4040
20/06/03 14:28:04 INFO StandaloneSchedulerBackend: Shutting down all executors
20/06/03 14:28:04 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
20/06/03 14:28:04 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/06/03 14:28:04 INFO MemoryStore: MemoryStore cleared
20/06/03 14:28:04 INFO BlockManager: BlockManager stopped
20/06/03 14:28:04 INFO BlockManagerMaster: BlockManagerMaster stopped
20/06/03 14:28:04 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/06/03 14:28:04 INFO SparkContext: Successfully stopped SparkContext
20/06/03 14:28:04 INFO SparkContext: SparkContext already stopped.
and after that the dies :(
I already tried increasing the driver timeout with no luck, and didn't find any known issues like that
my guess it has something to do with the spark driver connectivity to the rsc but I have no idea where to configure that
Anyone knows the reason/solution for that?
We faced a similar problem in one of our environments. The only difference between the working and non-working env was spark master setting in livy.conf file.
I removed the config livy.spark.master=yarn from livy.conf and set this value from the code itself.
// pass master as yarn
public static JavaSparkContext getSparkContext(final String master, final String appName) {
LOGGER.info("Creating spark context");
SparkConf conf = new SparkConf().setAppName(appName);
if (Strings.isNullOrEmpty(master)) {
LOGGER.warn("No spark master found setting local!!");
conf.setMaster("local");
} else {
conf.setMaster(master);
}
conf.set("spark.submit.deployMode", "client");
return new JavaSparkContext(conf);
}
This worked for me.
It would help if anyone can point out, how this worked for me.

Livy create session dead

I added to my spark config a package (in spark-default.conf) but when I create a new session with livy it causes me a problem (see the error below) and the session and death .
ps: when I remove this package all work fine .
20/05/04 00:17:35 WARN RSCClient: Error stopping RPC.
io.netty.util.concurrent.BlockingOperationException: DefaultChannelPromise#6d493840(uncancellable)
at io.netty.util.concurrent.DefaultPromise.checkDeadLock(DefaultPromise.java:394)
at io.netty.channel.DefaultChannelPromise.checkDeadLock(DefaultChannelPromise.java:157)
...........
Exception in thread "Thread-32" java.io.IOException: Stream closed
at java.io.BufferedInputStream.getBufIfOpen(BufferedInputStream.java:170)
at java.io.BufferedInputStream.read1(BufferedInputStream.java:283)
........
at org.apache.livy.utils.LineBufferedStream$$anon$1.run(LineBufferedStream.scala:46)
20/05/04 00:17:36 WARN ContextLauncher: Child process exited with code 143.
20/05/04 00:17:36 ERROR SparkProcApp: job was killed by user
20/05/04 00:17:36 INFO InteractiveSession: Stopped InteractiveSession 0.
20/05/04 00:28:17 INFO InteractiveSessionManager: Deleting InteractiveSession 0 because it was inactive for more than 3600000.0 ms.
20/05/04 00:28:17 INFO InteractiveSessionManager: Deleting session 0
20/05/04 00:28:17 INFO InteractiveSession: Stopping InteractiveSession 0...
20/05/04 00:28:17 INFO InteractiveSession: Stopped InteractiveSession 0.
20/05/04 00:28:17 INFO InteractiveSessionManager: Deleted session 0
I use :
cloudera hdp2.6.5 :
spark 2.3
livy 0.7.0
Hadoop 2.7
lib unsupervised (https://github.com/unsupervise/spark-tss)
step :
livy conf => livy.spark.master yarn-cluster
spark-default conf => spark.jars.repositories https://dl.bintray.com/unsupervise/maven/
spark-defaultconf => spark.jars.packages com.github.unsupervise:spark-tss:0.1.1
Please try also to add to spark defaults
spark.jars.repositories https://dl.bintray.com/unsupervise/maven/
Alternatively (just to be sure that you have no issues with spark-default.conf file) please try to include these configs to the Livy request body instead when submitting the Spark job (refer Livy API).

Why am I getting "Removing worker because we got no heartbeat in 60 seconds" on Spark master

I think I might of stumbled across a bug and wanted to get other people's input. I am running a pyspark application using Spark 2.2.0 in standalone mode. I am doing a somewhat heavy transformation in python inside a flatMap and the driver keeps killing the workers.
Here is what am I seeing:
The master after 60s of not seeing any heartbeat message from the workers it prints out this message to the log:
Removing worker [worker name] because we got no heartbeat in 60
seconds
Removing worker [worker name] on [IP]:[port]
Telling app of
lost executor: [executor number]
I then see in the driver log the following message:
Lost executor [executor number] on [executor IP]: worker lost
The worker then terminates and I see this message in its log:
Driver commanded a shutdown
I have looked at the Spark source code and from what I can tell, as long as the executor is alive it should send a heartbeat message back as it is using a ThreadUtils.newDaemonSingleThreadScheduledExecutor to do this.
One other thing that I noticed while I was running top on one of the workers, is that the executor JVM seems to be suspended throughout this process. There are as many python processes as I specified in the SPARK_WORKER_CORES env variable and each is consuming close to 100% of the CPU.
Anyone have any thoughts on this?
I was facing this same issue, increasing interval worked.
Excerpt from Logs start-all.sh logs
INFO Utils: Successfully started service 'sparkMaster' on port 7077.
INFO Master: Starting Spark master at spark://master:7077
INFO Master: Running Spark version 3.0.1
INFO Utils: Successfully started service 'MasterUI' on port 8080.
INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://master:8080
INFO Master: I have been elected leader! New state: ALIVE
INFO Master: Registering worker slave01:41191 with 16 cores, 15.7 GiB RAM
INFO Master: Registering worker slave02:37853 with 16 cores, 15.7 GiB RAM
WARN Master: Removing worker-20210618205117-slave01-41191 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618205117-slave01-41191 on slave01:41191
INFO Master: Telling app of lost worker: worker-20210618205117-slave01-41191
WARN Master: Removing worker-20210618204723-slave02-37853 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618204723-slave02-37853 on slave02:37853
INFO Master: Telling app of lost worker: worker-20210618204723-slave02-37853
WARN Master: Got heartbeat from unregistered worker worker-20210618205117-slave01-41191. This worker was never registered, so ignoring the heartbeat.
WARN Master: Got heartbeat from unregistered worker worker-20210618204723-slave02-37853. This worker was never registered, so ignoring the heartbeat.
Solution: add following configs to $SPARK_HOME/conf/spark-defaults.conf
spark.network.timeout 50000
spark.executor.heartbeatInterval 5000
spark.worker.timeout 5000

Spark Jobserver fail just by receiving a job request

Jobserver 0.7.0 it have 4Gb ram available and 10Gb for the context, the system have 3 more free Gb. The context was running for a while and at the time when receive a request fails without any error. The request is the same like other ones that have processed while it was up, is not a special one. The following log corresponds to the jobserver log and as you can see, the last successfully job was finished at 03:08:23,341 and when receive the next one then the driver command a shutdown.
[2017-05-16 03:08:23,340] INFO output.FileOutputCommitter [] [] - Saved output of task 'attempt_201705160308_0321_m_000199_0' to file:/value_iq/spark-warehouse/spark_cube_users_v/tenant_id=7/_temporary/0/task_201705160308_0321_m_000199
[2017-05-16 03:08:23,340] INFO pred.SparkHadoopMapRedUtil [] [] - attempt_201705160308_0321_m_000199_0: Committed
[2017-05-16 03:08:23,341] INFO he.spark.executor.Executor [] [] - Finished task 199.0 in stage 321.0 (TID 49474). 2738 bytes result sent to driver
[2017-05-16 03:39:02,195] INFO arseGrainedExecutorBackend [] [] - Driver commanded a shutdown
[2017-05-16 03:39:02,239] INFO storage.memory.MemoryStore [] [] - MemoryStore cleared
[2017-05-16 03:39:02,254] INFO spark.storage.BlockManager [] [] - BlockManager stopped
[2017-05-16 03:39:02,363] ERROR arseGrainedExecutorBackend [] [] - RECEIVED SIGNAL TERM
[2017-05-16 03:39:02,404] INFO k.util.ShutdownHookManager [] [] - Shutdown hook called
[2017-05-16 03:39:02,412] INFO k.util.ShutdownHookManager [] [] - Deleting directory /tmp/spark-556033e2-c456-49d6-a43c-ef2cd3494b71/executor-b3ceaf84-e66a-45ed-acfe-1052ab1de2f8/spark-87671e4f-54da-47d7-a077-eb5f75d07e39
The Spark Worker server just log the following:
17/05/15 19:25:54 INFO ExternalShuffleBlockResolver: Registered executor AppExecId{appId=app-20170515192550-0004, execId=0} with ExecutorShuffleInfo{localDirs=[/tmp/spark-556033e2-c456-49d6-a43c-ef2cd3494b71/executor-b3ceaf84-e66a-45ed-acfe-1052ab1de2f8/blockmgr-eca888c0-4e63-421c-9e61-d959ee45f8e9], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
17/05/16 03:39:02 INFO Worker: Asked to kill executor app-20170515192550-0004/0
17/05/16 03:39:02 INFO ExecutorRunner: Runner thread for executor app-20170515192550-0004/0 interrupted
17/05/16 03:39:02 INFO ExecutorRunner: Killing process!
17/05/16 03:39:02 INFO Worker: Executor app-20170515192550-0004/0 finished with state KILLED exitStatus 0
17/05/16 03:39:02 INFO Worker: Cleaning up local directories for application app-20170515192550-0004
17/05/16 03:39:07 INFO ExternalShuffleBlockResolver: Application app-20170515192550-0004 removed, cleanupLocalDirs = true
17/05/16 03:39:07 INFO ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=app-20170515192550-0004, execId=0}'s 1 local dirs
And the Master log:
17/05/16 03:39:02 INFO Master: Received unregister request from application app-20170515192550-0004
17/05/16 03:39:02 INFO Master: Removing app app-20170515192550-0004
17/05/16 03:39:02 INFO Master: 157.97.107.150:33928 got disassociated, removing it.
17/05/16 03:39:02 INFO Master: 157.97.107.150:55444 got disassociated, removing it.
17/05/16 03:39:02 WARN Master: Got status update for unknown executor app-20170515192550-0004/0
Before receiving this request spark wasn't executing any other job, the context was using 5,3G/10G and the driver 1,3G/4G.
What meas "Driver commanded a shutdown"?
There is any log property that can be changed to see more details on the logs?
How can a simple request just break the context?

Resources