Spark Jobserver fail just by receiving a job request - apache-spark

Jobserver 0.7.0 it have 4Gb ram available and 10Gb for the context, the system have 3 more free Gb. The context was running for a while and at the time when receive a request fails without any error. The request is the same like other ones that have processed while it was up, is not a special one. The following log corresponds to the jobserver log and as you can see, the last successfully job was finished at 03:08:23,341 and when receive the next one then the driver command a shutdown.
[2017-05-16 03:08:23,340] INFO output.FileOutputCommitter [] [] - Saved output of task 'attempt_201705160308_0321_m_000199_0' to file:/value_iq/spark-warehouse/spark_cube_users_v/tenant_id=7/_temporary/0/task_201705160308_0321_m_000199
[2017-05-16 03:08:23,340] INFO pred.SparkHadoopMapRedUtil [] [] - attempt_201705160308_0321_m_000199_0: Committed
[2017-05-16 03:08:23,341] INFO he.spark.executor.Executor [] [] - Finished task 199.0 in stage 321.0 (TID 49474). 2738 bytes result sent to driver
[2017-05-16 03:39:02,195] INFO arseGrainedExecutorBackend [] [] - Driver commanded a shutdown
[2017-05-16 03:39:02,239] INFO storage.memory.MemoryStore [] [] - MemoryStore cleared
[2017-05-16 03:39:02,254] INFO spark.storage.BlockManager [] [] - BlockManager stopped
[2017-05-16 03:39:02,363] ERROR arseGrainedExecutorBackend [] [] - RECEIVED SIGNAL TERM
[2017-05-16 03:39:02,404] INFO k.util.ShutdownHookManager [] [] - Shutdown hook called
[2017-05-16 03:39:02,412] INFO k.util.ShutdownHookManager [] [] - Deleting directory /tmp/spark-556033e2-c456-49d6-a43c-ef2cd3494b71/executor-b3ceaf84-e66a-45ed-acfe-1052ab1de2f8/spark-87671e4f-54da-47d7-a077-eb5f75d07e39
The Spark Worker server just log the following:
17/05/15 19:25:54 INFO ExternalShuffleBlockResolver: Registered executor AppExecId{appId=app-20170515192550-0004, execId=0} with ExecutorShuffleInfo{localDirs=[/tmp/spark-556033e2-c456-49d6-a43c-ef2cd3494b71/executor-b3ceaf84-e66a-45ed-acfe-1052ab1de2f8/blockmgr-eca888c0-4e63-421c-9e61-d959ee45f8e9], subDirsPerLocalDir=64, shuffleManager=org.apache.spark.shuffle.sort.SortShuffleManager}
17/05/16 03:39:02 INFO Worker: Asked to kill executor app-20170515192550-0004/0
17/05/16 03:39:02 INFO ExecutorRunner: Runner thread for executor app-20170515192550-0004/0 interrupted
17/05/16 03:39:02 INFO ExecutorRunner: Killing process!
17/05/16 03:39:02 INFO Worker: Executor app-20170515192550-0004/0 finished with state KILLED exitStatus 0
17/05/16 03:39:02 INFO Worker: Cleaning up local directories for application app-20170515192550-0004
17/05/16 03:39:07 INFO ExternalShuffleBlockResolver: Application app-20170515192550-0004 removed, cleanupLocalDirs = true
17/05/16 03:39:07 INFO ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=app-20170515192550-0004, execId=0}'s 1 local dirs
And the Master log:
17/05/16 03:39:02 INFO Master: Received unregister request from application app-20170515192550-0004
17/05/16 03:39:02 INFO Master: Removing app app-20170515192550-0004
17/05/16 03:39:02 INFO Master: 157.97.107.150:33928 got disassociated, removing it.
17/05/16 03:39:02 INFO Master: 157.97.107.150:55444 got disassociated, removing it.
17/05/16 03:39:02 WARN Master: Got status update for unknown executor app-20170515192550-0004/0
Before receiving this request spark wasn't executing any other job, the context was using 5,3G/10G and the driver 1,3G/4G.
What meas "Driver commanded a shutdown"?
There is any log property that can be changed to see more details on the logs?
How can a simple request just break the context?

Related

Spark Pod restarting every hour in Kubernetes

I have deployed spark applications in cluster-mode in kubernetes. The spark application pod is getting restarted almost every hour.
The driver log has this message before restart:
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 1 on x.x.x.x: The executor with id 1 was deleted by a user or the framework.
20/07/11 13:34:02 ERROR TaskSchedulerImpl: Lost executor 2 on y.y.y.y: The executor with id 2 was deleted by a user or the framework.
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 1 (epoch 0)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(1, x.x.x.x, 44879, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 1 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 1 (epoch 0)
20/07/11 13:34:02 INFO DAGScheduler: Executor lost: 2 (epoch 1)
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
20/07/11 13:34:02 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, y.y.y.y, 46191, None)
20/07/11 13:34:02 INFO BlockManagerMaster: Removed 2 successfully in removeExecutor
20/07/11 13:34:02 INFO DAGScheduler: Shuffle files lost for executor: 2 (epoch 1)
20/07/11 13:34:02 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
20/07/11 13:34:16 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes.
And the Executor log has:
20/07/11 15:55:01 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
20/07/11 15:55:01 INFO MemoryStore: MemoryStore cleared
20/07/11 15:55:01 INFO BlockManager: BlockManager stopped
20/07/11 15:55:01 INFO ShutdownHookManager: Shutdown hook called
How can I find what's causing the executors deletion?
Deployment:
Replicas: 1 desired | 1 updated | 1 total | 1 available | 0 unavailable
StrategyType: RollingUpdate
MinReadySeconds: 0
RollingUpdateStrategy: 1 max unavailable, 0 max surge
Pod Template:
Labels: app=test
chart=test-2.0.0
heritage=Tiller
product=testp
release=test
service=test-spark
Containers:
test-spark:
Image: test-spark:2df66df06c
Port: <none>
Host Port: <none>
Command:
/spark/bin/start-spark.sh
Args:
while true; do sleep 30; done;
Limits:
memory: 4Gi
Requests:
memory: 4Gi
Liveness: exec [/spark/bin/liveness-probe.sh] delay=300s timeout=1s period=30s #success=1 #failure=10
Environment:
JVM_ARGS: -Xms256m -Xmx1g
KUBERNETES_MASTER: https://kubernetes.default.svc
KUBERNETES_NAMESPACE: test-spark
IMAGE_PULL_POLICY: Always
DRIVER_CPU: 1
DRIVER_MEMORY: 2048m
EXECUTOR_CPU: 1
EXECUTOR_MEMORY: 2048m
EXECUTOR_INSTANCES: 2
KAFKA_ADVERTISED_HOST_NAME: kafka.default:9092
ENRICH_KAFKA_ENRICHED_EVENTS_TOPICS: test-events
Conditions:
Type Status Reason
---- ------ ------
Available True MinimumReplicasAvailable
Progressing True NewReplicaSetAvailable
OldReplicaSets: <none>
NewReplicaSet: test-spark-5c5997b459 (1/1 replicas created)
Events: <none>
I did a quick research on running Spark on Kubernetes, and it seems that Spark by design will terminate executor pod when they finished running Spark applications. Quoted from the official Spark website:
When the application completes, the executor pods terminate and are cleaned up, but the driver pod persists logs and remains in “completed” state in the Kubernetes API until it’s eventually garbage collected or manually cleaned up.
Therefore, I believe there is nothing to worry about the restarts as long as your Spark instance still manages to start executor pods as and when required.
Reference: https://spark.apache.org/docs/2.4.5/running-on-kubernetes.html#how-it-works
I don't know how you configured your application pod but you can use this to stop restarting pod include this in your deployment yaml file so that pod will never restart and you can debug the pod onwards.
restartPolicy: Never

Livy session stuck on starting after successful spark context creation

I've been trying to create a new spark session with Livy 0.7 server that runs on Ubuntu 18.04.
On that same machine I have a running spark cluster with 2 workers and I'm able to create a normal spark-session.
My problem is that after running the following request to Livy server the session stays stuck on starting state:
import json, pprint, requests, textwrap
host = 'http://localhost:8998'
data = {'kind': 'spark'}
headers = {'Content-Type': 'application/json'}
r = requests.post(host + '/sessions', data=json.dumps(data), headers=headers)
r.json()
I can see that the session is starting and created the spark session from the session log:
20/06/03 13:52:31 INFO SparkEntries: Spark context finished initialization in 5197ms
20/06/03 13:52:31 INFO SparkEntries: Created Spark session.
20/06/03 13:52:46 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Registered executor NettyRpcEndpointRef(spark-client://Executor) (xxx.xx.xx.xxx:1828) with ID 0
20/06/03 13:52:47 INFO BlockManagerMasterEndpoint: Registering block manager xxx.xx.xx.xxx:1830 with 434.4 MB RAM, BlockManagerId(0, xxx.xx.xx.xxx, 1830, None)
and also from the spark master UI:
and after the livy.rsc.server.idle-timeout is reached the session log then outputs:
20/06/03 14:28:04 WARN RSCDriver: Shutting down RSC due to idle timeout (10m).
20/06/03 14:28:04 INFO SparkUI: Stopped Spark web UI at http://172.17.52.209:4040
20/06/03 14:28:04 INFO StandaloneSchedulerBackend: Shutting down all executors
20/06/03 14:28:04 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
20/06/03 14:28:04 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/06/03 14:28:04 INFO MemoryStore: MemoryStore cleared
20/06/03 14:28:04 INFO BlockManager: BlockManager stopped
20/06/03 14:28:04 INFO BlockManagerMaster: BlockManagerMaster stopped
20/06/03 14:28:04 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/06/03 14:28:04 INFO SparkContext: Successfully stopped SparkContext
20/06/03 14:28:04 INFO SparkContext: SparkContext already stopped.
and after that the dies :(
I already tried increasing the driver timeout with no luck, and didn't find any known issues like that
my guess it has something to do with the spark driver connectivity to the rsc but I have no idea where to configure that
Anyone knows the reason/solution for that?
We faced a similar problem in one of our environments. The only difference between the working and non-working env was spark master setting in livy.conf file.
I removed the config livy.spark.master=yarn from livy.conf and set this value from the code itself.
// pass master as yarn
public static JavaSparkContext getSparkContext(final String master, final String appName) {
LOGGER.info("Creating spark context");
SparkConf conf = new SparkConf().setAppName(appName);
if (Strings.isNullOrEmpty(master)) {
LOGGER.warn("No spark master found setting local!!");
conf.setMaster("local");
} else {
conf.setMaster(master);
}
conf.set("spark.submit.deployMode", "client");
return new JavaSparkContext(conf);
}
This worked for me.
It would help if anyone can point out, how this worked for me.

All executors finish with state KILLED and exitStatus 1

I am trying to setup a local Spark cluster. I am using Spark 2.4.4 on Windows 10 machine.
To start the master and one worker I do
spark-class org.apache.spark.deploy.master.Master
spark-class org.apache.spark.deploy.worker.Worker 172.17.1.230:7077
After submitting an application to the cluster, it finishes successfully but in the Spark web admin UI it says that the application is KILLED. It's also what I get from worker logs. I have tried running my own examples and examples included in the Spark installation. They all get killed with exitStatus 1.
To start spark JavaSparkPi example from spark installation folder
Spark> spark-submit --master spark://172.17.1.230:7077 --class org.apache.spark.examples.JavaSparkPi .\examples\jars\spark-examples_2.11-2.4.4.jar
Part of the log after finishing calculation outputs
20/01/19 18:55:11 INFO DAGScheduler: Job 0 finished: reduce at JavaSparkPi.java:54, took 4.183853 s
Pi is roughly 3.13814
20/01/19 18:55:11 INFO SparkUI: Stopped Spark web UI at http://Nikola-PC:4040
20/01/19 18:55:11 INFO StandaloneSchedulerBackend: Shutting down all executors
20/01/19 18:55:11 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asking each executor to shut down
20/01/19 18:55:11 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/01/19 18:55:11 WARN TransportChannelHandler: Exception in connection from /172.17.1.230:58560
java.io.IOException: An existing connection was forcibly closed by the remote host
stderr log of the completed application outputs this at the end
20/01/19 18:55:11 INFO Executor: Finished task 1.0 in stage 0.0 (TID 1). 910 bytes result sent to driver
20/01/19 18:55:11 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 910 bytes result sent to driver
20/01/19 18:55:11 INFO CoarseGrainedExecutorBackend: Driver commanded a shutdown
The worker log outputs
20/01/19 18:55:06 INFO ExecutorRunner: Launch command: "C:\Program Files\Java\jdk1.8.0_231\bin\java" "-cp" "C:\Users\nikol\Spark\bin\..\conf\;C:\Users\nikol\Spark\jars\*" "-Xmx1024M" "-Dspark.driver.port=58484" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#Nikola-PC:58484" "--executor-id" "0" "--hostname" "172.17.1.230" "--cores" "12" "--app-id" "app-20200119185506-0001" "--worker-url" "spark://Worker#172.17.1.230:58069"
20/01/19 18:55:11 INFO Worker: Asked to kill executor app-20200119185506-0001/0
20/01/19 18:55:11 INFO ExecutorRunner: Runner thread for executor app-20200119185506-0001/0 interrupted
20/01/19 18:55:11 INFO ExecutorRunner: Killing process!
20/01/19 18:55:11 INFO Worker: Executor app-20200119185506-0001/0 finished with state KILLED exitStatus 1
I have tried with Spark 2.4.4 for Hadoop 2.6 and 2.7. The problem remains in both the cases.
This problem is the same as this one.

Why am I getting "Removing worker because we got no heartbeat in 60 seconds" on Spark master

I think I might of stumbled across a bug and wanted to get other people's input. I am running a pyspark application using Spark 2.2.0 in standalone mode. I am doing a somewhat heavy transformation in python inside a flatMap and the driver keeps killing the workers.
Here is what am I seeing:
The master after 60s of not seeing any heartbeat message from the workers it prints out this message to the log:
Removing worker [worker name] because we got no heartbeat in 60
seconds
Removing worker [worker name] on [IP]:[port]
Telling app of
lost executor: [executor number]
I then see in the driver log the following message:
Lost executor [executor number] on [executor IP]: worker lost
The worker then terminates and I see this message in its log:
Driver commanded a shutdown
I have looked at the Spark source code and from what I can tell, as long as the executor is alive it should send a heartbeat message back as it is using a ThreadUtils.newDaemonSingleThreadScheduledExecutor to do this.
One other thing that I noticed while I was running top on one of the workers, is that the executor JVM seems to be suspended throughout this process. There are as many python processes as I specified in the SPARK_WORKER_CORES env variable and each is consuming close to 100% of the CPU.
Anyone have any thoughts on this?
I was facing this same issue, increasing interval worked.
Excerpt from Logs start-all.sh logs
INFO Utils: Successfully started service 'sparkMaster' on port 7077.
INFO Master: Starting Spark master at spark://master:7077
INFO Master: Running Spark version 3.0.1
INFO Utils: Successfully started service 'MasterUI' on port 8080.
INFO MasterWebUI: Bound MasterWebUI to 0.0.0.0, and started at http://master:8080
INFO Master: I have been elected leader! New state: ALIVE
INFO Master: Registering worker slave01:41191 with 16 cores, 15.7 GiB RAM
INFO Master: Registering worker slave02:37853 with 16 cores, 15.7 GiB RAM
WARN Master: Removing worker-20210618205117-slave01-41191 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618205117-slave01-41191 on slave01:41191
INFO Master: Telling app of lost worker: worker-20210618205117-slave01-41191
WARN Master: Removing worker-20210618204723-slave02-37853 because we got no heartbeat in 60 seconds
INFO Master: Removing worker worker-20210618204723-slave02-37853 on slave02:37853
INFO Master: Telling app of lost worker: worker-20210618204723-slave02-37853
WARN Master: Got heartbeat from unregistered worker worker-20210618205117-slave01-41191. This worker was never registered, so ignoring the heartbeat.
WARN Master: Got heartbeat from unregistered worker worker-20210618204723-slave02-37853. This worker was never registered, so ignoring the heartbeat.
Solution: add following configs to $SPARK_HOME/conf/spark-defaults.conf
spark.network.timeout 50000
spark.executor.heartbeatInterval 5000
spark.worker.timeout 5000

Spark UI's kill is not killing Driver

I am trying to kill my spark-kafka streaming job from Spark UI. It is able to kill the application but the driver is still running.
Can anyone help me with this. I am good with my other streaming jobs. only one of the streaming jobs is giving this problem ever time.
I can't kill the driver through command or spark UI. Spark Master is alive.
Output i collected from logs is -
16/10/25 03:14:25 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor
16/10/25 03:14:25 INFO SparkUI: Stopped Spark web UI at http://***:4040
16/10/25 03:14:25 INFO SparkDeploySchedulerBackend: Shutting down all executors
16/10/25 03:14:25 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
16/10/25 03:14:35 INFO AppClient: Stop request to Master timed out; it may already be shut down.
16/10/25 03:14:35 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
16/10/25 03:14:35 INFO MemoryStore: MemoryStore cleared
16/10/25 03:14:35 INFO BlockManager: BlockManager stopped
16/10/25 03:14:35 INFO BlockManagerMaster: BlockManagerMaster stopped
16/10/25 03:14:35 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
16/10/25 03:14:35 INFO SparkContext: Successfully stopped SparkContext
16/10/25 03:14:35 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exiting due to error from cluster scheduler: Master removed our application: KILLED
at org.apache.spark.scheduler.TaskSchedulerImpl.error(TaskSchedulerImpl.scala:438)
at org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.dead(SparkDeploySchedulerBackend.scala:124)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint.markDead(AppClient.scala:264)
at org.apache.spark.deploy.client.AppClient$ClientEndpoint$$anonfun$receive$1.applyOrElse(AppClient.scala:172)
at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:116)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:204)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:215)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/10/25 03:14:35 WARN NettyRpcEnv: Ignored message: true
16/10/25 03:14:35 WARN AppClient$ClientEndpoint: Connection to master:7077 failed; waiting for master to reconnect...
16/10/25 03:14:35 WARN AppClient$ClientEndpoint: Connection to master:7077 failed; waiting for master to reconnect...
Get the running driverId from spark UI, and hit the post rest call(spark master rest port like 6066) to kill the pipeline. I have tested it with spark 1.6.1
curl -X POST http://localhost:6066/v1/submissions/kill/driverId
Hope it helps...

Resources