Apache spark job failed to deploy on Yarn - apache-spark

I'm trying to deploy a spark job to my Yarn cluster but am getting some exception which I don't really understand why.
Here is the stack trace:
15/07/29 14:07:13 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors
Exception in thread "Yarn application state monitor" org.apache.spark.SparkException: Error asking standalone scheduler to shut down executors
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(C oarseGrainedSchedulerBackend.scala:261)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrai nedSchedulerBackend.scala:2 66)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSch edulerBackend.scala:158)
at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139)
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
:1326) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257)
... 6 more
15/07/29 14:07:13 INFO cluster.YarnClientSchedulerBackend: Asking each executor to shut down
Here is my config:
SparkConf sparkConf = new SparkConf(true).setAppName("SparkQueryApp")
.setMaster("yarn-client")// "yarn-cluster" or "yarn-client"
.set("es.nodes", "10.0.0.207")
.set("es.nodes.discovery", "false")
.set("es.cluster", "wp-es-reporting-prod")
.set("es.scroll.size", "5000")
.setJars(JavaSparkContext.jarOfClass(Demo.class))
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.logConf", "true");
Any idea why ?

Related

Spark executor driver unknown host issue

Currently I got an issue when using spark. Here is the details.
I setup a spark cluster on VM with three nodes, one is master and the other two are worker.
Spark version is 3.0.0.
I created an application and deployed it into K8S pod.
Then I triggered the application in my K8S pod.
The code I created my Spark Context is like this.
Code:
val conf = SparkConf().setMaster(sparkMaster).setAppName(appName)
.set("spark.cassandra.connection.host", hosts)
.set("spark.cassandra.connection.port", port.toString())
.set("spark.cores.max", sparkCore)
.set("spark.executor.memory", executorMemory)
.set("spark.network.timeout", "50000")
Then I got the error that :
Spark Executor Command:
"/usr/lib/jvm/java-11-openjdk-11.0.7.10-4.el7_8.x86_64/bin/java" "-cp"
"/opt/spark3/conf/:/opt/spark3/jars/*:/etc/hadoop/conf" "-Xmx1024M"
"-Dspark.network.timeout=50000" "-Dspark.driver.port=34471"
"-Dspark.cassandra.connection.port=9042"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"
"--driver-url"
"spark://CoarseGrainedScheduler#my-spark-application-gjwd7:34471"
"--executor-id" "281" "--hostname" "100.84.162.9" "--cores" "1"
"--app-id" "app-20200722115716-0000" "--worker-url"
"spark://Worker#100.84.162.9:33670"
The error is like this:
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
Caused by: java.io.IOException: Failed to connect to my-spark-application-gjwd7:34471
Caused by: java.net.UnknownHostException: my-spark-application-gjwd7
I think this issue is caused by that the worker node can not call back to the driver, since my driver is in the K8S pod.
Do you know how can I resolve this issue. I have some concerns here.
1.How can I set the driver host and port in my code for spark context? port is random according to Spark.
2.How can Spark executor callback to a K8S pod where the driver is running?

Why does Spark on YARN in cluster mode fail with "Exception in thread "Driver" java.lang.NullPointerException"?

I'm using emr-5.4.0 with Spark 2.1.0. I understand what NullPointerException is, this question is about why that was thrown in this particular case.
Cannot really figure out why I got NullPointerException in the driver thread.
I got this weird job failing with this error:
18/03/29 20:07:52 INFO ApplicationMaster: Starting the user application in a separate Thread
18/03/29 20:07:52 INFO ApplicationMaster: Waiting for spark context initialization...
Exception in thread "Driver" java.lang.NullPointerException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:637)
18/03/29 20:07:52 ERROR ApplicationMaster: Uncaught exception:
java.lang.IllegalStateException: SparkContext is null but app is still running!
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:415)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:254)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:766)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:67)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:764)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
18/03/29 20:07:52 INFO ApplicationMaster: Final app status: FAILED, exitCode: 10, (reason: Uncaught exception: java.lang.IllegalStateException: SparkContext is null but app is still running!)
18/03/29 20:07:52 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: Uncaught exception: java.lang.IllegalStateException: SparkContext is null but app is still running!)
18/03/29 20:07:52 INFO ApplicationMaster: Deleting staging directory hdfs://<ip-address>.ec2.internal:8020/user/hadoop/.sparkStaging/application_1522348295743_0010
18/03/29 20:07:52 INFO ShutdownHookManager: Shutdown hook called
End of LogType:stderr
I submitted this job as this:
spark-submit --deploy-mode cluster --master yarn --num-executors 40 --executor-cores 16 --executor-memory 100g --driver-cores 8 --driver-memory 100g --class <package.class_name> --jars <s3://s3_path/some_lib.jar> <s3://s3_path/class.jar>
And my class looks like this:
class MyClass {
def main(args: Array[String]): Unit = {
val c = new MyClass()
c.process()
}
def process(): Unit = {
val sparkConf = new SparkConf().setAppName("my-test")
val sparkSession: SparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
import sparkSession.implicits._
....
}
...
}
Change class MyClass to object MyClass and you're done.
While we're at it, I'd also change class MyClass to object MyClass extends App and remove def main(args: Array[String]): Unit (as given by extends App).
I've reported an improvement for Spark 2.3.0 - [SPARK-23830] Spark on YARN in cluster deploy mode fail with NullPointerException when a Spark application is a Scala class not object - to have it reported nicely to an end user.
Digging deeper into how Spark on YARN works, the following message is when the ApplicationMaster of a Spark application starts the driver (you used --deploy-mode cluster --master yarn with spark-submit).
ApplicationMaster: Starting the user application in a separate Thread
Right after the INFO message you should see another:
ApplicationMaster: Waiting for spark context initialization...
This is part of the driver initialization when the ApplicationMaster runs.
The reason for the exception Exception in thread "Driver" java.lang.NullPointerException is due to the following code:
val mainMethod = userClassLoader.loadClass(args.userClass)
.getMethod("main", classOf[Array[String]])
My understanding is that mainMethod is null at this point so the following line (where mainMethod is null) "triggers" NullPointerException:
mainMethod.invoke(null, userArgs.toArray)
The thread is indeed called Driver (as in Exception in thread "Driver" java.lang.NullPointerException) as set in this line:
userThread.setContextClassLoader(userClassLoader)
userThread.setName("Driver")
userThread.start()
The line numbers differ since I used Spark 2.3.0 to reference the lines while you use emr-5.4.0 with Spark 2.1.0.

ERROR yarn.ApplicationMaster: Uncaught exception: java.util.concurrent.TimeoutException: Futures timed out after 100000 milliseconds [duplicate]

This question already has answers here:
Why does join fail with "java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]"?
(4 answers)
Closed 4 years ago.
I have this problem in my spark application, I use 1.6 spark version, scala 2.10:
17/10/23 14:32:15 ERROR yarn.ApplicationMaster: Uncaught exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000
milliseconds]at
scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at
scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at
scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107) at
org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:342)
at
org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:197)
at
org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:680)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:69)
at
org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:68)
at java.security.AccessController.doPrivileged(Native Method) at
javax.security.auth.Subject.doAs(Subject.java:422) at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1917)
at
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:68)
at
org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:678)
at
org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
17/10/23 14:32:15 INFO yarn.ApplicationMaster: Final app status:
FAILED, exitCode: 10, (reason: Uncaught exception:
java.util.concurrent.TimeoutException: Futures timed out after [100000
milliseconds]) 17/10/23 14:32:15 INFO spark.SparkContext: Invoking
stop() from shutdown hook 17/10/23 14:32:15 INFO ui.SparkUI: Stopped
Spark web UI at http://180.21.232.30:43576 17/10/23 14:32:15 INFO
scheduler.DAGScheduler: ShuffleMapStage 27 (show at Linkage.scala:282)
failed in 24.519 s due to Stage cancelled because SparkContext was
shut down 17/10/23 14:32:15 arkListenerJobEnd (18,1508761935656,JobFailed (org.apache.spark.SparkException:Job 18 cancelled because SparkContext was shut down)) 17/10/23 14:32:15 INFO spark.MapOutputTrackerMasterEndpoint:
MapOutputTrackerMasterEndpoint stopped! 17/10/23 14:32:15 INFO
storage.MemoryStore: MemoryStore cleared 17/10/23 14:32:15 INFO
storage.BlockManager: BlockManager stopped 17/10/23 14:32:15 INFO
storage.BlockManagerMaster: BlockManagerMaster stopped 17/10/23
14:32:15 INFO remote.RemoteActorRefProvider$RemotingTerminator:
Shutting down remote daemon.
17/10/23 14:32:15 INFO util.ShutdownHookManager: Shutdown hook
calledBlockquote
I read the articules that this problem and I tried to modify the next parameter without result
--conf spark.yarn.am.waitTime=6000s
--conf spark.sql.broadcastTimeout= 6000
--conf spark.network.timeout=600
Best Regars 
Please remove the setMaster(’local’) on the code, because Spark by default uses the YARN cluster manager in EMR.
If you are trying to run your spark job on yarn client/cluster. Don't forget to remove master configuration from your code .master("local[n]").
For submitting spark job on yarn, you need to pass --master yarn --deploy-mode cluster/client.
Having master set as local was giving repeated timeout exception.

Yarn-Cluster mode - ApplicationMaster: SparkContext did not initialize after waiting for 100000 ms

in my pyspark program i have as,
from pyspark import SparkConf, SparkContext, SQLContex
conf=SparkConf()
conf.setAppName("spark_name")
conf.set("spark.dynamicAllocation.enabled", "true")
conf.set("spark.shuffle.service.enabled", "true")
sc = SparkContext(conf=conf)
And running my pyspark program as
./spark-submit --master yarn-cluster PySparkProgramPath
then this job failed with
Exception in thread "main" org.apache.spark.SparkException: Application application_1482268372614_0318 finished with failed status
after i run yarn logs -applicationId application_1482268372614_0318
it showing the error
ERROR ApplicationMaster: User class threw exception: java.io.IOException: Cannot run program
ERROR ApplicationMaster: SparkContext did not initialize after waiting for 100000 ms. Please check earlier log output for errors.
[EDIT] Not facing this error when i submit this in yarn-client mode

Spark streaming job fails after getting stopped by Driver

I have a spark streaming job which reads in data from Kafka and does some operations on it. I am running the job over a yarn cluster, Spark 1.4.1, which has two nodes with 16 GB RAM each and 16 cores each.
I have these conf passed to the spark-submit job :
--master yarn-cluster --num-executors 3 --driver-memory 4g --executor-memory 2g --executor-cores 3
The job returns this error and finishes after running for a short while :
INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 11,
(reason: Max number of executor failures reached)
.....
ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0:
Stopped by driver
Updated :
These logs were found too :
INFO yarn.YarnAllocator: Received 3 containers from YARN, launching executors on 3 of them.....
INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down.
....
INFO yarn.YarnAllocator: Received 2 containers from YARN, launching executors on 2 of them.
INFO yarn.ExecutorRunnable: Starting Executor Container.....
INFO yarn.ApplicationMaster$AMEndpoint: Driver terminated or disconnected! Shutting down...
INFO yarn.YarnAllocator: Completed container container_e10_1453801197604_0104_01_000006 (state: COMPLETE, exit status: 1)
INFO yarn.YarnAllocator: Container marked as failed: container_e10_1453801197604_0104_01_000006. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_e10_1453801197604_0104_01_000006
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:576)
at org.apache.hadoop.util.Shell.run(Shell.java:487)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:753)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:211)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
What might be the reasons for this? Appreciate some help.
Thanks
can you please show your scala/java code that is reading from kafka? I suspect you probably not creating your SparkConf correctly.
Try something like
SparkConf sparkConf = new SparkConf().setAppName("ApplicationName");
also try running application in yarn-client mode and share the output.
I got the same issue. and I have found 1 solution to fix the issue by removing sparkContext.stop() at the end of main function, leave the stop action for GC.
Spark team has resolved the issue in Spark core, however, the fix has just been master branch so far. We need to wait until the fix has been updated into the new release.
https://issues.apache.org/jira/browse/SPARK-12009

Resources