Spark executor driver unknown host issue - apache-spark

Currently I got an issue when using spark. Here is the details.
I setup a spark cluster on VM with three nodes, one is master and the other two are worker.
Spark version is 3.0.0.
I created an application and deployed it into K8S pod.
Then I triggered the application in my K8S pod.
The code I created my Spark Context is like this.
Code:
val conf = SparkConf().setMaster(sparkMaster).setAppName(appName)
.set("spark.cassandra.connection.host", hosts)
.set("spark.cassandra.connection.port", port.toString())
.set("spark.cores.max", sparkCore)
.set("spark.executor.memory", executorMemory)
.set("spark.network.timeout", "50000")
Then I got the error that :
Spark Executor Command:
"/usr/lib/jvm/java-11-openjdk-11.0.7.10-4.el7_8.x86_64/bin/java" "-cp"
"/opt/spark3/conf/:/opt/spark3/jars/*:/etc/hadoop/conf" "-Xmx1024M"
"-Dspark.network.timeout=50000" "-Dspark.driver.port=34471"
"-Dspark.cassandra.connection.port=9042"
"org.apache.spark.executor.CoarseGrainedExecutorBackend"
"--driver-url"
"spark://CoarseGrainedScheduler#my-spark-application-gjwd7:34471"
"--executor-id" "281" "--hostname" "100.84.162.9" "--cores" "1"
"--app-id" "app-20200722115716-0000" "--worker-url"
"spark://Worker#100.84.162.9:33670"
The error is like this:
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
Caused by: java.io.IOException: Failed to connect to my-spark-application-gjwd7:34471
Caused by: java.net.UnknownHostException: my-spark-application-gjwd7
I think this issue is caused by that the worker node can not call back to the driver, since my driver is in the K8S pod.
Do you know how can I resolve this issue. I have some concerns here.
1.How can I set the driver host and port in my code for spark context? port is random according to Spark.
2.How can Spark executor callback to a K8S pod where the driver is running?

Related

When running "local-cluster" model in Apache Spark, how to prevent executor from dissociating prematurely?

I have a Spark application that should be tested in both local mode & local-cluster mode, using scalatest.
The local-cluster mode is submitted using this method:
How to scala-test a Spark program under "local-cluster" mode?
The test run successfully, but when terminating the test I got the following error in the log:
22/05/16 17:45:25 ERROR TaskSchedulerImpl: Lost executor 0 on 172.16.224.18: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/05/16 17:45:25 ERROR Worker: Failed to launch executor app-20220516174449-0000/2 for Test.
java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:195)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:142)
at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:77)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:547)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:215)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/05/16 17:45:25 ERROR Worker: Failed to launch executor app-20220516174449-0000/3 for Test.
java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:195)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:142)
at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:77)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:547)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:215)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/05/16 17:45:25 ERROR Worker: Failed to launch executor app-20220516174449-0000/4 for Test.
java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:195)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:142)
at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:77)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:547)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:215)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/05/16 17:45:25 ERROR Worker: Failed to launch executor app-20220516174449-0000/5 for Test.
java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:195)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:142)
at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:77)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:547)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:215)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dis
...
It turns out executor 0 was dropped before the SparkContext is stopped, this triggered a violent self-healing reaction from Spark master that tries to repeatedly launch new executors to compensate for the loss. How do I prevent this from happening?
Spark attempts to recover from failed tasks by attempting to run them again. What you can do to avoid this is to set some properties to 1 in
spark.task.maxFailures (default is 4)
spark.stage.maxConsecutiveAttempts (default is 4)
These properties can be set in $SPARK_HOME/conf/spark-defaults.conf or given as options to spark-submit:
spark-submit --conf spark.task.maxFailures=1 --conf spark.stage.maxConsecutiveAttempts=1
or in the Spark context/session configuration before starting the session.
EDIT:
It looks like your executors are lost due to insufficient memory. You could try to increase:
spark.executor.memory
spark.executor.memoryOverhead
spark.memory.offHeap.size with (spark.memory.offHeap.enabled=true)
(see Spark configuration)
The maximum memory size of container to running executor is determined by the sum of spark.executor.memoryOverhead, spark.executor.memory, spark.memory.offHeap.size and spark.executor.pyspark.memory.

How to run spark yarn jobs on a Hadoop cluster that is external to the K8S cluster

I am running a on-prem k8s cluster and am running juypterhub on it.
I can successfully submit the job to an yarn queue, however the job will fail because users notebook pod IP is not resolvable and therefore it can’t talk back to the spark driver running on said pod and I get an error like:
Caused by: java.io.IOException: Failed to connect to podIP:33630 at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:287)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:218)
at
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:230)
at
org.apache.spark.rpc.netty.NettyRpcEnv.createClient(NettyRpcEnv.scala:204)
at org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:202) at
org.apache.spark.rpc.netty.Outbox$$anon$1.call(Outbox.scala:198) at
java.util.concurrent.FutureTask.run(FutureTask.java:266) at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Caused by:
java.net.UnknownHostException: pod-ip
I believe I’m missing something in my setup that will allow yarn to talk back to the spawned notebook pods on the kubernetes cluster.
Any help or hints are greatly appreciated .
For now I am passing the spark driver the internal Kubernetes Pod IP of juypterhub by setting:
"spark.driver.host" to str(socket.gethostbyname(socket.gethostname())). Could I change this to something else in the notebook I am running? I am not too sure what to change it to.
Thanks!

Spark Failed- Futures timed out

I'm using apache spark 2.2.1, that running on Amazon EMR cluster. Sometimes jobs fail on 'Futures timed out':
java.util.concurrent.TimeoutException: Futures timed out after [100000 milliseconds]
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:201)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:401)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:254)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:764)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:762)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
I changed 2 params in spark-defaults.conf:
spark.sql.broadcastTimeout 1000
spark.network.timeout 10000000
but it didn't help.
Do you have any suggestions on how to handle this timeout?
Have you tried setting spark.yarn.am.waitTime?
Only used in cluster mode. Time for the YARN Application Master to
wait for the SparkContext to be initialized.
The quote above is from here.
A bit more context on my situation:
I am using spark-submit to execute a java-spark job. I deploy the client to the cluster, and the client is doing a very long running operation which was causing a time out.
I got around it by:
spark-submit --master yarn --deploy-mode cluster --conf "spark.yarn.am.waitTime=600000"

Apache spark job failed to deploy on Yarn

I'm trying to deploy a spark job to my Yarn cluster but am getting some exception which I don't really understand why.
Here is the stack trace:
15/07/29 14:07:13 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors
Exception in thread "Yarn application state monitor" org.apache.spark.SparkException: Error asking standalone scheduler to shut down executors
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(C oarseGrainedSchedulerBackend.scala:261)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stop(CoarseGrai nedSchedulerBackend.scala:2 66)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSch edulerBackend.scala:158)
at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:416)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1411)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1644)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$$anon$1.run(YarnClientSchedulerBackend.scala:139)
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java
:1326) at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:208)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:107)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:102)
at org.apache.spark.rpc.RpcEndpointRef.askWithRetry(RpcEndpointRef.scala:78)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.stopExecutors(CoarseGrainedSchedulerBackend.scala:257)
... 6 more
15/07/29 14:07:13 INFO cluster.YarnClientSchedulerBackend: Asking each executor to shut down
Here is my config:
SparkConf sparkConf = new SparkConf(true).setAppName("SparkQueryApp")
.setMaster("yarn-client")// "yarn-cluster" or "yarn-client"
.set("es.nodes", "10.0.0.207")
.set("es.nodes.discovery", "false")
.set("es.cluster", "wp-es-reporting-prod")
.set("es.scroll.size", "5000")
.setJars(JavaSparkContext.jarOfClass(Demo.class))
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.logConf", "true");
Any idea why ?

java.net.ConnectException (on port 9000) while submitting a spark job

On running this command:
~/spark/bin/spark-submit --class [class-name] --master [spark-master-url]:7077 [jar-path]
I am getting
java.lang.RuntimeException: java.net.ConnectException: Call to ec2-[ip].compute-1.amazonaws.com/[internal-ip]:9000 failed on connection exception: java.net.ConnectException: Connection refused
Using spark version 1.3.0.
How do I resolve it?
When Spark is run in Cluster mode, all input files will be expected to be from HDFS (otherwise how will workers read from master's local files). But in this case, Hadoop wasn't running, so it was giving this exception.
Starting HDFS resolved this.

Resources