Livy is failing when a job is submitted in yarn cluster mode - apache-spark

I have a livy server running in my local and it is connecting to a remote yarn cluster for running spark jobs.
I am getting the below error when i upload my jar from Livy programmatic Job api.
It looks like correct netty channel handler is not getting mapped for the request message. because of this spark context is not getting created.
livy.conf setting:
livy.spark.master = yarn
livy.spark.deploy-mode = cluster
I am able to do post to Livy-REST /batches api and it submits and completes the job in the yarn cluster successfully.
With the same config, when i am trying to upload jar from my java client it fails with the below error.
Livy version : 0.4.0-incubating.
HTTP Client:
<dependency>
<groupId>org.apache.livy</groupId>
<artifactId>livy-client-http</artifactId>
<version>0.4.0-incubating</version>
</dependency>
Error:
17/09/22 22:47:50 DEBUG RSCDriver: Registered new connection from [id: 0x46d83447, L:/127.0.0.1:10000 - R:/127.0.0.1:9273].
17/09/22 22:47:50 DEBUG RpcDispatcher: [ReplDriver] Registered outstanding rpc 0 (org.apache.livy.rsc.BaseProtocol$RemoteDriverAddress).
17/09/22 22:47:50 DEBUG KryoMessageCodec: Encoded message of type org.apache.livy.rsc.rpc.Rpc$MessageHeader (5 bytes)
17/09/22 22:47:50 DEBUG KryoMessageCodec: Encoded message of type org.apache.livy.rsc.BaseProtocol$RemoteDriverAddress (90 bytes)
17/09/22 22:47:50 DEBUG KryoMessageCodec: Decoded message of type org.apache.livy.rsc.rpc.Rpc$MessageHeader (5 bytes)
17/09/22 22:47:50 DEBUG KryoMessageCodec: Decoded message of type org.apache.livy.rsc.BaseProtocol$RemoteDriverAddress (90 bytes)
17/09/22 22:47:50 DEBUG RpcDispatcher: [ReplDriver] Received RPC message: type=CALL id=0 payload=org.apache.livy.rsc.BaseProtocol$RemoteDriverAddress
17/09/22 22:47:50 WARN RpcDispatcher: [ReplDriver] Failed to find handler for msg 'org.apache.livy.rsc.BaseProtocol$RemoteDriverAddress'.
17/09/22 22:47:50 DEBUG RpcDispatcher: [ReplDriver] Caught exception in channel pipeline.
java.lang.NullPointerException
at org.apache.livy.rsc.Utils.stackTraceAsString(Utils.java:95)
at org.apache.livy.rsc.rpc.RpcDispatcher.handleCall(RpcDispatcher.java:121)
at org.apache.livy.rsc.rpc.RpcDispatcher.channelRead0(RpcDispatcher.java:77)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293)
at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267)
at io.netty.handler.codec.ByteToMessageCodec.channelRead(ByteToMessageCodec.java:103)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at java.lang.Thread.run(Thread.java:745)
17/09/22 22:47:50 WARN RpcDispatcher: [ReplDriver] Closing RPC channel with 1 outstanding RPCs.
17/09/22 22:47:50 WARN RpcDispatcher: [ReplDriver] Closing RPC channel with 1 outstanding RPCs.
17/09/22 22:47:50 ERROR ApplicationMaster: User class threw exception: java.util.concurrent.CancellationException
java.util.concurrent.CancellationException
at io.netty.util.concurrent.DefaultPromise.cancel(...)(Unknown Source)
17/09/22 22:47:50 INFO ApplicationMaster: Final app status: FAILED, exitCode: 15, (reason: User class threw exception: java.util.concurrent.CancellationException)
17/09/22 22:47:50 ERROR ApplicationMaster: Uncaught exception:
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at org.apache.spark.deploy.yarn.ApplicationMaster.runDriver(ApplicationMaster.scala:401)
at org.apache.spark.deploy.yarn.ApplicationMaster.run(ApplicationMaster.scala:254)
at org.apache.spark.deploy.yarn.ApplicationMaster$$anonfun$main$1.apply$mcV$sp(ApplicationMaster.scala:764)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
at org.apache.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:762)
at org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala)
Caused by: java.util.concurrent.CancellationException
at io.netty.util.concurrent.DefaultPromise.cancel(...)(Unknown Source)
17/09/22 22:47:50 INFO ApplicationMaster: Unregistering ApplicationMaster with FAILED (diag message: User class threw exception: java.util.concurrent.CancellationException)
17/09/22 22:47:50 INFO ShutdownHookManager: Shutdown hook called.

Related

spark-sql/spark-submit with delta lake is resulting null pointer exception (at org.apache.spark.storage.BlockManagerMasterEndpoint)

I'm using delta lake on using pyspark by submitting below command
spark-sql --packages io.delta:delta-core_2.12:0.8.0 --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
System Specs:
Spark - 3.0.3
scala - 2.12.10
java - 1.8.0
hadoop - 2.7
i'm looking at reference blog https://docs.delta.io/latest/quick-start.html
https://www.confessionsofadataguy.com/introduction-to-delta-lake-on-apache-spark-for-data-engineers/
but when I use spark without the delta config spark works fine.
Error (trucated the stacktrace):
22/12/29 20:38:10 INFO Executor: Starting executor ID driver on host 192.168.0.100
22/12/29 20:38:10 INFO Executor: Fetching spark://192.168.0.100:50920/jars/org.antlr_antlr4-runtime-4.7.jar with timestamp 1672326488255
22/12/29 20:38:23 WARN ProcfsMetricsGetter: Exception when trying to compute pagesize, as a result reporting of ProcessTree metrics is stopped
22/12/29 20:38:23 INFO Executor: Told to re-register on heartbeat
22/12/29 20:38:23 INFO BlockManager: BlockManager null re-registering with master
22/12/29 20:38:23 INFO BlockManagerMaster: Registering BlockManager null
22/12/29 20:38:23 ERROR Inbox: Ignoring error
java.lang.NullPointerException
at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$register(BlockManagerMasterEndpoint.scala:404)
at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:97)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
22/12/29 20:38:23 WARN Executor: Issue communicating with driver in heartbeater
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:302)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:103)
at org.apache.spark.rpc.RpcEndpointRef.askSync(RpcEndpointRef.scala:87)
at org.apache.spark.storage.BlockManagerMaster.registerBlockManager(BlockManagerMaster.scala:66)
at org.apache.spark.storage.BlockManager.reregister(BlockManager.scala:567)
at org.apache.spark.executor.Executor.reportHeartBeat(Executor.scala:934)
at org.apache.spark.executor.Executor.$anonfun$heartbeater$1(Executor.scala:200)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1934)
at org.apache.spark.Heartbeater$$anon$1.run(Heartbeater.scala:46)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$register(BlockManagerMasterEndpoint.scala:404)
at org.apache.spark.storage.BlockManagerMasterEndpoint$$anonfun$receiveAndReply$1.applyOrElse(BlockManagerMasterEndpoint.scala:97)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:103)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
... 3 more
22/12/29 20:38:31 ERROR Utils: Aborting task
java.io.IOException: Failed to connect to /192.168.0.100:50920
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
at org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:392)
at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$openChannel$4(NettyRpcEnv.scala:360)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:931)
at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:49)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.<init>(SparkSQLCLIDriver.scala:321)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: no further information: /192.168.0.100:50920
Caused by: java.net.ConnectException: Connection timed out: no further information
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
at java.lang.Thread.run(Thread.java:745)
22/12/29 20:38:31 ERROR SparkContext: Error initializing SparkContext.
java.io.IOException: Failed to connect to /192.168.0.100:50920
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
at org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:392)
at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$openChannel$4(NettyRpcEnv.scala:360)
at java.lang.reflect.Method.invoke(Method.java:498)
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:745)
22/12/29 20:38:31 INFO SparkUI: Stopped Spark web UI at http://192.168.0.100:4041
22/12/29 20:38:31 ERROR Utils: Uncaught exception in thread main
java.lang.NullPointerException
at org.apache.spark.scheduler.local.LocalSchedulerBackend.org$apache$spark$scheduler$local$LocalSchedulerBackend$$stop(LocalSchedulerBackend.scala:168)
at org.apache.spark.scheduler.local.LocalSchedulerBackend.stop(LocalSchedulerBackend.scala:144)
at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:734)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2171)
java.lang.NullPointerException
at org.apache.spark.executor.Executor.$anonfun$stop$3(Executor.scala:304)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.apache.spark.util.Utils$.withContextClassLoader(Utils.scala:221)
at org.apache.spark.executor.Executor.stop(Executor.scala:304)
at org.apache.spark.executor.Executor.$anonfun$stopHookReference$1(Executor.scala:74)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:214)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
22/12/29 20:38:31 INFO ShutdownHookManager: Shutdown hook called
what is being missed?
Try it like this:
!pip3 install findspark --user
import findspark
findspark.init()
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages io.delta:delta- core_2.12:2.2.0 --driver-memory 2g pyspark-shell'
spark = SparkSession.builder.appName("your application") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:2.2.0") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate()

My PySpark Jobs Run Fine in Local Mode, But Fail in Cluster Mode - SOLVED

I have a four node Hadoop/Spark cluster running in AWS. I can submit and run jobs perfectly in local mode:
spark-submit --master local[*] myscript.py
But when I attempt to run the script in cluster mode, it fails. I'm just trying the cluster equivalent of "hello world":
spark-submit spark-yarn.py
Where the script is this one that was recommended:
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf()
conf.setMaster('yarn')
conf.setAppName('spark-yarn')
sc = SparkContext(conf=conf)
def mod(x):
import numpy as np
return (x, np.mod(x, 2))
rdd = sc.parallelize(range(1000)).map(mod).take(10)
print(rdd)
I've spent days looking at every log I can find and reading everything I can online, but nothing has helped me get to the root of why it's not working. Before I tear down all the servers and start over, I'm hoping someone can point me in the right direction to get this working.
Here's the output in the terminal:
20/02/25 12:59:51 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/02/25 13:00:11 ERROR YarnClientSchedulerBackend: YARN application has exited unexpectedly with state FAILED! Check the YARN application logs for more details.
20/02/25 13:00:11 ERROR YarnClientSchedulerBackend: Diagnostics message: Application application_1582603840719_0002 failed 2 times due to AM Container for appattempt_1582603840719_0002_000002 exited with exitCode: -103
Failing this attempt.Diagnostics: [2020-02-25 13:00:11.601]Container [pid=3124,containerID=container_1582603840719_0002_02_000001] is running beyond virtual memory limits. Current usage: 328.7 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1582603840719_0002_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 3128 3124 3124 3124 (java) 504 34 2359349248 83396 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -server -Xmx512m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1582603840719_0002/container_1582603840719_0002_02_000001/tmp -Dspark.yarn.app.container.log.dir=/home/ubuntu/server/hadoop-2.9.2/logs/userlogs/application_1582603840719_0002/container_1582603840719_0002_02_000001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg ip-172-31-7-96.ec2.internal:43275 --properties-file /tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1582603840719_0002/container_1582603840719_0002_02_000001/__spark_conf__/__spark_conf__.properties
|- 3124 3122 3124 3124 (bash) 0 0 13635584 760 /bin/bash -c /usr/lib/jvm/java-8-openjdk-amd64/bin/java -server -Xmx512m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1582603840719_0002/container_1582603840719_0002_02_000001/tmp -Dspark.yarn.app.container.log.dir=/home/ubuntu/server/hadoop-2.9.2/logs/userlogs/application_1582603840719_0002/container_1582603840719_0002_02_000001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg 'ip-172-31-7-96.ec2.internal:43275' --properties-file /tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1582603840719_0002/container_1582603840719_0002_02_000001/__spark_conf__/__spark_conf__.properties 1> /home/ubuntu/server/hadoop-2.9.2/logs/userlogs/application_1582603840719_0002/container_1582603840719_0002_02_000001/stdout 2> /home/ubuntu/server/hadoop-2.9.2/logs/userlogs/application_1582603840719_0002/container_1582603840719_0002_02_000001/stderr
[2020-02-25 13:00:11.651]Container killed on request. Exit code is 143
[2020-02-25 13:00:11.658]Container exited with a non-zero exit code 143.
For more detailed output, check the application tracking page: http://ec2-34-200-223-235.compute-1.amazonaws.com:8088/cluster/app/application_1582603840719_0002 Then click on links to logs of each attempt.
. Failing the application.
20/02/25 13:00:11 ERROR TransportClient: Failed to send RPC RPC 6867152665638655473 to /172.31.9.94:57526: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
20/02/25 13:00:11 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(0,0,Map(),Set()) to AM was unsuccessful
java.io.IOException: Failed to send RPC RPC 6867152665638655473 to /172.31.9.94:57526: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:362)
at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:339)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
20/02/25 13:00:11 ERROR Utils: Uncaught exception in thread YARN application state monitor
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:574)
at org.apache.spark.scheduler.cluster.YarnSchedulerBackend.stop(YarnSchedulerBackend.scala:98)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:164)
at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:653)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2042)
at org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1949)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1948)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:121)
Caused by: java.io.IOException: Failed to send RPC RPC 6867152665638655473 to /172.31.9.94:57526: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:362)
at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:339)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:420)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:122)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:987)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:869)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1316)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:738)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:730)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:38)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1081)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1128)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1070)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
20/02/25 13:00:12 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Spark context stopped while waiting for backend
at org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:818)
at org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:196)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:560)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Traceback (most recent call last):
File "/home/ubuntu/server/spark-yarn.py", line 7, in <module>
sc = SparkContext(conf=conf)
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 136, in __init__
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 198, in _do_init
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 306, in _initialize_context
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1525, in __call__
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalStateException: Spark context stopped while waiting for backend
at org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:818)
at org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:196)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:560)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
If I change 'spark' to 'spark-client' as the master, it gives a slightly different error:
20/02/25 13:07:31 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/02/25 13:07:46 ERROR TransportClient: Failed to send RPC RPC 5381013595535555066 to /172.31.5.228:39748: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
20/02/25 13:07:46 ERROR YarnScheduler: Lost executor 1 on ip-172-31-5-228.ec2.internal: Slave lost
20/02/25 13:07:51 ERROR YarnClientSchedulerBackend: YARN application has exited unexpectedly with state FAILED! Check the YARN application logs for more details.
20/02/25 13:07:51 ERROR YarnClientSchedulerBackend: Diagnostics message: Application application_1582603840719_0003 failed 2 times due to AM Container for appattempt_1582603840719_0003_000002 exited with exitCode: -103
Failing this attempt.Diagnostics: [2020-02-25 13:07:51.067]Container [pid=3223,containerID=container_1582603840719_0003_02_000001] is running beyond virtual memory limits. Current usage: 320.8 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container.
Dump of the process-tree for container_1582603840719_0003_02_000001 :
|- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
|- 3227 3223 3223 3223 (java) 489 32 2355855360 81352 /usr/lib/jvm/java-8-openjdk-amd64/bin/java -server -Xmx512m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1582603840719_0003/container_1582603840719_0003_02_000001/tmp -Dspark.yarn.app.container.log.dir=/home/ubuntu/server/hadoop-2.9.2/logs/userlogs/application_1582603840719_0003/container_1582603840719_0003_02_000001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg ip-172-31-7-96.ec2.internal:40963 --properties-file /tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1582603840719_0003/container_1582603840719_0003_02_000001/__spark_conf__/__spark_conf__.properties
|- 3223 3221 3223 3223 (bash) 0 0 13635584 767 /bin/bash -c /usr/lib/jvm/java-8-openjdk-amd64/bin/java -server -Xmx512m -Djava.io.tmpdir=/tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1582603840719_0003/container_1582603840719_0003_02_000001/tmp -Dspark.yarn.app.container.log.dir=/home/ubuntu/server/hadoop-2.9.2/logs/userlogs/application_1582603840719_0003/container_1582603840719_0003_02_000001 org.apache.spark.deploy.yarn.ExecutorLauncher --arg 'ip-172-31-7-96.ec2.internal:40963' --properties-file /tmp/hadoop-ubuntu/nm-local-dir/usercache/ubuntu/appcache/application_1582603840719_0003/container_1582603840719_0003_02_000001/__spark_conf__/__spark_conf__.properties 1> /home/ubuntu/server/hadoop-2.9.2/logs/userlogs/application_1582603840719_0003/container_1582603840719_0003_02_000001/stdout 2> /home/ubuntu/server/hadoop-2.9.2/logs/userlogs/application_1582603840719_0003/container_1582603840719_0003_02_000001/stderr
[2020-02-25 13:07:51.089]Container killed on request. Exit code is 143
[2020-02-25 13:07:51.090]Container exited with a non-zero exit code 143.
For more detailed output, check the application tracking page: http://ec2-34-200-223-235.compute-1.amazonaws.com:8088/cluster/app/application_1582603840719_0003 Then click on links to logs of each attempt.
. Failing the application.
20/02/25 13:07:51 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Spark context stopped while waiting for backend
at org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:818)
at org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:196)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:560)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
20/02/25 13:07:52 ERROR TransportClient: Failed to send RPC RPC 8397804982944513692 to /172.31.0.102:39468: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
20/02/25 13:07:52 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(0,0,Map(),Set()) to AM was unsuccessful
java.io.IOException: Failed to send RPC RPC 8397804982944513692 to /172.31.0.102:39468: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:362)
at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:339)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.access$000(DefaultPromise.java:34)
at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:431)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
20/02/25 13:07:52 ERROR Utils: Uncaught exception in thread YARN application state monitor
org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:574)
at org.apache.spark.scheduler.cluster.YarnSchedulerBackend.stop(YarnSchedulerBackend.scala:98)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:164)
at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:653)
at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2042)
at org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1949)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1948)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend$MonitorThread.run(YarnClientSchedulerBackend.scala:121)
Caused by: java.io.IOException: Failed to send RPC RPC 8397804982944513692 to /172.31.0.102:39468: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:362)
at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:339)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:507)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:481)
at io.netty.util.concurrent.DefaultPromise.access$000(DefaultPromise.java:34)
at io.netty.util.concurrent.DefaultPromise$1.run(DefaultPromise.java:431)
at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:403)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
Traceback (most recent call last):
File "/home/ubuntu/server/spark-yarn.py", line 7, in <module>
sc = SparkContext(conf=conf)
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 136, in __init__
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 198, in _do_init
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/context.py", line 306, in _initialize_context
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1525, in __call__
File "/home/ubuntu/server/spark-2.4.4-bin-hadoop2.7/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext.
: java.lang.IllegalStateException: Spark context stopped while waiting for backend
at org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:818)
at org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:196)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:560)
at org.apache.spark.api.java.JavaSparkContext.<init>(JavaSparkContext.scala:58)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:247)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:238)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
It mentions checking the logs at:
http://ec2-34-200-223-235.compute-1.amazonaws.com:8088/cluster/app/application_1582603840719_0003
But clicking on any of the log links on that page give an error:
Firefox can’t establish a connection to the server at ip-172-31-0-102.ec2.internal:8042.
(That's probably unrelated.)
Grepping for warnings, I see the following:
2020-02-25 13:07:38,904 WARN org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl: The specific max attempts: 0 for application: 3 is invalid, because it is out of the range [1, 2]. Use the global max attempts instead.
2020-02-25 13:07:51,241 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=ubuntOPERATION=Application Finished - Failed TARGET=RMAppManager RESULT=FAILURE DESCRIPTION=App failed with state: FAILED PERMISSIONS=Application application_1582603840719_0003 failed 2 times due to AM Container for appattempt_1582603840719_0003_000002 exited with exitCode: -103
2020-02-25 13:07:40,367 WARN org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode: Checkpoint done. New Image Size: 12086
No errors were generated.
Trying yarn logs from the command line will show me the jobs:
ubuntu#ip-172-31-7-96:~/server$ yarn application -list -appStates ALL
20/02/25 13:31:44 INFO client.RMProxy: Connecting to ResourceManager at ec2-34-200-223-235.compute-1.amazonaws.com/172.31.7.96:8032
Total number of applications (application-types: [], states: [NEW, NEW_SAVING, SUBMITTED, ACCEPTED, RUNNING, FINISHED, FAILED, KILLED] and tags: []):3
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1582603840719_0001 spark-yarn SPARK ubuntu default FINISHED UNDEFINED 100% N/A
But asking for the logs fails:
ubuntu#ip-172-31-7-96:~/server$ yarn logs -applicationId application_1582603840719_0001
20/02/25 13:32:48 INFO client.RMProxy: Connecting to ResourceManager at ec2-34-200-223-235.compute-1.amazonaws.com/172.31.7.96:8032
fs.AbstractFileSystem.ec2-34-200-223-235.compute-1.amazonaws.com.impl=null: No AbstractFileSystem configured for scheme: ec2-34-200-223-235.compute-1.amazonaws.com
Can not find any log file matching the pattern: [ALL] for the application: application_1582603840719_0001
Can not find the logs for the application: application_1582603840719_0001 with the appOwner: ubuntu
Again, if someone could direct me to next trouble-shooting steps, I'd really appreciate it. I've spent days on this and don't seem to be making progress.
Two of these things ended up solving this issue:
First, I added the following lines to all nodes in the yarn-site.xml file:
<property>
<name>yarn.nodemanager.pmem-check-enabled</name>
<value>false</value>
</property>
<property>
<name>yarn.nodemanager.vmem-check-enabled</name>
<value>false</value>
</property>
Next, I changed my spark-submit command to include the following lines to give the driver more memory:
spark-submit --master yarn \
--deploy-mode client \
--driver-memory 6g \
--executor-memory 6g \
--executor-cores 2 \
--num-executors 10 \
my_app.py

Spark - TransportClient: Failed to send RPC

I'm getting that error, which I guess is about memory limitations. Do you know how can I change the settings to avoid this error?
I'm running a cluster of 10 nodes of 61GB each. This is the command that I use to run the jobs:
spark-submit --deploy-mode client --executor-cores 5 --executor-memory 25g --driver-memory 40g adsquare-grid-join-postcode_lookup4-allfolder.py
The error in the log:
ERROR TransportClient: Failed to send RPC 5779186781068965319 to /172.31.29.229:47384: java.nio.channels.ClosedChannelException
java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
17/04/30 11:52:33 WARN BlockManagerMaster: Failed to remove broadcast 15 with removeFromMaster = true - Failed to send RPC 5779186781068965319 to /172.31.29.229:47384: java.nio.channels.ClosedChannelException
java.io.IOException: Failed to send RPC 5779186781068965319 to /172.31.29.229:47384: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:249)
at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:233)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:514)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:488)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:427)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:129)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:852)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:738)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1251)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:735)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:36)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1072)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1126)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1061)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)
17/04/30 11:52:33 ERROR ContextCleaner: Error cleaning broadcast 15
org.apache.spark.SparkException: Exception thrown in awaitResult
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:77)
at org.apache.spark.rpc.RpcTimeout$$anonfun$1.applyOrElse(RpcTimeout.scala:75)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:36)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:59)
at scala.PartialFunction$OrElse.apply(PartialFunction.scala:167)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:83)
at org.apache.spark.storage.BlockManagerMaster.removeBroadcast(BlockManagerMaster.scala:151)
at org.apache.spark.broadcast.TorrentBroadcast$.unpersist(TorrentBroadcast.scala:299)
at org.apache.spark.broadcast.TorrentBroadcastFactory.unbroadcast(TorrentBroadcastFactory.scala:45)
at org.apache.spark.broadcast.BroadcastManager.unbroadcast(BroadcastManager.scala:60)
at org.apache.spark.ContextCleaner.doCleanupBroadcast(ContextCleaner.scala:232)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:188)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1$$anonfun$apply$mcV$sp$2.apply(ContextCleaner.scala:179)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.ContextCleaner$$anonfun$org$apache$spark$ContextCleaner$$keepCleaning$1.apply$mcV$sp(ContextCleaner.scala:179)
at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1245)
at org.apache.spark.ContextCleaner.org$apache$spark$ContextCleaner$$keepCleaning(ContextCleaner.scala:172)
at org.apache.spark.ContextCleaner$$anon$1.run(ContextCleaner.scala:67)
Caused by: java.io.IOException: Failed to send RPC 5779186781068965319 to /172.31.29.229:47384: java.nio.channels.ClosedChannelException
at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:249)
at org.apache.spark.network.client.TransportClient$3.operationComplete(TransportClient.java:233)
at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:514)
at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:488)
at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:427)
at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:129)
at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:852)
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:738)
at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1251)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:743)
at io.netty.channel.AbstractChannelHandlerContext.invokeWrite(AbstractChannelHandlerContext.java:735)
at io.netty.channel.AbstractChannelHandlerContext.access$1900(AbstractChannelHandlerContext.java:36)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.write(AbstractChannelHandlerContext.java:1072)
at io.netty.channel.AbstractChannelHandlerContext$WriteAndFlushTask.write(AbstractChannelHandlerContext.java:1126)
at io.netty.channel.AbstractChannelHandlerContext$AbstractWriteTask.run(AbstractChannelHandlerContext.java:1061)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:408)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:455)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.nio.channels.ClosedChannelException
at io.netty.channel.AbstractChannel$AbstractUnsafe.write(...)(Unknown Source)

Spark Streaming:java.io.InvalidClassException: scala.concurrent.duration.Duration; local class incompatible: stream classdesc serialVersionUID

i am running a streaming job on CDH cluster and get error, and the cdh spark version is 1.2.0-cdh5.3.8, but i need spark2.1.0, so i have downloaded the apache spark and builded it(spark version: 2.1.0-cdh5.3.8,hadoop version=2.5.0-cdh5.3.8).
error message is below:
17/04/14 18:12:34 ERROR server.TransportRequestHandler: Error while invoking RpcHandler#receive() on RPC id 4724089633860239943
java.io.InvalidClassException: scala.concurrent.duration.Duration; local class incompatible: stream classdesc serialVersionUID = -7521802526148376080, local class serialVersionUID = -2941674837829752814
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:617)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1622)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:108)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1$$anonfun$apply$1.apply(NettyRpcEnv.scala:259)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:308)
at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$deserialize$1.apply(NettyRpcEnv.scala:258)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:58)
at org.apache.spark.rpc.netty.NettyRpcEnv.deserialize(NettyRpcEnv.scala:257)
at org.apache.spark.rpc.netty.NettyRpcHandler.internalReceive(NettyRpcEnv.scala:582)
at org.apache.spark.rpc.netty.NettyRpcHandler.receive(NettyRpcEnv.scala:567)
at org.apache.spark.network.server.TransportRequestHandler.processRpcRequest(TransportRequestHandler.java:159)
at org.apache.spark.network.server.TransportRequestHandler.handle(TransportRequestHandler.java:107)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:119)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:51)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:266)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at org.apache.spark.network.util.TransportFrameDecoder.channelRead(TransportFrameDecoder.java:85)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:346)
at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:367)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:353)
at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:131)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:652)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:575)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:489)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:451)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:140)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:745)
17/04/14 18:12:35 INFO impl.AMRMClientImpl: Received new token for : ztdm006:8041
17/04/14 18:12:35 INFO yarn.YarnAllocator: Received 1 containers from YARN, launching executors on 0 of them.
17/04/14 18:12:35 INFO yarn.YarnAllocator: Completed container container_1488960736410_229415_01_000004 on host: ztdm009 (state: COMPLETE, exit status: 1)
17/04/14 18:12:35 WARN yarn.YarnAllocator: Container marked as failed: container_1488960736410_229415_01_000004 on host: ztdm009. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1488960736410_229415_01_000004
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:543)
at org.apache.hadoop.util.Shell.run(Shell.java:460)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:707)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
17/04/14 18:12:35 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1488960736410_229415_01_000004 on host: ztdm009. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1488960736410_229415_01_000004
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:543)
at org.apache.hadoop.util.Shell.run(Shell.java:460)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:707)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
17/04/14 18:12:35 INFO yarn.YarnAllocator: Completed container container_1488960736410_229415_01_000005 on host: ztdm010 (state: COMPLETE, exit status: 1)
17/04/14 18:12:35 WARN yarn.YarnAllocator: Container marked as failed: container_1488960736410_229415_01_000005 on host: ztdm010. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1488960736410_229415_01_000005
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:543)
at org.apache.hadoop.util.Shell.run(Shell.java:460)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:707)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
17/04/14 18:12:35 INFO storage.BlockManagerMaster: Removal of executor 3 requested
17/04/14 18:12:35 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1488960736410_229415_01_000005 on host: ztdm010. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1488960736410_229415_01_000005
Exit code: 1
Stack trace: ExitCodeException exitCode=1:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:543)
at org.apache.hadoop.util.Shell.run(Shell.java:460)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:707)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:197)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:299)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 1
17/04/14 18:12:35 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 3 from BlockManagerMaster.
17/04/14 18:12:35 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 3
17/04/14 18:12:35 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 4 from BlockManagerMaster.
17/04/14 18:12:35 INFO storage.BlockManagerMaster: Removal of executor 4 requested
17/04/14 18:12:35 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 4
17/04/14 18:12:38 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (3) reached)
17/04/14 18:12:38 INFO storage.DiskBlockManager: Shutdown hook called
17/04/14 18:12:38 INFO util.ShutdownHookManager: Shutdown hook called
17/04/14 18:12:38 INFO util.ShutdownHookManager: Deleting directory /mnt/disk2/yarn/nm/usercache/efinance/appcache/application_1488960736410_229415/spark-1f1b6198-961b-418d-9274-5f35f8e67829
17/04/14 18:12:38 INFO util.ShutdownHookManager: Deleting directory /mnt/disk6/yarn/nm/usercache/efinance/appcache/application_1488960736410_229415/spark-fdce09f8-8677-45b2-9ce4-ac7134ab63b0
17/04/14 18:12:38 INFO util.ShutdownHookManager: Deleting directory /mnt/disk4/yarn/nm/usercache/efinance/appcache/application_1488960736410_229415/spark-10c52d9a-a76b-465f-82d5-42eba9c89c86
17/04/14 18:12:38 INFO util.ShutdownHookManager: Deleting directory /mnt/disk5/yarn/nm/usercache/efinance/appcache/application_1488960736410_229415/spark-4c492882-813a-4c2b-a041-ae69aba7ce00
17/04/14 18:12:38 INFO util.ShutdownHookManager: Deleting directory /mnt/disk3/yarn/nm/usercache/efinance/appcache/application_1488960736410_229415/spark-1d9e6c60-fc33-45c3-8552-55cbe4266931
17/04/14 18:12:38 INFO util.ShutdownHookManager: Deleting directory /mnt/disk1/yarn/nm/usercache/efinance/appcache/application_1488960736410_229415/spark-b6aa5cf9-042a-4804-9472-9bcddde2814e/userFiles-b259c206-d618-4d54-8630-824d955d0be4
17/04/14 18:12:38 INFO util.ShutdownHookManager: Deleting directory /mnt/disk1/yarn/nm/usercache/efinance/appcache/application_1488960736410_229415/spark-b6aa5cf9-042a-4804-9472-9bcddde2814e
Reason might be the scala jar in my code is able to conflict with the scala jar in the cluster. when i deleted scala directory in the jar file of compiled and built maven project, it fixed the issue.

how to understand yarn appattempt log and diagnostics?

I am running a spark application on YARN cluster(on AWS EMR). The application seems to be killed and I want to find the cause. I try to understand the YARN info given in the following screen.
The diagnostic line in the screen seems to show that YARN killing the app because of the memory limit:
Diagnostics: Container [pid=1540,containerID=container_1488651686158_0012_02_000001] is running beyond physical memory limits. Current usage: 1.6 GB of 1.4 GB physical memory used; 3.6 GB of 6.9 GB virtual memory used. Killing container.
However, the appattempt log shows completely different exception, something related to the IO/network. My question is : should I trust the diagnostic in the screen or the appattempt log? Is the IO exception causing the kill or the out of memory cause the IO exception in the appattempt log? Is it another log/diagnostic I should look at? Thanks.
17/03/04 21:59:02 ERROR Utils: Uncaught exception in thread task-result-getter-0
java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190)
at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:104)
at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:579)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Exception in thread "task-result-getter-0" java.lang.Error: java.lang.InterruptedException
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1148)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.InterruptedException
at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:998)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
at scala.concurrent.impl.Promise$DefaultPromise.tryAwait(Promise.scala:202)
at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:218)
at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223)
at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190)
at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53)
at scala.concurrent.Await$.result(package.scala:190)
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:190)
at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:104)
at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:579)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply$mcV$sp(TaskResultGetter.scala:82)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3$$anonfun$run$1.apply(TaskResultGetter.scala:63)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:62)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
... 2 more
17/03/04 21:59:02 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
17/03/04 21:59:02 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from ip-172-31-9-207.ec2.internal/172.31.9.207:38437 is closed
17/03/04 21:59:02 INFO RetryingBlockFetcher: Retrying fetch (1/3) for 1 outstanding blocks after 5000 ms
17/03/04 21:59:02 ERROR DiskBlockManager: Exception while deleting local spark dir: /mnt/yarn/usercache/hadoop/appcache/application_1488651686158_0012/blockmgr-941a13d8-1b31-4347-bdec-180125b6f4ca
java.io.IOException: Failed to delete: /mnt/yarn/usercache/hadoop/appcache/application_1488651686158_0012/blockmgr-941a13d8-1b31-4347-bdec-180125b6f4ca
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:1010)
at org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:169)
at org.apache.spark.storage.DiskBlockManager$$anonfun$org$apache$spark$storage$DiskBlockManager$$doStop$1.apply(DiskBlockManager.scala:165)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
at org.apache.spark.storage.DiskBlockManager.org$apache$spark$storage$DiskBlockManager$$doStop(DiskBlockManager.scala:165)
at org.apache.spark.storage.DiskBlockManager.stop(DiskBlockManager.scala:160)
at org.apache.spark.storage.BlockManager.stop(BlockManager.scala:1361)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:89)
at org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1842)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1841)
at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
17/03/04 21:59:02 INFO MemoryStore: MemoryStore cleared
17/03/04 21:59:02 INFO BlockManager: BlockManager stopped
17/03/04 21:59:02 INFO BlockManagerMaster: BlockManagerMaster stopped
17/03/04 21:59:02 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
17/03/04 21:59:02 ERROR Utils: Uncaught exception in thread Thread-3
java.lang.NoClassDefFoundError: Could not initialize class java.nio.file.FileSystems$DefaultFileSystemHolder
at java.nio.file.FileSystems.getDefault(FileSystems.java:176)
at java.nio.file.Paths.get(Paths.java:138)
at org.apache.spark.util.Utils$.isSymlink(Utils.scala:1021)
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:991)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:102)
at org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1842)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1841)
at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
17/03/04 21:59:02 WARN ShutdownHookManager: ShutdownHook '$anon$2' failed, java.lang.NoClassDefFoundError: Could not initialize class java.nio.file.FileSystems$DefaultFileSystemHolder
java.lang.NoClassDefFoundError: Could not initialize class java.nio.file.FileSystems$DefaultFileSystemHolder
at java.nio.file.FileSystems.getDefault(FileSystems.java:176)
at java.nio.file.Paths.get(Paths.java:138)
at org.apache.spark.util.Utils$.isSymlink(Utils.scala:1021)
at org.apache.spark.util.Utils$.deleteRecursively(Utils.scala:991)
at org.apache.spark.SparkEnv.stop(SparkEnv.scala:102)
at org.apache.spark.SparkContext$$anonfun$stop$11.apply$mcV$sp(SparkContext.scala:1842)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1283)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1841)
at org.apache.spark.SparkContext$$anonfun$2.apply$mcV$sp(SparkContext.scala:581)
at org.apache.spark.util.SparkShutdownHook.run(ShutdownHookManager.scala:216)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1$$anonfun$apply$mcV$sp$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1951)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply$mcV$sp(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anonfun$runAll$1.apply(ShutdownHookManager.scala:188)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.util.SparkShutdownHookManager.runAll(ShutdownHookManager.scala:188)
at org.apache.spark.util.SparkShutdownHookManager$$anon$2.run(ShutdownHookManager.scala:178)
at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:54)
The information in your screenshot is the most relevant. Your ApplicationMaster container ran out of memory. You need to increase yarn.app.mapreduce.am.resource.mb which is set in mapred-site.xml. I recommend a value of 2000 since that will usually accommodate running Spark and MapReduce applications at scale.
The container was killed (memory exceeds physical memory limits) so any attempt to reach this container fails.
Yarn is fine to have an overall view of the process, but you should prefer spark history server to analyse better your job (check unbalanced memory in spark history).

Resources