spark application completes with SUCCESS status when an exception is thrown - apache-spark

I am running a spark application on yarn, which my goal is do some ETL from jdbc to elasticsearch.
However, when I check the log ,there is some errors like,this error is due to network problem :
17/12/01 00:35:19 WARN scheduler.TaskSetManager: Lost task 1317.0 in stage 0.0 (TID 1381, worker50.hadoop, executor 1): org.apache.spark.util.TaskCompletionListenerException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[192.168.200.154:8201, 192.168.200.156:9200, 192.168.200.155:8201]]
at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:138)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:116)
at org.apache.spark.scheduler.Task.run(Task.scala:124)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This means that connection failed and lost some data in this process.The job finalStatus should be failed, but spark returned me with {"state":"FINISHED","finalStatus":"SUCCEEDED"}
WHY? My spark version is 2.2.0

Related

When running "local-cluster" model in Apache Spark, how to prevent executor from dissociating prematurely?

I have a Spark application that should be tested in both local mode & local-cluster mode, using scalatest.
The local-cluster mode is submitted using this method:
How to scala-test a Spark program under "local-cluster" mode?
The test run successfully, but when terminating the test I got the following error in the log:
22/05/16 17:45:25 ERROR TaskSchedulerImpl: Lost executor 0 on 172.16.224.18: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.
22/05/16 17:45:25 ERROR Worker: Failed to launch executor app-20220516174449-0000/2 for Test.
java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:195)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:142)
at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:77)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:547)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:215)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/05/16 17:45:25 ERROR Worker: Failed to launch executor app-20220516174449-0000/3 for Test.
java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:195)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:142)
at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:77)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:547)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:215)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/05/16 17:45:25 ERROR Worker: Failed to launch executor app-20220516174449-0000/4 for Test.
java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:195)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:142)
at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:77)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:547)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:215)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:221)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
22/05/16 17:45:25 ERROR Worker: Failed to launch executor app-20220516174449-0000/5 for Test.
java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.
at org.apache.spark.util.SparkShutdownHookManager.add(ShutdownHookManager.scala:195)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:153)
at org.apache.spark.util.ShutdownHookManager$.addShutdownHook(ShutdownHookManager.scala:142)
at org.apache.spark.deploy.worker.ExecutorRunner.start(ExecutorRunner.scala:77)
at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:547)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:117)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:215)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:102)
at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dis
...
It turns out executor 0 was dropped before the SparkContext is stopped, this triggered a violent self-healing reaction from Spark master that tries to repeatedly launch new executors to compensate for the loss. How do I prevent this from happening?
Spark attempts to recover from failed tasks by attempting to run them again. What you can do to avoid this is to set some properties to 1 in
spark.task.maxFailures (default is 4)
spark.stage.maxConsecutiveAttempts (default is 4)
These properties can be set in $SPARK_HOME/conf/spark-defaults.conf or given as options to spark-submit:
spark-submit --conf spark.task.maxFailures=1 --conf spark.stage.maxConsecutiveAttempts=1
or in the Spark context/session configuration before starting the session.
EDIT:
It looks like your executors are lost due to insufficient memory. You could try to increase:
spark.executor.memory
spark.executor.memoryOverhead
spark.memory.offHeap.size with (spark.memory.offHeap.enabled=true)
(see Spark configuration)
The maximum memory size of container to running executor is determined by the sum of spark.executor.memoryOverhead, spark.executor.memory, spark.memory.offHeap.size and spark.executor.pyspark.memory.

"Task X in stage X failed X times" - Apache Spark EMR

I have an error happening in spark running on Amazon EMR.
It doesn't always happen (most of the time the step finishes successfully) and when I work with more data and more nodes it happens more often.
I get a file already exists exception, but the question is what does it mean "failed 20 times"? What exactly failed? And why after 20 times it makes the whole application fail? Where is the configuration for that?
This is from the stderr of the step: (I blanked private info with *)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 14139 in stage 151.0 failed 20 times, most recent failure: Lost task 14139.19 in stage 151.0 (TID 186***, ip-*-*-*-*.ec2.internal, executor 107): org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://**path**
at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.UploadPlannerChain.plan(UploadPlannerChain.java:37)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:281)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1125)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1105)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:994)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:213)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:81)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CsvOutputWriter.scala:38)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:84)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:264)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
By the way, the file does not exist in this path before running the step.

ERROR TransportClientFactory: Exception while bootstrapping client

I have a PySpark job which works successfully with a small cluster, but starts to get a lot of the following errors in the first few minutes when it starts up. Any idea on how I can solve it? This is with PySpark 2.2.0 and mesos.
17/09/29 18:54:26 INFO Executor: Running task 5717.0 in stage 0.0 (TID 5717)
17/09/29 18:54:26 INFO CoarseGrainedExecutorBackend: Got assigned task 5813
17/09/29 18:54:26 INFO Executor: Running task 5813.0 in stage 0.0 (TID 5813)
17/09/29 18:54:26 INFO CoarseGrainedExecutorBackend: Got assigned task 5909
17/09/29 18:54:26 INFO Executor: Running task 5909.0 in stage 0.0 (TID 5909)
17/09/29 18:54:56 ERROR TransportClientFactory: Exception while bootstrapping client after 30001 ms
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.
at org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:275)
at org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:70)
at org.apache.spark.network.crypto.AuthClientBootstrap.doSaslAuth(AuthClientBootstrap.java:117)
at org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:76)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:244)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:366)
at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:332)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:654)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:467)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:684)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:681)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:681)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:308)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276)
at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:271)
... 23 more
17/09/29 18:54:56 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /10.0.1.25:37314 is closed
17/09/29 18:54:56 INFO Executor: Fetching spark://10.0.1.25:37314/files/djinn.spark.zip with timestamp 1506711239350

Executor shows up on the spark UI even on killing the worker and stages keep on failing with java.io.IOException

I am running a spark streaming application with spark version 1.4.0
If I kill the worker (using kill -9) when my job is running, then the worker and executor both on that node dies,but it still shows up in the executors tab of the UI. The number of active tasks sometimes shows as negative on those executors.
Because of this the jobs keep on failing with the following exception
16/04/01 23:54:20 WARN TaskSetManager: Lost task 141.0 in stage 19859.0 (TID 190333, 192.168.33.96): java.io.IOException: Failed to connect to /192.168.33.97:63276
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: /192.168.33.97:63276
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
... 1 more
On relaunching the worker a new executor is allocated but the old (dead) executor's entry is still there and the stages fail with "java.io.IOException: Failed to connect to " error.

Spark concurrently jobs fail

If I run a single job with spark on yarn-client everything works fine, but on multiple (>1) concurrently jobs I get the following exception on the container nodes. I'm Using Spark 1.2 with CDH5.3 and Spark-Jobserver
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_3_piece0 of broadcast_3
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_3_piece0 of broadcast_3
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
... 11 more
15/02/02 19:20:17 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1
15/02/02 19:20:17 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
15/02/02 19:20:17 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
15/02/02 19:20:17 ERROR executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1)
Check SparkConf.set("spark.cleaner.ttl", "10000") in SparkConf. It may be due value in spark.cleaner.ttl your program running time exceeds the corresponding value, this may happens. Just increase the value. its given in seconds.
For more details look at configuration.html
it shouldn't be the reason spark.cleaner.ttl, since it was deprecated since Spark1.4

Resources