Spark concurrently jobs fail - apache-spark

If I run a single job with spark on yarn-client everything works fine, but on multiple (>1) concurrently jobs I get the following exception on the container nodes. I'm Using Spark 1.2 with CDH5.3 and Spark-Jobserver
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_3_piece0 of broadcast_3
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1011)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:164)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:87)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:58)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_3_piece0 of broadcast_3
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:137)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:136)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:119)
at scala.collection.immutable.List.foreach(List.scala:318)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:119)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:174)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1008)
... 11 more
15/02/02 19:20:17 INFO executor.CoarseGrainedExecutorBackend: Got assigned task 1
15/02/02 19:20:17 INFO executor.Executor: Running task 1.0 in stage 0.0 (TID 1)
15/02/02 19:20:17 INFO broadcast.TorrentBroadcast: Started reading broadcast variable 3
15/02/02 19:20:17 ERROR executor.Executor: Exception in task 1.0 in stage 0.0 (TID 1)

Check SparkConf.set("spark.cleaner.ttl", "10000") in SparkConf. It may be due value in spark.cleaner.ttl your program running time exceeds the corresponding value, this may happens. Just increase the value. its given in seconds.
For more details look at configuration.html

it shouldn't be the reason spark.cleaner.ttl, since it was deprecated since Spark1.4

Related

"Task X in stage X failed X times" - Apache Spark EMR

I have an error happening in spark running on Amazon EMR.
It doesn't always happen (most of the time the step finishes successfully) and when I work with more data and more nodes it happens more often.
I get a file already exists exception, but the question is what does it mean "failed 20 times"? What exactly failed? And why after 20 times it makes the whole application fail? Where is the configuration for that?
This is from the stderr of the step: (I blanked private info with *)
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 14139 in stage 151.0 failed 20 times, most recent failure: Lost task 14139.19 in stage 151.0 (TID 186***, ip-*-*-*-*.ec2.internal, executor 107): org.apache.hadoop.fs.FileAlreadyExistsException: File already exists:s3://**path**
at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.checkExistenceIfNotOverwriting(RegularUploadPlanner.java:36)
at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.RegularUploadPlanner.plan(RegularUploadPlanner.java:30)
at com.amazon.ws.emr.hadoop.fs.s3.upload.plan.UploadPlannerChain.plan(UploadPlannerChain.java:37)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:281)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1125)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:1105)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:994)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:213)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:81)
at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92)
at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CsvOutputWriter.scala:38)
at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:84)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:126)
at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:111)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:264)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:127)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:444)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:447)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
By the way, the file does not exist in this path before running the step.

Lost executor Spark

I have a long running job on Spark, which after running for hours failed with the following errors.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 547 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 750.0 in stage 19.0 (TID 1565492, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 752.0 in stage 19.0 (TID 1565494, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 751.0 in stage 19.0 (TID 1565493, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 754.0 in stage 19.0 (TID 1565496, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 WARN TaskSetManager: Lost task 753.0 in stage 19.0 (TID 1565495, ip, executor 547): ExecutorLostFailure (executor 547 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 ERROR YarnScheduler: Lost executor 572 on ip: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
18/10/09 03:22:15 INFO DAGScheduler: Executor lost: 547 (epoch 45)
18/10/09 03:22:15 WARN TaskSetManager: Lost task 756.0 in stage 19.0 (TID 1565498, ip, executor 572): ExecutorLostFailure (executor 572 exited caused by one of the running tasks) Reason: Unable to create executor due to Unable to register with external shuffle server due to : java.util.concurrent.TimeoutException: Timeout waiting for task.
...
The strange thing is, I can't even see the lost executors on the Executor list for the log.
It would be great if someone can help fix the problem.
There are many factors for this to happen but the summary is the following:
Your master node is unable to reply to a specific executor and therefore gives the error
Unable to register with external shuffle server due to
Why your master node is unable to reply can be of different reasons. Depends on how your code is structured, the size of your instance if you are using EMR.
To solve it
Increase your master node. For example, if you are using i3.4xlarge, instead use i3.8xlarge or even i3.16xlarge.
Increase the network timeout from 2 minutes to 5 minutes. This is done with the following spark configuration: spark.network.timeout=300
Increase both the memory and number of cores of your master node. To increase the number of cores of your master node, set the following configuration. spark.yarn.am.cores=3
Hope this solves the issue.

spark application completes with SUCCESS status when an exception is thrown

I am running a spark application on yarn, which my goal is do some ETL from jdbc to elasticsearch.
However, when I check the log ,there is some errors like,this error is due to network problem :
17/12/01 00:35:19 WARN scheduler.TaskSetManager: Lost task 1317.0 in stage 0.0 (TID 1381, worker50.hadoop, executor 1): org.apache.spark.util.TaskCompletionListenerException: Connection error (check network and/or proxy settings)- all nodes failed; tried [[192.168.200.154:8201, 192.168.200.156:9200, 192.168.200.155:8201]]
at org.apache.spark.TaskContextImpl.invokeListeners(TaskContextImpl.scala:138)
at org.apache.spark.TaskContextImpl.markTaskCompleted(TaskContextImpl.scala:116)
at org.apache.spark.scheduler.Task.run(Task.scala:124)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
This means that connection failed and lost some data in this process.The job finalStatus should be failed, but spark returned me with {"state":"FINISHED","finalStatus":"SUCCEEDED"}
WHY? My spark version is 2.2.0

ERROR TransportClientFactory: Exception while bootstrapping client

I have a PySpark job which works successfully with a small cluster, but starts to get a lot of the following errors in the first few minutes when it starts up. Any idea on how I can solve it? This is with PySpark 2.2.0 and mesos.
17/09/29 18:54:26 INFO Executor: Running task 5717.0 in stage 0.0 (TID 5717)
17/09/29 18:54:26 INFO CoarseGrainedExecutorBackend: Got assigned task 5813
17/09/29 18:54:26 INFO Executor: Running task 5813.0 in stage 0.0 (TID 5813)
17/09/29 18:54:26 INFO CoarseGrainedExecutorBackend: Got assigned task 5909
17/09/29 18:54:26 INFO Executor: Running task 5909.0 in stage 0.0 (TID 5909)
17/09/29 18:54:56 ERROR TransportClientFactory: Exception while bootstrapping client after 30001 ms
java.lang.RuntimeException: java.util.concurrent.TimeoutException: Timeout waiting for task.
at org.spark_project.guava.base.Throwables.propagate(Throwables.java:160)
at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:275)
at org.apache.spark.network.sasl.SaslClientBootstrap.doBootstrap(SaslClientBootstrap.java:70)
at org.apache.spark.network.crypto.AuthClientBootstrap.doSaslAuth(AuthClientBootstrap.java:117)
at org.apache.spark.network.crypto.AuthClientBootstrap.doBootstrap(AuthClientBootstrap.java:76)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:244)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:182)
at org.apache.spark.rpc.netty.NettyRpcEnv.downloadClient(NettyRpcEnv.scala:366)
at org.apache.spark.rpc.netty.NettyRpcEnv.openChannel(NettyRpcEnv.scala:332)
at org.apache.spark.util.Utils$.doFetchFile(Utils.scala:654)
at org.apache.spark.util.Utils$.fetchFile(Utils.scala:467)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:684)
at org.apache.spark.executor.Executor$$anonfun$org$apache$spark$executor$Executor$$updateDependencies$3.apply(Executor.scala:681)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:99)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:99)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.executor.Executor.org$apache$spark$executor$Executor$$updateDependencies(Executor.scala:681)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:308)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.util.concurrent.TimeoutException: Timeout waiting for task.
at org.spark_project.guava.util.concurrent.AbstractFuture$Sync.get(AbstractFuture.java:276)
at org.spark_project.guava.util.concurrent.AbstractFuture.get(AbstractFuture.java:96)
at org.apache.spark.network.client.TransportClient.sendRpcSync(TransportClient.java:271)
... 23 more
17/09/29 18:54:56 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /10.0.1.25:37314 is closed
17/09/29 18:54:56 INFO Executor: Fetching spark://10.0.1.25:37314/files/djinn.spark.zip with timestamp 1506711239350

Executor shows up on the spark UI even on killing the worker and stages keep on failing with java.io.IOException

I am running a spark streaming application with spark version 1.4.0
If I kill the worker (using kill -9) when my job is running, then the worker and executor both on that node dies,but it still shows up in the executors tab of the UI. The number of active tasks sometimes shows as negative on those executors.
Because of this the jobs keep on failing with the following exception
16/04/01 23:54:20 WARN TaskSetManager: Lost task 141.0 in stage 19859.0 (TID 190333, 192.168.33.96): java.io.IOException: Failed to connect to /192.168.33.97:63276
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
at org.apache.spark.network.netty.NettyBlockTransferService$$anon$1.createAndStart(NettyBlockTransferService.scala:88)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused: /192.168.33.97:63276
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:208)
at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:287)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
... 1 more
On relaunching the worker a new executor is allocated but the old (dead) executor's entry is still there and the stages fail with "java.io.IOException: Failed to connect to " error.

Resources