Making Spark app work on m4 instances - apache-spark

I have a pyspark application running on EMR (5.7.0) which processes about 140M json records. Nothing terribly fancy outside of the size of the dataset -- main operations are map, filter, count, repartitionAndSort, and mapPartition.
This app runs on a cluster of 40 m3.2xlarge instances, but I wanted to try it on the m4 family (since I get access to beefier machines that way).
However, the executors fall over the the cluster is rendered inoperable (If I rerun my application, it fails far earlier with no available executors).
Here's the stack trace I get on failure.
17/08/26 15:51:15 WARN TaskSetManager: Lost task 1118.0 in stage 145.0 (TID 72871, ip-172-22-247-134.ec2.internal, executor 91): java.lang.IllegalArgumentException: Self-suppression not permitted
at java.lang.Throwable.addSuppressed(Throwable.java:1043)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1316)
at org.apache.spark.rdd.ReliableCheckpointRDD$.writePartitionToCheckpointFile(ReliableCheckpointRDD.scala:182)
at org.apache.spark.rdd.ReliableCheckpointRDD$$anonfun$writeRDDToCheckpointDirectory$1.apply(ReliableCheckpointRDD.scala:137)
at org.apache.spark.rdd.ReliableCheckpointRDD$$anonfun$writeRDDToCheckpointDirectory$1.apply(ReliableCheckpointRDD.scala:137)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /checkpoints/48b56f21-7f74-429c-934c-0aea983c0175/rdd-359/.part-01118-attempt-0 could only be replicated to 0 nodes instead of minReplication (=1). There are 36 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1580)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:725)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
Any help? Seems very weird that this can't run on m4-class machines.

Related

How to disable disk writes for shuffle files?

Hi We have spark cluster , during spark job execution , am getting sparkoutofmemory when writing intermediate data to spark.local.dir location , but when am seeing their is more than double memory for executor unused , so instead of writing to that dir , can we store the data into memory itself ?
Below the exception details
Job aborted due to stage failure: Task 134555 in stage 32.0 failed 4 times, most recent failure: Lost task 134555.3 in stage 32.0 (TID 151065, <<some worker node IP>>, executor 318): java.io.FileNotFoundException: /opt/spark/tmp/spark-98331af4-b923-4342-ae3e-93e764b02d4a/executor-a5874092-943d-4b57-b1d0-eab05a3d36c5/blockmgr-17e989d8-4657-4a4e-bc93-ea075cb45f61/0f/temp_shuffle_1f574b0e-617b-46db-a558-9937a911c90a (No space left on device)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:249)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:211)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:419)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
Below the screen shot of the failing stage
My understanding is that there's no way to disable disk writes.
ShuffleMapTasks have to write intermediate shuffle map output files (shuffle blocks) to spark.local.dir directory so reducers (of this shuffle stage) can do their job (no pun intended).
You could give External Shuffle Service a try to reduce disk usage (so rather than every executor would manage their own shuffle-related directories they could offload it). You can read about it a bit in Dynamic Resource Allocation.

Spark context stopped while waiting for backend

I have a long-running EMR step that executes spark-submit on EMR client mode. Between job executions, I manually restart the Spark context before the next execution if any configuration changes, like --executor-memory.
I'm running into the following exception when I try to restart the context with the new configuration with
currentSparkSession.close();
return SparkSession.builder().config(newConfig).getOrCreate();
19/05/23 15:52:35 ERROR SparkContext: Error initializing SparkContext.
java.lang.IllegalStateException: Spark context stopped while waiting for backend
at org.apache.spark.scheduler.TaskSchedulerImpl.waitBackendReady(TaskSchedulerImpl.scala:689)
at org.apache.spark.scheduler.TaskSchedulerImpl.postStartHook(TaskSchedulerImpl.scala:186)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:567)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:923)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:915)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:915)
.
.
.
19/05/23 15:52:35 INFO SparkContext: SparkContext already stopped.
19/05/23 15:52:35 WARN TransportChannelHandler: Exception in connection from /172.31.0.165:42556
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:221)
at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:899)
at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:275)
at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:643)
at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:566)
at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:480)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:442)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
I tried making the thread sleep a little in case there needs to be some time between the stop and start like:
currentSparkSession.close();
Thread.sleep(5000); // Sleep 5 seconds
return SparkSession.builder().config(newConfig).getOrCreate();
but that doesn't work either. I looked at the Spark source code and it looks like currentSparkSession.close() won't return until it's actually stopped anyways, so making the Thread sleep doesn't do anything.
I also see this in the container logs:
Error occurred during initialization of VM
Initial heap size set to a larger value than the maximum heap size
End of LogType:stdout
which confuses me because the only configured I changed between executions was --executor-memory, and I actually DECREASED it instead of increasing.
I've found similar questions on this site like Apache Spark running spark-shell on YARN error, but these suggestions look like they're essentially just turning off some resource manager validation checks that don't look very safe to me. Any suggestions?
This is because I tried sending a request with a lower --executor-memory (which happens to set Xmx, max heap size) than Xms (initial heap size), which was configured on the initial spark submit. The exception was thrown since max heap size can never be smaller than initial heap size.

Unable to save RDD to HDFS in Apache Spark

I am getting the following error while trying to save the RDD to HDFS
17/09/13 17:06:42 WARN TaskSetManager: Lost task 7340.0 in stage 16.0 (TID 100118, XXXXXX.com, executor 2358): java.io.IOException: Failing write. Tried pipeline recovery 5 times without success.
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:865)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:401)
Suppressed: java.lang.IllegalArgumentException: Self-suppression not permitted
at java.lang.Throwable.addSuppressed(Throwable.java:1043)
at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:108)
at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$8.apply$mcV$sp(PairRDDFunctions.scala:1218)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1359)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1218)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1197)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
[CIRCULAR REFERENCE:java.io.IOException: Failing write. Tried pipeline recovery 5 times without success.]
the final task in the stage is .saveAsTextFile(), In the Spark UI i am able to see that other tasks prior to .saveAsTextFile() finishes successfully. Using Spark 2.0.0 in YARN mode.
EDIT:
I have already seen the answer on Spark: Self-suppression not permitted when writing big file to HDFS and i made sure that issues mentioned in that answer were not the case here.

Kinesis Spark Streaming longevity issues

I'm having issues with the longevity of Spark-Kinesis Streaming application running on spark standalone cluster manager. The program runs for around 50 hours and stops receiving data from kinesis without giving any valid error why it stopped. But if i restart the application, it works for another day and half or so.
I'm seeing whole lot of errors during the program execution. I'm not sure this is the related to unexpected stoppage of events. Because these errors are there in logs even when the spark application is working fine.
There is no error specific to stoppage in driver or executor.Also I checked if there is any out of memory error but I was not able to spot in the logs. Could you please help me understand what are these error message means? Is this having anything to do with the longevity? Where do you think i should debug to understand whats happening with this?
2016-04-15 13:32:19 INFO KinesisRecordProcessor:58 - Shutdown: Shutting down workerId ip-10-205-1-150.us-west-2.compute.internal:6394789f-acb9-4702-8ea2-c2a3637d925a with reason ZOMBIE
2016-04-15 13:32:19 ERROR ShutdownTask:123 - Application exception.
java.lang.NullPointerException
at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
at java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1175)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.removeCheckpointer(KinesisCheckpointer.scala:66)
at org.apache.spark.streaming.kinesis.KinesisReceiver.removeCheckpointer(KinesisReceiver.scala:249)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor.shutdown(KinesisRecordProcessor.scala:124)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:94)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2016-04-15 13:32:19 ERROR ShutdownTask:123 - Application exception.
java.lang.NullPointerException
at java.util.concurrent.ConcurrentHashMap.hash(ConcurrentHashMap.java:333)
at java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1175)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.removeCheckpointer(KinesisCheckpointer.scala:66)
at org.apache.spark.streaming.kinesis.KinesisReceiver.removeCheckpointer(KinesisReceiver.scala:249)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor.shutdown(KinesisRecordProcessor.scala:124)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.V1ToV2RecordProcessorAdapter.shutdown(V1ToV2RecordProcessorAdapter.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.ShutdownTask.call(ShutdownTask.java:94)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:48)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.MetricsCollectingTaskDecorator.call(MetricsCollectingTaskDecorator.java:23)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
016-04-15 13:33:12 INFO LeaseRenewer:235 - Worker ip-10-205-1-151.us-west-2.compute.internal:188bd3f5-095b-405f-ac9f-b7eff11e16d1 lost lease with key shardId-000000000046 - discovered during update
2016-04-15 13:33:12 WARN MetricsHelper:67 - No metrics scope set in thread RecurringTimer - Kinesis Checkpointer - Worker ip-10-205-1-151.us-west-2.compute.internal:188bd3f5- 095b-405f-ac9f-b7eff11e16d1, getMetricsScope returning NullMetricsScope.
2016-04-15 13:33:12 ERROR KinesisRecordProcessor:95 - ShutdownException: Caught shutdown exception, skipping checkpoint.
com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:145)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.org$apache$spark$streaming$kinesis$KinesisCheckpointer$$checkpointAll(KinesisCheckpointer.scala:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$1.apply$mcVJ$sp(KinesisCheckpointer.scala:117)
at org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
2016-04-15 13:33:12 WARN KinesisCheckpointer:91 - Failed to checkpoint shardId shardId-000000000046 to DynamoDB.
com.amazonaws.services.kinesis.clientlibrary.exceptions.ShutdownException: Can't update checkpoint - instance doesn't hold the lease for this shard
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.KinesisClientLibLeaseCoordinator.setCheckpoint(KinesisClientLibLeaseCoordinator.java:120)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.advancePosition(RecordProcessorCheckpointer.java:216)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:137)
at com.amazonaws.services.kinesis.clientlibrary.lib.worker.RecordProcessorCheckpointer.checkpoint(RecordProcessorCheckpointer.java:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply$mcV$sp(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1$$anonfun$apply$1.apply(KinesisCheckpointer.scala:81)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.kinesis.KinesisRecordProcessor$.retryRandom(KinesisRecordProcessor.scala:145)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:81)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$checkpoint$1.apply(KinesisCheckpointer.scala:75)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.checkpoint(KinesisCheckpointer.scala:75)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer.org$apache$spark$streaming$kinesis$KinesisCheckpointer$$checkpointAll(KinesisCheckpointer.scala:103)
at org.apache.spark.streaming.kinesis.KinesisCheckpointer$$anonfun$1.apply$mcVJ$sp(KinesisCheckpointer.scala:117)
at org.apache.spark.streaming.util.RecurringTimer.triggerActionForNextInterval(RecurringTimer.scala:94)
at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:106)
at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:29)
2016-04-15 13:33:12 INFO LeaseRenewer:116 - Worker ip-10-205-1-151.us-west-2.compute.internal:188bd3f5-095b-405f-ac9f-b7eff11e16d1 lost lease with key shardId-000000000046
2016-04-15 13:33:12 INFO MemoryStore:58 - Block input-2-1460727103359 stored as values in memory (estimated size 120.3 KB, free 171.8 MB)

Spark: Self-suppression not permitted when writing big file to HDFS

I'm writing a large file to HDFS using spark. Basically what I was doing was to join 3 big files and then convert the result dataframe to json using toJSON() and then use saveAsTextFile to save it to HDFS. The final file to write is approximately 4TB. The application run pretty slow(as I should expected?) and after 6 hours it throwed an exception java.lang.IllegalArgumentException: Self-suppression not permitted. The detailed failure reason are copied from the monitoring page to below:
Job aborted due to stage failure: Task 37 in stage 6.0 failed 4 times, most recent failure: Lost task 37.3 in stage 6.0 (TID 361, 192.168.10.149): java.lang.IllegalArgumentException: Self-suppression not permitted
at java.lang.Throwable.addSuppressed(Throwable.java:1043)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1219)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/dawei/upid_json_all/_temporary/0/_temporary/attempt_201512210857_0006_m_000037_361/part-00037 could only be replicated to 0 nodes instead of minReplication (=1). There are 5 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1562)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3245)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:663)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2036)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2034)
at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
at sun.reflect.GeneratedMethodAccessor119.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy15.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1532)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1349)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)
Driver stacktrace:
can anyone tell me what causes this problem and how could I solve it?
From this error:
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/user/dawei/upid_json_all/_temporary/0/_temporary/attempt_201512210857_0006_m_000037_361/
part-00037 could only be replicated to 0 nodes instead of minReplication (=1).
There are 5 datanode(s) running and no node(s) are excluded in this operation.
It seems that replication is not happening. If you fix this error, things may fall in right place.
It may be due to below issues:
Inconsistency in your datanodes: Restart your Hadoop cluster and see if this solves your problem
Communication between datanodes and namenode: Network connectivity Issues and permission/firewall access issues related to port accessibility.
Disk space may be full on datanode
Datanode may be busy or unresponsive
Invalid configuration like Negative block size configuration
Have a look at related SE questions too on this topic.
HDFS error: could only be replicated to 0 nodes, instead of 1
The actual error could be hidden behind this weird 'self-supression' error.
When you don't see any clue in the yarn logs, check the Spark UI once. You will have some clue on the stage failures there.
It would more likely be some memory spill or something similar.

Resources