How to disable disk writes for shuffle files? - apache-spark

Hi We have spark cluster , during spark job execution , am getting sparkoutofmemory when writing intermediate data to spark.local.dir location , but when am seeing their is more than double memory for executor unused , so instead of writing to that dir , can we store the data into memory itself ?
Below the exception details
Job aborted due to stage failure: Task 134555 in stage 32.0 failed 4 times, most recent failure: Lost task 134555.3 in stage 32.0 (TID 151065, <<some worker node IP>>, executor 318): java.io.FileNotFoundException: /opt/spark/tmp/spark-98331af4-b923-4342-ae3e-93e764b02d4a/executor-a5874092-943d-4b57-b1d0-eab05a3d36c5/blockmgr-17e989d8-4657-4a4e-bc93-ea075cb45f61/0f/temp_shuffle_1f574b0e-617b-46db-a558-9937a911c90a (No space left on device)
at java.io.FileOutputStream.open0(Native Method)
at java.io.FileOutputStream.open(FileOutputStream.java:270)
at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
at org.apache.spark.storage.DiskBlockObjectWriter.initialize(DiskBlockObjectWriter.scala:103)
at org.apache.spark.storage.DiskBlockObjectWriter.open(DiskBlockObjectWriter.scala:116)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:249)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.writeSortedFile(ShuffleExternalSorter.java:211)
at org.apache.spark.shuffle.sort.ShuffleExternalSorter.closeAndGetSpills(ShuffleExternalSorter.java:419)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.closeAndWriteOutput(UnsafeShuffleWriter.java:230)
at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:190)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:402)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:408)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Driver stacktrace:
Below the screen shot of the failing stage

My understanding is that there's no way to disable disk writes.
ShuffleMapTasks have to write intermediate shuffle map output files (shuffle blocks) to spark.local.dir directory so reducers (of this shuffle stage) can do their job (no pun intended).
You could give External Shuffle Service a try to reduce disk usage (so rather than every executor would manage their own shuffle-related directories they could offload it). You can read about it a bit in Dynamic Resource Allocation.

Related

java.io.FileNotFoundException: File does not exist even if I cache

I have a long pyspark script.
At the start of the script, I read a user_table, and need to use it many times all along.
Sometimes the underlying files in the relevant partitions get updated (by an outside script of data team), and the script will fail with java.io.FileNotFoundException.
I would totally understand if I didn't cache: Spark always goes back to the source and gets it from there. But I explicitly cache, and show 1, to initiate the caching.
a.user_profile_df = user_profile_df.cache()
a.user_profile_df.show(1)
So wondering, how could the update of underlying files cause this error, if the data is already cached? It would mean it wants to read the data from the source files, but then what is the point of caching?
# Set the default spark-shell log level to WARN. When running the spark-shell, the21/07/24 12:57:31 ERROR TaskSetManager: Task 9 in stage 98.0 failed 4 times; aborting job
# Set the default spark-shell log level to WARN. When running the spark-shell, the21/07/24 12:57:31 WARN TaskSetManager: Lost task 16.1 in stage 98.0 (TID 10131, ip-12-345-67-8.id-element.io, executor 47): java.io.FileNotFoundException: File does not exist: hdfs://R2/projects/data_usermart/hive/user_tables/user_table_micro/region=AB/date=2021-07-23/part-00005-5ed5074f-3199-487d-9d3f-20dce24f4a59.c000
It is possible the underlying files have been updated. You can explicitly invalidate the cache in Spark by running 'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:127)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:177)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:101)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.scan_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:891)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:224)
at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.writeIteratorToStream(PythonUDFRunner.scala:50)
at org.apache.spark.api.python.BasePythonRunner$WriterThread$$anonfun$run$1.apply(PythonRunner.scala:346)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1945)
at org.apache.spark.api.python.BasePythonRunner$WriterThread.run(PythonRunner.scala:195)
My solution now is to
save down the relevant data I need from that table as CSV on the HDFS and
read it in and then refer on that all along the script,
but there has to be a better way.
#Chris, I am not clear or convinced about caching. In my case, like you mentinoed as a temporary solution, after I copied the file from local linux file system to hdfs, the java.io.FileNotFoundException was gone and it started working. This is what I did -
hdfs dfs -cp file:///<path_to_local_file> /<hdfs_file_dir>

AWS Glue - Running out of disk space on worker threads while processing data

I'm running a glue job on a s3 dataset of around 6 million files totaling to about 80 GB, on which I am performing a window function and writing to a different s3 location. My glue job is using 50 G2.X workers and default Spark partitioning. When I run this, I receive an error listed below. Any suggestions on how to keep from running out of storage on an executor?
scheduler.TaskSetManager (Logging.scala:logWarning(66)): Lost task 401.0 in stage 4.0 (TID 505643, 172.35.178.124, executor 13): org.apache.spark.memory.SparkOutOfMemoryError: error while calling spill() on org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter#16007c9 : No space left on device
at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:219)
at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:285)
at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:117)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:383)
at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:407)
at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:135)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.sort_addToSorter_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13$$anon$1.hasNext(WholeStageCodegenExec.scala:636)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$11$$anon$1.fetchNextRow(WindowExec.scala:314)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$11$$anon$1.(WindowExec.scala:323)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$11.apply(WindowExec.scala:303)
at org.apache.spark.sql.execution.window.WindowExec$$anonfun$11.apply(WindowExec.scala:302)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:801)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)

Unable to save RDD to HDFS in Apache Spark

I am getting the following error while trying to save the RDD to HDFS
17/09/13 17:06:42 WARN TaskSetManager: Lost task 7340.0 in stage 16.0 (TID 100118, XXXXXX.com, executor 2358): java.io.IOException: Failing write. Tried pipeline recovery 5 times without success.
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.processDatanodeError(DFSOutputStream.java:865)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:401)
Suppressed: java.lang.IllegalArgumentException: Self-suppression not permitted
at java.lang.Throwable.addSuppressed(Throwable.java:1043)
at java.io.FilterOutputStream.close(FilterOutputStream.java:159)
at org.apache.hadoop.mapred.TextOutputFormat$LineRecordWriter.close(TextOutputFormat.java:108)
at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$8.apply$mcV$sp(PairRDDFunctions.scala:1218)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1359)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1218)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1197)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
[CIRCULAR REFERENCE:java.io.IOException: Failing write. Tried pipeline recovery 5 times without success.]
the final task in the stage is .saveAsTextFile(), In the Spark UI i am able to see that other tasks prior to .saveAsTextFile() finishes successfully. Using Spark 2.0.0 in YARN mode.
EDIT:
I have already seen the answer on Spark: Self-suppression not permitted when writing big file to HDFS and i made sure that issues mentioned in that answer were not the case here.

Making Spark app work on m4 instances

I have a pyspark application running on EMR (5.7.0) which processes about 140M json records. Nothing terribly fancy outside of the size of the dataset -- main operations are map, filter, count, repartitionAndSort, and mapPartition.
This app runs on a cluster of 40 m3.2xlarge instances, but I wanted to try it on the m4 family (since I get access to beefier machines that way).
However, the executors fall over the the cluster is rendered inoperable (If I rerun my application, it fails far earlier with no available executors).
Here's the stack trace I get on failure.
17/08/26 15:51:15 WARN TaskSetManager: Lost task 1118.0 in stage 145.0 (TID 72871, ip-172-22-247-134.ec2.internal, executor 91): java.lang.IllegalArgumentException: Self-suppression not permitted
at java.lang.Throwable.addSuppressed(Throwable.java:1043)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1316)
at org.apache.spark.rdd.ReliableCheckpointRDD$.writePartitionToCheckpointFile(ReliableCheckpointRDD.scala:182)
at org.apache.spark.rdd.ReliableCheckpointRDD$$anonfun$writeRDDToCheckpointDirectory$1.apply(ReliableCheckpointRDD.scala:137)
at org.apache.spark.rdd.ReliableCheckpointRDD$$anonfun$writeRDDToCheckpointDirectory$1.apply(ReliableCheckpointRDD.scala:137)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /checkpoints/48b56f21-7f74-429c-934c-0aea983c0175/rdd-359/.part-01118-attempt-0 could only be replicated to 0 nodes instead of minReplication (=1). There are 36 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1580)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getNewBlockTargets(FSNamesystem.java:3107)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3031)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:725)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:492)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:616)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982)
Any help? Seems very weird that this can't run on m4-class machines.

Spark: Self-suppression not permitted when writing big file to HDFS

I'm writing a large file to HDFS using spark. Basically what I was doing was to join 3 big files and then convert the result dataframe to json using toJSON() and then use saveAsTextFile to save it to HDFS. The final file to write is approximately 4TB. The application run pretty slow(as I should expected?) and after 6 hours it throwed an exception java.lang.IllegalArgumentException: Self-suppression not permitted. The detailed failure reason are copied from the monitoring page to below:
Job aborted due to stage failure: Task 37 in stage 6.0 failed 4 times, most recent failure: Lost task 37.3 in stage 6.0 (TID 361, 192.168.10.149): java.lang.IllegalArgumentException: Self-suppression not permitted
at java.lang.Throwable.addSuppressed(Throwable.java:1043)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1219)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File /user/dawei/upid_json_all/_temporary/0/_temporary/attempt_201512210857_0006_m_000037_361/part-00037 could only be replicated to 0 nodes instead of minReplication (=1). There are 5 datanode(s) running and no node(s) are excluded in this operation.
at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1562)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3245)
at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:663)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:482)
at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:619)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:962)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2040)
at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2036)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1656)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2034)
at org.apache.hadoop.ipc.Client.call(Client.java:1468)
at org.apache.hadoop.ipc.Client.call(Client.java:1399)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
at com.sun.proxy.$Proxy14.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
at sun.reflect.GeneratedMethodAccessor119.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy15.addBlock(Unknown Source)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1532)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1349)
at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:588)
Driver stacktrace:
can anyone tell me what causes this problem and how could I solve it?
From this error:
Caused by: org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/user/dawei/upid_json_all/_temporary/0/_temporary/attempt_201512210857_0006_m_000037_361/
part-00037 could only be replicated to 0 nodes instead of minReplication (=1).
There are 5 datanode(s) running and no node(s) are excluded in this operation.
It seems that replication is not happening. If you fix this error, things may fall in right place.
It may be due to below issues:
Inconsistency in your datanodes: Restart your Hadoop cluster and see if this solves your problem
Communication between datanodes and namenode: Network connectivity Issues and permission/firewall access issues related to port accessibility.
Disk space may be full on datanode
Datanode may be busy or unresponsive
Invalid configuration like Negative block size configuration
Have a look at related SE questions too on this topic.
HDFS error: could only be replicated to 0 nodes, instead of 1
The actual error could be hidden behind this weird 'self-supression' error.
When you don't see any clue in the yarn logs, check the Spark UI once. You will have some clue on the stage failures there.
It would more likely be some memory spill or something similar.

Resources