java.lang.OutOfMemoryError for simple rdd.count() operation - apache-spark

I'm having a lot of trouble getting a simple count operation working on about 55 files on hdfs and a total of 1B records. Both spark-shell and PySpark fail with OOM errors. I'm using yarn, MapR, Spark 1.3.1, and hdfs 2.4.1. (It fails in local mode as well.) I've tried following the tuning and configuration advice, throwing more and more memory at the executor. My configuration is
conf = (SparkConf()
.setMaster("yarn-client")
.setAppName("pyspark-testing")
.set("spark.executor.memory", "6g")
.set("spark.driver.memory", "6g")
.set("spark.executor.instances", 20)
.set("spark.yarn.executor.memoryOverhead", "1024")
.set("spark.yarn.driver.memoryOverhead", "1024")
.set("spark.yarn.am.memoryOverhead", "1024")
)
sc = SparkContext(conf=conf)
sc.textFile('/data/on/hdfs/*.csv').count() # fails every time
The job gets split into 893 tasks and after about 50 tasks are successfully completed, many start failing. I see ExecutorLostFailure in the stderr of the application. When digging through the executor logs, I see errors like the following:
15/06/24 16:54:07 ERROR util.Utils: Uncaught exception in thread stdout writer for /work/analytics2/analytics/python/envs/santon/bin/python
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57)
at java.nio.CharBuffer.allocate(CharBuffer.java:331)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:792)
at org.apache.hadoop.io.Text.decode(Text.java:406)
at org.apache.hadoop.io.Text.decode(Text.java:383)
at org.apache.hadoop.io.Text.toString(Text.java:281)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:558)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:558)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:379)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1550)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
15/06/24 16:54:07 ERROR util.SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[stdout writer for /work/analytics2/analytics/python/envs/santon/bin/python,5,main]
java.lang.OutOfMemoryError: Java heap space
at java.nio.HeapCharBuffer.<init>(HeapCharBuffer.java:57)
at java.nio.CharBuffer.allocate(CharBuffer.java:331)
at java.nio.charset.CharsetDecoder.decode(CharsetDecoder.java:792)
at org.apache.hadoop.io.Text.decode(Text.java:406)
at org.apache.hadoop.io.Text.decode(Text.java:383)
at org.apache.hadoop.io.Text.toString(Text.java:281)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:558)
at org.apache.spark.SparkContext$$anonfun$textFile$1.apply(SparkContext.scala:558)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:379)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:242)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:204)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1550)
at org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:203)
15/06/24 16:54:07 ERROR executor.CoarseGrainedExecutorBackend: RECEIVED SIGNAL 15: SIGTERM
In the stdout:
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill %p"
# Executing /bin/sh -c "kill 16490"...
In general, I think I understand the OOM errors and troubleshooting, but I'm stuck conceptually here. This is just a simple count. I don't understand how the Java heap could possibly be overflowing when the executors have ~3G heaps. Has anyone run into this before or have any pointers? Is there something going on under the hood that would shed light on the issue?
Update:
I've also noticed that by specifying the parallelism (for example sc.textFile(..., 1000)) to the same number of tasks (893), then the created job has 920 tasks, all but the last of which complete without error. Then the very last task hangs indefinitely. This seems exceedingly strange!

It turns out that the issue I was having was actually related to a single file that was corrupted. Running a simple cat or wc -l on the file would cause the terminal to hang.

Try to increase JAVA heap size as following on your console
export JAVA_OPTS="-Xms512m -Xmx5g"
You can change the values according to your data and memory size, -Xms Means minimum memory and -Xmx means max size. Hopefully it will help you.

Related

Prediction.io - pio train fails with OutOfMemoryError

We are getting the following error after running "pio train". It works about 20 minutes and fails on Stage 26.
[ERROR] [Executor] Exception in task 0.0 in stage 1.0 (TID 3)
[ERROR] [SparkUncaughtExceptionHandler] Uncaught exception in thread Thread[Executor task launch worker-0,5,main]
[ERROR] [SparkUncaughtExceptionHandler] Uncaught exception in thread Thread[Executor task launch worker-4,5,main]
[WARN] [TaskSetManager] Lost task 2.0 in stage 1.0 (TID 5, localhost): java.lang.OutOfMemoryError: Java heap space
at com.esotericsoftware.kryo.io.Output.<init>(Output.java:35)
at org.apache.spark.serializer.KryoSerializer.newKryoOutput(KryoSerializer.scala:80)
at org.apache.spark.serializer.KryoSerializerInstance.output$lzycompute(KryoSerializer.scala:289)
at org.apache.spark.serializer.KryoSerializerInstance.output(KryoSerializer.scala:289)
at org.apache.spark.serializer.KryoSerializerInstance.serialize(KryoSerializer.scala:293)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:239)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Our server has about 30gb memory, but about 10gb is taken by hbase+elasticsearch.
We are trying to process about 20 millions of records created by Universal Recommender.
I've tried the following command to increase executor/driver memory, but it didn't help:
pio train -- --driver-memory 6g --executor-memory 8g
What options could we try to fix the issue? Is it possible to process that amount of events on server with that amount of memory?
Vertical scaling can take you only so far but you could try increasing the memory available if it's AWS by stopping and restarting with a larger instance.
CF looks at a lot of data, Since Spark gets it's speed by doing in-memory calculations (by default) you will need enough memory to hold all of your data spread over all Spark workers and in your case you have only 1.
Another thing that comes to mind is that this is a Kryo error so you might try increasing the Kryo buffer size a little, which is configured in engine.json
Also there is a Google Group for community support here: https://groups.google.com/forum/#!forum/actionml-user

spark on yarn, Container exited with a non-zero exit code 143

I am using HDP 2.5, running spark-submit as yarn cluster mode.
I have tried to generate data using dataframe cross join.
i.e
val generatedData = df1.join(df2).join(df3).join(df4)
generatedData.saveAsTable(...)....
df1 storage level is MEMORY_AND_DISK
df2,df3,df4 storage level is MEMORY_ONLY
df1 has much more records i.e 5 million while df2 to df4 has at most 100 records.
doing so my explain plain would result with better performance using BroadcastNestedLoopJoin explain plan.
for some reason it always fail. I don't know how can I debug it and where the memory explode.
Error log output:
16/12/06 19:44:08 WARN YarnAllocator: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
16/12/06 19:44:08 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
16/12/06 19:44:08 ERROR YarnClusterScheduler: Lost executor 1 on hdp4: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
16/12/06 19:44:08 WARN TaskSetManager: Lost task 1.0 in stage 12.0 (TID 19, hdp4): ExecutorLostFailure (executor 1 exited caused by one of the running tasks) Reason: Container marked as failed: container_e33_1480922439133_0845_02_000002 on host: hdp4. Exit status: 143. Diagnostics: Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143
Killed by external signal
I didn't see any WARN or ERROR logs before this error.
What is the problem? where should I look for the memory consumption?
I cannot see anything on the Storage tab of SparkUI.
the log was taken from yarn resource manager UI on HDP 2.5
EDIT
looking at the container log, it seems like it's a java.lang.OutOfMemoryError: GC overhead limit exceeded
I know how to increase the memory, but I don't have any memory anymore.
How can I do a cartesian / product join with 4 Dataframes without getting this error.
I also meet this problem and try to solve it by refering some blog.
1. Run spark add conf bellow:
--conf 'spark.driver.extraJavaOptions=-XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps' \
--conf 'spark.executor.extraJavaOptions=-XX:+UseCompressedOops -verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintHeapAtGC ' \
When jvm GC ,you will get follow message:
Heap after GC invocations=157 (full 98):
PSYoungGen total 940544K, used 853456K [0x0000000781800000, 0x00000007c0000000, 0x00000007c0000000)
eden space 860160K, 99% used [0x0000000781800000,0x00000007b5974118,0x00000007b6000000)
from space 80384K, 0% used [0x00000007b6000000,0x00000007b6000000,0x00000007bae80000)
to space 77824K, 0% used [0x00000007bb400000,0x00000007bb400000,0x00000007c0000000)
ParOldGen total 2048000K, used 2047964K [0x0000000704800000, 0x0000000781800000, 0x0000000781800000)
object space 2048000K, 99% used [0x0000000704800000,0x00000007817f7148,0x0000000781800000)
Metaspace used 43044K, capacity 43310K, committed 44288K, reserved 1087488K
class space used 6618K, capacity 6701K, committed 6912K, reserved 1048576K
}
Both PSYoungGen and ParOldGen are 99% ,then you will get java.lang.OutOfMemoryError: GC overhead limit exceeded
if more object was created .
Try to add more memory for your executor or your driver when more memory resources are avaliable:
--executor-memory 10000m \
--driver-memory 10000m \
For my case : memory for PSYoungGen are smaller then ParOldGen which causes many young object enter into ParOldGen memory area and finaly
ParOldGen are not avaliable.So java.lang.OutOfMemoryError: Java heap space error appear.
Adding conf for executor:
'spark.executor.extraJavaOptions=-XX:NewRatio=1 -XX:+UseCompressedOops
-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps '
-XX:NewRatio=rate
rate = ParOldGen/PSYoungGen
It dependends.You can try GC strategy like
-XX:+UseSerialGC :Serial Collector
-XX:+UseParallelGC :Parallel Collector
-XX:+UseParallelOldGC :Parallel Old collector
-XX:+UseConcMarkSweepGC :Concurrent Mark Sweep
Java Concurrent and Parallel GC
If both step 4 and step 6 are done but still get error, you should consider change you code. For example, reduce iterator times in ML model.
Log file of all containers and am are available on,
yarn logs -applicationId application_1480922439133_0845_02
If you just want AM logs,
yarn logs -am -applicationId application_1480922439133_0845_02
If you want to find containers ran for this job,
yarn logs -applicationId application_1480922439133_0845_02|grep container_e33_1480922439133_0845_02
If you want just a single container log,
yarn logs -containerId container_e33_1480922439133_0845_02_000002
And for these commands to work, log aggregation must have been set to true, or you will have to get logs from individual server directories.
Update
There is nothing you can do apart from try with swapping, but that will degrade performance alot.
The GC overhead limit means, GC has been running non-stop in quick succession but it was not able to recover much memory. Only reason for that is, either code has been poorly written and have alot of back reference(which is doubtful, as you are doing simple join), or memory capacity has reached.
REASON 1
By default the shuffle count is 200. Having too many shuffle will increase the complexity and chances of getting program crashed. Try controlling the number of shuffles in the spark session. I changed the count to 5 using the below code.
implicit val sparkSession = org.apache.spark.sql.SparkSession.builder().enableHiveSupport().getOrCreate()
sparkSession.sql("set spark.sql.shuffle.partitions=5")
Additionally if you are using dataframes and if you are not re-partitioning the dataframe, then the execution will be done in a single executor. If only 1 executor is running for some time then the yarn will make other executors to shut down. Later if more memory is required, though yarn tries to re-call the other executors sometimes the executors won't come up, hence the process might fail with memory overflow issue. To overcome this situation, try re-partitioning the dataframe before an action is called.
val df = df_temp.repartition(5)
Note that you might need to change the shuffle and partition count and according to your requirement. In my case the above combination worked.
REASON 2
It can occur due to memory is not getting cleared on time. For example, if you are running a spark command using Scala and that you are executing bunch of sql statements and exporting to csv. The data in some hive tables will be very huge and you have to manage the memory in your code.
Example, consider the below code where the lst_Sqls is a list that contains a set of sql commands
lst_Sqls.foreach(sqlCmd => spark.sql(sqlCmd).coalesce(1).write.format("com.databricks.spark.csv").option("delimiter","|").save("s3 path..."))
When you run this command sometimes you will end up seeing the same error. This is because although spark clears the memory, it does this in a lazy way, ie, your loop will be continuing but spark might be clearing the memory at some later point.
In such cases, you need to manage the memory in your code, ie, clear the memory after each command is executed. For this let us change our code little. I have commented what each line do in the below code.
lst_Sqls.foreach(sqlCmd =>
{
val df = spark.sql(sqlCmd)
// Store the result in in-memory. If in-memory is full, then it stored to HDD
df.persist(StorageLevel.MEMORY_AND_DISK)
// Export to csv from Dataframe
df.coalesce(1).write.format("com.databricks.spark.csv").save("s3 path")
// Clear the memory. Only after clearing memory, it will jump to next loop
df.unpersist(blocking = true)
})

Spark: out of memory when broadcasting objects

I tried to broadcast a not-so-large map (~ 70 MB when saved to HDFS as text file), and I got out of memory errors. I tried to increase the driver memory to 11G and executor memory to 11G, and still got the same error. The memory.fraction is set to 0.3, and there's not many data (less than 1G) cached either.
When the map is only around 2 MB, there's no problem. I wonder if there is a size limit when broadcasting objects. How can I solve this problem using the bigger map? Thank you!
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.IdentityHashMap.resize(IdentityHashMap.java:469)
at java.util.IdentityHashMap.put(IdentityHashMap.java:445)
at org.apache.spark.util.SizeEstimator$SearchState.enqueue(SizeEstimator.scala:159)
at org.apache.spark.util.SizeEstimator$.visitArray(SizeEstimator.scala:229)
at org.apache.spark.util.SizeEstimator$.visitSingleObject(SizeEstimator.scala:194)
at org.apache.spark.util.SizeEstimator$.org$apache$spark$util$SizeEstimator$$estimate(SizeEstimator.scala:186)
at org.apache.spark.util.SizeEstimator$.estimate(SizeEstimator.scala:54)
at org.apache.spark.util.collection.SizeTracker$class.takeSample(SizeTracker.scala:78)
at org.apache.spark.util.collection.SizeTracker$class.afterUpdate(SizeTracker.scala:70)
at org.apache.spark.util.collection.SizeTrackingVector.$plus$eq(SizeTrackingVector.scala:31)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:278)
at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:165)
at org.apache.spark.storage.MemoryStore.putIterator(MemoryStore.scala:143)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:801)
at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:648)
at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1006)
at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:99)
at org.apache.spark.broadcast.TorrentBroadcast.<init>(TorrentBroadcast.scala:85)
at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:63)
at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1327)
Edit:
Add more information according to the comments:
I use spark-submit to submit the compiled jar file in client mode. Spark 1.5.0
spark.yarn.executor.memoryOverhead 600
set("spark.kryoserializer.buffer.max", "256m")
set("spark.speculation", "true")
set("spark.storage.memoryFraction", "0.3")
set("spark.driver.memory", "15G")
set("spark.executor.memory", "11G")
I tried set("spar.sql.tungsten.enabled", "false") and it doesn't help.
The master machine has 60G memory. Around 30G is used for Spark/Yarn. I'm not sure how much heap size is for my job, but there's not much other process going on at the same time. Especially the map is only around 70MB.
Some code related to the broadcasting:
val mappingAllLocal: Map[String, Int] = mappingAll.rdd.map(r => (r.getAs[String](0), r.getAs[Int](1))).collectAsMap().toMap
// I can use the above mappingAll to HDFS, and it's around 70MB
val mappingAllBrd = sc.broadcast(mappingAllLocal) // <-- this is where the out of memory happens
Using set("spark.driver.memory", "15G") has no effect on client mode. You have to use the command line parameter --conf="spark.driver.memory=15G" when submitting the application to increase the driver's heap size.
You can try increase the JVM heap size :
-Xmx2g : max size of 2Go
-Xms2g : initial size of 2Go (default size is 256mo)

spark groupby driver throws OutOfMemory

I have a RDD[((Long,Long),Float)] about 150G (shown in web ui storage).
When I groupby this RDD, driver program throws following error
15/07/16 04:37:08 ERROR actor.ActorSystemImpl: Uncaught fatal error from thread [sparkDriver-akka.remote.default-remote-dispatcher-39] shutting down ActorSystem [sparkDriver]
java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply$mcV$sp(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
at akka.serialization.JavaSerializer$$anonfun$toBinary$1.apply(Serializer.scala:129)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.serialization.JavaSerializer.toBinary(Serializer.scala:129)
at akka.remote.MessageSerializer$.serialize(MessageSerializer.scala:36)
at akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:845)
at akka.remote.EndpointWriter$$anonfun$serializeMessage$1.apply(Endpoint.scala:845)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at akka.remote.EndpointWriter.serializeMessage(Endpoint.scala:844)
at akka.remote.EndpointWriter.writeSend(Endpoint.scala:747)
The executors didn't even start the stage.
This RDD has 120000 partitions. Could this be the cause of the error?
The size of a at least one of the partitions is more that the memory you have allocated to the executor (you can do that with the --executor-memory flag on the command line running the spark job
After grouping by (Long, Long), at least one of your groups are big to fit in memory. Spark expects each record after grouping ((Long,long), Iterator[Float]) to fit in memory. and this is not the case for your data. see this https://spark.apache.org/docs/1.2.0/tuning.html look for Memory Usage of Reduce Tasks
I suggest to have a work around by increasing your data parallelism. Add a mapping step before the group by and break down your data.
ds.Map(x=>(x._1._1,x._1._2,x._1._1%2),float))
Then group by the new key (you might do something more sophisticated than this x._1._1%2).

Apache Spark: Master removed our application: Failed when using saveAsTextFile on large RDD

I'm loading a file of 20 GB in Spark Standalone mode on a machine with 4GB RAM and 2 cores, do some processing and then try to save the result (for testing purposes) to a text file using saveAsTextFile.
If I manually extract only a few thousand lines from the original input file and run the code on that, it works like a charm, resulting in the expected part-xxxxx files.
However, if I provide the whole 20GB file as input, it will start off fine, then hang somewhere along the process and when let run overnight it will have failed in the morning with the following message:
Py4JJavaError: An error occurred while calling o219.saveAsTextFile.
: org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1204)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1193)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1192)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1192)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:693)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:693)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1393)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1354)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
Does anyone have an idea why this might be?
There might be a few issues here, but the most commons would be :
Too many file opens killed the process
The serializer you're using (by default the Java serializer) might create a GC Overhead - try switching to Kryo Serializer (c.f. Spark's documentation : Tuning page)
finally you're running out of disk.
Now without any idea of what your computation is or the logs from the actual server, it's difficult to tell what happened.

Resources