Kryo out of memory - apache-spark

No matter what I tried, I get this OOME using Spark 1.3.1 when using Kryo serializer (I don't have any issues if I use the default Java one)
15/06/25 20:16:37 WARN TaskSetManager: Lost task 47.0 in stage 1.0 (TID 59, ip-172-31-28-175.ec2.internal): java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at com.ning.compress.lzf.ChunkEncoder.encodeAndWriteChunk(ChunkEncoder.java:264)
at com.ning.compress.lzf.LZFOutputStream.writeCompressedBlock(LZFOutputStream.java:266)
at com.ning.compress.lzf.LZFOutputStream.write(LZFOutputStream.java:124)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:31)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:27)
at scala.collection.immutable.List.foreach(List.scala:318)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:27)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:21)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:29)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:27)
at scala.collection.immutable.List.foreach(List.scala:318)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:27)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:21)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:124)
at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:110)
I have 40GB of RAM available on both the driver and the executors. I tried playing with Kryo buffer size / max size (increased from the default all the way to ridiculous values), but to no avail.
Am I doing something wrong? Is that a know issue? Is there a solution?
EDIT tried with 1.4.0 same issue.
p.s. this only happens on the cluster. Locally I'm getting it to work successfully.
In summary
Locally it works with or without Kryo Serializer (smaller data set though), with 1.3.1 and 1.4.0
On the cluster it works with the default Java serializer, but fails on 1.3.1 and 1.4.0 using the Kryo serializer with the same error message.

This error is a bit of a misnomer. This does not mean your machine does not have enough memory, it means that Kryo was trying to create a byte array with a more than (2^31 -1) elements.
This is probably happening because your Spark tasks are quite large. If You have written your Spark job in Scala you may be accidentally pulling extra objects into scope and bloating your job size. What do I mean by that?
When you create a lambda function to by used by Spark, Spark looks at all the objects that are referred to by the lambda function
Spark then serializes them and bundles them with the Task definition. So if you are accessing large data structures, outside classes, or global variables from your lambda functions you can easily quickly bloat your Task definitions to the point where they are larger than (2^31 - 1) bytes.
Then when Kryo will try to make a byte array larger than (2^31 - 1) and the OutOfMemoryException you've seen will be thrown.

Related

Is kryo unable to serialize objects larger than 2 gigabytes?

I've read that the max size of kryo buffer in spark can be 2048 mb, and it should be larger than the largest object that my program will serialize (source: https://spark.apache.org/docs/latest/tuning.html). But what should I do if my largest object is larger than 2gb? Do I have to use the java serializer in that case? Or does the java serializer also have this limitation of 2g?
The main reason why Kryo cannot handle things larger than 2GB is because it uses the primitives of Java, using the Java Byte Arrays to setup the buffer. The limit of Java Byte Arrays are 2Gb. That is the main reason why Kryo has this limitation. This check done in Spark is to avoid the error to happens during execution time creating an even larger issue for you to debug and handle the code.
For more details please take a look here.

PySpark: Job aborts due to stage failure, but resetting max size isn't recognized

I'm attempting to display a dataframe in PySpark after reading the files in using a function/subroutine. Reading the files in works greatly, but it's the display that's not working. Actually, due to lazy evaluation, this may not be true.
I get this error
SparkException: Job aborted due to stage failure: Total size of serialized results of 29381 tasks (4.0 GB) is bigger than spark.driver.maxResultSize (4.0 GB)
so I do what was suggested https://forums.databricks.com/questions/66/how-do-i-work-around-this-error-when-using-rddcoll.html
sqlContext.setConf("spark.driver.maxResultSize", "8g")
sqlContext.getConf("spark.driver.maxResultSize")
however, the bizarre part is, this gives the same error back when I re-run the display(df) command.
It's like Spark is just ignoring my commands.
I've tried increasing the number of workers and making both the worker type and driver type larger, but neither of these fixed anything.
How can I get this to work? or is this a bug in Databricks/Spark?
It all depends on your code and partitioning of the code with respect to the cluster size. Increasing spark.driver.maxResultSize is the first option to solve the problem and eventually look for a permanent solution to modify the code or design. Please do avoid collecting more data to driver node.
OR
You need to change this parameter in the cluster configuration. Go into the cluster settings, under Advanced select spark and paste spark.driver.maxResultSize 0 (for unlimited) or whatever the value suits you. Using 0 is not recommended. You should optimize the job by re partitioning.
For more details, refer "Spark Configurations - Application Properties".
Hope this helps. Do let us know if you any further queries.

Memory difference between pyspark and spark?

I have been trying to get a PySpark job to work which creates a RDD with a bunch of binary files, and then I use a flatMap operation to process the binary data into a bunch of rows. This has lead to a bunch of out of memory errors, and after playing around with memory settings for a while I have decided to get the simplest thing possible working, which is just counting the number of files in the RDD.
This also fails with OOM error. So I opened up both the spark-shell and PySpark and ran the commands in the REPL/shell with default settings, the only additional parameter was --master yarn. The spark-shellversion works, while the PySpark version shows the same OOM error.
Is there that much overhead to running PySpark? Or is this a problem with binaryFiles being new? I am using Spark version 2.2.0.2.6.4.0-91.
The difference:
Scala will load records as PortableDataStream - this means process is lazy, and unless you call toArray on the values, won't load data at all.
Python will call Java backend, but load the data as byte array. This part will be eager-ish, therefore might fail on both sides.
Additionally PySpark will use at least twice as much memory - for Java and Python copy.
Finally binaryFiles (same as wholeTextFiles) are very inefficient and don't perform well, if individual input files are large. In case like this it is better to implement format specific Hadoop input format.
Since you are reading multiple binary files with binaryFiles() and starting Spark 2.1, the minPartitions argument of binaryFiles() is ignored
1.try to repartition the input files based on the following:
enter code hererdd = sc.binaryFiles(Path to the binary file , minPartitions = ).repartition()
2.You may try reducing the partition size to 64 MB or less depending on your size of the data using below config's
spark.files.maxPartitionBytes, default 128 MB
spark.files.openCostInBytes, default 4 MB
spark.default.parallelism

How spark creates stages and divides them into small tasks for spark data stream?

When I am creating data stream in spark for incoming data from kafka then I am getting following warning -
WARN TaskSetManager: Stage 1 contains a task of very large size (1057 KB). The maximum recommended task size is 100 KB.
So I think I need to increase task size,So can we resolve this issue by increasing no of partitions for a RDD? And How a stage is divided into small tasks and how we can configure the size of these tasks?
Thanks in advance.
So can we resolve this issue by increasing no of partitions for a RDD?
Not at all. Task size is amount of data that is send to the exectuor. This includes function definition, and serialized closure. Modifying splits won't help you here.
In general this warning is not critical and I wouldn't worry to much, but it is a hint you should take another look at your code:
Do you reference large objects with actions / transformations? If yes, consider using broadcast variables.
Are you sure you send only things you expect to send, not enclosing scope (like large objects). If the problem is here work on the structure of your code.

How to solve "job aborted due to stage failure" from "spark.akka.framesize"?

I have a spark program which is doing a bunch of column operations, and then calling .collect() to pull the results into memory.
I am receiving this problem when running the code:
An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Serialized task 302987:27 was 139041896 bytes, which exceeds max allowed: spark.akka.frameSize (134217728 bytes) - reserved (204800 bytes). Consider increasing spark.akka.frameSize or using broadcast variables for large values.
The more full stack trace can be seen here: https://pastebin.com/tuP2cPPe
Now I'm wondering what I need to change to my code and/or configuration to solve this. I have a few ideas:
Increase the spark.akka.frameSize, as suggested. I am a bit reluctant to do this because I do not know this parameter very well, and for other jobs I might prefer the default. Is there a way to specify this within an application? And can it be changed dynamically on the fly within the code similar to number of partitions?
Decrease the number of partitions before calling collect() on the table. I have a feeling that calling collect() when there are too many partitions is causing this to fail. It is putting too much stress on the driver when pulling all of these pieces into memory.
I do not understand the suggestion Consider...using broadcast variables for large values. How will this help? I still need to pull the results back to the driver whether I have a copy of the data on each executor or not.
Are there other ideas that I am missing? Thx.
I think that error is a little misleading. The error is because the result you are trying to download back to your driver is larger than Akka (the underlying networking library used by spark) can fit in a message. Broadcast variables are used to efficiently SEND data to the worker nodes, which is the opposite direction as what you are trying to do.
Usually you don't want to do a collect when it is going to pull back a lot of data because you will lose any parallelism for the job by trying to download that result to one node. If you have too much data this could either take forever or potentially cause your job to fail. You can try increasing the Akka frame size until it is large enough that your job doesn't fail, but that will probably just break again in the future when your data grows.
A better solution would be to save the the result to some distributed filesystem (HDFS, S3) using the RDD write APIs. Then you could either perform more distributed operations with it in follow on jobs using Spark to read it back in, or you could just download the result directly from the distributed file system and do whatever you want with it.

Resources