I've read that the max size of kryo buffer in spark can be 2048 mb, and it should be larger than the largest object that my program will serialize (source: https://spark.apache.org/docs/latest/tuning.html). But what should I do if my largest object is larger than 2gb? Do I have to use the java serializer in that case? Or does the java serializer also have this limitation of 2g?
The main reason why Kryo cannot handle things larger than 2GB is because it uses the primitives of Java, using the Java Byte Arrays to setup the buffer. The limit of Java Byte Arrays are 2Gb. That is the main reason why Kryo has this limitation. This check done in Spark is to avoid the error to happens during execution time creating an even larger issue for you to debug and handle the code.
For more details please take a look here.
Related
I encountered a kryo buffer overflow exception, but I really don't understand what data could require more than the current buffer size. I already have spark.kryoserializer.buffer.max set to 256Mb, and even a toString applied on the dataset items, which should be much bigger than what kryo requires, take less than that (per item).
I know I can increase this parameter, and I will right now, but I don't think this is a good practice to simply increase resources when reaching a bound without investigating what happens (same as if I get an OOM and simply increase ram allocation without checking what takes more ram)
=> So, is there a way to investigate what is put in the buffer along the spark dag execution?
I couldn't find anything in the spark ui.
Note that How Kryo serializer allocates buffer in Spark is not the same question. It ask how it works (and actually no one answers it), and I ask how to investigate. In the above question, all answers discuss the parameters to use, I know which param to use and I do manage to avoid the exception by increasing the parameters. However, I already consume too much ram, and need to optimize it, kryo buffer included.
All data that is sent over the network or written to the disk or persisted in the memory should be serialized along with the spark dag. Hence, Kryo serialization buffer must be larger than any object you attempt to serialize and must be less than 2048m.
https://spark.apache.org/docs/latest/tuning.html#data-serialization
I am doing some benchmark in a cluster using Spark. Among the various things I want to get a good approximation of the average size reduction achieved by serialization and compression. I am running in client deploy-mode and with the local master, and tired with both shells of versions 1.6 and 2.2 of spark.
I want to do that calculating the in-memory size and then the size on disk, so the fraction should be my answer. I have obviously no problems getting the on-disk size, but I am really struggling with the in-memory one.
Since my RDD is made of doubles and they occupy 8 bytes each in memory I tried counting the number of elements in the RDD and multiplying by 8, but that leaves out a lot of things.
The second approach was using "SizeEstimator" (https://spark.apache.org/docs/2.3.0/api/scala/index.html#org.apache.spark.util.SizeEstimator$
), but this is giving me crazy results! In Spark 1.6 it is either 30, 130 or 230 randomly (47 MB on disk), in Spark 2.2 it starts at 30 and everytime I execute it it increases by 0 or by 1. I know it says it's not super accurate but I can't even find a bit of consistency! I even tried setting persisting level in memory only
rdd.persist(StorageLevel.MEMORY_ONLY)
but still, nothing changed.
Is there any other way I can get the in-memory size of the RDD? Or should I try another approach? I am writing to disk with rdd.SaveAsTextFile, and generating the rdd via RandomRDDs.uniformRDD.
EDIT
sample code:
write
val rdd = RandomRDDs.uniformRDD(sc, nBlocks, nThreads)
rdd.persist(StorageLevel.MEMORY_ONLY_SER)
println("RDD count: " + rdd.count)
rdd.saveAsObjectFile("file:///path/to/folder")
read
val rdd = sc.wholeTextFiles(name,nThreads)
rdd.count() //action so I'm sure the file is actually read
webUI
Try caching the rdd as you mentioned and check in the storage tab of the spark UI.
By default rdd is deserialised and stored in memory. If you want to serialise it then specifically use persist with option MEMORY_ONLY_SER.The memory consumption will be less. In disk data always will be stored in serialised fashion
Check once the spark UI
While working with datasets in Spark, we need to specify Encoders for serializing and de-serializing objects. We have option of using Encoders.bean(Class<T>) or Encoders.kryo(Class<T>).
How are these different and what are the performance implications of using one vs another?
It is always advisable to use Kryo Serialization to Java Serialization for many reasons. Some of them are below.
Kryo Serialization is faster than Java Serialization.
Kryo Serialization uses less memory footprint especially, in the cases when you may need to Cache() and Persist(). This is very helpful during the phases like Shuffling.
Though Kryo is supported for caching and shuffling it is not supported during persistence to the disk.
saveAsObjectFile on RDD and objectFile method on SparkContext supports only java serialization.
The more Custom Data Types you are handling in your datasets the more complexity it is to handle them. Therefore, It is usually the best practice to use a uniform serialization like Kryo.
Java’s serialization framework is notoriously inefficient, consuming too much CPU, RAM and size to be a suitable large scale serialization format.
Java Serialization needs to store the fully qualified class names while serializing objects.But, Kryo lets you avoid this by saving/registering the classes sparkConf.registerKryoClasses(Array( classOf[A], classOf[B], ...)) or sparkConf.set("spark.kryo.registrator", "MyKryoRegistrator"). Which saves a lot of space and avoids unnecessary metadata.
Difference between the bean() and javaSerialization() is javaSerialization serializes objects of type T using generic java serialization. This encoder maps T into a single byte array (binary) field. Where as bean creates an encoder for Java Bean of type T. Both of them uses Java Serialization the only difference is how they represent the objects into bytes.
Quoting from the documentation
JavaSerialization is extremely inefficient and should only be used as
the last resort.
I'm trying to measure the max size of variable I can broadcast using spark broadcast.
I didn't find any explanation regarding this issue.
did someone measure it? does spark has configuration for broadcast size?
Limit for broadcasting has now been increased to 8 GB. you can find the details here.
It's currently ~2GB. Anything you broadcast is converted into java byte array during serialization and as java arrays have max size Integer.MAX_VALUE you get this limit. There may currently be some effort increasing this limit: SPARK-6235
No matter what I tried, I get this OOME using Spark 1.3.1 when using Kryo serializer (I don't have any issues if I use the default Java one)
15/06/25 20:16:37 WARN TaskSetManager: Lost task 47.0 in stage 1.0 (TID 59, ip-172-31-28-175.ec2.internal): java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:2271)
at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113)
at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140)
at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
at com.ning.compress.lzf.ChunkEncoder.encodeAndWriteChunk(ChunkEncoder.java:264)
at com.ning.compress.lzf.LZFOutputStream.writeCompressedBlock(LZFOutputStream.java:266)
at com.ning.compress.lzf.LZFOutputStream.write(LZFOutputStream.java:124)
at com.esotericsoftware.kryo.io.Output.flush(Output.java:155)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:31)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:27)
at scala.collection.immutable.List.foreach(List.scala:318)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:27)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:21)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:29)
at com.twitter.chill.TraversableSerializer$$anonfun$write$1.apply(Traversable.scala:27)
at scala.collection.immutable.List.foreach(List.scala:318)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:27)
at com.twitter.chill.TraversableSerializer.write(Traversable.scala:21)
at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501)
at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564)
at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213)
at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568)
at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:124)
at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:110)
I have 40GB of RAM available on both the driver and the executors. I tried playing with Kryo buffer size / max size (increased from the default all the way to ridiculous values), but to no avail.
Am I doing something wrong? Is that a know issue? Is there a solution?
EDIT tried with 1.4.0 same issue.
p.s. this only happens on the cluster. Locally I'm getting it to work successfully.
In summary
Locally it works with or without Kryo Serializer (smaller data set though), with 1.3.1 and 1.4.0
On the cluster it works with the default Java serializer, but fails on 1.3.1 and 1.4.0 using the Kryo serializer with the same error message.
This error is a bit of a misnomer. This does not mean your machine does not have enough memory, it means that Kryo was trying to create a byte array with a more than (2^31 -1) elements.
This is probably happening because your Spark tasks are quite large. If You have written your Spark job in Scala you may be accidentally pulling extra objects into scope and bloating your job size. What do I mean by that?
When you create a lambda function to by used by Spark, Spark looks at all the objects that are referred to by the lambda function
Spark then serializes them and bundles them with the Task definition. So if you are accessing large data structures, outside classes, or global variables from your lambda functions you can easily quickly bloat your Task definitions to the point where they are larger than (2^31 - 1) bytes.
Then when Kryo will try to make a byte array larger than (2^31 - 1) and the OutOfMemoryException you've seen will be thrown.