While working with datasets in Spark, we need to specify Encoders for serializing and de-serializing objects. We have option of using Encoders.bean(Class<T>) or Encoders.kryo(Class<T>).
How are these different and what are the performance implications of using one vs another?
It is always advisable to use Kryo Serialization to Java Serialization for many reasons. Some of them are below.
Kryo Serialization is faster than Java Serialization.
Kryo Serialization uses less memory footprint especially, in the cases when you may need to Cache() and Persist(). This is very helpful during the phases like Shuffling.
Though Kryo is supported for caching and shuffling it is not supported during persistence to the disk.
saveAsObjectFile on RDD and objectFile method on SparkContext supports only java serialization.
The more Custom Data Types you are handling in your datasets the more complexity it is to handle them. Therefore, It is usually the best practice to use a uniform serialization like Kryo.
Java’s serialization framework is notoriously inefficient, consuming too much CPU, RAM and size to be a suitable large scale serialization format.
Java Serialization needs to store the fully qualified class names while serializing objects.But, Kryo lets you avoid this by saving/registering the classes sparkConf.registerKryoClasses(Array( classOf[A], classOf[B], ...)) or sparkConf.set("spark.kryo.registrator", "MyKryoRegistrator"). Which saves a lot of space and avoids unnecessary metadata.
Difference between the bean() and javaSerialization() is javaSerialization serializes objects of type T using generic java serialization. This encoder maps T into a single byte array (binary) field. Where as bean creates an encoder for Java Bean of type T. Both of them uses Java Serialization the only difference is how they represent the objects into bytes.
Quoting from the documentation
JavaSerialization is extremely inefficient and should only be used as
the last resort.
Related
I've read that the max size of kryo buffer in spark can be 2048 mb, and it should be larger than the largest object that my program will serialize (source: https://spark.apache.org/docs/latest/tuning.html). But what should I do if my largest object is larger than 2gb? Do I have to use the java serializer in that case? Or does the java serializer also have this limitation of 2g?
The main reason why Kryo cannot handle things larger than 2GB is because it uses the primitives of Java, using the Java Byte Arrays to setup the buffer. The limit of Java Byte Arrays are 2Gb. That is the main reason why Kryo has this limitation. This check done in Spark is to avoid the error to happens during execution time creating an even larger issue for you to debug and handle the code.
For more details please take a look here.
Is there any industrial guideline on writing with either RDD or Dataset for Spark project?
So far what's obvious to me:
RDD, more type safety, less optimization (in the sense of Spark SQL)
Dataset, less type safety, more optimization
Which one is recommended in production code? Seems there's no such topic found in stackoverflow so far since Spark is prevalent in the past few years.
I can already foresee the majority of the community is with Dataset :), hence let me quote first a downvote for it from this answer (and please do share opinions against it):
Personally, I find statically typed Dataset to be the least useful:
Don't provide the same range of optimizations as Dataset[Row] (although they share storage format and some execution plan optimizations it doesn't fully benefit from code generation or off-heap storage) nor access to all the analytical capabilities of the DataFrame.
There are not as flexible as RDDs with only a small subset of types supported natively.
"Type safety" with Encoders is disputable when Dataset is converted using as method. Because data shape is not encoded using a signature, a compiler can only verify the existence of an Encoder.
Here is an excerpt from "Spark: The Definitive Guide" to answer this:
When to Use the Low-Level APIs?
You should generally use the lower-level APIs in three situations:
You need some functionality that you cannot find in the higher-level APIs; for
example, if you need very tight control over physical data placement across the
cluster.
You need to maintain some legacy codebase written using RDDs.
You need to do some custom shared variable manipulation
https://www.oreilly.com/library/view/spark-the-definitive/9781491912201/ch12.html
In other words: If you don't come across these situations above, in general better use the higher-level API (Datasets/Dataframes)
RDD Limitations :
No optimization engine for input:
There is no provision in RDD for automatic optimization. It cannot make use of Spark advance optimizers like catalyst optimizer and Tungsten execution engine. We can optimize each RDD manually.
This limitation is overcome in Dataset and DataFrame, both make use of Catalyst to generate optimized logical and physical query plan. We can use same code optimizer for R, Java, Scala, or Python DataFrame/Dataset APIs. It provides space and speed efficiency.
ii. Runtime type safety
There is no Static typing and run-time type safety in RDD. It does not allow us to check error at the runtime.
Dataset provides compile-time type safety to build complex data workflows. Compile-time type safety means if you try to add any other type of element to this list, it will give you compile time error. It helps detect errors at compile time and makes your code safe.
iii. Degrade when not enough memory
The RDD degrades when there is not enough memory to store RDD in-memory or on disk. There comes storage issue when there is a lack of memory to store RDD. The partitions that overflow from RAM can be stored on disk and will provide the same level of performance. By increasing the size of RAM and disk it is possible to overcome this issue.
iv. Performance limitation & Overhead of serialization & garbage collection
Since the RDD are in-memory JVM object, it involves the overhead of Garbage Collection and Java serialization this is expensive when the data grows.
Since the cost of garbage collection is proportional to the number of Java objects. Using data structures with fewer objects will lower the cost. Or we can persist the object in serialized form.
v. Handling structured data
RDD does not provide schema view of data. It has no provision for handling structured data.
Dataset and DataFrame provide the Schema view of data. It is a distributed collection of data organized into named columns.
This was all in limitations of RDD in Apache Spark so introduced Dataframe and Dataset .
When to use Spark DataFrame/Dataset API and when to use plain RDD?
https://databricks.com/blog/2016/07/14/a-tale-of-three-apache-spark-apis-rdds-dataframes-and-datasets.html
https://dzone.com/articles/apache-spark-3-reasons-why-you-should-not-use-rdds#:~:text=Yes!,data%20analytics%2C%20and%20data%20science.
https://data-flair.training/blogs/apache-spark-rdd-limitations/
Apache Spark computes closures of functions applied to RDDs to send them to executor nodes.
This serialization has a cost, so I would like to ensure that the closures Spark generates are as small as they can be. For instance, it is possible that functions needlessly refer to a large serializable object which would get serialized in the closure, without actually being required for the computation.
Are there any tools to inspect the contents of the closures sent to executors? Or any other technique to optimize them?
I'm not sure of a tool to inspect closures, but one technique to optimize serialization costs is to use broadcast variables (https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables), which will serialize and send a copy of the object to each executor. This is useful for static, readonly objects (i.e; a lookup table/dictionary), and it could save on serialization costs. For example, if we have 100 partitions and 10 executor nodes (10 partitions per executor), rather than serializing and sending the object to each partition (100x), it will only be serialized and sent to each executor (10x); once the object is sent to an executor for one partition, the other partitions will refer to the in-memory copy.
Hope this helps!
I'm very new to PySpark. I was building a tfidf and want to store it in disk as an intermediate result. Now the IDF scoring gives me a SparseVector representation.
However when trying to save it as Parquet, I'm getting OOM. I'm not sure if it internally converts the SparseVector to Dense as in that case it will lead to some 25k columns and according to this thread, saving such big data in columnar format can lead to OOM.
So, any idea on what can be the case? I'm having executor memory as 8g and operating on a 2g CSV file.
Should I try increasing the memory or save it in CSV instead of Parquet? Any help is appreciated. Thanks in advance.
Update 1
As pointed out, that Spark performs lazy evaluation, the error can be because of an upstream stage, I tried a show and a collect before the write. They seemed to run fine without throwing errors. So, is it still some issue related to Parquet or I need some other debugging?
Parquet doesn't provide native support for Spark ML / MLlib Vectors and neither are these first class citizens in Spark SQL.
Instead, Spark represents Vectors using struct fields with three fields:
type - ByteType
size - IntegerType (optional, only for SparseVectors)
indices - ArrayType(IntegerType) (optional, only for SparseVectors)
values - ArrayType(DoubleType)
and uses metadata to distinguish these from plain structs and UDT wrappers to map back to external types. No conversion between sparse and dense representation is needed. Nonetheless, depending on the data, such representation might require comparable memory, to the full dense array.
Please note that that OOM on write are not necessarily related to the writing process itself. Since Spark is in general lazy, the exception can be caused by any of the upstream stages.
I am learning Apache Spark and trying to clear out the concepts related to caching and persistence of RDDs in Spark.
So according to the documentation of persistence in book "Learning Spark":
To avoid computing an RDD multiple times, we can ask Spark to persist the data.
When we ask Spark to persist an RDD, the nodes that compute the RDD store their partitions.
Spark has many levels of persistence to choose from based on what our goals are.
In Scala and Java, the default persist() will
store the data in the JVM heap as unserialized objects. In Python, we always serialize
the data that persist stores, so the default is instead stored in the JVM heap as pickled
objects. When we write data out to disk or off-heap storage, that data is also always
serialized.
But why is-- the default persist() will
store the data in the JVM heap as unserialized objects.
Because there is no serialization and deserialization overhead making it low cost operation and cached data can be load without additional memory. SerDe is expensive and significantly increase overall cost. And keeping serialized and deserialized objects (particularly with standard Java serialization) can double memory usage in the worst case scenario.