Apache Spark computes closures of functions applied to RDDs to send them to executor nodes.
This serialization has a cost, so I would like to ensure that the closures Spark generates are as small as they can be. For instance, it is possible that functions needlessly refer to a large serializable object which would get serialized in the closure, without actually being required for the computation.
Are there any tools to inspect the contents of the closures sent to executors? Or any other technique to optimize them?
I'm not sure of a tool to inspect closures, but one technique to optimize serialization costs is to use broadcast variables (https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables), which will serialize and send a copy of the object to each executor. This is useful for static, readonly objects (i.e; a lookup table/dictionary), and it could save on serialization costs. For example, if we have 100 partitions and 10 executor nodes (10 partitions per executor), rather than serializing and sending the object to each partition (100x), it will only be serialized and sent to each executor (10x); once the object is sent to an executor for one partition, the other partitions will refer to the in-memory copy.
Hope this helps!
Related
I found this is a good question to ask, I might be able to find answer in the spark-kafka-streaming source code, I will do that if no one could answer this.
imagine scenario like this:
val dstream = ...
dstream.foreachRDD(
rdd=>
rdd.count()
rdd.collect()
)
in the example code above, as we can see we are getting micro-batches from dstream and for each batch we are triggering 2 actions.
count() how many rows
collect() all the rows
according to Spark's lazy eval behaviour, both actions will trace back to the origin of the data source(which is kafka topic), and also since we don't have any persist() or wide transformations, there is no way in our code logic that would make spark cache the data it have read from kafka.
so here is the question. Will spark read from kafka twice or just once? this is very perf related since reading from kafka involves netIO and potentially puts more pressure on the kafka brokers. so if spark-kafka-streaming lib won't cache it, we should definitely cache()/persist() it before multi-actions.
any discussions are welcome. thanks.
EDIT:
just found some docs on spark official website, looks like executor receivers are caching the data. but I don't know if this is for separate receivers only. because I read that spark kafka streaming lib doesn't use separate receivers, it receives data and process the data on the same core.
http://spark.apache.org/docs/latest/streaming-programming-guide.html#data-serialization
Input data: By default, the input data received through Receivers is stored in the executors’ memory with StorageLevel.MEMORY_AND_DISK_SER_2. That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads – the receiver must deserialize the received data and re-serialize it using Spark’s serialization format.
according to official docs from Spark:
http://spark.apache.org/docs/latest/streaming-programming-guide.html#data-serialization
Input data: By default, the input data received through Receivers is stored in the executors’ memory with StorageLevel.MEMORY_AND_DISK_SER_2. That is, the data is serialized into bytes to reduce GC overheads, and replicated for tolerating executor failures. Also, the data is kept first in memory, and spilled over to disk only if the memory is insufficient to hold all of the input data necessary for the streaming computation. This serialization obviously has overheads – the receiver must deserialize the received data and re-serialize it using Spark’s serialization format.
There is no implicit caching when working with DStreams so unless you cache explicitly, every evaluation will hit Kafka brokers.
If you evaluate multiple times, and brokers are not co-located with Spark nodes, you should definitely consider caching.
Suppose we have an RDD, which is being used multiple times. So to save the computations again and again, we persisted this RDD using the rdd.persist() method.
So when we are persisting this RDD, the nodes computing the RDD will be storing their partitions.
So now suppose, the node containing this persisted partition of RDD fails, then what will happen? How will spark recover the lost data? Is there any replication mechanism? Or some other mechanism?
When you do rdd.persist, rdd doesn't materialize the content. It does when you perform an action on the rdd. It follows the same lazy evaluation principle.
Now an RDD knows the partition on which it should operate and the DAG associated with it. With the DAG it is perfectly capable of recreating the materialized partition.
So, when a node fails the driver spawn another executor in some other node and provides it the Data partition on which it was supposed to work and the DAG associated with it in a closure. Now with this information it can recompute the data and materialize it.
In the mean time the cached data in the RDD won't have all the data in memory, the data of the lost nodes it has to fetch from the disk it will take so little more time.
On the replication, yes spark supports in memory replication. You need to set StorageLevel.MEMORY_DISK_2 when you persist.
rdd.persist(StorageLevel.MEMORY_DISK_2)
This ensures the data is replicated twice.
I think the best way I was able to understand how Spark is resilient was when someone told me that I should not think of RDDs as big, distributed arrays of data.
Instead I should picture them as a container that had instructions on what steps to take to convert data from data source and take one step at a time until a result was produced.
Now if you really care about losing data when persisting, then you can specify that you want to replicate your cached data.
For this, you need to select storage level. So instead of normally using this:
MEMORY_ONLY - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, some partitions will not be cached and will be recomputed on the fly each time they're needed. This is the default level.
MEMORY_AND_DISK - Store RDD as deserialized Java objects in the JVM. If the RDD does not fit in memory, store the partitions that don't fit on disk, and read them from there when they're needed.
You can specify that you want your persisted data replcated
MEMORY_ONLY_2, MEMORY_AND_DISK_2, etc. - Same as the levels above, but replicate each partition on two cluster nodes.
So if the node fails, you will not have to recompute the data.
Check storage levels here: http://spark.apache.org/docs/latest/rdd-programming-guide.html#rdd-persistence
I have a spark application in which I need to get the data from executors to driver and I am using collect(). However, I also came across toLocalIterator(). As far as I have read about toLocalIterator() on Internet, it returns an iterator rather than sending whole RDD instantly, so it has better memory performance, but what about speed? How is the performance between collect() and toLocalIterator() when it comes to execution/computation time?
The answer to this question depends on what would you do after making df.collect() and df.rdd.toLocalIterator(). For example, if you are processing a considerably big file about 7M rows and for each of the records in there, after doing all the required transformations, you needed to iterate over each of the records in the DataFrame and make a service calls in batches of 100.
In the case of df.collect(), it will dumping the entire set of records to the driver, so the driver will need an enormous amount of memory. Where as in the case of toLocalIterator(), it will only return an iterator over a partition of the total records, hence the driver does not need to have enormous amount of memory. So if you are going to load such big files in parallel workflows inside the same cluster, df.collect() will cause you a lot of expense, where as toLocalIterator() will not and it will be faster and reliable as well.
On the other hand if you plan on doing some transformations after df.collect() or df.rdd.toLocalIterator(), then df.collect() will be faster.
Also if your file size is so small that Spark's default partitioning logic does not break it down into partitions at all then df.collect() will be more faster.
To quote from the documentation on toLocalIterator():
This results in multiple Spark jobs, and if the input RDD is the result of a wide transformation (e.g. join with different partitioners), to avoid recomputing the input RDD should be cached first.
It means that in the worst case scenario (no caching at all) it can be n-partitions times more expensive than collect. Even if data is cached, the overhead of starting multiple Spark jobs can be significant on large datasets. However lower memory footprint can partially compensate that, depending on a particular configuration.
Overall, both methods are inefficient and should be avoided on large datasets.
As for the toLocalIterator, it is used to collect the data from the RDD scattered around your cluster into one only node, the one from which the program is running, and do something with all the data in the same node. It is similar to the collect method, but instead of returning a List it will return an Iterator.
So, after applying a function to an RDD using foreach you can call toLocalIterator to get an iterator to all the contents of the RDD and process it. However, bear in mind that if your RDD is very big, you may have memory issues. If you want to transform it to an RDD again after doing the operations you need, use the SparkContext to parallelize it.
I've read the spark doc and other related Q&As in SO, but I am still unclear about some details on Spark Broadcast variables, especially, the statement in bold:
Spark actions are executed through a set of stages, separated by distributed “shuffle” operations. Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
what is "common data"?
if the variable is only used in 1 stage, does it mean broadcasting it is not useful, regardless of its memory footprint?
Since broadcast effectively "reference" the variable on each executor instead of copying it multiple times, in what scenario broadcasting is a BAD idea? I mean why this broadcasting behavior is not the default spark behavior?
Thank you!
Your question has almost all the answers you need.
what is "common data"?
The data which is referred by/read by multiple executors. For example, dictionary lookup. Assume you have 100 executors running a task that needs some huge dictionary lookup. Without broadcast variables, you would load this data in every executor. With broadcast variables, you just have to load it once and all the executors will refer to the same dictionary. Hence you save a lot of space.
For more detail: https://blog.knoldus.com/2016/04/30/broadcast-variables-in-spark-how-and-when-to-use-them/
if the variable is only used in 1 stage, does it mean broadcasting it is not useful, regardless of its memory footprint?
No and Yes. No, if your single stage has hundreds to thousands of executors! Yes, if you stage has vert few executors.
Since broadcast effectively "reference" the variable on each executor instead of copying it multiple times, in what scenario broadcasting is a BAD idea? I mean why this broadcasting behavior is not the default spark behavior?
The data broadcasted this way is cached in serialized form and deserialized before running each task. So, if the data being broadcasted is very very huge, serialization and deserialization become costly operations. So in such cases you should avoid using broadcast variables.
My Spark processing logic depends upon long-lived, expensive-to-instantiate utility objects to perform data-persistence operations. Not only are these objects probably not Serializable, but it is probably impractical to distribute their state in any case, as said state likely includes stateful network connections.
What I would like to do instead is instantiate these objects locally within each executor, or locally within threads spawned by each executor. (Either alternative is acceptable, as long as the instantiation does not take place on each tuple in the RDD.)
Is there a way to write my Spark driver program such that it directs executors to invoke a function to instantiate an object locally (and cache it in the executor's local JVM memory space), rather than instantiating it within the driver program then attempting to serialize and distribute it to the executors?
It is possible to share objects at partition level:
I've tried this : How to make Apache Spark mapPartition work correctly?
The repartition to make numPartitions match a multiple of the number of executors.