Would it be worthy to broadcast small variables in Apache Spark?

Would it be worthy to broadcast small variables in Apache Spark? - apache-spark

According to Spark Tuning Tips, broadcast functionality can be used on large objects to reduce the size of each serialized task.
This makes sense to me, but my question would be, for small objects like Integer or Boolean objects, would it still be worthy to have object creation overhead to broadcast them? My hunch is that it is discouraged, but I couldn't find any convincing explanation on this top online, please help out if you have done some benchmarking and study.
Here is the code to define the variables:
final Broadcast<String> someFolderBroadcast = javaSparkContext.broadcast(someFolder);
final Broadcast<Boolean> someModeBroadcast = javaSparkContext.broadcast(isSomeMode);
someFolderBroadcast.value() and someModeBroadcast.value() are used to retrieve the stored values in the Broadcast variables.

Spark prints the serialized size of each task on the master, so you can look at that to decide whether your tasks are too large.In general, tasks larger than about 20 KB are probably worth optimizing.
So if your variables (or tasks) are larger than 20 KB, broadcast them!

Related

What is user memory in spark?

This question is similar to the one asked here. But, the answer does not help me clearly understand what user memory in spark actually is.
Can you help me understand with an example. Like, an example to understand execution and storage memory would be: In c = a.join(b, a.id==b.id); c.persist() the join operation (shuffle etc) uses the execution memory, the persist uses the storage memory to keep the c cached. Similarly, can you please give me an example of user memory?
From the official documentation, one thing I understand is that it stores UDFs. Storing the UDFs does not warrant even a few MBs of space let alone the default of 25% that is actually used in spark. What kind of heavy objects might get stored in user memory that one should be careful of and should take into consideration while deciding to set the necessary parameters (spark.memory.fraction) that set the bounds of user memory?

That's a really great question, to which I won't be able to give a fully detailed answer (I'll be following this question to see if better answers pop up) but I've been snooping around on the docs and found out some things.
I wasn't sure whether I should post this as an answer, because it ends with a few questions of my own but since it does answer your question to some degree I decided to post this as an answer. If this is not appropriate I'm happy to move this somewhere else.
Spark configuration docs
From the configuration docs, you can see the following about spark.memory.fraction:
Fraction of (heap space - 300MB) used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records. Leaving this at the default value is recommended. For more detail, including important information about correctly tuning JVM garbage collection when increasing this value, see this description.
So we learn it contains:
Internal metadata
User data structures
Imprecise size estimation in case of sparse, unusually large records
Spark tuning docs: memory management
Following the link in the docs, we get to the Spark tuning page. In there, we find a bunch of interesting info about the storage vs execution memory, but that is not what we're after in this question. There is another bit of text:
spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MiB) (default 0.6). The rest of the space (40%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.
and also
The value of spark.memory.fraction should be set in order to fit this amount of heap space comfortably within the JVM’s old or “tenured” generation. See the discussion of advanced GC tuning below for details.
So, this is a similar explanation and also a reference to garbage collection.
Spark tuning docs: garbage collection
When we go to the garbage collection page, we see a bunch of information about classical GC in Java. But there is a section that discusses spark.memory.fraction:
In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of memory used for caching by lowering spark.memory.fraction; it is better to cache fewer objects than to slow down task execution. Alternatively, consider decreasing the size of the Young generation. This means lowering -Xmn if you’ve set it as above. If not, try changing the value of the JVM’s NewRatio parameter. Many JVMs default this to 2, meaning that the Old generation occupies 2/3 of the heap. It should be large enough such that this fraction exceeds spark.memory.fraction.
What do I gather from this
As you have already said, the default spark.memory.fraction is 0.6, so 40% is reserved for this "user memory". That is quite large. Which objects end up in there?
This is where I'm not sure, but I would guess the following:
Internal metadata
I don't expect this to be huge?
User data structures
This might be large (just intuition speaking here, not sure at all), and I would hope that someone with more knowledge about this would be able to give some good examples here.
If you make intermediate structures during a map operation on a dataset, do they end up in user memory or in execution memory?
Imprecise size estimation in the case of sparse, unusually large records
Seems like this is only triggered in special cases, would be interesting to know where/how this gets decided.
In some other place in the docs it is said safeguarding against OOM errors in the case of sparse and unusually large records. So it might be that this is more of a safety buffer than anything else?

Does Spark's reduceByKey use a constant amount of memory, or a linear one in the number of keys?

As far as I know there are solutions of external sorting and/or in Hadoop MapReduce that allow for a constant amount of memory, not more, to be used when sorting/grouping data by keys for further piping through aggregation functions for each key.
Assuming that the reduce state is a constant amount as well, like addition.
Is this constant-memory grouping/sorting available for Apache Spark or Flink as well, and if so, is there any specific configuration or programatic way of asking for this constant memory way of processing in the case of reduceByKey or aggregateByKey?

Both systems needs to implicitly perform the operation as the Java processes get only a fixed amount of main memory. Note that when the data to sort gets much larger, data needs to be spilled on disk. In the case of sorting and depending on your query, it may mean that the complete dataset needs to be materialized on main memory and disk.
If you are asking if you could limit the memory consumption of a specific operator, then things look much more complicated. You could limit your application to one specific operation and use the global memory setting to limit the consumption but that would result in complicated setup.
Do you have a specific use case in mind, where you would need to limit the memory of a specific operation?
Btw you can consider Spark and Flink to supersede Hadoop MapReduce. There are just a couple of edge cases, where MapReduce may be able to beat the next generation systems.

Spark performs poorly when generating non-associate features

I have been using Spark as a tool for my own feature-generation project. For this specific project, I have two data-sources which I load into RDDs as follows:
Datasource1: RDD1 = [(key,(time,quantity,user-id,...)j] => ... => bunch of other attributes such as transaction-id, etc.
Datasource2: RDD2 = [(key,(t1,t2)j)]
In RDD1, time denotes the time-stamp where the event has happened and, in RDD2, denotes the acceptable time-interval for each feature. The feature-key is "key". I have two types of features as follows:
associative features: number of items
non-associative features: Example: unique number of users
For each feature-key, I need to see which events fall in the interval (t1,t2) and then aggregate those things. So, I have a join followed by a reduce operation as follows:
`RDD1.join(RDD2).map((key,(v1,v2))=>(key,featureObj)).reduceByKey(...)`
The initial value for my feature would be featureObj=(0,set([])) where the first argument keeps number of items and the second stores number of unique user ids. I also partition the input data to make sure that RDD1 and RDD2 use the same partitioner.
Now, when I run the job to just calculate the associative feature, it runs very fast on a cluster of 16 m2.xlarge, in only 3 minutes. The minute I add the second one, the computation time jumps to 5min. I tried to add a couple of other non-associate features and, every time, the run-time increases fast. Right now, my job runs in 15minutes for 15 features 10 of them are non-associative. I also tried to use KyroSerializer and persist RDDs in a serialized form but nothing special happened. Since I will be moving to implement more features, this issue seems to become a bottleneck.
PS. I tried to do the same task on a single big host (128GB of Ram and 16 cores). With 145 features, the whole job was done in 10minutes. I am under the impression that the main Spark bottleneck is JOIN. I checked my RDDs and noticed that both are co-partitioned in the same way. As a single job is calling these two RDDs, I presume they are co-located too? However, spark web-console still shows "2.6GB" shuffle-read and "15.6GB" shuffle-write.
Could someone please advise me if I am doing something really crazy here? Am I using Spark for a wrong application? Thanks for the comments in advance.
With best regards,
Ali

I noticed poor performance with shuffle operations, too. It turned out that the shuffle ran very fast when data was shuffled from one core to another within the same executor (locality PROCESS_LOCAL), but much slower than expected in all other situations, even NODE_LOCAL was very slow. This can be seen in the Spark UI.
Further investigation with CPU and garbage collection monitoring found that at some point garbage collection made one of the nodes in my cluster unresponsive, and this would block the other nodes shuffling data from or to this node, too.
There are a lot of options that you can tweak in order to improve garbage collection performance. One important thing is to enable early reclamation of humongous objects for the G1 garbage collector, which requires java 8u45 or higher.
In my case the biggest problem was memory allocation in netty. When I turned direct buffer memory off by setting spark.shuffle.io.preferDirectBufs = false, my jobs ran much more stable.

Is it possible to get and use a JavaSparkContext from within a task?

I've come across a situation where I'd like to do a "lookup" within a Spark and/or Spark Streaming pipeline (in Java). The lookup is somewhat complex, but fortunately, I have some existing Spark pipelines (potentially DataFrames) that I could reuse.
For every incoming record, I'd like to potentially launch a spark job from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good idea?
Not considering the performance implications, is this even possible?

Is it possible to get and use a JavaSparkContext from within a task?
No. The spark context is only valid on the driver and Spark will prevent serialization of it. Therefore it's not possible to use the Spark context from within a task.
For every incoming record, I'd like to potentially launch a spark job
from the task to get the necessary information to decorate it with.
Considering the performance implications, would this ever be a good
idea?
Without more details, my umbrella answer would be: Probably not a good idea.
Not considering the performance implications, is this even possible?
Yes, probably by bringing the base collection to the driver (collect) and iterating over it. If that collection doesn't fit in memory of the driver, please previous point.
If we need to process every record, consider performing some form of join with the 'decorating' dataset - that will be only 1 large job instead of tons of small ones.

When is a broadcast overkill?

I have started working on a Scala Spark codebase where everything that can be broadcasted seems to be, even small objects (a handful of small String attributes)
For example, I see this a lot:
val csvParser: CSVParser = new CSVParser(someComputedValue())
val csvParserBc = sc.broadcast(csvParser)
someFunction(..., csvParserBc)
My question is twofold:
Is broadcasting useful when a small object is reused in several closures?
Is broadcasting useful when a small object is used one single time?
I'm under the impression that in that cases broadcasting is not useful, and could even be wasteful, but I'd like a more enlightened opinion.

When you broadcast something it is copied to each executor once. If you don't broadcast it, it's copied along with each task. So broadcasting is useful if you have a large object and/or many more tasks than executors.
In my experience this is very rarely the case. Broadcast complicates the code. So I would always start off without a broadcast and only add a broadcast if I find that this is required for good performance.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string