This is a very simple question: in spark, broadcast can be used to send variables to executors efficiently. How does this work ?
More precisely:
when are values sent : as soon as I call broadcast, or when the values are used ?
Where exactly is the data sent : to all executors, or only to the ones that will need it ?
where is the data stored ? In memory, or on disk ?
Is there a difference in how simple variables and broadcast variables are accessed ? What happens under the hood when I call the .value method ?
Short answer
Values are sent the first time they are needed in an executor. Nothing is sent when sc.broadcast(variable) is called.
The data is sent only to the nodes that contain an executor that needs it.
The data is stored in memory. If not enough memory is available, the disk is used.
Yes, there is a big difference between accessing a local variable and a broadcast variable. Broadcast variables have to be downloaded the first time they are accessed.
Long answer
The answer is in Spark's source, in TorrentBroadcast.scala.
When sc.broadcast is called, a new TorrentBroadcast object is instantiated from BroadcastFactory.scala. The following happens in writeBlocks(), which is called when the TorrentBroadcast object is initialized:
The object is cached unserialized locally using the MEMORY_AND_DISK policy.
It is serialized.
The serialized version is split into 4Mb blocks, that are compressed[0], and saved locally[1].
When new executors are created, they only have the lightweight TorrentBroadcast object, that only contains the broadcast object's identifier, and its number of blocks.
The TorrentBroadcast object has a lazy[2] property that contains its value. When the value method is called, this lazy property is returned. So the first time this value function is called on a task, the following happens:
In a random order, blocks are fetched from the local block manager and uncompressed.
If they are not present in the local block manager, getRemoteBytes is called on the block manager to fetch them. Network traffic happens only at that time.
If the block wasn't present locally, it is cached using MEMORY_AND_DISK_SER.
[0] Compressed with lz4 by default. This can be tuned.
[1] The blocks are stored in the local block manager, using MEMORY_AND_DISK_SER, which means that it spills partitions that don't fit in memory to disk. Each block has an unique identifier, computed from the identifier of the broadcast variable, and its offset. The size of blocks can be configured; it is 4Mb by default.
[2] A lazy val in scala is a variable whose value is evaluated the first time it is accessed, and then cached. See the documentation.
as soon as it is broadcasted
it is send to all executors using torrent protocol but loaded only when needed
once loaded variables are stored deserialized in memory
it:
validates that broadcast hasn't been destroyed
lazily loads variable from blockManager
Related
I am using broadcast variable to join operation in Spark. But I meet issue about the time broadcast to load from driver to executor. So I just want load once but use for multi job(range application cycle).
Link my ref: https://github.com/apache/spark/blob/branch-2.2/core/src/test/scala/org/apache/spark/broadcast/BroadcastSuite.scala
Broadcast variables are not related to a job but to a session/context. If you reuse the same SparkSession it's likely that the broadcast variable will be reused. If I recall correctly, under certain types of memory pressure the workers may clear the broadcast variable but, if it is referenced, it would be automatically re-broadcast to satisfy the reference.
Broadcast variables, which can be used to cache a value in memory on all nodes. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.
EdhBroadcast broadcast = new EdhBroadcast(JavaSparkContext);
It's not possible Broadcast variables are used to send some immutable state once to each worker. You use them when you want a local copy of a variable.
You can create rdd and cache rdd and reuse.
I have a set of large variables that I broadcast. These variables are loaded from a clustered database. Is it possible to distribute the load from the database across worker nodes and then have each one broadcast their specific variables to all nodes for subsequent map operations?
Thanks!
Broadcast variables are generally passed to workers, but I can tell you what I did in a similar case in python.
If you know the total number of rows, you can try to create an RDD of that length and then run a map operation on it (which will be distributed to workers). In the map, the workers are running a function to get some piece of data (not sure how you are going to make them all get different data).
Each worker would retrieve required data through making the calls. You could then do a collectAsMap() to get a dictionary and broadcast that to all workers.
However keep in mind that you need all software dependencies of making client requests on each worker. You also need to keep socket usage in mind. I just did something similar with querying an API and did not see a rise in sockets, although I was making regular HTTP requests. Not sure....
Ok, so the answer it seems is no.
Calling sc.broadcast(someRDD) results in an error. You have to collect() it back to the driver first.
Spark has broadcast variables, which are read only, and accumulator variables, which can be updates by the nodes, but not read. Is there way - or a workaround - to define a variable which is both updatable and can be read?
One requirement for such a read\write global variable would be to implement a cache. As files are loaded and processed as rdd's, calculations are performed. The results of these calculations - happening in several nodes running in parallel - need to be placed into a map, which has as it's key some of the attributes of the entity being processed. As subsequent entities within the rdd's are processed, the cache is queried.
Scala does have ScalaCache, which is a facade for cache implementations such as Google Guava. But how would such a cache be included and accessed within a Spark application?
The cache could be defined as a variable in the driver application which creates the SparkContext. But then there would be two issues:
Performance would presumably be bad because of the network overhead
between the nodes and the driver application.
To my understanding, each rdd will be passed a copy of the variable
(cache in this case) when the variable is first accessed by the
function passed to the rdd. Each rdd would have it's own copy, not access to a shared global variable .
What is the best way to implement and store such a cache?
Thanks
Well, the best way of doing this is not doing it at all. In general Spark processing model doesn't provide any guarantees* regarding
where,
when,
in what order (excluding of course the order of transformations defined by the lineage / DAG)
and how many times
given piece of code is executed. Moreover, any updates which depend directly on the Spark architecture, are not granular.
These are the properties which make Spark scalable and resilient but at the same this is the thing that makes keeping shared mutable state very hard to implement and most of the time completely useless.
If all you want is a simple cache then you have multiple options:
use one of the methods described by Tzach Zohar in Caching in Spark
use local caching (per JVM or executor thread) combined with application specific partitioning to keep things local
for communication with external systems use node local cache independent of Spark (for example Nginx proxy for http requests)
If application requires much more complex communication you may try different message passing tools to keep synchronized state but in general it requires a complex and potentially fragile code.
* This partially changed in Spark 2.4, with introduction of the barrier execution mode (SPARK-24795, SPARK-24822).
Currently inside transformation I am reading one file and creating a HashMap and it is an Static field for re-using purpose.
For each and every record I need to check against the HashMap<> contains the corresponding key or not. If it matches with record key then get the value from HashMap.
What is the best way to do this?
Should i broadcast this HashMap and use it inside Transformation? [HashMap or ConcurrentHashMap]
Does Broadcast will make sure the HashMap always contains the value.
Is there any scenario like HashMap become empty and we need to handle that check as well? [ if it's empty load it again ]
Update:
Basically i need to use HashMap as a lookup inside transformation. What is the best way to do? Broadcast or static variable?
When i use Static variable for few records i am not getting correct value from HashMap.HashMap contains only 100 elements. But i am comparing this with 25 Million records.
First of all, a broadcast variable can be used only for reading purposes, not as a global variable, that can be modified in classic programming (one thread, one computer, procedural programming, etc...). Indeed, you can use a global variable in your code and it can be utilized in any part of it (even inside maps), but never modified.
As you can see here Advantages of broadcast variables, they boost the performance because having a cached copy of the data in all nodes, allow you to avoid transporting repeatedly the same object to every node.
Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks.
For example.
rdd = sc.parallelize(range(1000))
broadcast = sc.broadcast({"number":1, "value": 4})
rdd = rdd.map(lambda x: x + broadcast.value["value"])
rdd.collect()
As you can see I access the value inside the dictionary in every iteration of the transformation.
You should broadcast the variable.
Making the variable static will cause the class to be serialized and distributed and you might not want that.
I've got a software application where it can have over a million objects in memory using all the cores of the machine. Each object has a unique ID and has it's own internal StateObject that needs to be persisted temporarily somewhere, any changes to the StateObject will result in overwriting previous StateObject with new updated data.
I was wondering if I should be reading & writing the state to a database or should I just create text files locally on the machine, each named with uniqueId of the object and each object will read and write a json String of StateObject to the file.
Which option will yield better performance? database or to just writing to local file system? Should I write to multiple files with uniqueId or one file with multiple rows where the first column will be the unique id ? After doing some research I found that parallel read and writes are slower on HDD but is fast on SSD. So I guess I have to use SSD.
Update
The reason to write to disk is because there are too many Objectss (> 1Million) and having each object's all their StateObjects in memory is going to be expensive so I would rather persist the Object's internal State (StateObject) to disk if they are not being used. And guarantee of the writes is very important to process the next request by that object. If the write fails for some reason the StateObject will be built from a remote APIs before processing the next request which is more time consuming.