Spark Broadcast Variable with only one task using it - apache-spark

To my understanding, Spark works like this:
For standard variables, the Driver sends them together with the lambda (or better, closure) to the executors for each task using them.
For broadcast variables, the Driver sends them to the executors only once, the first time they are used.
Is there any advantage to use a broadcast variable instead of a standard variable when we know it is used only once, so there would be only one transfer even in case of a standard variable?
Example (Java):
public class SparkDriver {
public static void main(String[] args) {
String inputPath = args[0];
String outputPath = args[1];
Map<String,String> dictionary = new HashMap<>();
dictionary.put("J", "Java");
dictionary.put("S", "Spark");
SparkConf conf = new SparkConf()
.setAppName("Try BV")
.setMaster("local");
try (JavaSparkContext context = new JavaSparkContext(conf)) {
final Broadcast<Map<String,String>> dictionaryBroadcast = context.broadcast(dictionary);
context.textFile(inputPath)
.map(line -> { // just one transformation using BV
Map<String,String> d = dictionaryBroadcast.value();
String[] words = line.split(" ");
StringBuffer sb = new StringBuffer();
for (String w : words)
sb.append(d.get(w)).append(" ");
return sb.toString();
})
.saveAsTextFile(outputPath); // just one action!
}
}
}

There are several advantages regarding the use of broadcast variables, even if you use only once:
You avoid several problems of serialization. When you serialize an anonymous inner class that uses a field belonging to the external class this involves serializing its enclosing class. Even if spark and other framework has written a workaround to partially mitigate this problem, although sometimes the ClosureCleaner doesn't do the trick. You could avoid the NotSerializableExceptions by performing some tricks i.e.: copy a class member variable into a closure, transform the anonymous inner class into a class and put only the required fields in the constructor etc.
If you use the BroadcastVariable you don't even think about that, the method itself will serialize only the required variable. I suggest reading not-serializable-exception question and the first answer to deepen better the concept.
The serialization performance of the closure is, most of the time, worse than a specialized serialization method. As the official documentation of spark says: data-serialization
Kryo is significantly faster and more compact than Java serialization (often as much as 10x).
Searching on the Spark classes from the official spark repo I had seen that the closure is serialized through the variable SparkEnv.get.closureSerializer. The only assignment of that variable is the one present at line 306 of the SparkEnv class that use the standard and inefficient JavaSerializer.
In that case, if you serialize a big object you lose some performance due to the network bandwidth. This could be also an explanation of why the official doc claiming about to switch to BroadcastVariable for task larger than 20 KiB.
There is only one copy for each machine, in case of more executor on the same phisical machine there is an advantages.
> Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks
The distribution algorithm is probably a lot more efficient. Using the immutability property of the BroadcastVariable is not difficult to think of distributing following a p2p algorithm instead of a centralized one. Imagine for example from the driver whenever you had finished with the first executor sending the BroadcastVariable to the second, but in parallel the initial executor send the data to the third and so on. Picture kindly provided by the bitTorrent Wikipedia page:
I had no deepen the implementation from spark but, as the documentation of the Broadcast variables says:
Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.
Surely a more efficient algorithm than the trivial centralized-one can be designed using the immutability property of the BroadcastVariable.
Long story short: isn't the same thing use a closure or a broadcast variable. If the object that you are sending is big, use a broadcast variable.

Please refer to this excellent article: https://www.mikulskibartosz.name/broadcast-variables-and-broadcast-joins-in-apache-spark/ I could re-write it but it serves the purpose well and answers your question.
In summary:
A broadcast variable is an Apache Spark feature that lets us send a
read-only copy of a variable to every worker node in the Spark
cluster.
The broadcast variable is useful only when we want to:
Reuse the same variable across multiple stages of the Spark job
Speed up joins via a small table that is broadcast to all worker nodes, not all Executors.

Related

How do I share a private variable among tasks in Flink (Scala)?

I want to share a private variable in my Flink job (Scala) across the parallel tasks of Flink. My code is something like this :
object myJob extends flinkJob {
private val myVariable = someValue
def run(params) : Unit = {
//Stream processing
//myVariable is used here in the ProcessFunction
}
}
When I run this job with some parallelization, will there be a single copy of "myVariable" shared across all of the Flink tasks? If not, how can I ensure that only one single copy of the variable is used and maintained across all of the parallel tasks?
Since parallelized copies of your operator run as tasks in separate JVMs, there can be no "sharing" of a variable. What you can do is use a BroadcastStream to share the same data with multiple tasks. If you need to be able to update the variable, then you'll need to look at using iterations, or store the variable in an external system that you can regularly query.

from single machine to parallel processing?

i am new to Spark Streaming and Big Data in general. I am trying to understand the structure of a project in spark. I want to create a main class lets say "driver" with M machines, each machine keeps an array of counters and its values. In single machine and not in Spark, I would create a class for the machines and a class for the counters and i would do the computations that i want. But i am wondering if that is happening in Spark too. Would the same project but in Spark, have the structure I am quoting bellow?
class Driver {
var num : Int = 100
var machines: Array[Machine] = new Array[Machine](num)
//split incoming dstream and fill machines' queues
}
class Machine {
var counters = new Queue[(Int,Int)]() // counter with id 1 and value 25
def fillCounters: Unit = { ... } //function to fill the queue counters
}
In general, you could imagine Spark application as a driver part of application which runs all coordination tasks, constructs graph (you will find mentions of direct acyclic graph or DAG in theoretical parts of tutorials on Spark and distributed computations) of operations to take place over your data, and executor part which results in many copies of the code, sent to each node of the cluster to run over the data.
Main idea is that driver extracts part of your application's code that needs to be run locally with data on nodes, serializes it, sends over network to each executor, launches, manages and collects results.
Spark framework hides this details for simplicity of usage, so applications being developed and looks like single-threaded application.
Developer could separate contexts that run on driver and executors, but this is not very common for tutorials (again, for simplicity).
So, to the answer for the actual question above:
you do not need to design your application in a way you demonstrated above, unless your certainly want to.
Just follow official Spark tutorial to get viable solution and split it afterwards with contexts of execution.
There is good post, summarizing a lot of Spark turorials, videos and talks - you could find it here at SO.

Spark - Broadcasting HashMap and use it inside Transformations

Currently inside transformation I am reading one file and creating a HashMap and it is an Static field for re-using purpose.
For each and every record I need to check against the HashMap<> contains the corresponding key or not. If it matches with record key then get the value from HashMap.
What is the best way to do this?
Should i broadcast this HashMap and use it inside Transformation? [HashMap or ConcurrentHashMap]
Does Broadcast will make sure the HashMap always contains the value.
Is there any scenario like HashMap become empty and we need to handle that check as well? [ if it's empty load it again ]
Update:
Basically i need to use HashMap as a lookup inside transformation. What is the best way to do? Broadcast or static variable?
When i use Static variable for few records i am not getting correct value from HashMap.HashMap contains only 100 elements. But i am comparing this with 25 Million records.
First of all, a broadcast variable can be used only for reading purposes, not as a global variable, that can be modified in classic programming (one thread, one computer, procedural programming, etc...). Indeed, you can use a global variable in your code and it can be utilized in any part of it (even inside maps), but never modified.
As you can see here Advantages of broadcast variables, they boost the performance because having a cached copy of the data in all nodes, allow you to avoid transporting repeatedly the same object to every node.
Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks.
For example.
rdd = sc.parallelize(range(1000))
broadcast = sc.broadcast({"number":1, "value": 4})
rdd = rdd.map(lambda x: x + broadcast.value["value"])
rdd.collect()
As you can see I access the value inside the dictionary in every iteration of the transformation.
You should broadcast the variable.
Making the variable static will cause the class to be serialized and distributed and you might not want that.

When is a broadcast overkill?

I have started working on a Scala Spark codebase where everything that can be broadcasted seems to be, even small objects (a handful of small String attributes)
For example, I see this a lot:
val csvParser: CSVParser = new CSVParser(someComputedValue())
val csvParserBc = sc.broadcast(csvParser)
someFunction(..., csvParserBc)
My question is twofold:
Is broadcasting useful when a small object is reused in several closures?
Is broadcasting useful when a small object is used one single time?
I'm under the impression that in that cases broadcasting is not useful, and could even be wasteful, but I'd like a more enlightened opinion.
When you broadcast something it is copied to each executor once. If you don't broadcast it, it's copied along with each task. So broadcasting is useful if you have a large object and/or many more tasks than executors.
In my experience this is very rarely the case. Broadcast complicates the code. So I would always start off without a broadcast and only add a broadcast if I find that this is required for good performance.

Spark Cassandra Connector proper usage

I'm looking to use spark for some ETL, which will mostly consist of "update" statements (a column is a set, that'll be appended to, so a simple insert is likely not going to work). As such, it seems like issuing CQL queries to import the data is the best option. Using the Spark Cassandra Connector, I see I can do this:
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/1_connecting.md#connecting-manually-to-cassandra
Now I don't want to open a session and close it for every row in the source (am I right in not wanting this? Usually, I have one session for the entire process, and keep using that in "normal" apps). However, it says that the connector is serializable, but the session is obviously not. So, wrapping the whole import inside a single "withSessionDo" seems like it'll cause problems. I was thinking of using something like this:
class CassandraStorage(conf:SparkConf) {
val session = CassandraConnector(conf).openSession()
def store (t:Thingy) : Unit = {
//session.execute cql goes here
}
}
Is this a good approach? Do I need to worry about closing the session? Where / how best would I do that? Any pointers are appreciated.
You actually do want to use withSessionDo because it won't actually open and close a session on every access. Under the hood, withSessionDo accesses a JVM level session. This means you will only have one session object PER cluster configuration PER node.
This means code like
val connector = CassandraConnector(sc.getConf)
sc.parallelize(1 to 10000000L).map(connector.withSessionDo( Session => stuff)
Will only ever make 1 cluster and session object on each executor JVM regardless of how many cores each machine has.
For efficiency i would still recommend using mapPartitions to minimize cache checks.
sc.parallelize(1 to 10000000L)
.mapPartitions(it => connector.withSessionDo( session =>
it.map( row => do stuff here )))
In addition the session object also uses a prepare cache, which lets you cache a prepared statement in your serialized code, and it will only ever be prepared once per jvm(all other calls will return the cache reference.)

Resources