pyspark to broadcast or not to broadcast [duplicate] - apache-spark

I am going through Spark Programming guide that says:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks.
Considering the above, what are the use cases of broadcast variables? What problems do broadcast variables solve?
When we create any broadcast variable like below, the variable reference, here it is broadcastVar available in all the nodes in the cluster?
val broadcastVar = sc.broadcast(Array(1, 2, 3))
How long these variables available in the memory of the nodes?

If you have a huge array that is accessed from Spark Closures, for example, some reference data, this array will be shipped to each spark node with closure. For example, if you have 10 nodes cluster with 100 partitions (10 partitions per node), this Array will be distributed at least 100 times (10 times to each node).
If you use broadcast, it will be distributed once per node using an efficient p2p protocol.
val array: Array[Int] = ??? // some huge array
val broadcasted = sc.broadcast(array)
And some RDD
val rdd: RDD[Int] = ???
In this case, array will be shipped with closure each time
rdd.map(i => array.contains(i))
and with broadcast, you'll get a huge performance benefit
rdd.map(i => broadcasted.value.contains(i))

Broadcast variables are used to send shared data (for example application configuration) across all nodes/executors.
The broadcast value will be cached in all the executors.
Sample scala code creating broadcast variable at driver:
val broadcastedConfig:Broadcast[Option[Config]] = sparkSession.sparkContext.broadcast(objectToBroadcast)
Sample scala code receiving broadcasted variable at executor side:
val config = broadcastedConfig.value

Related

How Spark broadcast the data in Broadcast Join

How Spark broadcast the data when we use Broadcast Join with hint - As I can see when we use the broadcast hint: It calls this function
def broadcast[T](df: Dataset[T]): Dataset[T] = {
Dataset[T](df.sparkSession,
ResolvedHint(df.logicalPlan, HintInfo(strategy = Some(BROADCAST))))(df.exprEnc)
}
Which internally calls the apply method of dataset & set the logicalPlan using ResolvedHint
val dataset = new Dataset(sparkSession, logicalPlan, implicitly[Encoder[T]])
But what is after this. How this actually work, where is code written for that.
What if we have multiple partitions of small dataset (which we are going to broadcast), does spark combine all partitions & then broadcast?
Does it broadcast to driver first & then it goes executors.
What is BitTorrent.
Regarding 1 & 2 During broadcast join data are collected on driver and what is going on later depends on join algorith
For BroadcastHashJoin(BHJ) driver builds hashtable and then this table is distributed to executors
For BroadcastNestedLoops broadcasted dataset is distributed as array to executors
So as you can see initial structure is not kept here and whole broadcasted dataset needs to fit into driver's memory (otherwise job will fail with oom error on driver)
Regarding 3 what exactly do you want to know?
In spark there is TorrentBroadcast which is BitTorrent-like implementation of broadcast. I don't know much about it (i never had to dig so deep), but if you want to know more i think that you can start here:
TorrentBroadcast docu
TorrentBroadcast source code
HttpBroadcast docu - its other broadcast algorithm

What is the difference between passing local variable VS broadcast variable to spark pipeline?

consider the code below:
val rdd: RDD[String] = domainsRDD()
val backlistDomains: Set[String] = readDomainsBlacklist()
rdd.filter(domain => !backlistDomains.contains(domain)
VS code where blacklisted domains are broacasted:
val rdd: RDD[String] = domainsRDD()
val bBacklistDomains: Set[String] = sc.broadcast(readDomainsBlacklist())
rdd.filter(domain => !bBacklistDomains.value.contains(domain))
Despite the fact that broadcasted variable can be erased from executors (via bBacklistDomains.destroy()) are there any other reasons to use it (performance?)?
(please note, that in the first code example domains is a local variable and serialization issue will not appear)
There is none, local variables used in stages are automatically broadcasted.
Spark automatically broadcasts the common data needed by tasks within each stage.
The data broadcasted this way is cached in serialized form and deserialized before running each task.
This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
From the docs: https://spark.apache.org/docs/latest/rdd-programming-guide.html#broadcast-variables

how to make a huge array visible to all worker nodes in Spark

I am using Spark Java API to implement A-Priori algorithm described in MMD, chapter 6, and the algorithm will need to involve a huge int array like this:
frequent_item[i] = x, // i is a big integer, x is some integer
Now, how to make this array visible to all the worker nodes in the cluster? more specifically,
can sc.broadcast(frequent_item) be used for this purpose?
does this mean this huge array will have a copy in the memory of each worker node?
what would be the best practice guideline for things like this?
Thanks, as always!
Broadcast is the right approach.
val y = sc.broadcast(frequent_item) will broadcast frequent_item
and y will become Broadcast[Array[Int]] and the value can be
accessed by using: y.value
To access (i)th element the code is
val element = y.value(i) // scala notation
Does this mean this huge array will have a copy in the memory of
each worker node? Yes there will be copy of the data in each node.
Best practise
a.)estimate the size of the broadcast variable and determine the executor and driver memories with this in consideration.
b.) broadcast only when needed
c.) unpersist once the broadcast variable is not used.
For more information read Spark Brodcast

How to execute computation per executor in Spark

In my computation, I
first broadcast some data, say bc,
then compute some big data shared by all executor/partition:val shared = f(bc)
then run the distributed computing, using shared data
To avoid computing the shared data on all RDD items, I can use .mapPartitions but I have much more partitions than executors. So it run the computation of shared data more times than necessary.
I found a simple method to compute the shared data only once per executor (which I understood is the JVM actually running the spark tasks): using lazy val on the broadcast data.
// class to be Broadcast
case class BC(input: WhatEver){
lazy val shared = f(input)
}
// in the spark code
val sc = ... // init SparkContext
val bc = sc.broadcast(BC(...))
val initRdd = sc.parallelize(1 to 10000, numSlices = 10000)
initRDD.map{ i =>
val shared = bc.value.shared
... // run computation using shared data
}
I think this does what I want, but
I am not sure, can someone guaranties it?
I am not sure lazy val is the best way to manage concurrent access, especially with respect to the spark internal distribution system. Is there a better way?
if computing shared fails, I think it will be recomputed for all RDD items with possible retries, instead of simply stopping the whole job with a unique error.
So, is there a better way?

How to broadcast RDD in PySpark?

Is it possible to broadcast an RDD in Python?
I am following the book "Advanced Analytics with Spark: Patterns for Learning from Data at Scale" and on chapter 3 an RDD needs to be broadcasted. I'm trying to follow the examples using Python instead of Scala.
Anyway, even with this simple example I have an error:
my_list = ["a", "d", "c", "b"]
my_list_rdd = sc.parallelize(my_list)
sc.broadcast(my_list_rdd)
The error being:
"It appears that you are attempting to broadcast an RDD or reference an RDD from an "
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an
action or transformation. RDD transformations and actions can only be invoked by the driver, n
ot inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) i
s invalid because the values transformation and count action cannot be performed inside of the
rdd1.map transformation. For more information, see SPARK-5063.
I don't really understand what "action or transformation" the error is referring to.
I am using spark-2.1.1-hadoop2.7.
Important Edit: the book is correct. I just failed to read that it wasn't an RDD that was being broadcasted but a map version of it obtained with collectAsMap().
Thanks!
Is it possible to broadcast an RDD in Python?
TL;DR No.
When you think what RDD really is you'll find it's simply not possible. There is nothing in an RDD you could broadcast. It's too fragile (so to speak).
RDD is a data structure that describes a distributed computation on some datasets. By the features of RDD you can describe what and how to compute. It's an abstract entity.
Quoting the scaladoc of RDD:
Represents an immutable, partitioned collection of elements that can be operated on in parallel
Internally, each RDD is characterized by five main properties:
A list of partitions
A function for computing each split
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
There's not much you could broadcast as (quoting SparkContext.broadcast method's scaladoc):
broadcast[T](value: T)(implicit arg0: ClassTag[T]): Broadcast[T] Broadcast a read-only variable to the cluster, returning a org.apache.spark.broadcast.Broadcast object for reading it in distributed functions. The variable will be sent to each cluster only once.
You can only broadcast a real value, but an RDD is just a container of values that are only available when executors process its data.
From Broadcast Variables:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.
And later in the same document:
This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
You could however collect the dataset an RDD holds and broadcast it as follows:
my_list = ["a", "d", "c", "b"]
my_list_rdd = sc.parallelize(my_list)
sc.broadcast(my_list_rdd.collect) // <-- collect the dataset
At "collect the dataset" step, the dataset leaves an RDD space and becomes a locally-available collection, a Python value, that can be then broadcast.
you cannot broadcast an RDD. you broadcast values to all your executors nodes that is used multiple times while process your RDD. So in your code you should collect your RDD before broadcasting it. The collect converts a RDD into a local python object which can be broadcasted without issues.
sc.broadcast(my_list_rdd.collect())
When you broadcast a value, the value is serialized and sent over the network to all the executor nodes. your my_list_rdd is just a reference to an RDD that is distributed across multiple nodes. serializing this reference and broadcasting this reference to all worker nodes wouldn't mean anything in the worker node. so you should collect the values of your RDD and broadcast the value instead.
more information on Spark Broadcast can be found here
Note: If your RDD is too large, the application might run into a OutOfMemory error. The collect method pull all the data the driver's memory which usually isn't large enough.

Resources