Is it possible to broadcast an RDD in Python?
I am following the book "Advanced Analytics with Spark: Patterns for Learning from Data at Scale" and on chapter 3 an RDD needs to be broadcasted. I'm trying to follow the examples using Python instead of Scala.
Anyway, even with this simple example I have an error:
my_list = ["a", "d", "c", "b"]
my_list_rdd = sc.parallelize(my_list)
sc.broadcast(my_list_rdd)
The error being:
"It appears that you are attempting to broadcast an RDD or reference an RDD from an "
Exception: It appears that you are attempting to broadcast an RDD or reference an RDD from an
action or transformation. RDD transformations and actions can only be invoked by the driver, n
ot inside of other transformations; for example, rdd1.map(lambda x: rdd2.values.count() * x) i
s invalid because the values transformation and count action cannot be performed inside of the
rdd1.map transformation. For more information, see SPARK-5063.
I don't really understand what "action or transformation" the error is referring to.
I am using spark-2.1.1-hadoop2.7.
Important Edit: the book is correct. I just failed to read that it wasn't an RDD that was being broadcasted but a map version of it obtained with collectAsMap().
Thanks!
Is it possible to broadcast an RDD in Python?
TL;DR No.
When you think what RDD really is you'll find it's simply not possible. There is nothing in an RDD you could broadcast. It's too fragile (so to speak).
RDD is a data structure that describes a distributed computation on some datasets. By the features of RDD you can describe what and how to compute. It's an abstract entity.
Quoting the scaladoc of RDD:
Represents an immutable, partitioned collection of elements that can be operated on in parallel
Internally, each RDD is characterized by five main properties:
A list of partitions
A function for computing each split
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)
There's not much you could broadcast as (quoting SparkContext.broadcast method's scaladoc):
broadcast[T](value: T)(implicit arg0: ClassTag[T]): Broadcast[T] Broadcast a read-only variable to the cluster, returning a org.apache.spark.broadcast.Broadcast object for reading it in distributed functions. The variable will be sent to each cluster only once.
You can only broadcast a real value, but an RDD is just a container of values that are only available when executors process its data.
From Broadcast Variables:
Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner.
And later in the same document:
This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
You could however collect the dataset an RDD holds and broadcast it as follows:
my_list = ["a", "d", "c", "b"]
my_list_rdd = sc.parallelize(my_list)
sc.broadcast(my_list_rdd.collect) // <-- collect the dataset
At "collect the dataset" step, the dataset leaves an RDD space and becomes a locally-available collection, a Python value, that can be then broadcast.
you cannot broadcast an RDD. you broadcast values to all your executors nodes that is used multiple times while process your RDD. So in your code you should collect your RDD before broadcasting it. The collect converts a RDD into a local python object which can be broadcasted without issues.
sc.broadcast(my_list_rdd.collect())
When you broadcast a value, the value is serialized and sent over the network to all the executor nodes. your my_list_rdd is just a reference to an RDD that is distributed across multiple nodes. serializing this reference and broadcasting this reference to all worker nodes wouldn't mean anything in the worker node. so you should collect the values of your RDD and broadcast the value instead.
more information on Spark Broadcast can be found here
Note: If your RDD is too large, the application might run into a OutOfMemory error. The collect method pull all the data the driver's memory which usually isn't large enough.
Related
It was my assumption that Spark Data Frames were built from RDDs. However, I recently learned that this is not the case, and Difference between DataFrame, Dataset, and RDD in Spark does a good job explaining that they are not.
So what is the overhead of converting an RDD to a DataFrame, and back again? Is it negligible or significant?
In my application, I create a DataFrame by reading a text file into an RDD and then custom-encoding every line with a map function that returns a Row() object. Should I not be doing this? Is there a more efficient way?
RDDs have a double role in Spark. First of all is the internal data structure for tracking changes between stages in order to manage failures and secondly until Spark 1.3 was the main interface for interaction with users. Therefore after after Spark 1.3 Dataframes constitute the main interface offering much richer functionality than RDDs.
There is no significant overhead when converting one Dataframe to RDD with df.rdd since the dataframes they already keep an instance of their RDDs initialized therefore returning a reference to this RDD should not have any additional cost. On the other side, generating a dataframe from an RDD requires some extra effort. There are two ways to convert an RDD to dataframe 1st by calling rdd.toDF() and 2nd with spark.createDataFrame(rdd, schema). Both methods will evaluate lazily although there will be an extra overhead regarding the schema validation and execution plan (you can check the toDF() code here for more details). Of course that would be identical to the overhead that you have just by initializing your data with spark.read.text(...) but with one less step, the conversion from RDD to dataframe.
This the first reason that I would go directly with Dataframes instead of working with two different Spark interfaces.
The second reason is that when using the RDD interface you are missing some significant performance features that dataframes and datasets offer related to Spark optimizer (catalyst) and memory management (tungsten).
Finally I would use the RDDs interface only if I need some features that are missing in dataframes such as key-value pairs, zipWithIndex function etc. But even then you can access those via df.rdd which is costless as already mentioned. As for your case , I believe that would be faster to use directly a dataframe and use the map function of that dataframe to ensure that Spark leverages the usage of tungsten ensuring efficient memory management.
so I have this need to broadcast some related content from a RDD to all worker nodes, and I am trying to do it more efficiently.
More specifically, some RDD is created dynamically in the middle of the execution, to broadcast some of its content to all the worker nodes, an obvious solution would be to traverse its element one by one, and create a list/vector/hashmap to hold the needed content while traversing, and then broadcast this data structure to the cluster.
This does not seems to be a good solution at all since the RDD can be huge and it is distributed already, traversing it and creating some array/list based on the traversal result will be very slow.
So what would be a better solution, or best practice for this case? Would it be a good idea to run a SQL query on the RDD (after changing it to a dataFrame) to get the needed content, and then broadcast the query result to all the worker nodes?
thank you for your help in advance!
The following is added after reading Varslavans' answer:
a RDD is created dynamically and it has the following content:
[(1,1), (2,5), (3,5), (4,7), (5,1), (6,3), (7,2), (8,2), (9,3), (10,3), (11,3), ...... ]
so this RDD contains key-value pairs. What we want is to collect all the pairs whose value is > 3. So pair (2,5), (3,5), (4,7), ..., will be collected. Now, once we collected all these pairs, we would like to broadcast them so all the worker nodes will have these pairs.
Sounds like we should use collect() on the RDD and then broadcast... at least this is the best solution at this point.
Thanks again!
First of all - you don't need to traverse RDD to get all data. There is API for that - collect().
Second: Broadcast is not the same as distributed.
In broadcast - you have all the data on each node
In Distributed - you have different parts of a whole on each node
RDD is distributed by it's nature.
Third: To get needed content you can either use RDD API or convert it to DataFrame and use SQL queries. It depends on the data you have. Anyway contents of the result will be RDD or DataFrame and it will also be distributed. So if you need data locally - you collect() it.
Btw from your question it's not possible to understand what you exactly want to do and it looks like you need to read Spark basics. That will give you much answers :)
I want to cache some data (ndarrays) locally on worker nodes to do some comparison with ndarray distributed from incoming RDDs from Spark streaming. What is the best way to do it?
Since I want to compare ndarrays stored in my files with each single ndarray passed in from Spark streaming. It doesn't seem like I can load those data into an RDD, since I cannot go through another RDD inside the map function of one other RDD. And I tried loading them to a list on master node and broadcast them to worker nodes. But I got an error that broadcast variable is not iterable when I try to go through them and comparing with the incoming data.
Issue here was that you need to use the value() method to read the actual value of the broadcasted variable. Following the example in the comment by #user9613318:
bd_array = sc.broadcast(np.arange(100))
This will create a numpy array for that range and broadcast it to all workers. If you try to use the variable with just 'bd_array' you'll get a broadcast variable class which has other methods such as persist, destroy, etc. This is not iterable. If you read it with 'bd_array.value' you'll get back the broadcasted numpy array which can be iterated on (some docs here)
I am running two different ways of RDDs generation in a local machine, the first way is:
rdd = sc.range(0, 100).sortBy(lambda x: x, numPartitions=10)
rdd.collect()
The second way is:
rdd = sc.parallelize(xrange(100), 10)
rdd.collect()
But in my Spark UI, it showed different data locality, and I don't know why. Below is the result from the first way, it shows Locality Level(the 5th column) is ANY
And the result from the second way shows the Locality Level is Process_Local:
And I read from https://spark.apache.org/docs/latest/tuning.html , Process_Local Level is usually faster than Any Level for processing.
Is this because of sortBy operation will give rise to shuffle then influence the data locality? Can someone give me a clearer explanation?
You are correct.
In the first snippet you first create a parallelized collection, meaning your driver tells each worker to create some part of the collection. Then, as for sorting each worker node needs access to data on other nodes, data needs to be shuffled around and data locality is lost.
The second code snippet is effectively not even a distributed job.
As Spark uses lazy evaluation, nothing is done until you call to materialize the results, in this case using the collect method. The steps in your second computation are effectively
Distribute the object of type list from driver to worker nodes
Do nothing on each worker node
Collect distributed objects from workers to create object of type list on driver.
Spark is smart enough to realize that there is no reason to distribute the list even though parallelize is called. Since the data resides and the computation is done on the same single node, data locality is obviously preserved.
EDIT:
Some additional info on how Spark does sort.
Spark operates on the underlying MapReduce model (the programming model, not the Hadoop implementation) and sort is implemented as a single map and a reduce. Conceptually, on each node in the map phase, the part of the collection that a particular node operates on is sorted and written to memory. The reducers then pull relevant data from the mappers, merge the results and create iterators.
So, for your example, let's say you have a mapper that wrote numbers 21-34 to memory in sorted order. Let's say the same node has a reducer that is responsible for numbers 31-40. The reducer gets information from driver where the relevant data is. The numbers 31-34 are pulled from the same node and data only has to travel between threads. The other numbers however can be on arbitrary nodes in the cluster and need to be transferred over the network. Once the reducer has pulled all the relevant data from the nodes, the shuffle phase is over. The reducer now merges the results (like in mergesort) and creates an iterator over the sorted part of the collection.
http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/
Can anyone explain, what is the result of RDD transformations? Is it the new set of data (copy of data) or it is only new set of pointers, to filtered blocks of old data?
RDD transformations allow you to create dependencies between RDDs. Dependencies are only steps for producing results (a program). Each RDD in lineage chain (string of dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD. Spark will divide RDD dependencies into stages and tasks and send those to workers for execution.
So if you do this:
val lines = sc.textFile("...")
val words = lines.flatMap(line => line.split(" "))
val localwords = words.collect()
words will be an RDD containing a reference to lines RDD. When the program is executed, first lines' function will be executed (load the data from a text file), then words' function will be executed on the resulting data (split lines into words). Spark is lazy, so nothing will get executed unless you call some transformation or action that will trigger job creation and execution (collect in this example).
So, an RDD (transformed RDD, too) is not 'a set of data', but a step in a program (might be the only step) telling Spark how to get the data and what to do with it.
Transformations create new RDD based on the existing RDD. Basically, RDD's are immutable.
All transformations in Spark are lazy.Data in RDD's is not processed until an acton is performed.
Example of RDD transformations:
map,filter,flatMap,groupByKey,reduceByKey
As others have mentioned, an RDD maintains a list of all the transformations which have been programmatically applied to it. These are lazily evaluated, so though (in the REPL, for example), you may get a result back of a different parameter type (after applying a map, for example), the 'new' RDD doensn't yet contain anything, because nothing has forced the original RDD to evaluate the transformations / filters which are in its lineage. Methods such as count, the various reduction methods, etc will cause the transportations to be applied. The checkpoint method applies all RDD actions as well, returning an RDD which is the result of the transportations but has no lineage (this can be a performance advantage, especially with iterative applications).
All answers are perfectly valid. I just want to add a quick picture :-)
Transformations are kind of operations which will transform your RDD data from one form to another. And when you apply this operation on any RDD, you will get a new RDD with transformed data (RDDs in Spark are immutable, Remember????). Operations like map, filter, flatMap are transformations.
Now there is a point to be noted here and that is when you apply the transformation on any RDD it will not perform the operation immediately. It will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.
The other answers give a good explanation already. Here are my some cents:
To know well what's inside that returned RDD, it'd better to check what's inside the RDD abstract class (quoted from source code):
Internally, each RDD is characterized by five main properties:
A list of partitions
A function for computing each split
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)