how to broadcast the content of a RDD efficiently - apache-spark

so I have this need to broadcast some related content from a RDD to all worker nodes, and I am trying to do it more efficiently.
More specifically, some RDD is created dynamically in the middle of the execution, to broadcast some of its content to all the worker nodes, an obvious solution would be to traverse its element one by one, and create a list/vector/hashmap to hold the needed content while traversing, and then broadcast this data structure to the cluster.
This does not seems to be a good solution at all since the RDD can be huge and it is distributed already, traversing it and creating some array/list based on the traversal result will be very slow.
So what would be a better solution, or best practice for this case? Would it be a good idea to run a SQL query on the RDD (after changing it to a dataFrame) to get the needed content, and then broadcast the query result to all the worker nodes?
thank you for your help in advance!
The following is added after reading Varslavans' answer:
a RDD is created dynamically and it has the following content:
[(1,1), (2,5), (3,5), (4,7), (5,1), (6,3), (7,2), (8,2), (9,3), (10,3), (11,3), ...... ]
so this RDD contains key-value pairs. What we want is to collect all the pairs whose value is > 3. So pair (2,5), (3,5), (4,7), ..., will be collected. Now, once we collected all these pairs, we would like to broadcast them so all the worker nodes will have these pairs.
Sounds like we should use collect() on the RDD and then broadcast... at least this is the best solution at this point.
Thanks again!

First of all - you don't need to traverse RDD to get all data. There is API for that - collect().
Second: Broadcast is not the same as distributed.
In broadcast - you have all the data on each node
In Distributed - you have different parts of a whole on each node
RDD is distributed by it's nature.
Third: To get needed content you can either use RDD API or convert it to DataFrame and use SQL queries. It depends on the data you have. Anyway contents of the result will be RDD or DataFrame and it will also be distributed. So if you need data locally - you collect() it.
Btw from your question it's not possible to understand what you exactly want to do and it looks like you need to read Spark basics. That will give you much answers :)

Related

Is it possible to outperform the Catalyst optimizer on highly skewed data using only RDDs

I am reading High Performance Spark and the author introduces a technique that can be used to perform joins on highly skewed data by selectively filtering the data to build a HashMap with the data containing the most common keys. This HashMap is then sent to all the partitions to perform a broadcast join. The resulting data are concatenated with a union operation at the very end.
Apologies in advance, but the text does not give an example of this technique using code, so I cannot share a code snippet to illustrate it.
Text follows.
Sometimes not all of our smaller RDD will fit into memory, but some keys are so overrepresented in the large dataset that you want to broadcast just the most common keys. This is especially useful if one key is so large that it can't fit on a single partition. In this case you can use countByKeyApprox on the large RDD to get an approximate idea of which keys would most benefit from a broadcast. You then filter the smaller RDD for only these keys, collecting the result locally in a HashMap. Using sc.broadcast you can broadcast the HashMap so that each worker only has one copy and manually perform the join against the HashMap. Using the same HashMap you can then filter your large RDD down to not include the large number of duplicate keys and perform your standard join, uniting it with the result of your manual join. This approach is quite convoluted but may allow you to handle highly skewed data you couldn't otherwise process.
For those who don't know, a broadcast join is a technique where the user can avoid a shuffle incurred when joining two chunks of data by sending the smaller chunk to every single executor. Each executor then performs the join on its own. The idea is that the shuffle is so expensive that having each executor perform the join and then discard the data it doesn't need is sometimes the best way to go.
The text describes a situation where part of a chunk of data can be extracted and joined using a broadcast join. The result of the join is then unioned with the rest of the data.
The reason why this might be necessary is that excessive shuffling can usually be avoided by making sure data consisting of the same keys in the two chunks are both present in the same partition, so that the same executor handles it. However, there are situations where a single key is too large to fit on a single partition. In that case, the author suggests that separating the overrepresented key into a HashMap and performing a broadcast join on just the overrepresented key may be a good idea.
Is this a good idea? Moreover, a technique like this seems very situational, so Catalyst probably does not use this technique. Is that correct? Is it true Catalyst does not use this technique? If so, does that mean on highly skewed data this technique using RDDs can beat Catalyst operating on Dataframes or Datasets?

Spark Broadcasting Alternatives

Our application uses a long-running spark context(just like spark RPEL) to enable users perform tasks online. We use spark broadcasts heavily to process dimensional data. As in common practice, we broadcast the dimension tables and use dataframe APIs to join the fact table with the other dimension tables. One of the dimension tables is quite big and has about 100k records and 15MB of size in-memory(kyro serialized is just few MBs lesser).
We see that every spark JOB on the de-normalized dataframe is causing all the dimensions to be broadcasted over and over again. The bigger table takes ~7 secs every time it is broadcasted. We are trying to find a way to have the dimension tables broadcasted only once per context life span. We tried both sqlcontext and sparkcontext broadcasting.
Are there any other alternatives to spark broadcasting? Or is there a way to reduce the memory footprint of the dataframe(compression/serialization etc. - post-kyro is still 15MB :( ) ?
Possible Alternative
We use Iginite spark integration to load large amount of data at start of job and keep on mutating as needed.
In embedded mode you can start ignite at boot of Spark context and kill in the end.
You can read more about it here.
https://ignite.apache.org/features/igniterdd.html
Finally we were able to find a stopgap solution until spark support pinning of RDDs or preferably RDDs in a later version. This is apparently not addressed even in v2.1.0.
The solution relies on RDD mapPartitions, below is a brief summary of the approach
Collect the dimension table records as map of key-value pairs and broadcast using spark context. You can possibly use RDD.keyBy
Map fact rows using RDD mapPartitions method.
For each fact row mapParitions
collect the dimension ids in the fact row and lookup the dimension records
yields a new fact row by denormalizing the dimension ids in the fact
table

Why these two Spark RDDs generation ways have different data localities?

I am running two different ways of RDDs generation in a local machine, the first way is:
rdd = sc.range(0, 100).sortBy(lambda x: x, numPartitions=10)
rdd.collect()
The second way is:
rdd = sc.parallelize(xrange(100), 10)
rdd.collect()
But in my Spark UI, it showed different data locality, and I don't know why. Below is the result from the first way, it shows Locality Level(the 5th column) is ANY
And the result from the second way shows the Locality Level is Process_Local:
And I read from https://spark.apache.org/docs/latest/tuning.html , Process_Local Level is usually faster than Any Level for processing.
Is this because of sortBy operation will give rise to shuffle then influence the data locality? Can someone give me a clearer explanation?
You are correct.
In the first snippet you first create a parallelized collection, meaning your driver tells each worker to create some part of the collection. Then, as for sorting each worker node needs access to data on other nodes, data needs to be shuffled around and data locality is lost.
The second code snippet is effectively not even a distributed job.
As Spark uses lazy evaluation, nothing is done until you call to materialize the results, in this case using the collect method. The steps in your second computation are effectively
Distribute the object of type list from driver to worker nodes
Do nothing on each worker node
Collect distributed objects from workers to create object of type list on driver.
Spark is smart enough to realize that there is no reason to distribute the list even though parallelize is called. Since the data resides and the computation is done on the same single node, data locality is obviously preserved.
EDIT:
Some additional info on how Spark does sort.
Spark operates on the underlying MapReduce model (the programming model, not the Hadoop implementation) and sort is implemented as a single map and a reduce. Conceptually, on each node in the map phase, the part of the collection that a particular node operates on is sorted and written to memory. The reducers then pull relevant data from the mappers, merge the results and create iterators.
So, for your example, let's say you have a mapper that wrote numbers 21-34 to memory in sorted order. Let's say the same node has a reducer that is responsible for numbers 31-40. The reducer gets information from driver where the relevant data is. The numbers 31-34 are pulled from the same node and data only has to travel between threads. The other numbers however can be on arbitrary nodes in the cluster and need to be transferred over the network. Once the reducer has pulled all the relevant data from the nodes, the shuffle phase is over. The reducer now merges the results (like in mergesort) and creates an iterator over the sorted part of the collection.
http://blog.cloudera.com/blog/2015/01/improving-sort-performance-in-apache-spark-its-a-double/

In spark streaming can I create RDD on worker

I want to know how can I create RDD on worker say containing a Map. This Map/RDD will be small and I want this RDD to completely reside on one machine/executor (I guess repartition(1) can achieve this). Further I want to be able to cache this Map/RDD on local executor and use it in tasks running on this executor for lookup.
How can I do this?
No, you cannot create RDD in worker node. Only driver can create RDD.
The broadcast variable seems be solution in your situation. It will send data to all workers, however if your map is small, then it wouldn't be an issue.
You cannot control on which partition your RDD will be placed, so you cannot just do repartition(1) - you don't know if this RDD will be placed on the same node ;) Broadcast variable will be on every node, so lookup will be very fast
You can create RDD on your driver program using sc.parallelize(data) . For storing Map, it can be split into 2 parts as key, value and then can be stored in RDD/Dataframe as two separate columns.

How to update an RDD?

We are developing Spark framework wherein we are moving historical data into RDD sets.
Basically, RDD is immutable, read only dataset on which we do operations.
Based on that we have moved historical data into RDD and we do computations like filtering/mapping, etc on such RDDs.
Now there is a use case where a subset of the data in the RDD gets updated and we have to recompute the values.
HistoricalData is in the form of RDD.
I create another RDD based on request scope and save the reference of that RDD in a ScopeCollection
So far I have been able to think of below approaches -
Approach1: broadcast the change:
For each change request, my server fetches the scope specific RDD and spawns a job
In a job, apply a map phase on that RDD -
2.a. for each node in the RDD do a lookup on the broadcast and create a new Value which is now updated, thereby creating a new RDD
2.b. now I do all the computations again on this new RDD at step2.a. like multiplication, reduction etc
2.c. I Save this RDDs reference back in my ScopeCollection
Approach2: create an RDD for the updates
For each change request, my server fetches the scope specific RDD and spawns a job
On each RDD, do a join with the new RDD having changes
now I do all the computations again on this new RDD at step2 like multiplication, reduction etc
Approach 3:
I had thought of creating streaming RDD where I keep updating the same RDD and do re-computation. But as far as I understand it can take streams from Flume or Kafka. Whereas in my case the values are generated in the application itself based on user interaction.
Hence I cannot see any integration points of streaming RDD in my context.
Any suggestion on which approach is better or any other approach suitable for this scenario.
TIA!
The usecase presented here is a good match for Spark Streaming. The two other options bear the question: "How do you submit a re-computation of the RDD?"
Spark Streaming offers a framework to continuously submit work to Spark based on some stream of incoming data and preserve that data in RDD form. Kafka and Flume are only two possible Stream sources.
You could use Socket communication with the SocketInputDStream, reading files in a directory using FileInputDStream or even using shared Queue with the QueueInputDStream. If none of those options fit your application, you could write your own InputDStream.
In this usecase, using Spark Streaming, you will read your base RDD and use the incoming dstream to incrementally transform the existing data and maintain an evolving in-memory state. dstream.transform will allow you to combine the base RDD with the data collected during a given batch interval, while the updateStateByKey operation could help you build an in-memory state addressed by keys. See the documentation for further information.
Without more details on the application is hard to go up to the code level on what's possible using Spark Streaming. I'd suggest you to explore this path and make new questions for any specific topics.
I suggest to take a look at IndexedRDD implementation, which provides updatable RDD of key value pairs. That might give you some insights.
The idea is based on the knowledge of the key and that allows you to zip your updated chunk of data with the same keys of already created RDD. During update it's possible to filter out previous version of the data.
Having historical data, I'd say you have to have sort of identity of an event.
Regarding streaming and consumption, it's possible to use TCP port. This way the driver might open a TCP connection spark expects to read from and sends updates there.

Resources