Broadcast a dictionary to rdd in PySpark

Broadcast a dictionary to rdd in PySpark - apache-spark

I am just getting the hang of Spark, and I have function that needs to be mapped to an rdd, but uses a global dictionary:
from pyspark import SparkContext
sc = SparkContext('local[*]', 'pyspark')
my_dict = {"a": 1, "b": 2, "c": 3, "d": 4} # at no point will be modified
my_list = ["a", "d", "c", "b"]
def my_func(letter):
return my_dict[letter]
my_list_rdd = sc.parallelize(my_list)
result = my_list_rdd.map(lambda x: my_func(x)).collect()
print result
The above gives the expected result; however, I am really not sure about my use of the global variable my_dict. It seems that a copy of the dictionary is made with every partition. And it just does not feel right..
It looked like broadcast is what I am looking for. However, when I try to use it:
my_dict_bc = sc.broadcast(my_dict)
def my_func(letter):
return my_dict_bc[letter]
I get the following error:
TypeError: 'Broadcast' object has no attribute '__getitem__
This seems to imply that I cannot broadcast a dictionary.
My question: If I have a function that uses a global dictionary, that needs to be mapped to rdd, what is the proper way to do it?
My example is very simple, but in reality my_dict and my_list are much larger, and my_func is more complicated.

You forgot something important about Broadcast objects, they have a property called value where the data is stored.
Therefore you need to modify my_func to something like this:
my_dict_bc = sc.broadcast(my_dict)
def my_func(letter):
return my_dict_bc.value[letter]

The proper way to do it depends on how the read-only shared variables (the dictionary in your case) will be accessed in the rest of the program. In the case you described, you don't need to use a broadcast variable. From the Spark programming guide section on broadcast variables:
Spark automatically broadcasts the common data needed by tasks within each stage. The data broadcasted this way is cached in serialized form and deserialized before running each task. This means that explicitly creating broadcast variables is only useful when tasks across multiple stages need the same data or when caching the data in deserialized form is important.
In your case, if the data is only needed in the single map stage, there is no need to explicitly broadcast the variable (it is not "useful"). However, if the same dictionary were to be used later in another stage, then you might wish to use broadcast to avoid serializing and deserializing the dictionary before each stage.

Related

What is the most efficient way to access large dask arrays in functions submitted to the workers?

Let's say if I have a global large dask array called dataset, that has two dimensions:
I also have a function that I want to submit to my dask workers, which simply takes the sum of a slice of the large dask array
def process_data(idx):
return sum(dataset[idx,:].compute())
The above method only passes in a parameter called idx , by something like idx_bag.map(process_data), to slice the global dask array, which is accessible within the function. I thought this is good because dask arrays are only like memory addresses so passing in a global dask array is just like passing in a memory address, and .compute() is only called on a slice of the large dask array.
Alternatively, we can call .compute() or .persist() on each partition of the large dask arrays on the client first, obtaining concrete numpy arrays for these partitions, and then passing these into
def process_data2(data_part):
return sum(data_part)
which now avoids calling .compute() inside each workers.
However, I think these submitted functions really do intend to use only parameters as arguments, not actual data, like in here under Using the Futures API.
I am having some memory trouble (see this post) with the first workflow, so I am wondering what is the best way to access a slice of a large dask array within a worker function.
Thanks!

How to create RDD inside map function

I have RDD of key/value pair and for each key i need to call some function which accept RDD. So I tried RDD.Map and inside map created RDD using sc.parallelize(value) method and send this rdd to my function but as Spark does not support to create RDD within RDD this is not working.
Can you please suggest me any solution for this situation ?
I am looking for solution as suggest in below thread but problem i am having is my keys are not fixed and i can have any number of keys.
How to create RDD from within Task?
Thanks

It doesn't sound quite right. If the function needs to process the key value pair, it should receive the pair as the parameter, not RDD.
But if you really want to send the RDD as a parameter, instead of inside the chain operation, you may create a reference after preprocessing and send that reference to the method.

No, you shouldn't create RDD inside RDD.
Depends on the size of your data, there could be two solutions:
1) If there are many keys and each key has not too much values. Turn the function which accepts RDD to a function which accepts Iterable. Then you can do some thing like
// rdd: RDD[(keyType, valueType)]
rdd.groupByKey()
.map { case (key, values) =>
func(values)
}
2) If there are few keys and each key has many values. Then you should not do a group as it would collect all values for a key to an executor, which may cause OutOfMemory. Instead, run a job for each key like
rdd.keys.distinct().collect()
.foreach { key =>
func(rdd.filter(_._1 == key))
}

How do I run RDD operations after a groupby in Spark?

I have a large set of data that I want to perform clustering on. The catch is, I don't want one clustering for the whole set, but a clustering for each user. Essentially I would do a groupby userid first, then run KMeans.
The problem is, once you do a groupby, any mapping would be outside the spark controller context, so any attempt to create RDDs would fail. Spark's KMeans lib in mllib requires an RDD (so it can parallelize).
I see two workarounds, but I was hoping there was a better solution.
1) Manually loop through all the thousands of users in the controller (maybe millions when things get big), and run kmeans for each of them.
2) Do the groupby in the controller, then in map run a non-parallel kmeans provided by an external library.
Please tell me there is another way, I'd rather just have everything || as possible.

Edit: I didn't know it was pyspark at the moment of the response. However, I will leave it as an idea that may be adapted
I had a similar problem and I was able to improve the performance, but it was still not the ideal solution for me. Maybe for you it could work.
The idea was to break the RDD in many smaller RDDs (a new one for each user id), saving them to an array, then calling the processing function (clustering in your case) for each "sub-RDD". The suggested code is given below (explanation in the comments):
// A case class just to use as example
case class MyClass(userId: Long, value: Long, ...)
// A Scala local array with the user IDs (Could be another iterator, such as List or Array):
val userList: Seq[Long] = rdd.map{ _.userId }.distinct.collect.toSeq // Just a suggestion!
// Now we can create the new rdds:
val rddsList: Seq[RDD[MyClass]] = userList.map {
userId => rdd.filter({ item: MyClass => item.userId == userId })
}.toSeq
// Finally, we call the function we want for each RDD, saving the results in a new list.
// Note the ".par" call, which is used to start the expensive execution for multiple RDDs at the same time
val results = rddsList.par.map {
r => myFunction(r)
}
I know this is roughly the same as your first option, but by using the .par call, I was able to improve the performance.
This call transforms the rddsList object to a ParSeq object. This new Scala object allows parallel computation, so, ideally, the map function will call myFunction(r) for multiple RDDs at once, which can improve the performance.
For more details about parallel collections, please check the Scala Documentation.

Spark - Broadcasting HashMap and use it inside Transformations

Currently inside transformation I am reading one file and creating a HashMap and it is an Static field for re-using purpose.
For each and every record I need to check against the HashMap<> contains the corresponding key or not. If it matches with record key then get the value from HashMap.
What is the best way to do this?
Should i broadcast this HashMap and use it inside Transformation? [HashMap or ConcurrentHashMap]
Does Broadcast will make sure the HashMap always contains the value.
Is there any scenario like HashMap become empty and we need to handle that check as well? [ if it's empty load it again ]
Update:
Basically i need to use HashMap as a lookup inside transformation. What is the best way to do? Broadcast or static variable?
When i use Static variable for few records i am not getting correct value from HashMap.HashMap contains only 100 elements. But i am comparing this with 25 Million records.

First of all, a broadcast variable can be used only for reading purposes, not as a global variable, that can be modified in classic programming (one thread, one computer, procedural programming, etc...). Indeed, you can use a global variable in your code and it can be utilized in any part of it (even inside maps), but never modified.
As you can see here Advantages of broadcast variables, they boost the performance because having a cached copy of the data in all nodes, allow you to avoid transporting repeatedly the same object to every node.
Broadcast variables allow the programmer to keep a read-only variable
cached on each machine rather than shipping a copy of it with tasks.
For example.
rdd = sc.parallelize(range(1000))
broadcast = sc.broadcast({"number":1, "value": 4})
rdd = rdd.map(lambda x: x + broadcast.value["value"])
rdd.collect()
As you can see I access the value inside the dictionary in every iteration of the transformation.

You should broadcast the variable.
Making the variable static will cause the class to be serialized and distributed and you might not want that.

Caching in Spark

A function is defined to transform an RDD. Therefore, the function is called once for each element in the RDD.
The function needs to call an external web service to look up reference data, passing as a parameter data from the current element in the RDD.
Two questions:
Is there an issue with issuing a web service call within Spark?
The data from the web service needs to be cached. What is the best way to hold (and subsequently reference) the cached data? The simple way would be to hold the cache in a collection with the Scala class which contains the function being passed to the RDD. Would this be efficient, or is there a better approach for caching in Spark?
Thanks

There isn't really any mechanism for "caching" (in the sense that you mean). Seems like the best approach would be to split this task into two phases:
Get the distinct "keys" by which you must access the external lookup, and perform the lookup once for each key
Use this mapping to perform the lookup for each record in the RDD
I'm assuming there would potentially be many records accessing the same lookup key (otherwise "caching" won't be of any value anyway), so performing the external calls for the distinct keys is substantially faster.
How should you implement this?
If you know this set of distinct keys is small enough to fit into your driver machine's memory:
map your data into the distinct keys by which you'd want to cache these fetched values, and collect it, e.g. : val keys = inputRdd.map(/* get key */).distinct().collect()
perform the fetching on driver-side (not using Spark)
use the resulting Map[Key, FetchedValues] in any transformation on your original RDD - it will be serialized and sent to each worker where you can perform the lookup. For example, assuming the input has records for which the foreignId field is the lookup key:
val keys = inputRdd.map(record => record.foreignId).distinct().collect()
val lookupTable = keys.map(k => (k, fetchValue(k))).asMap
val withValues = inputRdd.map(record => (record, lookupTable(record.foreignId)))
Alternatively - if this map is large (but still can fit in driver memory), you can broadcast it before you use it in RDD transformation - see Broadcast Variables in Spark's Programming Guide
Otherwise (if this map might be too large) - you'll need to use join if you want keep data in the cluster, but to still refrain from fetching the same element twice:
val byKeyRdd = inputRdd.keyBy(record => record.foreignId)
val lookupTableRdd = byKeyRdd
.keys()
.distinct()
.map(k => (k, fetchValue(k))) // this time fetchValue is done in cluster - concurrently for different values
val withValues = byKeyRdd.join(lookupTableRdd)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Broadcast a dictionary to rdd in PySpark - apache-spark

You forgot something important about Broadcast objects, they have a property called value where the data is stored. Therefore you need to modify my_func to something like this: my_dict_bc = sc.broadcast(my_dict) def my_func(letter): return my_dict_bc.value[letter]

Related

What is the most efficient way to access large dask arrays in functions submitted to the workers?

How to create RDD inside map function

How do I run RDD operations after a groupby in Spark?

Spark - Broadcasting HashMap and use it inside Transformations

Caching in Spark

Categories

Resources