How to create RDD inside map function - apache-spark

I have RDD of key/value pair and for each key i need to call some function which accept RDD. So I tried RDD.Map and inside map created RDD using sc.parallelize(value) method and send this rdd to my function but as Spark does not support to create RDD within RDD this is not working.
Can you please suggest me any solution for this situation ?
I am looking for solution as suggest in below thread but problem i am having is my keys are not fixed and i can have any number of keys.
How to create RDD from within Task?
Thanks

It doesn't sound quite right. If the function needs to process the key value pair, it should receive the pair as the parameter, not RDD.
But if you really want to send the RDD as a parameter, instead of inside the chain operation, you may create a reference after preprocessing and send that reference to the method.

No, you shouldn't create RDD inside RDD.
Depends on the size of your data, there could be two solutions:
1) If there are many keys and each key has not too much values. Turn the function which accepts RDD to a function which accepts Iterable. Then you can do some thing like
// rdd: RDD[(keyType, valueType)]
rdd.groupByKey()
.map { case (key, values) =>
func(values)
}
2) If there are few keys and each key has many values. Then you should not do a group as it would collect all values for a key to an executor, which may cause OutOfMemory. Instead, run a job for each key like
rdd.keys.distinct().collect()
.foreach { key =>
func(rdd.filter(_._1 == key))
}

Related

Apache Spark - map and filter and take(1)

I know the usage of map and filter transformations, but I want to clarify something, map change the content of every element of an rdd one by one, if I use myrdd.map().filter().take(1) the map() function stops when the first element pass the filter function? Or does the whole map() function execute, then the filter takes action?
I'm trying to transform every rdd element and if an element satisfying a condition then the map() function stops and return the element.
The documentation seems to hint that there is no shortcut, and that the entire map and filter is executed.
Take the first num elements of the RDD.
It works by first scanning one partition, and use the results from
that partition to estimate the number of additional partitions needed
to satisfy the limit.
Translated from the Scala implementation in RDD#take().
Note this method should only be used if the resulting array is
expected to be small, as all the data is loaded into the driver’s
memory.

Grouping related keys of an RDD together

I have a generated RDD with a set of key value pairs. Assume that the keys are [10, 20, 25,30, 40, 50]. The real keys are close by Geographic bins of size X.X meters that need to be aggregated to size 2*X.2*X size.
So in this RDD set I need to aggregate keys that are having a relation between them. Example a key that is twice that of the current key - say 10 and 20. Then these will be added together to give 30. The values will also be added together Similarly the result set would be [30,25,70,50].
I am assuming that since map and reduce work on the current key of an element in an RDD , there is no way to do it using map or groupbyKey or aggregatebyKey; as the grouping I want needs the state of the previous key
I was thinking the only way to do this is to iterate through the elements in the RDD using foreach and for each element pass in also the entire RDD to it.
def group_rdds_together(rdd,rdd_list):
key,val = rdd
xbin,ybin = key
rdd_list.foreach(group_similar_keys,xbin,ybin)
bin_rdd.map(lambda x : group_rdds_together(rdd,bin_rdd))
For that I have to pass in rdd to the map lambda as well as custom parameters to the foreach function
What I am doing is horribly wrong; just wanted to illustrate where i am going with this. There should be a simpler and better way than this

How do I run RDD operations after a groupby in Spark?

I have a large set of data that I want to perform clustering on. The catch is, I don't want one clustering for the whole set, but a clustering for each user. Essentially I would do a groupby userid first, then run KMeans.
The problem is, once you do a groupby, any mapping would be outside the spark controller context, so any attempt to create RDDs would fail. Spark's KMeans lib in mllib requires an RDD (so it can parallelize).
I see two workarounds, but I was hoping there was a better solution.
1) Manually loop through all the thousands of users in the controller (maybe millions when things get big), and run kmeans for each of them.
2) Do the groupby in the controller, then in map run a non-parallel kmeans provided by an external library.
Please tell me there is another way, I'd rather just have everything || as possible.
Edit: I didn't know it was pyspark at the moment of the response. However, I will leave it as an idea that may be adapted
I had a similar problem and I was able to improve the performance, but it was still not the ideal solution for me. Maybe for you it could work.
The idea was to break the RDD in many smaller RDDs (a new one for each user id), saving them to an array, then calling the processing function (clustering in your case) for each "sub-RDD". The suggested code is given below (explanation in the comments):
// A case class just to use as example
case class MyClass(userId: Long, value: Long, ...)
// A Scala local array with the user IDs (Could be another iterator, such as List or Array):
val userList: Seq[Long] = rdd.map{ _.userId }.distinct.collect.toSeq // Just a suggestion!
// Now we can create the new rdds:
val rddsList: Seq[RDD[MyClass]] = userList.map {
userId => rdd.filter({ item: MyClass => item.userId == userId })
}.toSeq
// Finally, we call the function we want for each RDD, saving the results in a new list.
// Note the ".par" call, which is used to start the expensive execution for multiple RDDs at the same time
val results = rddsList.par.map {
r => myFunction(r)
}
I know this is roughly the same as your first option, but by using the .par call, I was able to improve the performance.
This call transforms the rddsList object to a ParSeq object. This new Scala object allows parallel computation, so, ideally, the map function will call myFunction(r) for multiple RDDs at once, which can improve the performance.
For more details about parallel collections, please check the Scala Documentation.

How to access Map modified in RDD, in the driver program of Apache Spark?

Need help.
I am working on Apache Spark 1.2.0. I have a requirement or rather I should say I am stuck in some issue.
Its like :-
I am running a map function on RDD in which I am creating some Object instances and storing those instances in a ConcurrentMap against some key. Now after Map function has finished I need data that was stored in ConcurrentMap in the driver program. Which as of now is blank outside the map function.
Is it at all possible ? Can I access it by any means ?
Thanks
Nitin
I think you are misusing Spark or misunderstanding the concept. The thing you want to do can be achieved with mapPartitions function. This function will provide you an iterator over all the rows in the input RDD partition, this way you would know when the processing has finished and would be able to either save the ConcurrentMap you've generated to persistent storage or return its iterator as the function result
It you would elaborate on your use case or attach the code, I would be able to recommend the right solution for you

Caching in Spark

A function is defined to transform an RDD. Therefore, the function is called once for each element in the RDD.
The function needs to call an external web service to look up reference data, passing as a parameter data from the current element in the RDD.
Two questions:
Is there an issue with issuing a web service call within Spark?
The data from the web service needs to be cached. What is the best way to hold (and subsequently reference) the cached data? The simple way would be to hold the cache in a collection with the Scala class which contains the function being passed to the RDD. Would this be efficient, or is there a better approach for caching in Spark?
Thanks
There isn't really any mechanism for "caching" (in the sense that you mean). Seems like the best approach would be to split this task into two phases:
Get the distinct "keys" by which you must access the external lookup, and perform the lookup once for each key
Use this mapping to perform the lookup for each record in the RDD
I'm assuming there would potentially be many records accessing the same lookup key (otherwise "caching" won't be of any value anyway), so performing the external calls for the distinct keys is substantially faster.
How should you implement this?
If you know this set of distinct keys is small enough to fit into your driver machine's memory:
map your data into the distinct keys by which you'd want to cache these fetched values, and collect it, e.g. : val keys = inputRdd.map(/* get key */).distinct().collect()
perform the fetching on driver-side (not using Spark)
use the resulting Map[Key, FetchedValues] in any transformation on your original RDD - it will be serialized and sent to each worker where you can perform the lookup. For example, assuming the input has records for which the foreignId field is the lookup key:
val keys = inputRdd.map(record => record.foreignId).distinct().collect()
val lookupTable = keys.map(k => (k, fetchValue(k))).asMap
val withValues = inputRdd.map(record => (record, lookupTable(record.foreignId)))
Alternatively - if this map is large (but still can fit in driver memory), you can broadcast it before you use it in RDD transformation - see Broadcast Variables in Spark's Programming Guide
Otherwise (if this map might be too large) - you'll need to use join if you want keep data in the cluster, but to still refrain from fetching the same element twice:
val byKeyRdd = inputRdd.keyBy(record => record.foreignId)
val lookupTableRdd = byKeyRdd
.keys()
.distinct()
.map(k => (k, fetchValue(k))) // this time fetchValue is done in cluster - concurrently for different values
val withValues = byKeyRdd.join(lookupTableRdd)

Resources