Differences: Object instantiation within mapPartitions vs outside - apache-spark

I'm a beginner to Apache Spark. Spark's RDD API offers transformation functions like map, mapPartitions. I can understand that map works with each element from the RDD but mapPartitions works with each partition and many people have mentioned mapPartitions are ideally used where we want to do object creation/instantiation and provided examples like:
val rddData = sc.textFile("sample.txt")
val res = rddData.mapPartitions(iterator => {
//Do object instantiation here
//Use that instantiated object in applying the business logic
})
My question is can we not do that with map function itself by doing object instantiation outside the map function like:
val rddData = sc.textFile("sample.txt")
val obj = InstantiatingSomeObject
val res = rddData.map(element =>
//Use the instantiated object 'obj' and do something with data
)
I could be wrong in my fundamental understanding of map and mapPartitions and if the question is wrong, please correct me.

All objects that you create outside of your lambdas are created on the driver. For each execution of the lambda, they are sent over the network to the specific executor.
When calling map, the lambda is executed once per data element, causing to send your serialized object once per data element over the network. When using mapPartitions, this happens only once per partition. However, even when using mapPartions, it would usually be better to create the object inside of your lambda. In many cases, your object is not serializable (like a database connection for example). In this case you have to create the object inside your lambda.

Related

Spark forEach vs Map functions

I was going to through a review and it was stated:
forEach forces all of the data to be sent to a single process (the Driver)
which will cause issues (such as OutOfMemory issues) at scale. Instead the
map() function serves the same purpose and distributes processing across
different Worker nodes in the cluster.
Is it correct, I could not find any document which says forEach is not distributed while map is distributed!!!
I believe you talking about applying these functions on RDD or Dataset.
Nothing comes to driver in either case. All code is executed in executor. foreach is an action that returns nothing where as map() acts as transformer from one value to other.
def foreach(f: (T) ⇒ Unit): Unit
Applies a function f to all elements of this RDD.
foreach(func) Run a function func on each element of the dataset. This is usually done for side effects such as updating an Accumulator or interacting with external storage systems.
Note: modifying variables other than Accumulators outside of the foreach() may result in undefined behavior. See Understanding closures for more details.
def map[U](f: (T) ⇒ U)(implicit arg0: ClassTag[U]): RDD[U]
Return a new RDD by applying a function to all elements of this RDD.

How to create RDD inside map function

I have RDD of key/value pair and for each key i need to call some function which accept RDD. So I tried RDD.Map and inside map created RDD using sc.parallelize(value) method and send this rdd to my function but as Spark does not support to create RDD within RDD this is not working.
Can you please suggest me any solution for this situation ?
I am looking for solution as suggest in below thread but problem i am having is my keys are not fixed and i can have any number of keys.
How to create RDD from within Task?
Thanks
It doesn't sound quite right. If the function needs to process the key value pair, it should receive the pair as the parameter, not RDD.
But if you really want to send the RDD as a parameter, instead of inside the chain operation, you may create a reference after preprocessing and send that reference to the method.
No, you shouldn't create RDD inside RDD.
Depends on the size of your data, there could be two solutions:
1) If there are many keys and each key has not too much values. Turn the function which accepts RDD to a function which accepts Iterable. Then you can do some thing like
// rdd: RDD[(keyType, valueType)]
rdd.groupByKey()
.map { case (key, values) =>
func(values)
}
2) If there are few keys and each key has many values. Then you should not do a group as it would collect all values for a key to an executor, which may cause OutOfMemory. Instead, run a job for each key like
rdd.keys.distinct().collect()
.foreach { key =>
func(rdd.filter(_._1 == key))
}

How do I run RDD operations after a groupby in Spark?

I have a large set of data that I want to perform clustering on. The catch is, I don't want one clustering for the whole set, but a clustering for each user. Essentially I would do a groupby userid first, then run KMeans.
The problem is, once you do a groupby, any mapping would be outside the spark controller context, so any attempt to create RDDs would fail. Spark's KMeans lib in mllib requires an RDD (so it can parallelize).
I see two workarounds, but I was hoping there was a better solution.
1) Manually loop through all the thousands of users in the controller (maybe millions when things get big), and run kmeans for each of them.
2) Do the groupby in the controller, then in map run a non-parallel kmeans provided by an external library.
Please tell me there is another way, I'd rather just have everything || as possible.
Edit: I didn't know it was pyspark at the moment of the response. However, I will leave it as an idea that may be adapted
I had a similar problem and I was able to improve the performance, but it was still not the ideal solution for me. Maybe for you it could work.
The idea was to break the RDD in many smaller RDDs (a new one for each user id), saving them to an array, then calling the processing function (clustering in your case) for each "sub-RDD". The suggested code is given below (explanation in the comments):
// A case class just to use as example
case class MyClass(userId: Long, value: Long, ...)
// A Scala local array with the user IDs (Could be another iterator, such as List or Array):
val userList: Seq[Long] = rdd.map{ _.userId }.distinct.collect.toSeq // Just a suggestion!
// Now we can create the new rdds:
val rddsList: Seq[RDD[MyClass]] = userList.map {
userId => rdd.filter({ item: MyClass => item.userId == userId })
}.toSeq
// Finally, we call the function we want for each RDD, saving the results in a new list.
// Note the ".par" call, which is used to start the expensive execution for multiple RDDs at the same time
val results = rddsList.par.map {
r => myFunction(r)
}
I know this is roughly the same as your first option, but by using the .par call, I was able to improve the performance.
This call transforms the rddsList object to a ParSeq object. This new Scala object allows parallel computation, so, ideally, the map function will call myFunction(r) for multiple RDDs at once, which can improve the performance.
For more details about parallel collections, please check the Scala Documentation.

Accessing client side objects and code

A Spark applications needs to validate each element in an RDD.
Given a driver\client side Scala object called Validator, which of the following two solutions is better:
rdd.filter { x => if Validator.isValid(x.somefield) true else false }
or something like
// get list of the field to validate against
val list = rdd.map(x => x.somefield)
// Use the Validator to check which ones are invalid
var invalidElements = Validator.getValidElements().diff(list)
// remove invalid elements from the RDD
rdd.filter(x => !invalidElements.contains(x.somefield))
The second solution avoids referencing the driver side object from within the function passed to the RDD. The invalid elements are determined on the client, that list is then passed back to the RDD.
Or is neither recommended?
Thanks
If I understand you correctly (i.e. you have an object Validator), that's not driver code, because your job's Jar will also be distributed to the workers. So a Scala object you define will also be instantiated in the executor JVM. (That's also why you don't receive a serialization exception in contrast to using methods defined in the job, e.g. in Spark Streaming with checkpointing).
The first version should perform better because you filter first. Mapping over all of the data and then filtering it will be slower.
The second version is also problematic because if you are creating a list of valid elements on the driver, you now have to ship it back to the workers.

Caching in Spark

A function is defined to transform an RDD. Therefore, the function is called once for each element in the RDD.
The function needs to call an external web service to look up reference data, passing as a parameter data from the current element in the RDD.
Two questions:
Is there an issue with issuing a web service call within Spark?
The data from the web service needs to be cached. What is the best way to hold (and subsequently reference) the cached data? The simple way would be to hold the cache in a collection with the Scala class which contains the function being passed to the RDD. Would this be efficient, or is there a better approach for caching in Spark?
Thanks
There isn't really any mechanism for "caching" (in the sense that you mean). Seems like the best approach would be to split this task into two phases:
Get the distinct "keys" by which you must access the external lookup, and perform the lookup once for each key
Use this mapping to perform the lookup for each record in the RDD
I'm assuming there would potentially be many records accessing the same lookup key (otherwise "caching" won't be of any value anyway), so performing the external calls for the distinct keys is substantially faster.
How should you implement this?
If you know this set of distinct keys is small enough to fit into your driver machine's memory:
map your data into the distinct keys by which you'd want to cache these fetched values, and collect it, e.g. : val keys = inputRdd.map(/* get key */).distinct().collect()
perform the fetching on driver-side (not using Spark)
use the resulting Map[Key, FetchedValues] in any transformation on your original RDD - it will be serialized and sent to each worker where you can perform the lookup. For example, assuming the input has records for which the foreignId field is the lookup key:
val keys = inputRdd.map(record => record.foreignId).distinct().collect()
val lookupTable = keys.map(k => (k, fetchValue(k))).asMap
val withValues = inputRdd.map(record => (record, lookupTable(record.foreignId)))
Alternatively - if this map is large (but still can fit in driver memory), you can broadcast it before you use it in RDD transformation - see Broadcast Variables in Spark's Programming Guide
Otherwise (if this map might be too large) - you'll need to use join if you want keep data in the cluster, but to still refrain from fetching the same element twice:
val byKeyRdd = inputRdd.keyBy(record => record.foreignId)
val lookupTableRdd = byKeyRdd
.keys()
.distinct()
.map(k => (k, fetchValue(k))) // this time fetchValue is done in cluster - concurrently for different values
val withValues = byKeyRdd.join(lookupTableRdd)

Resources