Apache Spark: comparison of map vs flatMap vs mapPartitions vs mapPartitionsWithIndex - apache-spark

Apache Spark: comparison of map vs flatMap vs mapPartitions vs mapPartitionsWithIndex
Suggestions are welcome to improve our knowledge.

map(func)
What does it do? Pass each element of the RDD through the supplied function; i.e. func
flatMap(func)
“Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).”
Compare flatMap to map in the following
mapPartitions(func)
Consider mapPartitions a tool for performance optimization. It won’t do much for you when running examples on your local machine compared to running across a cluster. It’s the same as map, but works with Spark RDD partitions. Remember the first D in RDD is “Distributed” – Resilient Distributed Datasets. Or, put another way, you could say it is distributed over partitions.
mapPartitionsWithIndex(func)
Similar to mapPartitions, but also provides a function with an Int value to indicate the index position of the partition.
If we change the above example to use a parallelize’d list with 3 slices, our output changes significantly:

Related

What's the overhead of converting an RDD to a DataFrame and back again?

It was my assumption that Spark Data Frames were built from RDDs. However, I recently learned that this is not the case, and Difference between DataFrame, Dataset, and RDD in Spark does a good job explaining that they are not.
So what is the overhead of converting an RDD to a DataFrame, and back again? Is it negligible or significant?
In my application, I create a DataFrame by reading a text file into an RDD and then custom-encoding every line with a map function that returns a Row() object. Should I not be doing this? Is there a more efficient way?
RDDs have a double role in Spark. First of all is the internal data structure for tracking changes between stages in order to manage failures and secondly until Spark 1.3 was the main interface for interaction with users. Therefore after after Spark 1.3 Dataframes constitute the main interface offering much richer functionality than RDDs.
There is no significant overhead when converting one Dataframe to RDD with df.rdd since the dataframes they already keep an instance of their RDDs initialized therefore returning a reference to this RDD should not have any additional cost. On the other side, generating a dataframe from an RDD requires some extra effort. There are two ways to convert an RDD to dataframe 1st by calling rdd.toDF() and 2nd with spark.createDataFrame(rdd, schema). Both methods will evaluate lazily although there will be an extra overhead regarding the schema validation and execution plan (you can check the toDF() code here for more details). Of course that would be identical to the overhead that you have just by initializing your data with spark.read.text(...) but with one less step, the conversion from RDD to dataframe.
This the first reason that I would go directly with Dataframes instead of working with two different Spark interfaces.
The second reason is that when using the RDD interface you are missing some significant performance features that dataframes and datasets offer related to Spark optimizer (catalyst) and memory management (tungsten).
Finally I would use the RDDs interface only if I need some features that are missing in dataframes such as key-value pairs, zipWithIndex function etc. But even then you can access those via df.rdd which is costless as already mentioned. As for your case , I believe that would be faster to use directly a dataframe and use the map function of that dataframe to ensure that Spark leverages the usage of tungsten ensuring efficient memory management.

Apache Spark - Iterators and Memory consumption

I am a newbie to the spark and have question regarding spark memory usage with iterators.
When using Foreach() or MapPartitions() of Datasets (or even a direct call to iterator() function of RDD), does spark needs to load the entire partition to RAM first (assuming partition is in disk) or can data be lazy loaded as we continue to iterate (meaning that spark can load only part of the partition data execute task and save to disk the intermediate result)
The first difference of those two is that forEach() is an action when mapPartition() is a transformation. It would be more meaningful to compare forEach with forEachPartition since they are both actions and they both work on the final-accumulated data on the driver. Refer here for a detailed discussions over those two. As for the memory consumption it really depends on how much data you return to the driver. As a rule of thumb remember to return the results on the driver using methods like limit(), take(), first() etc and avoid using collect() unless you are sure that the data can fit on driver's memory.
The mapPartition can be compared with the map or flatMap functions and they will modify the dataset's data by applying some transformation. mapPartition is more efficient since it will execute the given func fewer times when map will do the same of each item in the dataset. Refer here for more details about these two functions.

Spark: Shuffle with Parallelism = 1

I am running Spark on only one node with Parallelism = 1 in order to compare its performance with a single-threaded application. I'm wondering if Spark is still using a Shuffle although it does not run in parallel. So if e.g. the following command is executed:
val counts = text_file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_+_)
I get the following output from the Spark Interactive Scala Shell:
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[10]
at reduceByKey at <console>:41
So should I assume, that a Shuffle was used before reduceByKey? And does [10] actually has any meaning?
I'm wondering if Spark is still using a Shuffle although it does not run in parallel.
Yes it does. It is worth noting that even with a single core number of partitions may be much larger than one. For example if RDD is created using SparkContext.textFile number of partitions depends on a size of the file system block.
So should I assume, that a Shuffle was used before reduceByKey
No, shuffle is a fundamental part of the reduceByKey logic so it was used during reduceByKey, not before. Simplifying things a little bit shuffle is an equivalent of creating a hash table. Assuming only a single partition it doesn't perform any useful task but is still present.
And does [10] actually has any meaning?
It is a unique (in the current SparkContext) ID for a given RDD. For example if RDD is persisted then the number you see should be a key in SparkContext.getPersistentRDDs.

Does UpdateStateByKey in Spark shuffles the data across

I'm a newbie in Spark and i would like to understand whether i need to aggregate the DStream data by key before calling updateStateByKey?
My application basically counts the number of words in every second using Spark Streaming where i perform couple of map operations before doing a state-full update as follows,
val words = inputDstream.flatMap(x => x.split(" "))
val wordDstream = words.map(x => (x, 1))
val stateDstream = wordDstream.updateStateByKey(UpdateFunc _)
stateDstream.print()
Say after the second Map operation, same keys (words) might present across worker nodes due to various partitions, So i assume that the updateStateByKey method internally shuffles and aggregates the key values as Seq[Int] and calls the updateFunc. Is my assumption correct?
correct: as you can see in the method signature it takes an optional partitionNum/Partitioner argument, which denotes the number of reducers i.e. state updaters. This leads to a shuffle.
Also, I suggest to explicitly put a number there otherwise Spark may significantly decrease your job's parallelism trying to run tasks locally with respect to the location of the blocks of the HDFS checkpoint files
updateStateByKey() does not shuffle the state , rather the new data is brought to the nodes containing the state for the same key.
Link to Tathagat's answer to a similar question : https://www.mail-archive.com/user#spark.apache.org/msg43512.html

What is the result of RDD transformation in Spark?

Can anyone explain, what is the result of RDD transformations? Is it the new set of data (copy of data) or it is only new set of pointers, to filtered blocks of old data?
RDD transformations allow you to create dependencies between RDDs. Dependencies are only steps for producing results (a program). Each RDD in lineage chain (string of dependencies) has a function for calculating its data and has a pointer (dependency) to its parent RDD. Spark will divide RDD dependencies into stages and tasks and send those to workers for execution.
So if you do this:
val lines = sc.textFile("...")
val words = lines.flatMap(line => line.split(" "))
val localwords = words.collect()
words will be an RDD containing a reference to lines RDD. When the program is executed, first lines' function will be executed (load the data from a text file), then words' function will be executed on the resulting data (split lines into words). Spark is lazy, so nothing will get executed unless you call some transformation or action that will trigger job creation and execution (collect in this example).
So, an RDD (transformed RDD, too) is not 'a set of data', but a step in a program (might be the only step) telling Spark how to get the data and what to do with it.
Transformations create new RDD based on the existing RDD. Basically, RDD's are immutable.
All transformations in Spark are lazy.Data in RDD's is not processed until an acton is performed.
Example of RDD transformations:
map,filter,flatMap,groupByKey,reduceByKey
As others have mentioned, an RDD maintains a list of all the transformations which have been programmatically applied to it. These are lazily evaluated, so though (in the REPL, for example), you may get a result back of a different parameter type (after applying a map, for example), the 'new' RDD doensn't yet contain anything, because nothing has forced the original RDD to evaluate the transformations / filters which are in its lineage. Methods such as count, the various reduction methods, etc will cause the transportations to be applied. The checkpoint method applies all RDD actions as well, returning an RDD which is the result of the transportations but has no lineage (this can be a performance advantage, especially with iterative applications).
All answers are perfectly valid. I just want to add a quick picture :-)
Transformations are kind of operations which will transform your RDD data from one form to another. And when you apply this operation on any RDD, you will get a new RDD with transformed data (RDDs in Spark are immutable, Remember????). Operations like map, filter, flatMap are transformations.
Now there is a point to be noted here and that is when you apply the transformation on any RDD it will not perform the operation immediately. It will create a DAG(Directed Acyclic Graph) using the applied operation, source RDD and function used for transformation. And it will keep on building this graph using the references till you apply any action operation on the last lined up RDD. That is why the transformation in Spark are lazy.
The other answers give a good explanation already. Here are my some cents:
To know well what's inside that returned RDD, it'd better to check what's inside the RDD abstract class (quoted from source code):
Internally, each RDD is characterized by five main properties:
A list of partitions
A function for computing each split
A list of dependencies on other RDDs
Optionally, a Partitioner for key-value RDDs (e.g. to say that the RDD is hash-partitioned)
Optionally, a list of preferred locations to compute each split on (e.g. block locations for an HDFS file)

Resources