Spark: Shuffle with Parallelism = 1 - apache-spark

I am running Spark on only one node with Parallelism = 1 in order to compare its performance with a single-threaded application. I'm wondering if Spark is still using a Shuffle although it does not run in parallel. So if e.g. the following command is executed:
val counts = text_file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_+_)
I get the following output from the Spark Interactive Scala Shell:
counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[10]
at reduceByKey at <console>:41
So should I assume, that a Shuffle was used before reduceByKey? And does [10] actually has any meaning?

I'm wondering if Spark is still using a Shuffle although it does not run in parallel.
Yes it does. It is worth noting that even with a single core number of partitions may be much larger than one. For example if RDD is created using SparkContext.textFile number of partitions depends on a size of the file system block.
So should I assume, that a Shuffle was used before reduceByKey
No, shuffle is a fundamental part of the reduceByKey logic so it was used during reduceByKey, not before. Simplifying things a little bit shuffle is an equivalent of creating a hash table. Assuming only a single partition it doesn't perform any useful task but is still present.
And does [10] actually has any meaning?
It is a unique (in the current SparkContext) ID for a given RDD. For example if RDD is persisted then the number you see should be a key in SparkContext.getPersistentRDDs.

Related

Apache Spark: comparison of map vs flatMap vs mapPartitions vs mapPartitionsWithIndex

Apache Spark: comparison of map vs flatMap vs mapPartitions vs mapPartitionsWithIndex
Suggestions are welcome to improve our knowledge.
map(func)
What does it do? Pass each element of the RDD through the supplied function; i.e. func
flatMap(func)
“Similar to map, but each input item can be mapped to 0 or more output items (so func should return a Seq rather than a single item).”
Compare flatMap to map in the following
mapPartitions(func)
Consider mapPartitions a tool for performance optimization. It won’t do much for you when running examples on your local machine compared to running across a cluster. It’s the same as map, but works with Spark RDD partitions. Remember the first D in RDD is “Distributed” – Resilient Distributed Datasets. Or, put another way, you could say it is distributed over partitions.
mapPartitionsWithIndex(func)
Similar to mapPartitions, but also provides a function with an Int value to indicate the index position of the partition.
If we change the above example to use a parallelize’d list with 3 slices, our output changes significantly:

Managing Spark partitions after DataFrame unions

I have a Spark application that will need to make heavy use of unions whereby I'll be unioning lots of DataFrames together at different times, under different circumstances. I'm trying to make this run as efficiently as I can. I'm still pretty much brand-spanking-new to Spark, and something occurred to me:
If I have DataFrame 'A' (dfA) that has X number of partitions (numAPartitions), and I union that to DataFrame 'B' (dfB) which has Y number of partitions (numBPartitions), then what will the resultant unioned DataFrame (unionedDF) look like, with result to partitions?
// How many partitions will unionedDF have?
// X * Y ?
// Something else?
val unionedDF : DataFrame = dfA.unionAll(dfB)
To me, this seems like its very important to understand, seeing that Spark performance seems to rely heavily on the partitioning strategy employed by DataFrames. So if I'm unioning DataFrames left and right, I need to make sure I'm constantly managing the partitions of the resultant unioned DataFrames.
The only thing I can think of (so as to properly manage partitions of unioned DataFrames) would be to repartition them and then subsequently persist the DataFrames to memory/disk as soon as I union them:
val unionedDF : DataFrame = dfA.unionAll(dfB)
unionedDF.repartition(optimalNumberOfPartitions).persist(StorageLevel.MEMORY_AND_DISK)
This way, as soon as they are unioned, we repartition them so as to spread them over the available workers/executors properly, and then the persist(...) call tells to Spark to not evict the DataFrame from memory, so we can continue working on it.
The problem is, repartitioning sounds expensive, but it may not be as expensive as the alternative (not managing partitions at all). Are there generally-accepted guidelines about how to efficiently manage unions in Spark-land?
Yes, Partitions are important for spark.
I am wondering if you could find that out yourself by calling:
yourResultedRDD.getNumPartitions()
Do I have to persist, post union?
In general, you have to persist/cache an RDD (no matter if it is the result of a union, or a potato :) ), if you are going to use it multiple times. Doing so will prevent spark from fetching it again in memory and can increase the performance of your application by 15%, in some cases!
For example if you are going to use the resulted RDD just once, it would be safe not to do persist it.
Do I have to repartition?
Since you don't care about finding the number of partitions, you can read in my memoryOverhead issue in Spark
about how the number of partitions affects your application.
In general, the more partitions you have, the smaller the chunk of data every executor will process.
Recall that a worker can host multiple executors, you can think of it like the worker to be the machine/node of your cluster and the executor to be a process (executing in a core) that runs on that worker.
Isn't the Dataframe always in memory?
Not really. And that's something really lovely with spark, since when you handle bigdata you don't want unnecessary things to lie in the memory, since this will threaten the safety of your application.
A DataFrame can be stored in temporary files that spark creates for you, and is loaded in the memory of your application only when needed.
For more read: Should I always cache my RDD's and DataFrames?
Union just add up the number of partitions in dataframe 1 and dataframe 2. Both dataframe have same number of columns and same order to perform union operation. So no worries, if partition columns different in both the dataframes, there will be max m + n partitions.
You doesn't need to repartition your dataframe after join, my suggestion is to use coalesce in place of repartition, coalesce combine common partitions or merge some small partitions and avoid/reduce shuffling data within partitions.
If you cache/persist dataframe after each union, you will reduce performance and lineage is not break by cache/persist, in that case, garbage collection will clean cache/memory in case of some heavy memory intensive operation and recomputing will increase computation time for the same, may be this time partial computation is required for clear/removed data.
As spark transformation are lazy, i.e; unionAll is lazy operation and coalesce/repartition is also lazy operation and come in action at the time of first action, so try to coalesce unionall result after an interval like counter of 8 and reduce partition in resulting dataframe. Use checkpoints to break lineage and store data, if there is lots of memory intensive operation in your solution.

How to reduce Spark task counts & avoid group by

All, I am using PySpark & need to join two RDD's but to join them both I need to group all elements of each RDD by the joining key and later perform a join function. This causes additional overheads and I am not sure what a work around can be. Also this is creating a high number of tasks that is in turn increasing the number of files to write to HDFS and slowing overall process by a lot here is a example:
RDD1 = [join_col,{All_Elements of RDD1}] #derived by using groupby join_col)
RDD2 = [join_col,{All_Elements of RDD2}] #derived by using groupby join_col)
RDD3 = RDD1.join(RDD2)
If desired output is grouped and both RDDs are to large to be broadcasted there is not much you can do at the code level. It could be cleaner to simply apply cogroup:
rdd1.cogroup(rdd2)
but there should be no significant difference performance-wise. If you suspect there can be a large data / hash skew you can try different partitioning, for example by using sortByKey but it is unlikely to help you in a general case.

Does UpdateStateByKey in Spark shuffles the data across

I'm a newbie in Spark and i would like to understand whether i need to aggregate the DStream data by key before calling updateStateByKey?
My application basically counts the number of words in every second using Spark Streaming where i perform couple of map operations before doing a state-full update as follows,
val words = inputDstream.flatMap(x => x.split(" "))
val wordDstream = words.map(x => (x, 1))
val stateDstream = wordDstream.updateStateByKey(UpdateFunc _)
stateDstream.print()
Say after the second Map operation, same keys (words) might present across worker nodes due to various partitions, So i assume that the updateStateByKey method internally shuffles and aggregates the key values as Seq[Int] and calls the updateFunc. Is my assumption correct?
correct: as you can see in the method signature it takes an optional partitionNum/Partitioner argument, which denotes the number of reducers i.e. state updaters. This leads to a shuffle.
Also, I suggest to explicitly put a number there otherwise Spark may significantly decrease your job's parallelism trying to run tasks locally with respect to the location of the blocks of the HDFS checkpoint files
updateStateByKey() does not shuffle the state , rather the new data is brought to the nodes containing the state for the same key.
Link to Tathagat's answer to a similar question : https://www.mail-archive.com/user#spark.apache.org/msg43512.html

Is there a way to check if a variable in Spark is parallelizable?

So I am using groupByKey function in spark, but its not being parallelized, as I can see that during its execution, only 1 core is being used. It seems that the data I'm working with doesn't allow parallelization. Is there a way in spark to know if the input data is amicable to parallelization or if it's not a proper RDD?
The unit of parallelization in Spark is the 'partition'. That is, RDDs are split in partitions and transformations are applied to each partition in parallel. How RDD data is distributed across partitions is determined by the Partitioner. By default, the HashPartitioner is used which should work fine for most purposes.
You can check how many partitions your RDD is split into using:
rdd.partitions // Array of partitions

Resources