Why mapPartitionsWithIndex cause a shuffle in Spark? - apache-spark

I'm new in Spark. I'm checking the shuffling issues in a test application and I don't know why in my program the mapPartitionsWithIndex method cause a shuffle! As you can see in picture my initial RDD has two 16MB partition and Shuffle write about 49.8 MB.
I know that the map or mapPartition or mapPartitionsWithIndex are not shuffling transformation like groupByKey but I see that they also cause shuffle in Spark. Why?

I think you are performing some join/group operation after mapPartitionsWithIndex and that is causing shuffle.
you can establish it by modifying your code.
current code
val rdd = inputRDD1.mapPartitionsWithIndex(....)
val outRDD = rdd.join(inputRDD2)
Modified code
val rdd = inputRDD1.mapPartitionsWithIndex(....)
println(rdd.count)

Related

Partition data when reading file in Spark

I am new to Spark. Consider the following code:
val rdd = sc
.objectFile[(Int, Int)]("path")
.partitionBy(new HashPartitioner(sc.defaultParallelism))
.persist()
rdd.count()
Is each tuple read from the file directly sent to the its partition specified by the hash partitioner? Or is it that the whole file is first read into memory without considering the partitioner and then distributed according to the partitioner. To me, the former may be more efficient since the data is shuffled once while the latter needs two shuffles.
Please find the comments in the code
val rdd = sc
.objectFile[(Int, Int)]("path") // Loads the whole file with default minimum partitions and default partitioner
.partitionBy(new HashPartitioner(sc.defaultParallelism)) // Re-partitions the RDD using HashPartitioner
.persist()

SPARK rdd performance pipelining

If we have, say, :
val rdd1 = rdd0.map( ...
followed by
val rdd2 = rdd1.filter( ...
Then, when actually running due to an action, can rdd2 start computing the already computed rdd1 results that are known - or must this wait until rdd1 work is all complete? It is not apparent to me when reading the SPARK stuff. Informatica pipelining does do this, so I assume it probably does in SPARK as well.
Spark transformations are lazy so both calls doesn't do anything, beyond computing dependency DAG. So your code doesn't even touch the data.
For anything to be computed you have to execute an action on rdd2 or one of its descendants.
By default there are also forgetful, so unless you cache rdd1 it will be evaluated all over again, every time rdd2 is evaluated.
Finally, due to lazy evaluation, multiple narrow transformations are combined together in a single stage and your code will interleave calls to map and filter functions.

Spark SQL createDataFrame() raising OutOfMemory exception

Does it create the whole dataFrame in Memory?
How do I create a large dataFrame (> 1 million Rows) and persist it for later queries?
To persist it for later queries:
val sc: SparkContext = ...
val hc = new HiveContext( sc )
val df: DataFrame = myCreateDataFrameCode().
coalesce( 8 ).persist( StorageLevel.MEMORY_ONLY_SER )
df.show()
This will coalesce the DataFrame to 8 partitions before persisting it with serialization. Not sure I can say what number of partitions is best, perhaps even "1". Check StorageLevel docs for other persistence options, such as MEMORY_AND_DISK_SER, which will persist to both memory and disk.
In answer to the first question, yes I think Spark will need to create the whole DataFrame in memory before persisting it. If you're getting 'OutOfMemory', that's probably the key roadblock. You don't say how you're creating it. Perhaps there's some workaround, like creating and persisting it in smaller pieces, persisting to memory_and_disk with serialization, and then combining the pieces.

repartition() is not affecting RDD partition size

I am trying to change partition size of an RDD using repartition() method. The method call on the RDD succeeds, but when I explicitly check the partition size using partition.size property of the RDD, I get back the same number of partitions that it originally had:-
scala> rdd.partitions.size
res56: Int = 50
scala> rdd.repartition(10)
res57: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[19] at repartition at <console>:27
At this stage I perform some action like rdd.take(1) just to force evaluation, just in case if that matters. And then I again check the partition size:-
scala> rdd.partitions.size
res58: Int = 50
As one can see, it's not changing. Can someone answer why?
First, it does matter that you run an action as repartition is indeed lazy. Second, repartition returns a new RDD with the partitioning changed, so you must use the returned RDD or else you are still working off of the old partitioning. Finally, when shrinking your partitions, you should use coalesce, as that will not reshuffle the data. It will instead keep data on the number of nodes and pull in the remaining orphans.

How to duplicate RDD into multiple RDDs?

Is it possible to duplicate a RDD into two or several RDDs ?
I want to use the cassandra-spark driver and save a RDD into a Cassandra table, and, in addition, keep going with more calculations (and eventually save the result to Cassandra as well).
RDDs are immutable and transformations on RDDs create new RDDs. Therefore, it's not necessary to create copies of an RDD to apply different operations.
You could save the base RDD to secondary storage and further apply operations to it.
This is perfectly OK:
val rdd = ???
val base = rdd.byKey(...)
base.saveToCassandra(ks,table)
val processed = byKey.map(...).reduceByKey(...)
processed.saveToCassandra(ks,processedTable)
val analyzed = base.map(...).join(suspectsRDD).reduceByKey(...)
analyzed.saveAsTextFile("./path/to/save")

Resources