Spark 2 performance even run is better than odd - apache-spark

I am using Spark 2.4.3, want to test its performance. I found an interesting fact: same code as below, same env, run in the spark shell, the even number of run (2, 4, 6..) always is faster than the odd number, for example, No. 2 is faster than first one, No. 3 is faster than second run .... Anyone know why ?
This code will generate random integer, assign to two partitions and get a total.
val r = scala.util.Random
val input1 = for (i <- 1 to 10000000) yield r.nextInt
val input = sc.parallelize(input1, 2)
val start = System.currentTimeMillis()
input.reduce((x,y) => x+y)
println((System.currentTimeMillis()-start)+"")
Thanks.

Related

Something strange about groupBy on spark

I'm learning about groupBy function on spark,I create a list with 2 partitions,then use groupBy to get every odd and even numbers.I found if I define
val rdd = sc.makeRDD(List(1, 2, 3, 4),2)
val result = rdd.groupBy(_ % 2 )
the result with goes to their own partition. But if I define
val result = rdd.groupBy(_ % 2 ==0)
the result turns to in one partition.could anybody explain why?
It's just the hashing applied to groupBy Shuffle.
val rdd = sc.makeRDD(List(1, 2, 3, 4), 5)
With 5 partitions you see 2 partitions used and 3 empty. Just an algorithm.

Why is Spark explode function much slower than a flat map function to split array?

I'm new to Spark and Spark SQL. I have a dataset of 2 columns, "col1" and "col2", and "col2" originally is a Seq of longs. I want to explode "col2" into multiple rows so that each row only has one long.
I tried the explode function versus using flatMap and my own mapper function. They seemed to have significant performance difference. Everything else remained the same, "explode" function seems to be much slower than flatMap (order of magnitude depends on data size). Why?
Option 1: Using "explode"
val exploded = data.withColumn("col2", explode(col("col2")))
Option 2: Using manual flatMap
case class MyPair(col1: Long, col2: Long)
def longAndLongArrayMapper(colToKeep: Long, colToExplode: Seq[Long]) = {
(for (val <- colToExplode) yield MyPair(val, colToKeep))
}
val exploded = data.flatMap{ (x: Row) =>
longAndLongArrayMapper(x.getAs[Long]("col1"), (x.getAs[Seq[Long]]("col2"))) }

Referencing the next entry in RDD within a map function

I have a stream of <id, action, timestamp, data>s to process.
For example, (let us assume there's only 1 id for simplicity)
id event timestamp
-------------------------------
1 A 1
1 B 2
1 C 4
1 D 7
1 E 15
1 F 16
Let's say TIMEOUT = 5. Because more than 5 seconds passed after D happened without any further event, I want to map this to a JavaPairDStream with two key : value pairs.
id1_1:
A 1
B 2
C 4
D 7
and
id1_2:
E 15
F 16
However, in my anonymous function object, PairFunction that I pass to mapToPair() method,
incomingMessages.mapToPair(new PairFunction<String, String, RequestData>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2<String, RequestData> call(String s) {
I cannot reference the data in the next entry. In other words, when I am processing the entry with event D, I cannot look at the data at E.
If this was not Spark, I could have simply created an array timeDifferences, store the differences in two adjacent timestamps, and split the array into parts whenever I see a time difference in timeDifferences that is larger than TIMEOUT. (Although, actually there's no need to explicitly create an array)
How can I do this in Spark?
I'm still struggling to understand your question a bit, but based on what you've written, I think you can do it this way:
val A = sc.parallelize(List((1,"A",1.0),(1,"B",2.0),(1,"C",15.0))).zipWithIndex.map(x=>(x._2,x._1))
val B = A.map(x=>(x._1-1,x._2))
val C = A.leftOuterJoin(B).map(x=>(x._2._1,x._2._1._3 - (x._2._2 match{
case Some(a) => a._3
case _ => 0
})))
val group1 = C.filter(x=>(x._2 <= 5))
val group2 = C.filter(x=>(x._2 > 5))
So the concept is you zip with index to create val A (which assigns a serial long number to each entry of your RDD), and duplicate the RDD but with the index of the consecutive entry to create val B (by subtracting 1 from the index), then use a join to work out the TIMEOUT between consecutive entries. Then use Filter. This method uses RDD. A easier way is to collect them into the Master and use Map or zipped mapping, but it would be scala not spark I guess.
I believe this does what you need:
def splitToTimeWindows(input: RDD[Event], timeoutBetweenWindows: Long): RDD[Iterable[Event]] = {
val withIndex: RDD[(Long, Event)] = input.sortBy(_.timestamp).zipWithIndex().map(_.swap).cache()
val withIndexDrop1: RDD[(Long, Event)] = withIndex.map({ case (i, e) => (i-1, e)})
// joining the two to attach a "followingGap" to each event
val extendedEvents: RDD[ExtendedEvent] = withIndex.leftOuterJoin(withIndexDrop1).map({
case (i, (current, Some(next))) => ExtendedEvent(current, next.timestamp - current.timestamp)
case (i, (current, None)) => ExtendedEvent(current, 0) // last event has no following gap
})
// collecting (to driver memory!) cutoff points - timestamp of events that are *last* in their window
// if this collection is very large, another join might be needed
val cutoffPoints = extendedEvents.collect({ case e: ExtendedEvent if e.followingGap > timeoutBetweenWindows => e.event.timestamp }).distinct().collect()
// going back to original input, grouping by each event's nearest cutoffPoint (i.e. begining of this event's windown
input.groupBy(e => cutoffPoints.filter(_ < e.timestamp).sortWith(_ > _).headOption.getOrElse(0)).values
}
case class Event(timestamp: Long, data: String)
case class ExtendedEvent(event: Event, followingGap: Long)
The first part builds on GameOfThrows's answer - joining the input with itself with 1's offset to calculate the 'followingGap' for each record. Then we collect the "breaks" or "cutoff points" between the windows, and perform another transformation on the input using these points to group it by window.
NOTE: there might be more efficient ways to perform some of these transformations, depending on the characteristics of the input, for example: if you have lots of "sessions", this code might be slow or run out of memory.

Find out the partition no/id

Is there a way(A method) in Spark to find out the Parition ID/No
Take this example here
val input1 = sc.parallelize(List(8, 9, 10), 3)
val res = input1.reduce{ (x, y) => println("Inside partiton " + ???)
x + y)}
I would like to put some code in ??? to print the Partition ID / No
You can also use
TaskContext.getPartitionId()
e.g., in lieu of the presently missing foreachPartitionWithIndex()
https://github.com/apache/spark/pull/5927#issuecomment-99697229
Indeed, the mapParitionsWithIndex will give you an iterator & the partition index. (This isn't the same as reduce of course, but you could combine the result of that with aggregate).
Posting the answer here using mapParitionsWithIndex based on suggestion by #Holden.
I have created an RDD(Input) with 3 Partitions. The elements in input is tagged with the Partition Index(index) in the call to mapPartitionsWithIndex
scala> val input = sc.parallelize(11 to 17, 3)
input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:21
scala> input.mapPartitionsWithIndex{ (index, itr) => itr.toList.map(x => x + "#" + index).iterator }.collect()
res8: Array[String] = Array(11#0, 12#0, 13#1, 14#1, 15#2, 16#2, 17#2)
I ran across this old question while looking for the spark_partition_id sql function for DataFrame.
val input = spark.sparkContext.parallelize(11 to 17, 3)
input.toDF.withColumn("id",spark_partition_id).rdd.collect
res7: Array[org.apache.spark.sql.Row] = Array([11,0], [12,0], [13,1], [14,1], [15,2], [16,2], [17,2])

How do I select a range of elements in Spark RDD?

I'd like to select a range of elements in a Spark RDD. For example, I have an RDD with a hundred elements, and I need to select elements from 60 to 80. How do I do that?
I see that RDD has a take(i: int) method, which returns the first i elements. But there is no corresponding method to take the last i elements, or i elements from the middle starting at a certain index.
I don't think there is an efficient method to do this yet. But the easy way is using filter(), lets say you have an RDD, pairs with key value pairs and you only want elements from 60 to 80 inclusive just do.
val 60to80 = pairs.filter {
_ match {
case (k,v) => k >= 60 && k <= 80
case _ => false //incase of invalid input
}
}
I think it's possible that this could be done more efficiently in the future, by using sortByKey and saving information about the range of values mapped to each partition. Keep in mind this approach would only save anything if you were planning to query the range multiple times because the sort is obviously expensive.
From looking at the spark source it would definitely be possible to do efficient range queries using RangePartitioner:
// An array of upper bounds for the first (partitions - 1) partitions
private val rangeBounds: Array[K] = {
This is a private member of RangePartitioner with the knowledge of all the upper bounds of the partitions, it would be easy to only query the necessary partitions. It looks like this is something spark users may see in the future: SPARK-911
UPDATE: Way better answer, based on pull request I'm writing for SPARK-911. It will run efficiently if the RDD is sorted and you query it multiple times.
val sorted = sc.parallelize((1 to 100).map(x => (x, x))).sortByKey().cache()
val p: RangePartitioner[Int, Int] = sorted.partitioner.get.asInstanceOf[RangePartitioner[Int, Int]];
val (lower, upper) = (10, 20)
val range = p.getPartition(lower) to p.getPartition(upper)
println(range)
val rangeFilter = (i: Int, iter: Iterator[(Int, Int)]) => {
if (range.contains(i))
for ((k, v) <- iter if k >= lower && k <= upper) yield (k, v)
else
Iterator.empty
}
for((k,v) <- sorted.mapPartitionsWithIndex(rangeFilter, preservesPartitioning = true).collect()) println(s"$k, $v")
If having the whole partition in memory is acceptable you could even do something like this.
val glommedAndCached = sorted.glom()cache();
glommedAndCached.map(a => a.slice(a.search(lower),a.search(upper)+1)).collect()
search is not a member BTW I just made an implicit class that has a binary search function, not shown here
How big is your data set? You might be able to do what you need with:
data.take(80).drop(59)
This seems inefficient, but for small to medium-sized data, should work.
Is it possible to solve this in another way? What's the case for picking exactly a certain range out of the middle of your data? Would takeSample serve you better?
Following should be able to get the range. Note the cache will save you some overhead, because internally zipWithIndex need to scan the RDD partition to get the number of elements in each partition.
scala>val r1 = sc.parallelize(List("a", "b", "c", "d", "e", "f", "g"), 3).cache
scala>val r2 = r1.zipWithIndex
scala>val r3 = r2.filter(x=> {x._2>2 && x._2 < 4}).map(x=>x._1)
scala>r3.foreach(println)
d
For those who stumble on this question looking for Spark 2.x-compatible answer, you can use filterByRange

Resources