Something strange about groupBy on spark

Something strange about groupBy on spark - apache-spark

I'm learning about groupBy function on spark,I create a list with 2 partitions,then use groupBy to get every odd and even numbers.I found if I define
val rdd = sc.makeRDD(List(1, 2, 3, 4),2)
val result = rdd.groupBy(_ % 2 )
the result with goes to their own partition. But if I define
val result = rdd.groupBy(_ % 2 ==0)
the result turns to in one partition.could anybody explain why?

It's just the hashing applied to groupBy Shuffle.
val rdd = sc.makeRDD(List(1, 2, 3, 4), 5)
With 5 partitions you see 2 partitions used and 3 empty. Just an algorithm.

Related

Spark 2 performance even run is better than odd

I am using Spark 2.4.3, want to test its performance. I found an interesting fact: same code as below, same env, run in the spark shell, the even number of run (2, 4, 6..) always is faster than the odd number, for example, No. 2 is faster than first one, No. 3 is faster than second run .... Anyone know why ?
This code will generate random integer, assign to two partitions and get a total.
val r = scala.util.Random
val input1 = for (i <- 1 to 10000000) yield r.nextInt
val input = sc.parallelize(input1, 2)
val start = System.currentTimeMillis()
input.reduce((x,y) => x+y)
println((System.currentTimeMillis()-start)+"")
Thanks.

Spark: FlatMapValues query

I'm reading the Learning Spark book and couldn't understand the following pair rdd transformation.
rdd.flatMapValues(x => (x to 5))
It is applied on an rdd {(1,2),(3,4),(3,6)} and the output of the transformation is {(1,2),(1,3),(1,4),(1,5),(3,4),(3,5)}
Can someone please explain this.

flatMapValues method is a combination of flatMap and mapValues.
Let's start with the given rdd.
val sampleRDD = sc.parallelize(Array((1,2),(3,4),(3,6)))
mapValues maps the values while keeping the keys.
For example, sampleRDD.mapValues(x => x to 5) returns
Array((1,Range(2, 3, 4, 5)), (3,Range(4, 5)), (3,Range()))
notice that for key-value pair (3, 6), it produces (3,Range()) since 6 to 5 produces an empty collection of values.
flatMap "breaks down" collections into the elements of the collection. You can search for more accurate description of flatMap online like here and here.
For example,
given val rdd2 = sampleRDD.mapValues(x => x to 5),
if we do rdd2.flatMap(x => x), you will get
Array((1,2),(1,3),(1,4),(1,5),(3,4),(3,5)).
That is, for every element in the collection in each key, we create a (key, element) pair.
Also notice that (3, Range()) does not produce any additional key element pair since the sequence is empty.
now combining flatMap and mapValues, you get flatMapValues.

flatMapValues works on each value associated with key. In above case x to 5 means each value will be incremented till 5.
Taking first pair where you have (1,2) , here key is 1 and value is 2 so there after applying transformation it will become {(1,2),(1,3),(1,4),(1,5)}.
Hope this helps.

How to generate a new RDD from another RDD according to specific logic

I am a freshman to Spark. I have a problem, but I don't know how to solve it. My data in RDD is as follows:
(1,{A,B,C,D})
(2,{E,F,G})
......
I know RDDs are immutable, but, I want to transform my RDD into a new RDD that looks like this:
11 A,B
12 B,C
13 C,D
21 E,F
22 F,G
......
How can I generate a new key and extract adjacent elements?

Assuming your collection is something similar to a List, you could do something like:
val rdd2 = rdd1.flatMap { case (key, values) =>
for (value <- values.sliding(2).zipWithIndex)
yield (key.toString + value._2, value._1)
}
What we are doing here is iterating through the values in your list, applying a sliding window of size 2 on the elements, zipping the elements with an integer index, and finally outputting a list of tuples keyed by the original index appended with the list indices (whose values are the slid elements). We also use a flatMap here in order to flatten the results into their own records.
When run in spark-shell, I'm seeing the following output on your example:
scala> val rdd1 = sc.parallelize(Array((1,List("A","B","C","D")), (2,List("E","F","G"))))
rdd1: org.apache.spark.rdd.RDD[(Int, List[String])] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> val rdd2 = rdd1.flatMap { case (key, values) => for (value <- values.sliding(2).zipWithIndex) yield (key.toString + value._2, value._1) }
rdd2: org.apache.spark.rdd.RDD[(String, Seq[String])] = MapPartitionsRDD[1] at flatMap at <console>:23
scala> rdd2.foreach(println)
...
(10,List(A, B))
(11,List(B, C))
(12,List(C, D))
(20,List(E, F))
(21,List(F, G))
The one note with this is that the output key (e.g. 10, 11) will have 3 digits if you have 11 or more elements. For example, for the input key 1, you will have an output key 110 on the 11th element. Not sure if that fits your use case, but it seemed like a reasonable extension of your request. Based off your output key scheme, I would actually suggest something different (like maybe adding a hyphen between the key and element?). This will prevent collisions later as you'll see 2-10 and 21-0 instead of 210 for both keys.

Find out the partition no/id

Is there a way(A method) in Spark to find out the Parition ID/No
Take this example here
val input1 = sc.parallelize(List(8, 9, 10), 3)
val res = input1.reduce{ (x, y) => println("Inside partiton " + ???)
x + y)}
I would like to put some code in ??? to print the Partition ID / No

You can also use
TaskContext.getPartitionId()
e.g., in lieu of the presently missing foreachPartitionWithIndex()
https://github.com/apache/spark/pull/5927#issuecomment-99697229

Indeed, the mapParitionsWithIndex will give you an iterator & the partition index. (This isn't the same as reduce of course, but you could combine the result of that with aggregate).

Posting the answer here using mapParitionsWithIndex based on suggestion by #Holden.
I have created an RDD(Input) with 3 Partitions. The elements in input is tagged with the Partition Index(index) in the call to mapPartitionsWithIndex
scala> val input = sc.parallelize(11 to 17, 3)
input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:21
scala> input.mapPartitionsWithIndex{ (index, itr) => itr.toList.map(x => x + "#" + index).iterator }.collect()
res8: Array[String] = Array(11#0, 12#0, 13#1, 14#1, 15#2, 16#2, 17#2)

I ran across this old question while looking for the spark_partition_id sql function for DataFrame.
val input = spark.sparkContext.parallelize(11 to 17, 3)
input.toDF.withColumn("id",spark_partition_id).rdd.collect
res7: Array[org.apache.spark.sql.Row] = Array([11,0], [12,0], [13,1], [14,1], [15,2], [16,2], [17,2])

Join two (non)paired RDDs to make a DataFrame

As the title describes, say I have two RDDs
rdd1 = sc.parallelize([1,2,3])
rdd2 = sc.parallelize([1,0,0])
or
rdd3 = sc.parallelize([("Id", 1),("Id", 2),("Id",3)])
rdd4 = sc.parallelize([("Result", 1),("Result", 0),("Result", 0)])
How can I create the following DataFrame?
Id Result
1 1
2 0
3 0
If I could create the paired RDD [(1,1),(2,0),(3,0)] then sqlCtx.createDataFrame would give me what I want, but I don't know how?
I'd appreciate any comment or help!

So first off, there is an RDD operation called RDD.zipWithIndex. If you called rdd2.zipWithIndex you would get:
scala> rdd2.zipWithIndex collect() foreach println
(1,0)
(0,1)
(0,2)
If you wanted to make it look like yours, just do this:
scala> rdd2.zipWithIndex map(t => (t._2 + 1,t._1)) collect() foreach println
(1,1)
(2,0)
(3,0)
If you really need to zip the two RDDs, then just use RDD.zip
scala> rdd1.zip(rdd2) collect() foreach println
(1,1)
(2,0)
(3,0)

Provided that they have the same partitioner and the same number of elements per partition, you can use the zip function, e.g.
case class Elem(id: Int, result: Int)
val df = sqlCtx.createDataFrame(rdd1.zip(rdd2).map(x => Elem(x._1, x._2)))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Something strange about groupBy on spark - apache-spark

It's just the hashing applied to groupBy Shuffle. val rdd = sc.makeRDD(List(1, 2, 3, 4), 5) With 5 partitions you see 2 partitions used and 3 empty. Just an algorithm.

Related

Spark 2 performance even run is better than odd

Spark: FlatMapValues query

How to generate a new RDD from another RDD according to specific logic

Find out the partition no/id

Join two (non)paired RDDs to make a DataFrame

Categories

Resources