Is there a way(A method) in Spark to find out the Parition ID/No
Take this example here
val input1 = sc.parallelize(List(8, 9, 10), 3)
val res = input1.reduce{ (x, y) => println("Inside partiton " + ???)
x + y)}
I would like to put some code in ??? to print the Partition ID / No
You can also use
TaskContext.getPartitionId()
e.g., in lieu of the presently missing foreachPartitionWithIndex()
https://github.com/apache/spark/pull/5927#issuecomment-99697229
Indeed, the mapParitionsWithIndex will give you an iterator & the partition index. (This isn't the same as reduce of course, but you could combine the result of that with aggregate).
Posting the answer here using mapParitionsWithIndex based on suggestion by #Holden.
I have created an RDD(Input) with 3 Partitions. The elements in input is tagged with the Partition Index(index) in the call to mapPartitionsWithIndex
scala> val input = sc.parallelize(11 to 17, 3)
input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:21
scala> input.mapPartitionsWithIndex{ (index, itr) => itr.toList.map(x => x + "#" + index).iterator }.collect()
res8: Array[String] = Array(11#0, 12#0, 13#1, 14#1, 15#2, 16#2, 17#2)
I ran across this old question while looking for the spark_partition_id sql function for DataFrame.
val input = spark.sparkContext.parallelize(11 to 17, 3)
input.toDF.withColumn("id",spark_partition_id).rdd.collect
res7: Array[org.apache.spark.sql.Row] = Array([11,0], [12,0], [13,1], [14,1], [15,2], [16,2], [17,2])
Related
I'm learning about groupBy function on spark,I create a list with 2 partitions,then use groupBy to get every odd and even numbers.I found if I define
val rdd = sc.makeRDD(List(1, 2, 3, 4),2)
val result = rdd.groupBy(_ % 2 )
the result with goes to their own partition. But if I define
val result = rdd.groupBy(_ % 2 ==0)
the result turns to in one partition.could anybody explain why?
It's just the hashing applied to groupBy Shuffle.
val rdd = sc.makeRDD(List(1, 2, 3, 4), 5)
With 5 partitions you see 2 partitions used and 3 empty. Just an algorithm.
i am trying to group elements of an RDD that i have created. one simple but expensive way is to use GroupByKey(). but recently i learned that CombineByKey() can do this work more efficiently. my RDD is very simple. it looks like this:
(1,5)
(1,8)
(1,40)
(2,9)
(2,20)
(2,6)
val grouped_elements=first_RDD.groupByKey()..mapValues(x => x.toList)
the result is:
(1,List(5,8,40))
(2,List(9,20,6))
i want to group them based on the first element (key).
can any one help me to do it with CombineByKey() function? i am really confused by CombineByKey()
To begin with take a look at API Refer docs
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
So it accepts three functions which I have defined below
scala> val createCombiner = (v:Int) => List(v)
createCombiner: Int => List[Int] = <function1>
scala> val mergeValue = (a:List[Int], b:Int) => a.::(b)
mergeValue: (List[Int], Int) => List[Int] = <function2>
scala> val mergeCombiners = (a:List[Int],b:List[Int]) => a.++(b)
mergeCombiners: (List[Int], List[Int]) => List[Int] = <function2>
Once you define these then you can use it in your combineByKey call as below
scala> val list = List((1,5),(1,8),(1,40),(2,9),(2,20),(2,6))
list: List[(Int, Int)] = List((1,5), (1,8), (1,40), (2,9), (2,20), (2,6))
scala> val temp = sc.parallelize(list)
temp: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[41] at parallelize at <console>:30
scala> temp.combineByKey(createCombiner,mergeValue, mergeCombiners).collect
res27: Array[(Int, List[Int])] = Array((1,List(8, 40, 5)), (2,List(20, 9, 6)))
Please note that I tried this out in Spark Shell and hence you can see the outputs below the commands executed. They will help build you your understanding.
I have two different RDDs and apply a foreach on both of them and note a difference that I cannot resolve.
First one:
val data = Array(("CORN",6), ("WHEAT",3),("CORN",4),("SOYA",4),("CORN",1),("PALM",2),("BEANS",9),("MAIZE",8),("WHEAT",2),("PALM",10))
val rdd = sc.parallelize(data,3) // NOT sorted
rdd.foreach{ x => {
println (x)
}}
rdd: org.apache.spark.rdd.RDD[(String, Int)] = ParallelCollectionRDD[103] at parallelize at command-325897530726166:8
Works fine in this sense.
Second one:
rddX.foreach{ x => {
val prod = x(0)
val vol = x(1)
val prt = counter
val cnt = counter * 100
println(prt,cnt,prod,vol)
}}
rddX: org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] = MapPartitionsRDD[128] at rdd at command-686855653277634:51
Works fine.
Question: why can I not do val prod = x(0) as in the second case on the first example? And how could I do that with the foreach? Or would we need to use map for the first case always? Due to Row internals on the second example?
As you can see the difference in datatypes
First one is RDD[(String, Int)]
This is an RDD of Tuple2 which contains (String, Int) so you can access this as val prod = x._1 for first value as String and x._2 for second Integer value.
Since it is a Tuple you can't access as val prod = x(0)
and second one is RDD[org.apache.spark.sql.Row] which can be access a
val prod = x.getString(0) or val prod = x(0)
I hope this helped!
I have an RDD of tuples with the form (key, count) however some keys are equivalent, i.e.
(a,3)
(b,4)
(c,5)
should reduce down to... as a and c are equivalent (for example)
(a,8)
(b,4)
is there a way to perform this operation in Spark?
I'm thinking some sort of conditional within the reduce() function?
i don't think there is a way to do this within the reduce operation, but you can achieve it using a pre-processing step. One option is to create a Map[K,K] that links your keys.
val in = sc.parallelize(List(("a",3),("b",4),("c",5)))
val keyMap: Map[String,String] = Map[String,String]("a"->"a", "b"->"b", "c"->"a")
val out = in.map{case (k,v) => (keyMap.getOrElse(k,k),v)}.reduceByKey(_+_)
out.take(3).foreach(println)
Edit:
If the Map can't fit on the driver, you can also distribute the lookup:
val in = sc.parallelize(List(("a",3),("b",4),("c",5)))
val keyMap = sc.parallelize(List(("a","a"),("b","b"),("c"->"a")))
val out = in.join(keyMap).map{case (oldKey, (v, newKey)) => (newKey, v)}.reduceByKey(_+_)
out.take(3).foreach(println)
reduceByKey() does the trick here as your data is already paired one.
val baseRDD = sc.parallelize(Seq(("a", 3), ("b", 4), ("a", 5)))
baseRDD.reduceByKey((accum, current) => accum + current).foreach(println)
As the title describes, say I have two RDDs
rdd1 = sc.parallelize([1,2,3])
rdd2 = sc.parallelize([1,0,0])
or
rdd3 = sc.parallelize([("Id", 1),("Id", 2),("Id",3)])
rdd4 = sc.parallelize([("Result", 1),("Result", 0),("Result", 0)])
How can I create the following DataFrame?
Id Result
1 1
2 0
3 0
If I could create the paired RDD [(1,1),(2,0),(3,0)] then sqlCtx.createDataFrame would give me what I want, but I don't know how?
I'd appreciate any comment or help!
So first off, there is an RDD operation called RDD.zipWithIndex. If you called rdd2.zipWithIndex you would get:
scala> rdd2.zipWithIndex collect() foreach println
(1,0)
(0,1)
(0,2)
If you wanted to make it look like yours, just do this:
scala> rdd2.zipWithIndex map(t => (t._2 + 1,t._1)) collect() foreach println
(1,1)
(2,0)
(3,0)
If you really need to zip the two RDDs, then just use RDD.zip
scala> rdd1.zip(rdd2) collect() foreach println
(1,1)
(2,0)
(3,0)
Provided that they have the same partitioner and the same number of elements per partition, you can use the zip function, e.g.
case class Elem(id: Int, result: Int)
val df = sqlCtx.createDataFrame(rdd1.zip(rdd2).map(x => Elem(x._1, x._2)))