Join two (non)paired RDDs to make a DataFrame - apache-spark

As the title describes, say I have two RDDs
rdd1 = sc.parallelize([1,2,3])
rdd2 = sc.parallelize([1,0,0])
or
rdd3 = sc.parallelize([("Id", 1),("Id", 2),("Id",3)])
rdd4 = sc.parallelize([("Result", 1),("Result", 0),("Result", 0)])
How can I create the following DataFrame?
Id Result
1 1
2 0
3 0
If I could create the paired RDD [(1,1),(2,0),(3,0)] then sqlCtx.createDataFrame would give me what I want, but I don't know how?
I'd appreciate any comment or help!

So first off, there is an RDD operation called RDD.zipWithIndex. If you called rdd2.zipWithIndex you would get:
scala> rdd2.zipWithIndex collect() foreach println
(1,0)
(0,1)
(0,2)
If you wanted to make it look like yours, just do this:
scala> rdd2.zipWithIndex map(t => (t._2 + 1,t._1)) collect() foreach println
(1,1)
(2,0)
(3,0)
If you really need to zip the two RDDs, then just use RDD.zip
scala> rdd1.zip(rdd2) collect() foreach println
(1,1)
(2,0)
(3,0)

Provided that they have the same partitioner and the same number of elements per partition, you can use the zip function, e.g.
case class Elem(id: Int, result: Int)
val df = sqlCtx.createDataFrame(rdd1.zip(rdd2).map(x => Elem(x._1, x._2)))

Related

Something strange about groupBy on spark

I'm learning about groupBy function on spark,I create a list with 2 partitions,then use groupBy to get every odd and even numbers.I found if I define
val rdd = sc.makeRDD(List(1, 2, 3, 4),2)
val result = rdd.groupBy(_ % 2 )
the result with goes to their own partition. But if I define
val result = rdd.groupBy(_ % 2 ==0)
the result turns to in one partition.could anybody explain why?
It's just the hashing applied to groupBy Shuffle.
val rdd = sc.makeRDD(List(1, 2, 3, 4), 5)
With 5 partitions you see 2 partitions used and 3 empty. Just an algorithm.

Map each element of a list in Spark

I'm working with an RDD which pairs are structured this way: [Int, List[Int]] my goal is to map the items of the list of each pair with the key. So for example I'd need to do this:
RDD1:[Int, List[Int]]
<1><[2, 3]>
<2><[3, 5, 8]>
RDD2:[Int, Int]
<1><2>
<1><3>
<2><3>
<2><5>
<2><8>
well I can't understand what kind of transformation would be needed in order to get to RDD2. Transformations list can be found here. Any Idea? Is it a wrong approach?
You can use flatMap:
val rdd1 = sc.parallelize(Seq((1, List(2, 3)), (2, List(3, 5, 8))))
val rdd2 = rdd1.flatMap(x => x._2.map(y => (x._1, y)))
// or:
val rdd2 = rdd1.flatMap{case (key, list) => list.map(nr => (key, nr))}
// print result:
rdd2.collect().foreach(println)
Gives result:
(1,2)
(1,3)
(2,3)
(2,5)
(2,8)
flatMap created few output objects from one input object.
In your case, inner map in flatMap maps tuple (Int, List[Int]) to List[(Int, Int)] - key is the same as input tuple, but for each element in input list it creates one output tuple. flatMap causes that each element of this List becomes one row in RDD

How to generate a new RDD from another RDD according to specific logic

I am a freshman to Spark. I have a problem, but I don't know how to solve it. My data in RDD is as follows:
(1,{A,B,C,D})
(2,{E,F,G})
......
I know RDDs are immutable, but, I want to transform my RDD into a new RDD that looks like this:
11 A,B
12 B,C
13 C,D
21 E,F
22 F,G
......
How can I generate a new key and extract adjacent elements?
Assuming your collection is something similar to a List, you could do something like:
val rdd2 = rdd1.flatMap { case (key, values) =>
for (value <- values.sliding(2).zipWithIndex)
yield (key.toString + value._2, value._1)
}
What we are doing here is iterating through the values in your list, applying a sliding window of size 2 on the elements, zipping the elements with an integer index, and finally outputting a list of tuples keyed by the original index appended with the list indices (whose values are the slid elements). We also use a flatMap here in order to flatten the results into their own records.
When run in spark-shell, I'm seeing the following output on your example:
scala> val rdd1 = sc.parallelize(Array((1,List("A","B","C","D")), (2,List("E","F","G"))))
rdd1: org.apache.spark.rdd.RDD[(Int, List[String])] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> val rdd2 = rdd1.flatMap { case (key, values) => for (value <- values.sliding(2).zipWithIndex) yield (key.toString + value._2, value._1) }
rdd2: org.apache.spark.rdd.RDD[(String, Seq[String])] = MapPartitionsRDD[1] at flatMap at <console>:23
scala> rdd2.foreach(println)
...
(10,List(A, B))
(11,List(B, C))
(12,List(C, D))
(20,List(E, F))
(21,List(F, G))
The one note with this is that the output key (e.g. 10, 11) will have 3 digits if you have 11 or more elements. For example, for the input key 1, you will have an output key 110 on the 11th element. Not sure if that fits your use case, but it seemed like a reasonable extension of your request. Based off your output key scheme, I would actually suggest something different (like maybe adding a hyphen between the key and element?). This will prevent collisions later as you'll see 2-10 and 21-0 instead of 210 for both keys.

Spark DStream sort and take N elements

I am using spark stream to read data from kafka cluster. I want to sort a DStream pair and get the Top N alone. So far I have sorted using
val result = ds.reduceByKeyAndWindow((x: Double, y: Double) => x + y,
Seconds(windowInterval), Seconds(batchInterval))
result.transform(rdd => rdd.sortBy(_._2, false))
result.print
My Questions are
How to get only the top N elements from the dstream ?
The transform operation is applied rdd by rdd . So will the result be sorted across elements in all rdds ? If not how to achieve it ?
You can use transform method in the DStream object then sort the input RDD and take n elements of it in a list, then filter the original RDD to be contained in this list.
Note: Both RDD and DStream are immutable, So any transformation would return a new RDD or DStream but won't change in the original RDD or DStream.
val n = 10
val topN = result.transform(rdd =>{
val list = rdd.sortBy(_._2, false).take(n)
rdd.filter(list.contains)
})
topN.print

Find out the partition no/id

Is there a way(A method) in Spark to find out the Parition ID/No
Take this example here
val input1 = sc.parallelize(List(8, 9, 10), 3)
val res = input1.reduce{ (x, y) => println("Inside partiton " + ???)
x + y)}
I would like to put some code in ??? to print the Partition ID / No
You can also use
TaskContext.getPartitionId()
e.g., in lieu of the presently missing foreachPartitionWithIndex()
https://github.com/apache/spark/pull/5927#issuecomment-99697229
Indeed, the mapParitionsWithIndex will give you an iterator & the partition index. (This isn't the same as reduce of course, but you could combine the result of that with aggregate).
Posting the answer here using mapParitionsWithIndex based on suggestion by #Holden.
I have created an RDD(Input) with 3 Partitions. The elements in input is tagged with the Partition Index(index) in the call to mapPartitionsWithIndex
scala> val input = sc.parallelize(11 to 17, 3)
input: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[9] at parallelize at <console>:21
scala> input.mapPartitionsWithIndex{ (index, itr) => itr.toList.map(x => x + "#" + index).iterator }.collect()
res8: Array[String] = Array(11#0, 12#0, 13#1, 14#1, 15#2, 16#2, 17#2)
I ran across this old question while looking for the spark_partition_id sql function for DataFrame.
val input = spark.sparkContext.parallelize(11 to 17, 3)
input.toDF.withColumn("id",spark_partition_id).rdd.collect
res7: Array[org.apache.spark.sql.Row] = Array([11,0], [12,0], [13,1], [14,1], [15,2], [16,2], [17,2])

Resources