Spark DStream sort and take N elements - apache-spark

I am using spark stream to read data from kafka cluster. I want to sort a DStream pair and get the Top N alone. So far I have sorted using
val result = ds.reduceByKeyAndWindow((x: Double, y: Double) => x + y,
Seconds(windowInterval), Seconds(batchInterval))
result.transform(rdd => rdd.sortBy(_._2, false))
result.print
My Questions are
How to get only the top N elements from the dstream ?
The transform operation is applied rdd by rdd . So will the result be sorted across elements in all rdds ? If not how to achieve it ?

You can use transform method in the DStream object then sort the input RDD and take n elements of it in a list, then filter the original RDD to be contained in this list.
Note: Both RDD and DStream are immutable, So any transformation would return a new RDD or DStream but won't change in the original RDD or DStream.
val n = 10
val topN = result.transform(rdd =>{
val list = rdd.sortBy(_._2, false).take(n)
rdd.filter(list.contains)
})
topN.print

Related

Why is Spark explode function much slower than a flat map function to split array?

I'm new to Spark and Spark SQL. I have a dataset of 2 columns, "col1" and "col2", and "col2" originally is a Seq of longs. I want to explode "col2" into multiple rows so that each row only has one long.
I tried the explode function versus using flatMap and my own mapper function. They seemed to have significant performance difference. Everything else remained the same, "explode" function seems to be much slower than flatMap (order of magnitude depends on data size). Why?
Option 1: Using "explode"
val exploded = data.withColumn("col2", explode(col("col2")))
Option 2: Using manual flatMap
case class MyPair(col1: Long, col2: Long)
def longAndLongArrayMapper(colToKeep: Long, colToExplode: Seq[Long]) = {
(for (val <- colToExplode) yield MyPair(val, colToKeep))
}
val exploded = data.flatMap{ (x: Row) =>
longAndLongArrayMapper(x.getAs[Long]("col1"), (x.getAs[Seq[Long]]("col2"))) }

Strict partition an RDD into multiple RDDs in Spark

I have an rdd with n partitions and I would like to split this rdd into k rdds in such a way that
rdd = rdd_1.union(rdd_2).union(rdd_3)...union(rdd_k)
So for example if n=10 and k=2 I would like to end up with 2 rdds where rdd1 is composed of 5 partitions and rdd2 is composed of the other 5 partitions.
What is the most efficient way to do this in Spark?
You can try something like this:
val rdd: RDD[T] = ???
val k: Integer = ???
val n = rdd.partitions.size
val rdds = (0 until n) // Create Seq of partitions numbers
.grouped(n / k) // group it into fixed sized buckets
.map(idxs => (idxs.head, idxs.last)) // Take the first and the last idx
.map {
case(min, max) => rdd.mapPartitionsWithIndex(
// If partition in [min, max] range keep its iterator
// otherwise return empty-one
(i, iter) => if (i >= min & i <= max) iter else Iterator()
)
}
If input RDD has complex dependencies you should cache it before applying this.

Collect RDD to one node in sorted order

I have a large RDD which needs to be written to a single file on disk, one line for each element, the lines sorted in some defined order. So I was thinking of sorting the RDD, collect one partition at a time in the driver, and appending to the output file.
Couple of questions:
After rdd.sortBy(), do I have the guarantee that partition 0 will contain the first elements of the sorted RDD, partiton 1 will contain the next elements of the sorted RDD, and so on? (I'm using the default partitioner.)
e.g.
val rdd = ???
val sortedRdd = rdd.sortBy(???)
for (p <- sortedRdd.partitions) {
val index = p.index
val partitionRdd = sortedRdd mapPartitionsWithIndex { case (i, values) => if (i == index) values else Iterator() }
val partition = partitionRdd.collect()
partition foreach { e =>
// Append element e to file
}
}
I understand that rdd.toLocalIterator is a more efficient way of fetching all partitions, one at a time. So same question: do I get the elements in the order given by .sortBy()?
val rdd = ???
val sortedRdd = rdd.sortBy(???)
for (e <- sortedRdd.toLocalIterator) {
// Append element e to file
}

How to generate a new RDD from another RDD according to specific logic

I am a freshman to Spark. I have a problem, but I don't know how to solve it. My data in RDD is as follows:
(1,{A,B,C,D})
(2,{E,F,G})
......
I know RDDs are immutable, but, I want to transform my RDD into a new RDD that looks like this:
11 A,B
12 B,C
13 C,D
21 E,F
22 F,G
......
How can I generate a new key and extract adjacent elements?
Assuming your collection is something similar to a List, you could do something like:
val rdd2 = rdd1.flatMap { case (key, values) =>
for (value <- values.sliding(2).zipWithIndex)
yield (key.toString + value._2, value._1)
}
What we are doing here is iterating through the values in your list, applying a sliding window of size 2 on the elements, zipping the elements with an integer index, and finally outputting a list of tuples keyed by the original index appended with the list indices (whose values are the slid elements). We also use a flatMap here in order to flatten the results into their own records.
When run in spark-shell, I'm seeing the following output on your example:
scala> val rdd1 = sc.parallelize(Array((1,List("A","B","C","D")), (2,List("E","F","G"))))
rdd1: org.apache.spark.rdd.RDD[(Int, List[String])] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> val rdd2 = rdd1.flatMap { case (key, values) => for (value <- values.sliding(2).zipWithIndex) yield (key.toString + value._2, value._1) }
rdd2: org.apache.spark.rdd.RDD[(String, Seq[String])] = MapPartitionsRDD[1] at flatMap at <console>:23
scala> rdd2.foreach(println)
...
(10,List(A, B))
(11,List(B, C))
(12,List(C, D))
(20,List(E, F))
(21,List(F, G))
The one note with this is that the output key (e.g. 10, 11) will have 3 digits if you have 11 or more elements. For example, for the input key 1, you will have an output key 110 on the 11th element. Not sure if that fits your use case, but it seemed like a reasonable extension of your request. Based off your output key scheme, I would actually suggest something different (like maybe adding a hyphen between the key and element?). This will prevent collisions later as you'll see 2-10 and 21-0 instead of 210 for both keys.

Join two (non)paired RDDs to make a DataFrame

As the title describes, say I have two RDDs
rdd1 = sc.parallelize([1,2,3])
rdd2 = sc.parallelize([1,0,0])
or
rdd3 = sc.parallelize([("Id", 1),("Id", 2),("Id",3)])
rdd4 = sc.parallelize([("Result", 1),("Result", 0),("Result", 0)])
How can I create the following DataFrame?
Id Result
1 1
2 0
3 0
If I could create the paired RDD [(1,1),(2,0),(3,0)] then sqlCtx.createDataFrame would give me what I want, but I don't know how?
I'd appreciate any comment or help!
So first off, there is an RDD operation called RDD.zipWithIndex. If you called rdd2.zipWithIndex you would get:
scala> rdd2.zipWithIndex collect() foreach println
(1,0)
(0,1)
(0,2)
If you wanted to make it look like yours, just do this:
scala> rdd2.zipWithIndex map(t => (t._2 + 1,t._1)) collect() foreach println
(1,1)
(2,0)
(3,0)
If you really need to zip the two RDDs, then just use RDD.zip
scala> rdd1.zip(rdd2) collect() foreach println
(1,1)
(2,0)
(3,0)
Provided that they have the same partitioner and the same number of elements per partition, you can use the zip function, e.g.
case class Elem(id: Int, result: Int)
val df = sqlCtx.createDataFrame(rdd1.zip(rdd2).map(x => Elem(x._1, x._2)))

Resources