How do I select a range of elements in Spark RDD?

How do I select a range of elements in Spark RDD? - apache-spark

I'd like to select a range of elements in a Spark RDD. For example, I have an RDD with a hundred elements, and I need to select elements from 60 to 80. How do I do that?
I see that RDD has a take(i: int) method, which returns the first i elements. But there is no corresponding method to take the last i elements, or i elements from the middle starting at a certain index.

I don't think there is an efficient method to do this yet. But the easy way is using filter(), lets say you have an RDD, pairs with key value pairs and you only want elements from 60 to 80 inclusive just do.
val 60to80 = pairs.filter {
_ match {
case (k,v) => k >= 60 && k <= 80
case _ => false //incase of invalid input
}
}
I think it's possible that this could be done more efficiently in the future, by using sortByKey and saving information about the range of values mapped to each partition. Keep in mind this approach would only save anything if you were planning to query the range multiple times because the sort is obviously expensive.
From looking at the spark source it would definitely be possible to do efficient range queries using RangePartitioner:
// An array of upper bounds for the first (partitions - 1) partitions
private val rangeBounds: Array[K] = {
This is a private member of RangePartitioner with the knowledge of all the upper bounds of the partitions, it would be easy to only query the necessary partitions. It looks like this is something spark users may see in the future: SPARK-911
UPDATE: Way better answer, based on pull request I'm writing for SPARK-911. It will run efficiently if the RDD is sorted and you query it multiple times.
val sorted = sc.parallelize((1 to 100).map(x => (x, x))).sortByKey().cache()
val p: RangePartitioner[Int, Int] = sorted.partitioner.get.asInstanceOf[RangePartitioner[Int, Int]];
val (lower, upper) = (10, 20)
val range = p.getPartition(lower) to p.getPartition(upper)
println(range)
val rangeFilter = (i: Int, iter: Iterator[(Int, Int)]) => {
if (range.contains(i))
for ((k, v) <- iter if k >= lower && k <= upper) yield (k, v)
else
Iterator.empty
}
for((k,v) <- sorted.mapPartitionsWithIndex(rangeFilter, preservesPartitioning = true).collect()) println(s"$k, $v")
If having the whole partition in memory is acceptable you could even do something like this.
val glommedAndCached = sorted.glom()cache();
glommedAndCached.map(a => a.slice(a.search(lower),a.search(upper)+1)).collect()
search is not a member BTW I just made an implicit class that has a binary search function, not shown here

How big is your data set? You might be able to do what you need with:
data.take(80).drop(59)
This seems inefficient, but for small to medium-sized data, should work.
Is it possible to solve this in another way? What's the case for picking exactly a certain range out of the middle of your data? Would takeSample serve you better?

Following should be able to get the range. Note the cache will save you some overhead, because internally zipWithIndex need to scan the RDD partition to get the number of elements in each partition.
scala>val r1 = sc.parallelize(List("a", "b", "c", "d", "e", "f", "g"), 3).cache
scala>val r2 = r1.zipWithIndex
scala>val r3 = r2.filter(x=> {x._2>2 && x._2 < 4}).map(x=>x._1)
scala>r3.foreach(println)
d

For those who stumble on this question looking for Spark 2.x-compatible answer, you can use filterByRange

Related

Why is Spark explode function much slower than a flat map function to split array?

I'm new to Spark and Spark SQL. I have a dataset of 2 columns, "col1" and "col2", and "col2" originally is a Seq of longs. I want to explode "col2" into multiple rows so that each row only has one long.
I tried the explode function versus using flatMap and my own mapper function. They seemed to have significant performance difference. Everything else remained the same, "explode" function seems to be much slower than flatMap (order of magnitude depends on data size). Why?
Option 1: Using "explode"
val exploded = data.withColumn("col2", explode(col("col2")))
Option 2: Using manual flatMap
case class MyPair(col1: Long, col2: Long)
def longAndLongArrayMapper(colToKeep: Long, colToExplode: Seq[Long]) = {
(for (val <- colToExplode) yield MyPair(val, colToKeep))
}
val exploded = data.flatMap{ (x: Row) =>
longAndLongArrayMapper(x.getAs[Long]("col1"), (x.getAs[Seq[Long]]("col2"))) }

Spark reduce with comparison

I have an RDD of tuples with the form (key, count) however some keys are equivalent, i.e.
(a,3)
(b,4)
(c,5)
should reduce down to... as a and c are equivalent (for example)
(a,8)
(b,4)
is there a way to perform this operation in Spark?
I'm thinking some sort of conditional within the reduce() function?

i don't think there is a way to do this within the reduce operation, but you can achieve it using a pre-processing step. One option is to create a Map[K,K] that links your keys.
val in = sc.parallelize(List(("a",3),("b",4),("c",5)))
val keyMap: Map[String,String] = Map[String,String]("a"->"a", "b"->"b", "c"->"a")
val out = in.map{case (k,v) => (keyMap.getOrElse(k,k),v)}.reduceByKey(_+_)
out.take(3).foreach(println)
Edit:
If the Map can't fit on the driver, you can also distribute the lookup:
val in = sc.parallelize(List(("a",3),("b",4),("c",5)))
val keyMap = sc.parallelize(List(("a","a"),("b","b"),("c"->"a")))
val out = in.join(keyMap).map{case (oldKey, (v, newKey)) => (newKey, v)}.reduceByKey(_+_)
out.take(3).foreach(println)

reduceByKey() does the trick here as your data is already paired one.
val baseRDD = sc.parallelize(Seq(("a", 3), ("b", 4), ("a", 5)))
baseRDD.reduceByKey((accum, current) => accum + current).foreach(println)

How to check if all records for a given key are in the same partition already?

I'd like to avoid repartitioning data set by key as much as possible and know if all records for a given key are in the same partition already.
Is there a built-in function in Spark that would give me the answer?

Not built-in but if you assume specific partitioner it is easy enough to implement your own function:
import org.apache.spark.rdd.RDD
import org.apache.spark.Partitioner
import scala.reflect.ClassTag
def checkDistribution[K : ClassTag, V : ClassTag](
rdd: RDD[(K, V)], partitioner: Partitioner) =
// If partitioner is set we compare partitioners
rdd.partitioner.map(_ == partitioner).getOrElse {
// Otherwise check if correct number of partitions
rdd.partitions.size == partitioner.numPartitions &&
// And check if distribution matches partitioner
rdd.keys.mapPartitionsWithIndex((i, iter) =>
Iterator(iter.forall(x => partitioner.getPartition(x) == i))
).fold(true)(_ && _)
}
A few tests:
import org.apache.spark.HashPartitioner
val rdd = sc.range(0, 20, 5).map((_, None))
Not partitioned, invalid distribution:
checkDistribution(rdd, new HashPartitioner(10))
Boolean = false
Partitioned, invalid partitioner:
checkDistribution(
rdd.partitionBy(new HashPartitioner(5)),
new HashPartitioner(10)
)
Boolean = false
Partitioned, valid partitioner:
checkDistribution(
rdd.partitionBy(new HashPartitioner(10)),
new HashPartitioner(10)
)
Boolean = true
Not partitioned, valid distribution:
checkDistribution(
rdd.partitionBy(new HashPartitioner(10)).map(identity),
new HashPartitioner(10)
)
Boolean = true
Without assuming particular partitioner the only option that comes to mind requires shuffle, so it it unlikely to be an improvement.
def checkDistribution[K : ClassTag, V : ClassTag](rdd: RDD[(K, V)]) =
rdd.keys.mapPartitionsWithIndex((i, iter) => iter.map((_, i)))
.combineByKey(
x => Seq(x),
(x: Seq[Int], y: Int) => x,
(x: Seq[Int], y: Seq[Int]) => x ++ y) // Should be more or less OK
.values
.mapPartitions(iter => Iterator(iter.forall(_.size == 1)))
.fold(true)(_ && _)
One possible improvement is that you can use the same logic to automatically define Partitioner for the data. If you collectAsMap before values and check that all Seqs are of size 1 you have a valid partitioner which guarantees no network traffic.

Not 100% what you requested but you can check this by using spark_partition_id. Basically do:
withColumn("pid", spark_partition_id())
and then do:
df.groupby(what you want to check).agg(max($"pid").as("pidmax"),min($"pid").as("pidmin")).filter($"pidmax"===$"pidmin").count()
The count would give you how many elements are not partitioned.
Note that this is relatively low cost being a simple aggregation.
I don't believe there is a generic way because if we read from a generic source (e.g. file), we don't necessarily know how the source was originally partitioned.
It would be nice if there was something like "get current partitioner" which would get explicit partitioners (e.g. if we had an explicit repartition command or reading something from parquet which was written using PartitionBy) as an approximation though.

Referencing the next entry in RDD within a map function

I have a stream of <id, action, timestamp, data>s to process.
For example, (let us assume there's only 1 id for simplicity)
id event timestamp
-------------------------------
1 A 1
1 B 2
1 C 4
1 D 7
1 E 15
1 F 16
Let's say TIMEOUT = 5. Because more than 5 seconds passed after D happened without any further event, I want to map this to a JavaPairDStream with two key : value pairs.
id1_1:
A 1
B 2
C 4
D 7
and
id1_2:
E 15
F 16
However, in my anonymous function object, PairFunction that I pass to mapToPair() method,
incomingMessages.mapToPair(new PairFunction<String, String, RequestData>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2<String, RequestData> call(String s) {
I cannot reference the data in the next entry. In other words, when I am processing the entry with event D, I cannot look at the data at E.
If this was not Spark, I could have simply created an array timeDifferences, store the differences in two adjacent timestamps, and split the array into parts whenever I see a time difference in timeDifferences that is larger than TIMEOUT. (Although, actually there's no need to explicitly create an array)
How can I do this in Spark?

I'm still struggling to understand your question a bit, but based on what you've written, I think you can do it this way:
val A = sc.parallelize(List((1,"A",1.0),(1,"B",2.0),(1,"C",15.0))).zipWithIndex.map(x=>(x._2,x._1))
val B = A.map(x=>(x._1-1,x._2))
val C = A.leftOuterJoin(B).map(x=>(x._2._1,x._2._1._3 - (x._2._2 match{
case Some(a) => a._3
case _ => 0
})))
val group1 = C.filter(x=>(x._2 <= 5))
val group2 = C.filter(x=>(x._2 > 5))
So the concept is you zip with index to create val A (which assigns a serial long number to each entry of your RDD), and duplicate the RDD but with the index of the consecutive entry to create val B (by subtracting 1 from the index), then use a join to work out the TIMEOUT between consecutive entries. Then use Filter. This method uses RDD. A easier way is to collect them into the Master and use Map or zipped mapping, but it would be scala not spark I guess.

I believe this does what you need:
def splitToTimeWindows(input: RDD[Event], timeoutBetweenWindows: Long): RDD[Iterable[Event]] = {
val withIndex: RDD[(Long, Event)] = input.sortBy(_.timestamp).zipWithIndex().map(_.swap).cache()
val withIndexDrop1: RDD[(Long, Event)] = withIndex.map({ case (i, e) => (i-1, e)})
// joining the two to attach a "followingGap" to each event
val extendedEvents: RDD[ExtendedEvent] = withIndex.leftOuterJoin(withIndexDrop1).map({
case (i, (current, Some(next))) => ExtendedEvent(current, next.timestamp - current.timestamp)
case (i, (current, None)) => ExtendedEvent(current, 0) // last event has no following gap
})
// collecting (to driver memory!) cutoff points - timestamp of events that are *last* in their window
// if this collection is very large, another join might be needed
val cutoffPoints = extendedEvents.collect({ case e: ExtendedEvent if e.followingGap > timeoutBetweenWindows => e.event.timestamp }).distinct().collect()
// going back to original input, grouping by each event's nearest cutoffPoint (i.e. begining of this event's windown
input.groupBy(e => cutoffPoints.filter(_ < e.timestamp).sortWith(_ > _).headOption.getOrElse(0)).values
}
case class Event(timestamp: Long, data: String)
case class ExtendedEvent(event: Event, followingGap: Long)
The first part builds on GameOfThrows's answer - joining the input with itself with 1's offset to calculate the 'followingGap' for each record. Then we collect the "breaks" or "cutoff points" between the windows, and perform another transformation on the input using these points to group it by window.
NOTE: there might be more efficient ways to perform some of these transformations, depending on the characteristics of the input, for example: if you have lots of "sessions", this code might be slow or run out of memory.

How to generate a new RDD from another RDD according to specific logic

I am a freshman to Spark. I have a problem, but I don't know how to solve it. My data in RDD is as follows:
(1,{A,B,C,D})
(2,{E,F,G})
......
I know RDDs are immutable, but, I want to transform my RDD into a new RDD that looks like this:
11 A,B
12 B,C
13 C,D
21 E,F
22 F,G
......
How can I generate a new key and extract adjacent elements?

Assuming your collection is something similar to a List, you could do something like:
val rdd2 = rdd1.flatMap { case (key, values) =>
for (value <- values.sliding(2).zipWithIndex)
yield (key.toString + value._2, value._1)
}
What we are doing here is iterating through the values in your list, applying a sliding window of size 2 on the elements, zipping the elements with an integer index, and finally outputting a list of tuples keyed by the original index appended with the list indices (whose values are the slid elements). We also use a flatMap here in order to flatten the results into their own records.
When run in spark-shell, I'm seeing the following output on your example:
scala> val rdd1 = sc.parallelize(Array((1,List("A","B","C","D")), (2,List("E","F","G"))))
rdd1: org.apache.spark.rdd.RDD[(Int, List[String])] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> val rdd2 = rdd1.flatMap { case (key, values) => for (value <- values.sliding(2).zipWithIndex) yield (key.toString + value._2, value._1) }
rdd2: org.apache.spark.rdd.RDD[(String, Seq[String])] = MapPartitionsRDD[1] at flatMap at <console>:23
scala> rdd2.foreach(println)
...
(10,List(A, B))
(11,List(B, C))
(12,List(C, D))
(20,List(E, F))
(21,List(F, G))
The one note with this is that the output key (e.g. 10, 11) will have 3 digits if you have 11 or more elements. For example, for the input key 1, you will have an output key 110 on the 11th element. Not sure if that fits your use case, but it seemed like a reasonable extension of your request. Based off your output key scheme, I would actually suggest something different (like maybe adding a hyphen between the key and element?). This will prevent collisions later as you'll see 2-10 and 21-0 instead of 210 for both keys.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How do I select a range of elements in Spark RDD? - apache-spark

For those who stumble on this question looking for Spark 2.x-compatible answer, you can use filterByRange

Related

Why is Spark explode function much slower than a flat map function to split array?

Spark reduce with comparison

How to check if all records for a given key are in the same partition already?

Referencing the next entry in RDD within a map function

How to generate a new RDD from another RDD according to specific logic

Categories

Resources