Referencing the next entry in RDD within a map function - apache-spark

I have a stream of <id, action, timestamp, data>s to process.
For example, (let us assume there's only 1 id for simplicity)
id event timestamp
-------------------------------
1 A 1
1 B 2
1 C 4
1 D 7
1 E 15
1 F 16
Let's say TIMEOUT = 5. Because more than 5 seconds passed after D happened without any further event, I want to map this to a JavaPairDStream with two key : value pairs.
id1_1:
A 1
B 2
C 4
D 7
and
id1_2:
E 15
F 16
However, in my anonymous function object, PairFunction that I pass to mapToPair() method,
incomingMessages.mapToPair(new PairFunction<String, String, RequestData>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2<String, RequestData> call(String s) {
I cannot reference the data in the next entry. In other words, when I am processing the entry with event D, I cannot look at the data at E.
If this was not Spark, I could have simply created an array timeDifferences, store the differences in two adjacent timestamps, and split the array into parts whenever I see a time difference in timeDifferences that is larger than TIMEOUT. (Although, actually there's no need to explicitly create an array)
How can I do this in Spark?

I'm still struggling to understand your question a bit, but based on what you've written, I think you can do it this way:
val A = sc.parallelize(List((1,"A",1.0),(1,"B",2.0),(1,"C",15.0))).zipWithIndex.map(x=>(x._2,x._1))
val B = A.map(x=>(x._1-1,x._2))
val C = A.leftOuterJoin(B).map(x=>(x._2._1,x._2._1._3 - (x._2._2 match{
case Some(a) => a._3
case _ => 0
})))
val group1 = C.filter(x=>(x._2 <= 5))
val group2 = C.filter(x=>(x._2 > 5))
So the concept is you zip with index to create val A (which assigns a serial long number to each entry of your RDD), and duplicate the RDD but with the index of the consecutive entry to create val B (by subtracting 1 from the index), then use a join to work out the TIMEOUT between consecutive entries. Then use Filter. This method uses RDD. A easier way is to collect them into the Master and use Map or zipped mapping, but it would be scala not spark I guess.

I believe this does what you need:
def splitToTimeWindows(input: RDD[Event], timeoutBetweenWindows: Long): RDD[Iterable[Event]] = {
val withIndex: RDD[(Long, Event)] = input.sortBy(_.timestamp).zipWithIndex().map(_.swap).cache()
val withIndexDrop1: RDD[(Long, Event)] = withIndex.map({ case (i, e) => (i-1, e)})
// joining the two to attach a "followingGap" to each event
val extendedEvents: RDD[ExtendedEvent] = withIndex.leftOuterJoin(withIndexDrop1).map({
case (i, (current, Some(next))) => ExtendedEvent(current, next.timestamp - current.timestamp)
case (i, (current, None)) => ExtendedEvent(current, 0) // last event has no following gap
})
// collecting (to driver memory!) cutoff points - timestamp of events that are *last* in their window
// if this collection is very large, another join might be needed
val cutoffPoints = extendedEvents.collect({ case e: ExtendedEvent if e.followingGap > timeoutBetweenWindows => e.event.timestamp }).distinct().collect()
// going back to original input, grouping by each event's nearest cutoffPoint (i.e. begining of this event's windown
input.groupBy(e => cutoffPoints.filter(_ < e.timestamp).sortWith(_ > _).headOption.getOrElse(0)).values
}
case class Event(timestamp: Long, data: String)
case class ExtendedEvent(event: Event, followingGap: Long)
The first part builds on GameOfThrows's answer - joining the input with itself with 1's offset to calculate the 'followingGap' for each record. Then we collect the "breaks" or "cutoff points" between the windows, and perform another transformation on the input using these points to group it by window.
NOTE: there might be more efficient ways to perform some of these transformations, depending on the characteristics of the input, for example: if you have lots of "sessions", this code might be slow or run out of memory.

Related

Why is this multi-threaded bubble sort taking so long?

I'm trying to implement a more efficient bubble sort for a homework problem which requires us to create a list of 3,000,000 random doubles and use multi-threading to divide the list into quarters and simultaneously bubble sort each quarter. This is supposed to take ~7 minutes for each quarter (750,000 numbers), but on my computer, even sorting 10,000 doubles (2,500 per quarter) takes about 6 minutes and sorting a list of 1,000,000 integers isn't finished after over an hour. I am using a 2012 Macbook Pro. Does anyone have any ideas as to why this program is taking so long?
import scala.util.Random
import scala.collection.mutable.ListBuffer
object bubbleSortTest extends App{
val size:Int = 10000
val numDivs:Int = 4
val list = ListBuffer.fill(size)(Random.nextDouble*100)
//create list of 10000 random doubles
val divs = ListBuffer[ListBuffer[Double]]()
for(i <- 0 until numDivs)
divs += list.grouped(list.length/numDivs).toList(i).to[ListBuffer]
//divide list into quarters and put quarters into new list divs
val s0 = new Sorter("s0",list)
//create sorter for entire list
val sorters = divs.zipWithIndex.map(x => new Sorter(s"s${x._2 + 1}", x._1)) //create sorter for each quarter
s0.start()
for(s <- sorters)
s.start()
}
class Sorter(name:String, list:ListBuffer[Double]) extends Thread {
override def run() {
val t0 = System.nanoTime()
sort(list)
val t1 = System.nanoTime()
println(s"$name done: ${(t1-t0)/1e9}s")
//println(list)
}
def sort(list:ListBuffer[Double]): List[Double] = {
var didSwap = false
for(i <- 0 until list.length-1) {
//println(s"$getName: $i")
if(list(i)>list(i+1)) {
didSwap = true
var temp = list(i)
list(i) = list(i+1)
list(i+1) = temp
}
}
if(didSwap)
return sort(list)
else
return list.toList
}
}
It could be because you are using ListBuffer. A linked list has a potential problem that every individual step during a traversal of a linked list can entail a cache miss. Try the other mutable data structures such as Array or ArrayBuffer. Because contiguous slots in Array or ArrayBuffer exist side-by-side in memory, you will have fewer cache misses, which means your sort should be faster.
The problem is that ListBuffer is a list not an array, with each element of the list pointing to the next one rather than being in a single block of memory. This means that taking the length of a ListBuffer means counting every element in the list. And getting element i from the the ListBuffer means stepping through i elements of the list.
Indexing a ListBuffer is O(n) and in the worst case Bubble Sort is O(n^2), so your algorithm is O(n^3) which get very slow with large n.

Reduce Spark RDD to return multiple values

I have the following RDD containing sets of items which I would like to group by item similarity (Items in the same set are considered similar. Similarity is transitive and all the items in sets which have atleast one common item are also considered similar)
Input RDD:
Set(w1, w2)
Set(w1, w2, w3, w4)
Set(w5, w2, w6)
Set(w7, w8, w9)
Set(w10, w5, w8) --> All the first 5 set elements are similar as each of the sets have atleast one common item
Set(w11, w12, w13)
I would like the above RDD to be reduced to
Set(w1, w2, w3, w4, w5, w6, w7, w8, w9, w10)
Set(w11, w12, w13)
Any suggestions on how I could do this ? I am unable to do something like below where I could ignore reducing two sets if they don't contain any common elements:
data.reduce((a,b) => if (a.intersect(b).size > 0) a ++ b ***else (a,b)***)
Thanks.
Your reduce algorithm are actually incorrect. For example, what if one set cannot be merge with the next set, but can still be merged with a different set in the collect.
There're probably better ways, but I think up a solution to this by converting it to a graph problem and use Graphx.
val data = Array(Set("w1", "w2", "w3"), Set("w5", "w6"), Set("w7"), Set("w2", "w3", "w4"))
val setRdd = sc.parallelize(data).cache
// Generate an unique id for each item to use as vertex's id in the graph
val itemToId = setRdd.flatMap(_.toSeq).distinct.zipWithUniqueId.cache
val idToItem = itemToId.map { case (item, itemId) => (itemId, item) }
// Convert to a RDD of set of itemId
val newSetRdd = setRdd.zipWithUniqueId
.flatMap { case (sets, setId) =>
sets.map { item => (item, setId) }
}.join(itemToId).values.groupByKey().values
// Create an RDD containing edges of the graph
val edgeRdd = newSetRdd.flatMap { set =>
val seq = set.toSeq
val head = seq.head
// Add an edge from the first item to each item in a set,
// including itself
seq.map { item => Edge[Long](head, item)}
}
val graph = Graph.fromEdges(edgeRdd, Nil)
// Run connected component algorithm to check which items are similar.
// Items in the same component are similar
val verticesRDD = graph.connectedComponents().vertices
verticesRDD.join(idToItem).values.groupByKey.values.collect.foreach(println)

How to figure out if DStream is empty

I have 2 inputs, where first input is stream (say input1) and the second one is batch (say input2).
I want to figure out if the keys in first input matches single row or more than one row in the second input.
The further transformations/logic depends on the number of rows matching, whether single row matches or multiple rows match (for atleast one key in the first input)
if(single row matches){
// do something
}else{
// do something
}
Code that i tried so far
val input1Pair = streamData.map(x => (x._1, x))
val input2Pair = input2.map(x => (x._1, x))
val joinData = input1Pair.transform{ x => input2Pair.leftOuterJoin(x)}
val result = joinData.mapValues{
case(v, Some(a)) => 1L
case(v, None) => 0
}.reduceByKey(_ + _).filter(_._2 > 1)
I have done the above coding.
When I do result.print, it prints nothing if all the keys matches only one row in the input2.
With the fact that the DStream may have multiple RDDs, not sure how to figure out if the DStream is empty or not. If this is possible then I can do a if check.
There's no function to determine if a DStream is empty, as a DStream represents a collection over time. From a conceptual perspective, an empty DStream would be a stream that never has data and that would not be very useful.
What can be done is to check whether a given microbatch has data or not:
dstream.foreachRDD{ rdd => if (rdd.isEmpty) {...} }
Please note that at any given point in time, there's only one RDD.
I think that the actual question is how to check the number of matches between the reference RDD and the data in the DStream. Probably the easiest way would be to intersect both collections and check the intersection size:
val intersectionDStream = streamData.transform{rdd => rdd.intersection(input2)}
intersectionDStream.foreachRDD{rdd =>
if (rdd.count > 1) {
..do stuff with the matches
} else {
..do otherwise
}
}
We could also place the RDD-centric transformations within the foreachRDD operation:
streamData.foreachRDD{rdd =>
val matches = rdd.intersection(input2)
if (matches.count > 1) {
..do stuff with the matches
} else {
..do otherwise
}
}

Scala String Similarity

I have a Scala code that computes similarity between a set of strings and give all the unique strings.
val filtered = z.reverse.foldLeft((List.empty[String],z.reverse)) {
case ((acc, zt), zz) =>
if (zt.tail.exists(tt => similarity(tt, zz) < threshold)) acc
else zz :: acc, zt.tail
}._1
I'll try to explain what is going on here :
This uses a fold over the reversed input data, starting from the empty String (to accumulate results) and the (reverse of the) remaining input data (to compare against - I labeled it zt for "z-tail").
The fold then cycles through the data, checking each entry against the tail of the remaining data (so it doesn't get compared to itself or any earlier entry)
If there is a match, just the existing accumulator (labelled acc) will be allowed through, otherwise, add the current entry (zz) to the accumulator. This updated accumulator is paired with the tail of the "remaining" Strings (zt.tail), to ensure a reducing set to compare against.
Finally, we end up with a pair of lists: the required remaining Strings, and an empty list (no Strings left to compare against), so we take the first of these as our result.
The problem is like in first iteration, if 1st, 4th and 8th strings are similar, I am getting only the 1st string. Instead of it, I should get a set of (1st,4th,8th), then if 2nd,5th,14th and 21st strings are similar, I should get a set of (2nd,5th,14th,21st).
If I understand you correctly - you want the result to be of type List[List[String]] and not the List[String] you are getting now - where each item is a list of similar Strings (right?).
If so - I can't see a trivial change to your implementation that would achieve this, as the similar values are lost (when you enter the if(true) branch and just return the acc - you skip an item and you'll never "see" it again).
Two possible solutions I can think of:
Based on your idea, but using a 3-Tuple of the form (acc, zt, scanned) as the foldLeft result type, where the added scanned is the list of already-scanned items. This way we can refer back to them when we find an element that doesn't have preceeding similar elements:
val filtered = z.reverse.foldLeft((List.empty[List[String]],z.reverse,List.empty[String])) {
case ((acc, zt, scanned), zz) =>
val hasSimilarPreceeding = zt.tail.exists { tt => similarity(tt, zz) < threshold }
val similarFollowing = scanned.collect { case tt if similarity(tt, zz) < threshold => tt }
(if (hasSimilarPreceeding) acc else (zz :: similarFollowing) :: acc, zt.tail, zz :: scanned)
}._1
A probably-slower but much simpler solution would be to just groupBy the group of similar strings:
val alternative = z.groupBy(s => z.collect {
case other if similarity(s, other) < threshold => other
}.toSet ).values.toList
All of this assumes that the function:
f(a: String, b: String): Boolean = similarity(a, b) < threshold
Is commutative and transitive, i.e.:
f(a, b) && f(a. c) means that f(b, c)
f(a, b) if and only if f(b, a)
To test both implementations I used:
// strings are similar if they start with the same character
def similarity(s1: String, s2: String) = if (s1.head == s2.head) 0 else 100
val threshold = 1
val z = List("aa", "ab", "c", "a", "e", "fa", "fb")
And both options produce the same results:
List(List(aa, ab, a), List(c), List(e), List(fa, fb))

How do I select a range of elements in Spark RDD?

I'd like to select a range of elements in a Spark RDD. For example, I have an RDD with a hundred elements, and I need to select elements from 60 to 80. How do I do that?
I see that RDD has a take(i: int) method, which returns the first i elements. But there is no corresponding method to take the last i elements, or i elements from the middle starting at a certain index.
I don't think there is an efficient method to do this yet. But the easy way is using filter(), lets say you have an RDD, pairs with key value pairs and you only want elements from 60 to 80 inclusive just do.
val 60to80 = pairs.filter {
_ match {
case (k,v) => k >= 60 && k <= 80
case _ => false //incase of invalid input
}
}
I think it's possible that this could be done more efficiently in the future, by using sortByKey and saving information about the range of values mapped to each partition. Keep in mind this approach would only save anything if you were planning to query the range multiple times because the sort is obviously expensive.
From looking at the spark source it would definitely be possible to do efficient range queries using RangePartitioner:
// An array of upper bounds for the first (partitions - 1) partitions
private val rangeBounds: Array[K] = {
This is a private member of RangePartitioner with the knowledge of all the upper bounds of the partitions, it would be easy to only query the necessary partitions. It looks like this is something spark users may see in the future: SPARK-911
UPDATE: Way better answer, based on pull request I'm writing for SPARK-911. It will run efficiently if the RDD is sorted and you query it multiple times.
val sorted = sc.parallelize((1 to 100).map(x => (x, x))).sortByKey().cache()
val p: RangePartitioner[Int, Int] = sorted.partitioner.get.asInstanceOf[RangePartitioner[Int, Int]];
val (lower, upper) = (10, 20)
val range = p.getPartition(lower) to p.getPartition(upper)
println(range)
val rangeFilter = (i: Int, iter: Iterator[(Int, Int)]) => {
if (range.contains(i))
for ((k, v) <- iter if k >= lower && k <= upper) yield (k, v)
else
Iterator.empty
}
for((k,v) <- sorted.mapPartitionsWithIndex(rangeFilter, preservesPartitioning = true).collect()) println(s"$k, $v")
If having the whole partition in memory is acceptable you could even do something like this.
val glommedAndCached = sorted.glom()cache();
glommedAndCached.map(a => a.slice(a.search(lower),a.search(upper)+1)).collect()
search is not a member BTW I just made an implicit class that has a binary search function, not shown here
How big is your data set? You might be able to do what you need with:
data.take(80).drop(59)
This seems inefficient, but for small to medium-sized data, should work.
Is it possible to solve this in another way? What's the case for picking exactly a certain range out of the middle of your data? Would takeSample serve you better?
Following should be able to get the range. Note the cache will save you some overhead, because internally zipWithIndex need to scan the RDD partition to get the number of elements in each partition.
scala>val r1 = sc.parallelize(List("a", "b", "c", "d", "e", "f", "g"), 3).cache
scala>val r2 = r1.zipWithIndex
scala>val r3 = r2.filter(x=> {x._2>2 && x._2 < 4}).map(x=>x._1)
scala>r3.foreach(println)
d
For those who stumble on this question looking for Spark 2.x-compatible answer, you can use filterByRange

Resources