Reduce Spark RDD to return multiple values - apache-spark

I have the following RDD containing sets of items which I would like to group by item similarity (Items in the same set are considered similar. Similarity is transitive and all the items in sets which have atleast one common item are also considered similar)
Input RDD:
Set(w1, w2)
Set(w1, w2, w3, w4)
Set(w5, w2, w6)
Set(w7, w8, w9)
Set(w10, w5, w8) --> All the first 5 set elements are similar as each of the sets have atleast one common item
Set(w11, w12, w13)
I would like the above RDD to be reduced to
Set(w1, w2, w3, w4, w5, w6, w7, w8, w9, w10)
Set(w11, w12, w13)
Any suggestions on how I could do this ? I am unable to do something like below where I could ignore reducing two sets if they don't contain any common elements:
data.reduce((a,b) => if (a.intersect(b).size > 0) a ++ b ***else (a,b)***)
Thanks.

Your reduce algorithm are actually incorrect. For example, what if one set cannot be merge with the next set, but can still be merged with a different set in the collect.
There're probably better ways, but I think up a solution to this by converting it to a graph problem and use Graphx.
val data = Array(Set("w1", "w2", "w3"), Set("w5", "w6"), Set("w7"), Set("w2", "w3", "w4"))
val setRdd = sc.parallelize(data).cache
// Generate an unique id for each item to use as vertex's id in the graph
val itemToId = setRdd.flatMap(_.toSeq).distinct.zipWithUniqueId.cache
val idToItem = itemToId.map { case (item, itemId) => (itemId, item) }
// Convert to a RDD of set of itemId
val newSetRdd = setRdd.zipWithUniqueId
.flatMap { case (sets, setId) =>
sets.map { item => (item, setId) }
}.join(itemToId).values.groupByKey().values
// Create an RDD containing edges of the graph
val edgeRdd = newSetRdd.flatMap { set =>
val seq = set.toSeq
val head = seq.head
// Add an edge from the first item to each item in a set,
// including itself
seq.map { item => Edge[Long](head, item)}
}
val graph = Graph.fromEdges(edgeRdd, Nil)
// Run connected component algorithm to check which items are similar.
// Items in the same component are similar
val verticesRDD = graph.connectedComponents().vertices
verticesRDD.join(idToItem).values.groupByKey.values.collect.foreach(println)

Related

adding a new column using withColumn from a lookup table dynamically

I am using spark-sql-2.4.1v with Java 8. I have a scenario where I need to dynamically add a column from a look up table.
I have data frame with columns
A, B, C , ..., X,Y, Z
When few (original) columns ( Ex: A,B,C) values are null , i need to take/substitute column( Ex: X,Y,Z) values else take the original column values.
I will get this mapping information as part of business logic.
If that is the case i will follow something like below hard-coded code
Dataset<Row> substitutedDs = ds
.withColumn("A",
when(col("A").isNull() , col("X").cast(DataTypes.StringType))
.otherwise(col("A").cast(DataTypes.StringType))
)
.withColumn("C",
when(col("C").isNull() , col("Z").cast(DataTypes.StringType))
.otherwise(col("C").cast(DataTypes.StringType))
Which is working fine. But I need to do this dynamically/configurable to avoid hard-coding.
I will get look up table with columns "code" and "code_substitutes" information as below
-------------------------
| Code | Code_Substitute |
-------------------------
A X
B Y
C Z
-------------------------
I need to dynamically construct above "substitutedDs" , how can this be done ?
With Java8, you can use this Stream.reduce() overload:
final Dataset<Row> dataframe = ...;
final Map<String, String> substitutes = ...;
final Dataset<Row> afterSubstitutions = codeSubstitutes.entrySet().stream()
.reduce(dataframe, (df, entry) ->
df.withColumn(entry.getKey(), when(/* replace with col(entry.getValue()) when null */)),
(left, right) -> { throw new IllegalStateException("Can't merge two dataframes. This stream should not be a parallel one!"); }
);
The combiner (last argument) is supposed to merge two dataframes processed in parallel (if the stream was a parallel() stream), but we'll simply not allow that, as we're only invoking this logic on a sequential() stream.
A more readable/maintainable version involves an extra-step for extracting the above logic into dedicated methods, such as:
// ...
Dataset<Row> nullSafeDf = codeSubstitutes.entrySet().stream()
.reduce(dataframe, this::replaceIfNull, this::throwingCombiner);
// ...
}
private Dataset<Row> replaceIfNull(Dataset<Row> df, Map.Entry<String, String> substitution) {
final String original = substitution.getKey();
final String replacement = substitution.getValue();
return df.withColumn(original, when(col(original).isNull(), col(replacement))
.otherwise(col(original)));
}
private <X> X throwingCombiner(X left, X right) {
throw new IllegalStateException("Combining not allowed");
}
In Scala, I would do like this
val substitueMapping: Map[String, String] = ??? //this is your substitute map, this is small as it contains columns and their null substitutes
val df = ??? //this is your main dataframe
val substitutedDf = substituteMapping.keys().foldLeft(df)((df, k) => {
df.withColumn(k, when(col(k).isNull, col(substituteMapping(k))).otherwise(col(k)))
//do approproate casting in above which you have done in post
})
I think foldLeft is not there in Java 8, you can emulate the same by modifying a variable repeatedly and doing iteration on substituteMapping.

Why filter does not preserve partitioning?

This is a quote from jaceklaskowski.gitbooks.io.
Some operations, e.g. map, flatMap, filter, don’t preserve partitioning.
map, flatMap, filter operations apply a function to every partition.
I don't understand why filter does not preserve partitioning. It's just getting a subset of each partition which satisfy a condition so I think partitions can be preserved. Why isn't it like that?
You are of course right. The quote is just incorrect. filter does preserve partitioning (for the reason you've already described), and it is trivial to confirm that
val rdd = sc.range(0, 10).map(x => (x % 3, None)).partitionBy(
new org.apache.spark.HashPartitioner(11)
)
rdd.partitioner
// Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner#b)
val filteredRDD = rdd.filter(_._1 == 3)
filteredRDD.partitioner
// Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner#b)
rdd.partitioner == filteredRDD.partitioner
// Boolean = true
This stays in contrast to operations like map, which don't preserver partitioning (Partitioner):
rdd.map(identity _).partitioner
// Option[org.apache.spark.Partitioner] = None
Datasets are a bit more subtle, as filters are normally pushed-down, but overall the behavior is similar.
Filter does preserve partitioning, at least this is suggested by the source-code of filter ( preservesPartitioning = true):
/**
* Return a new RDD containing only the elements that satisfy a predicate.
*/
def filter(f: T => Boolean): RDD[T] = withScope {
val cleanF = sc.clean(f)
new MapPartitionsRDD[T, T](
this,
(context, pid, iter) => iter.filter(cleanF),
preservesPartitioning = true)
}

How to figure out if DStream is empty

I have 2 inputs, where first input is stream (say input1) and the second one is batch (say input2).
I want to figure out if the keys in first input matches single row or more than one row in the second input.
The further transformations/logic depends on the number of rows matching, whether single row matches or multiple rows match (for atleast one key in the first input)
if(single row matches){
// do something
}else{
// do something
}
Code that i tried so far
val input1Pair = streamData.map(x => (x._1, x))
val input2Pair = input2.map(x => (x._1, x))
val joinData = input1Pair.transform{ x => input2Pair.leftOuterJoin(x)}
val result = joinData.mapValues{
case(v, Some(a)) => 1L
case(v, None) => 0
}.reduceByKey(_ + _).filter(_._2 > 1)
I have done the above coding.
When I do result.print, it prints nothing if all the keys matches only one row in the input2.
With the fact that the DStream may have multiple RDDs, not sure how to figure out if the DStream is empty or not. If this is possible then I can do a if check.
There's no function to determine if a DStream is empty, as a DStream represents a collection over time. From a conceptual perspective, an empty DStream would be a stream that never has data and that would not be very useful.
What can be done is to check whether a given microbatch has data or not:
dstream.foreachRDD{ rdd => if (rdd.isEmpty) {...} }
Please note that at any given point in time, there's only one RDD.
I think that the actual question is how to check the number of matches between the reference RDD and the data in the DStream. Probably the easiest way would be to intersect both collections and check the intersection size:
val intersectionDStream = streamData.transform{rdd => rdd.intersection(input2)}
intersectionDStream.foreachRDD{rdd =>
if (rdd.count > 1) {
..do stuff with the matches
} else {
..do otherwise
}
}
We could also place the RDD-centric transformations within the foreachRDD operation:
streamData.foreachRDD{rdd =>
val matches = rdd.intersection(input2)
if (matches.count > 1) {
..do stuff with the matches
} else {
..do otherwise
}
}

Collect RDD to one node in sorted order

I have a large RDD which needs to be written to a single file on disk, one line for each element, the lines sorted in some defined order. So I was thinking of sorting the RDD, collect one partition at a time in the driver, and appending to the output file.
Couple of questions:
After rdd.sortBy(), do I have the guarantee that partition 0 will contain the first elements of the sorted RDD, partiton 1 will contain the next elements of the sorted RDD, and so on? (I'm using the default partitioner.)
e.g.
val rdd = ???
val sortedRdd = rdd.sortBy(???)
for (p <- sortedRdd.partitions) {
val index = p.index
val partitionRdd = sortedRdd mapPartitionsWithIndex { case (i, values) => if (i == index) values else Iterator() }
val partition = partitionRdd.collect()
partition foreach { e =>
// Append element e to file
}
}
I understand that rdd.toLocalIterator is a more efficient way of fetching all partitions, one at a time. So same question: do I get the elements in the order given by .sortBy()?
val rdd = ???
val sortedRdd = rdd.sortBy(???)
for (e <- sortedRdd.toLocalIterator) {
// Append element e to file
}

Referencing the next entry in RDD within a map function

I have a stream of <id, action, timestamp, data>s to process.
For example, (let us assume there's only 1 id for simplicity)
id event timestamp
-------------------------------
1 A 1
1 B 2
1 C 4
1 D 7
1 E 15
1 F 16
Let's say TIMEOUT = 5. Because more than 5 seconds passed after D happened without any further event, I want to map this to a JavaPairDStream with two key : value pairs.
id1_1:
A 1
B 2
C 4
D 7
and
id1_2:
E 15
F 16
However, in my anonymous function object, PairFunction that I pass to mapToPair() method,
incomingMessages.mapToPair(new PairFunction<String, String, RequestData>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2<String, RequestData> call(String s) {
I cannot reference the data in the next entry. In other words, when I am processing the entry with event D, I cannot look at the data at E.
If this was not Spark, I could have simply created an array timeDifferences, store the differences in two adjacent timestamps, and split the array into parts whenever I see a time difference in timeDifferences that is larger than TIMEOUT. (Although, actually there's no need to explicitly create an array)
How can I do this in Spark?
I'm still struggling to understand your question a bit, but based on what you've written, I think you can do it this way:
val A = sc.parallelize(List((1,"A",1.0),(1,"B",2.0),(1,"C",15.0))).zipWithIndex.map(x=>(x._2,x._1))
val B = A.map(x=>(x._1-1,x._2))
val C = A.leftOuterJoin(B).map(x=>(x._2._1,x._2._1._3 - (x._2._2 match{
case Some(a) => a._3
case _ => 0
})))
val group1 = C.filter(x=>(x._2 <= 5))
val group2 = C.filter(x=>(x._2 > 5))
So the concept is you zip with index to create val A (which assigns a serial long number to each entry of your RDD), and duplicate the RDD but with the index of the consecutive entry to create val B (by subtracting 1 from the index), then use a join to work out the TIMEOUT between consecutive entries. Then use Filter. This method uses RDD. A easier way is to collect them into the Master and use Map or zipped mapping, but it would be scala not spark I guess.
I believe this does what you need:
def splitToTimeWindows(input: RDD[Event], timeoutBetweenWindows: Long): RDD[Iterable[Event]] = {
val withIndex: RDD[(Long, Event)] = input.sortBy(_.timestamp).zipWithIndex().map(_.swap).cache()
val withIndexDrop1: RDD[(Long, Event)] = withIndex.map({ case (i, e) => (i-1, e)})
// joining the two to attach a "followingGap" to each event
val extendedEvents: RDD[ExtendedEvent] = withIndex.leftOuterJoin(withIndexDrop1).map({
case (i, (current, Some(next))) => ExtendedEvent(current, next.timestamp - current.timestamp)
case (i, (current, None)) => ExtendedEvent(current, 0) // last event has no following gap
})
// collecting (to driver memory!) cutoff points - timestamp of events that are *last* in their window
// if this collection is very large, another join might be needed
val cutoffPoints = extendedEvents.collect({ case e: ExtendedEvent if e.followingGap > timeoutBetweenWindows => e.event.timestamp }).distinct().collect()
// going back to original input, grouping by each event's nearest cutoffPoint (i.e. begining of this event's windown
input.groupBy(e => cutoffPoints.filter(_ < e.timestamp).sortWith(_ > _).headOption.getOrElse(0)).values
}
case class Event(timestamp: Long, data: String)
case class ExtendedEvent(event: Event, followingGap: Long)
The first part builds on GameOfThrows's answer - joining the input with itself with 1's offset to calculate the 'followingGap' for each record. Then we collect the "breaks" or "cutoff points" between the windows, and perform another transformation on the input using these points to group it by window.
NOTE: there might be more efficient ways to perform some of these transformations, depending on the characteristics of the input, for example: if you have lots of "sessions", this code might be slow or run out of memory.

Resources