adding a new column using withColumn from a lookup table dynamically - apache-spark

I am using spark-sql-2.4.1v with Java 8. I have a scenario where I need to dynamically add a column from a look up table.
I have data frame with columns
A, B, C , ..., X,Y, Z
When few (original) columns ( Ex: A,B,C) values are null , i need to take/substitute column( Ex: X,Y,Z) values else take the original column values.
I will get this mapping information as part of business logic.
If that is the case i will follow something like below hard-coded code
Dataset<Row> substitutedDs = ds
.withColumn("A",
when(col("A").isNull() , col("X").cast(DataTypes.StringType))
.otherwise(col("A").cast(DataTypes.StringType))
)
.withColumn("C",
when(col("C").isNull() , col("Z").cast(DataTypes.StringType))
.otherwise(col("C").cast(DataTypes.StringType))
Which is working fine. But I need to do this dynamically/configurable to avoid hard-coding.
I will get look up table with columns "code" and "code_substitutes" information as below
-------------------------
| Code | Code_Substitute |
-------------------------
A X
B Y
C Z
-------------------------
I need to dynamically construct above "substitutedDs" , how can this be done ?

With Java8, you can use this Stream.reduce() overload:
final Dataset<Row> dataframe = ...;
final Map<String, String> substitutes = ...;
final Dataset<Row> afterSubstitutions = codeSubstitutes.entrySet().stream()
.reduce(dataframe, (df, entry) ->
df.withColumn(entry.getKey(), when(/* replace with col(entry.getValue()) when null */)),
(left, right) -> { throw new IllegalStateException("Can't merge two dataframes. This stream should not be a parallel one!"); }
);
The combiner (last argument) is supposed to merge two dataframes processed in parallel (if the stream was a parallel() stream), but we'll simply not allow that, as we're only invoking this logic on a sequential() stream.
A more readable/maintainable version involves an extra-step for extracting the above logic into dedicated methods, such as:
// ...
Dataset<Row> nullSafeDf = codeSubstitutes.entrySet().stream()
.reduce(dataframe, this::replaceIfNull, this::throwingCombiner);
// ...
}
private Dataset<Row> replaceIfNull(Dataset<Row> df, Map.Entry<String, String> substitution) {
final String original = substitution.getKey();
final String replacement = substitution.getValue();
return df.withColumn(original, when(col(original).isNull(), col(replacement))
.otherwise(col(original)));
}
private <X> X throwingCombiner(X left, X right) {
throw new IllegalStateException("Combining not allowed");
}

In Scala, I would do like this
val substitueMapping: Map[String, String] = ??? //this is your substitute map, this is small as it contains columns and their null substitutes
val df = ??? //this is your main dataframe
val substitutedDf = substituteMapping.keys().foldLeft(df)((df, k) => {
df.withColumn(k, when(col(k).isNull, col(substituteMapping(k))).otherwise(col(k)))
//do approproate casting in above which you have done in post
})
I think foldLeft is not there in Java 8, you can emulate the same by modifying a variable repeatedly and doing iteration on substituteMapping.

Related

Merge two strings in kotlin

I have two strings
val a = "abc"
val b = "xyz"
I want to merge it and need output like below
axbycz
I added both strings to arraylist and then flatmap it
val c = listOf(a, b)
val d = c.flatMap {
it.toList()
}
but not getting the desired result
Use the zip function. It creates a list of pairs with "adjacent" letters. You can then use joinToString with a transformer to create your final result.
a.zip(b) // Returns the list [(a, x), (b, y), (c, z)]
.joinToString("") { (a, b) -> "$a$b" } // Joins the list back to a string with no separator
You can always use a simple loop, assuming both strings have the same size. That way You only allocate a StringBuilder and counter variable, without any lists, arrays or pairs:
val a = "abc"
val b = "xyz"
val sb = StringBuilder()
for(i in 0 until a.length){
sb.append(a[i]).append(b[i])
}
val d = sb.toString()
marstran's answer is really concise and Pawels answer is really fast. Using buildString you can have to best of both worlds:
buildString {
a.zip(b).forEach { (a, b) ->
append(a).append(b)
}
}
buildString creates a StringBuilder and offers it as receiver in the lambda. It returns the built string.
Try it out here: Kotlin Playground. Thanks to Pawel for creating the original benchmark.

Reduce Spark RDD to return multiple values

I have the following RDD containing sets of items which I would like to group by item similarity (Items in the same set are considered similar. Similarity is transitive and all the items in sets which have atleast one common item are also considered similar)
Input RDD:
Set(w1, w2)
Set(w1, w2, w3, w4)
Set(w5, w2, w6)
Set(w7, w8, w9)
Set(w10, w5, w8) --> All the first 5 set elements are similar as each of the sets have atleast one common item
Set(w11, w12, w13)
I would like the above RDD to be reduced to
Set(w1, w2, w3, w4, w5, w6, w7, w8, w9, w10)
Set(w11, w12, w13)
Any suggestions on how I could do this ? I am unable to do something like below where I could ignore reducing two sets if they don't contain any common elements:
data.reduce((a,b) => if (a.intersect(b).size > 0) a ++ b ***else (a,b)***)
Thanks.
Your reduce algorithm are actually incorrect. For example, what if one set cannot be merge with the next set, but can still be merged with a different set in the collect.
There're probably better ways, but I think up a solution to this by converting it to a graph problem and use Graphx.
val data = Array(Set("w1", "w2", "w3"), Set("w5", "w6"), Set("w7"), Set("w2", "w3", "w4"))
val setRdd = sc.parallelize(data).cache
// Generate an unique id for each item to use as vertex's id in the graph
val itemToId = setRdd.flatMap(_.toSeq).distinct.zipWithUniqueId.cache
val idToItem = itemToId.map { case (item, itemId) => (itemId, item) }
// Convert to a RDD of set of itemId
val newSetRdd = setRdd.zipWithUniqueId
.flatMap { case (sets, setId) =>
sets.map { item => (item, setId) }
}.join(itemToId).values.groupByKey().values
// Create an RDD containing edges of the graph
val edgeRdd = newSetRdd.flatMap { set =>
val seq = set.toSeq
val head = seq.head
// Add an edge from the first item to each item in a set,
// including itself
seq.map { item => Edge[Long](head, item)}
}
val graph = Graph.fromEdges(edgeRdd, Nil)
// Run connected component algorithm to check which items are similar.
// Items in the same component are similar
val verticesRDD = graph.connectedComponents().vertices
verticesRDD.join(idToItem).values.groupByKey.values.collect.foreach(println)

Collect RDD to one node in sorted order

I have a large RDD which needs to be written to a single file on disk, one line for each element, the lines sorted in some defined order. So I was thinking of sorting the RDD, collect one partition at a time in the driver, and appending to the output file.
Couple of questions:
After rdd.sortBy(), do I have the guarantee that partition 0 will contain the first elements of the sorted RDD, partiton 1 will contain the next elements of the sorted RDD, and so on? (I'm using the default partitioner.)
e.g.
val rdd = ???
val sortedRdd = rdd.sortBy(???)
for (p <- sortedRdd.partitions) {
val index = p.index
val partitionRdd = sortedRdd mapPartitionsWithIndex { case (i, values) => if (i == index) values else Iterator() }
val partition = partitionRdd.collect()
partition foreach { e =>
// Append element e to file
}
}
I understand that rdd.toLocalIterator is a more efficient way of fetching all partitions, one at a time. So same question: do I get the elements in the order given by .sortBy()?
val rdd = ???
val sortedRdd = rdd.sortBy(???)
for (e <- sortedRdd.toLocalIterator) {
// Append element e to file
}

Referencing the next entry in RDD within a map function

I have a stream of <id, action, timestamp, data>s to process.
For example, (let us assume there's only 1 id for simplicity)
id event timestamp
-------------------------------
1 A 1
1 B 2
1 C 4
1 D 7
1 E 15
1 F 16
Let's say TIMEOUT = 5. Because more than 5 seconds passed after D happened without any further event, I want to map this to a JavaPairDStream with two key : value pairs.
id1_1:
A 1
B 2
C 4
D 7
and
id1_2:
E 15
F 16
However, in my anonymous function object, PairFunction that I pass to mapToPair() method,
incomingMessages.mapToPair(new PairFunction<String, String, RequestData>() {
private static final long serialVersionUID = 1L;
#Override
public Tuple2<String, RequestData> call(String s) {
I cannot reference the data in the next entry. In other words, when I am processing the entry with event D, I cannot look at the data at E.
If this was not Spark, I could have simply created an array timeDifferences, store the differences in two adjacent timestamps, and split the array into parts whenever I see a time difference in timeDifferences that is larger than TIMEOUT. (Although, actually there's no need to explicitly create an array)
How can I do this in Spark?
I'm still struggling to understand your question a bit, but based on what you've written, I think you can do it this way:
val A = sc.parallelize(List((1,"A",1.0),(1,"B",2.0),(1,"C",15.0))).zipWithIndex.map(x=>(x._2,x._1))
val B = A.map(x=>(x._1-1,x._2))
val C = A.leftOuterJoin(B).map(x=>(x._2._1,x._2._1._3 - (x._2._2 match{
case Some(a) => a._3
case _ => 0
})))
val group1 = C.filter(x=>(x._2 <= 5))
val group2 = C.filter(x=>(x._2 > 5))
So the concept is you zip with index to create val A (which assigns a serial long number to each entry of your RDD), and duplicate the RDD but with the index of the consecutive entry to create val B (by subtracting 1 from the index), then use a join to work out the TIMEOUT between consecutive entries. Then use Filter. This method uses RDD. A easier way is to collect them into the Master and use Map or zipped mapping, but it would be scala not spark I guess.
I believe this does what you need:
def splitToTimeWindows(input: RDD[Event], timeoutBetweenWindows: Long): RDD[Iterable[Event]] = {
val withIndex: RDD[(Long, Event)] = input.sortBy(_.timestamp).zipWithIndex().map(_.swap).cache()
val withIndexDrop1: RDD[(Long, Event)] = withIndex.map({ case (i, e) => (i-1, e)})
// joining the two to attach a "followingGap" to each event
val extendedEvents: RDD[ExtendedEvent] = withIndex.leftOuterJoin(withIndexDrop1).map({
case (i, (current, Some(next))) => ExtendedEvent(current, next.timestamp - current.timestamp)
case (i, (current, None)) => ExtendedEvent(current, 0) // last event has no following gap
})
// collecting (to driver memory!) cutoff points - timestamp of events that are *last* in their window
// if this collection is very large, another join might be needed
val cutoffPoints = extendedEvents.collect({ case e: ExtendedEvent if e.followingGap > timeoutBetweenWindows => e.event.timestamp }).distinct().collect()
// going back to original input, grouping by each event's nearest cutoffPoint (i.e. begining of this event's windown
input.groupBy(e => cutoffPoints.filter(_ < e.timestamp).sortWith(_ > _).headOption.getOrElse(0)).values
}
case class Event(timestamp: Long, data: String)
case class ExtendedEvent(event: Event, followingGap: Long)
The first part builds on GameOfThrows's answer - joining the input with itself with 1's offset to calculate the 'followingGap' for each record. Then we collect the "breaks" or "cutoff points" between the windows, and perform another transformation on the input using these points to group it by window.
NOTE: there might be more efficient ways to perform some of these transformations, depending on the characteristics of the input, for example: if you have lots of "sessions", this code might be slow or run out of memory.

Spark key value pairs to string

I was wonder how I could turn key value pairs into strings for the output. Currently my code is like this:
object Task2 {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Task2"))
// read the file
val file = sc.textFile("hdfs://localhost:8020" + args(0))
val split = file
.map(line => (line.split("\t")(1), line.split("\t")(2)))
.reduceByKey(_ + _)
// store output on given HDFS path.
fin.saveAsTextFile("hdfs://localhost:8020" + args(1))
}
}
the split output gives me key value pairs like (x, y) but I would like them to be x y pairs separated by a tab. I've tried using map and mkString to no avail. What should I do?
This should work:
split.map(x=>x._1+"\t"+x._2).saveAsTextFile(....)

Resources