How to replace contents of RDD with another while preserving order? - apache-spark

I have two RDDs, one (a, b, a, c, b, c, a) and the other - a paired RDD ((a, 0), (b, 1), (c, 2)).
I want to replace the as, bs and cs in first RDD with 0,1,2 (which are the values of keys a,b,c respectively in second RDD), respectively. I'd like to preserve the order of the events in first RDD.
How to achieve it in Spark?

For example like this:
val rdd1 = sc.parallelize(Seq("a", "b", "a", "c", "b", "c", "a"))
val rdd2 = sc.parallelize(Seq(("a", 0), ("b", 1), ("c", 2)))
rdd1
.map((_, 1)) // Map first to PairwiseRDD with dummy values
.join(rdd2)
.map { case (_, (_, x)) => x } // Drop keys and dummy values
If mapping RDD is small it can be faster to broadcast and map:
val bd = sc.broadcast(rdd2.collectAsMap)
// This assumes all values are present. If not use get / getOrElse
// or map withDefault
rdd1.map(bd.value)
It will also preserve an order of elements.
In case of join you can add increasing identifiers (zipWithIndex / zipWithUniqueId) to be able to restore an initial ordering but it is substantially more expensive.

You can do this by using join.
First to simulate your RDDs:
val rdd = sc.parallelize(List("a","b","a","c","b","c","a"))
val mapping = sc.parallelize(List(("a",0),("b",1),("c",2)))
You can only join pairRDDs, so map the original rdd to a pairRDDand then join with mapping
rdd.map(s => (s, None)).join(mapping).map{case(_, (_, intValue)) => intValue}

Related

Can any one implement CombineByKey() instead of GroupByKey() in Spark in order to group elements?

i am trying to group elements of an RDD that i have created. one simple but expensive way is to use GroupByKey(). but recently i learned that CombineByKey() can do this work more efficiently. my RDD is very simple. it looks like this:
(1,5)
(1,8)
(1,40)
(2,9)
(2,20)
(2,6)
val grouped_elements=first_RDD.groupByKey()..mapValues(x => x.toList)
the result is:
(1,List(5,8,40))
(2,List(9,20,6))
i want to group them based on the first element (key).
can any one help me to do it with CombineByKey() function? i am really confused by CombineByKey()
To begin with take a look at API Refer docs
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
So it accepts three functions which I have defined below
scala> val createCombiner = (v:Int) => List(v)
createCombiner: Int => List[Int] = <function1>
scala> val mergeValue = (a:List[Int], b:Int) => a.::(b)
mergeValue: (List[Int], Int) => List[Int] = <function2>
scala> val mergeCombiners = (a:List[Int],b:List[Int]) => a.++(b)
mergeCombiners: (List[Int], List[Int]) => List[Int] = <function2>
Once you define these then you can use it in your combineByKey call as below
scala> val list = List((1,5),(1,8),(1,40),(2,9),(2,20),(2,6))
list: List[(Int, Int)] = List((1,5), (1,8), (1,40), (2,9), (2,20), (2,6))
scala> val temp = sc.parallelize(list)
temp: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[41] at parallelize at <console>:30
scala> temp.combineByKey(createCombiner,mergeValue, mergeCombiners).collect
res27: Array[(Int, List[Int])] = Array((1,List(8, 40, 5)), (2,List(20, 9, 6)))
Please note that I tried this out in Spark Shell and hence you can see the outputs below the commands executed. They will help build you your understanding.

How to find all two pairs of sets and elements in a collection using MapReduce in Spark?

I have a collection of sets, each set contains many items. I want to retrieve all pairs of sets and elements using Spark where each pair after reduce processing will contains two items and two sets
for example:
If I have this list of sets
Set A={1,2,3,4 }
Set B={1,2,4,5}
Set C= {2,3,5,6}
The map process will be:
(A,1)
(A,2)
(A,3)
(B,1)
(B,2)
(B,4)
(B,5)
(C,2)
(C,3)
(C,5)
(C,6)
The target result after reduce is:
(A B, 1 2) // since 1 2 exist in both A and B
(A B, 1 4)
(A B, 2 4)
(A C,2 3)
(B C,2 5)
here (A B,1 3) not in the result because 1 3 not exists in B
Could you help me to solve this problem in Spark in one map and one reduce functions in any language ( Python, Scala, or Java)?
Lets break this problem into multiple parts, I consider the transformation from input lists to map output trivial. So let us start from there,
That you have a list of (String, int) looking like
("A", 1)
("A", 2)
....
Lets forget you need 2 integer elements in result set first, and lets solve for getting intersection set between any 2 keys from the mapped output.
Result from your input would look like
(AB, Set(1,2,4))
(BC, Set(2,5))
(AC, Set(2,3))
To do this, first, extract all keys from your mapped output (mappedOutput) that is an RDD of (String, Int), convert to set, and get all combinations of 2 elements (I am using a stupid method here, a good way to do this that scales would be to use a combination generator)
val combinations = mappedOutput.map(x => x._1).collect.toSet
.subsets.filter(x => x.size == 2).toList
.map(x => x.mkString(""))
output would be List(ab,ac,bc), these combination codes will serve as keys to be joined.
convert mapped output to list of set key (a,b,c) => set of elements
val step1 = mappedOutput.groupByKey().map(x => (x._1, x._2.toSet))
Attach combination codes as key to step1
val step2 = step1.map(x => combinations.filter(y => y.contains(x._1)).map(y => (y, x))).flatMap(x => x)
output would be (ab, (a, set of elements in a)), (ac, (a, set of elements in a)) etc. Because of the filter, we will not attach combination code bc to set a.
Now obtain the result I want using a reduce
val result = step2.reduceByKey((a, b) => ("", a.intersect(b))).map(x => (x._1, x._2._2))
So we have the output I mentioned we want at the start now. What is left is to transform this result to what you need, which is very simple to do.
val transformed = result.map(x => x._2.subsets.filter(x => x.size == 2).map(y => (x._1, y.mkString(" ")))).flatMap(x => x)
end :)

Map each element of a list in Spark

I'm working with an RDD which pairs are structured this way: [Int, List[Int]] my goal is to map the items of the list of each pair with the key. So for example I'd need to do this:
RDD1:[Int, List[Int]]
<1><[2, 3]>
<2><[3, 5, 8]>
RDD2:[Int, Int]
<1><2>
<1><3>
<2><3>
<2><5>
<2><8>
well I can't understand what kind of transformation would be needed in order to get to RDD2. Transformations list can be found here. Any Idea? Is it a wrong approach?
You can use flatMap:
val rdd1 = sc.parallelize(Seq((1, List(2, 3)), (2, List(3, 5, 8))))
val rdd2 = rdd1.flatMap(x => x._2.map(y => (x._1, y)))
// or:
val rdd2 = rdd1.flatMap{case (key, list) => list.map(nr => (key, nr))}
// print result:
rdd2.collect().foreach(println)
Gives result:
(1,2)
(1,3)
(2,3)
(2,5)
(2,8)
flatMap created few output objects from one input object.
In your case, inner map in flatMap maps tuple (Int, List[Int]) to List[(Int, Int)] - key is the same as input tuple, but for each element in input list it creates one output tuple. flatMap causes that each element of this List becomes one row in RDD

Spark reduce with comparison

I have an RDD of tuples with the form (key, count) however some keys are equivalent, i.e.
(a,3)
(b,4)
(c,5)
should reduce down to... as a and c are equivalent (for example)
(a,8)
(b,4)
is there a way to perform this operation in Spark?
I'm thinking some sort of conditional within the reduce() function?
i don't think there is a way to do this within the reduce operation, but you can achieve it using a pre-processing step. One option is to create a Map[K,K] that links your keys.
val in = sc.parallelize(List(("a",3),("b",4),("c",5)))
val keyMap: Map[String,String] = Map[String,String]("a"->"a", "b"->"b", "c"->"a")
val out = in.map{case (k,v) => (keyMap.getOrElse(k,k),v)}.reduceByKey(_+_)
out.take(3).foreach(println)
Edit:
If the Map can't fit on the driver, you can also distribute the lookup:
val in = sc.parallelize(List(("a",3),("b",4),("c",5)))
val keyMap = sc.parallelize(List(("a","a"),("b","b"),("c"->"a")))
val out = in.join(keyMap).map{case (oldKey, (v, newKey)) => (newKey, v)}.reduceByKey(_+_)
out.take(3).foreach(println)
reduceByKey() does the trick here as your data is already paired one.
val baseRDD = sc.parallelize(Seq(("a", 3), ("b", 4), ("a", 5)))
baseRDD.reduceByKey((accum, current) => accum + current).foreach(println)

How to generate a new RDD from another RDD according to specific logic

I am a freshman to Spark. I have a problem, but I don't know how to solve it. My data in RDD is as follows:
(1,{A,B,C,D})
(2,{E,F,G})
......
I know RDDs are immutable, but, I want to transform my RDD into a new RDD that looks like this:
11 A,B
12 B,C
13 C,D
21 E,F
22 F,G
......
How can I generate a new key and extract adjacent elements?
Assuming your collection is something similar to a List, you could do something like:
val rdd2 = rdd1.flatMap { case (key, values) =>
for (value <- values.sliding(2).zipWithIndex)
yield (key.toString + value._2, value._1)
}
What we are doing here is iterating through the values in your list, applying a sliding window of size 2 on the elements, zipping the elements with an integer index, and finally outputting a list of tuples keyed by the original index appended with the list indices (whose values are the slid elements). We also use a flatMap here in order to flatten the results into their own records.
When run in spark-shell, I'm seeing the following output on your example:
scala> val rdd1 = sc.parallelize(Array((1,List("A","B","C","D")), (2,List("E","F","G"))))
rdd1: org.apache.spark.rdd.RDD[(Int, List[String])] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> val rdd2 = rdd1.flatMap { case (key, values) => for (value <- values.sliding(2).zipWithIndex) yield (key.toString + value._2, value._1) }
rdd2: org.apache.spark.rdd.RDD[(String, Seq[String])] = MapPartitionsRDD[1] at flatMap at <console>:23
scala> rdd2.foreach(println)
...
(10,List(A, B))
(11,List(B, C))
(12,List(C, D))
(20,List(E, F))
(21,List(F, G))
The one note with this is that the output key (e.g. 10, 11) will have 3 digits if you have 11 or more elements. For example, for the input key 1, you will have an output key 110 on the 11th element. Not sure if that fits your use case, but it seemed like a reasonable extension of your request. Based off your output key scheme, I would actually suggest something different (like maybe adding a hyphen between the key and element?). This will prevent collisions later as you'll see 2-10 and 21-0 instead of 210 for both keys.

Resources