Map each element of a list in Spark - apache-spark

I'm working with an RDD which pairs are structured this way: [Int, List[Int]] my goal is to map the items of the list of each pair with the key. So for example I'd need to do this:
RDD1:[Int, List[Int]]
<1><[2, 3]>
<2><[3, 5, 8]>
RDD2:[Int, Int]
<1><2>
<1><3>
<2><3>
<2><5>
<2><8>
well I can't understand what kind of transformation would be needed in order to get to RDD2. Transformations list can be found here. Any Idea? Is it a wrong approach?

You can use flatMap:
val rdd1 = sc.parallelize(Seq((1, List(2, 3)), (2, List(3, 5, 8))))
val rdd2 = rdd1.flatMap(x => x._2.map(y => (x._1, y)))
// or:
val rdd2 = rdd1.flatMap{case (key, list) => list.map(nr => (key, nr))}
// print result:
rdd2.collect().foreach(println)
Gives result:
(1,2)
(1,3)
(2,3)
(2,5)
(2,8)
flatMap created few output objects from one input object.
In your case, inner map in flatMap maps tuple (Int, List[Int]) to List[(Int, Int)] - key is the same as input tuple, but for each element in input list it creates one output tuple. flatMap causes that each element of this List becomes one row in RDD

Related

Can any one implement CombineByKey() instead of GroupByKey() in Spark in order to group elements?

i am trying to group elements of an RDD that i have created. one simple but expensive way is to use GroupByKey(). but recently i learned that CombineByKey() can do this work more efficiently. my RDD is very simple. it looks like this:
(1,5)
(1,8)
(1,40)
(2,9)
(2,20)
(2,6)
val grouped_elements=first_RDD.groupByKey()..mapValues(x => x.toList)
the result is:
(1,List(5,8,40))
(2,List(9,20,6))
i want to group them based on the first element (key).
can any one help me to do it with CombineByKey() function? i am really confused by CombineByKey()
To begin with take a look at API Refer docs
combineByKey[C](createCombiner: (V) ⇒ C, mergeValue: (C, V) ⇒ C, mergeCombiners: (C, C) ⇒ C): RDD[(K, C)]
So it accepts three functions which I have defined below
scala> val createCombiner = (v:Int) => List(v)
createCombiner: Int => List[Int] = <function1>
scala> val mergeValue = (a:List[Int], b:Int) => a.::(b)
mergeValue: (List[Int], Int) => List[Int] = <function2>
scala> val mergeCombiners = (a:List[Int],b:List[Int]) => a.++(b)
mergeCombiners: (List[Int], List[Int]) => List[Int] = <function2>
Once you define these then you can use it in your combineByKey call as below
scala> val list = List((1,5),(1,8),(1,40),(2,9),(2,20),(2,6))
list: List[(Int, Int)] = List((1,5), (1,8), (1,40), (2,9), (2,20), (2,6))
scala> val temp = sc.parallelize(list)
temp: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[41] at parallelize at <console>:30
scala> temp.combineByKey(createCombiner,mergeValue, mergeCombiners).collect
res27: Array[(Int, List[Int])] = Array((1,List(8, 40, 5)), (2,List(20, 9, 6)))
Please note that I tried this out in Spark Shell and hence you can see the outputs below the commands executed. They will help build you your understanding.

Spark: FlatMapValues query

I'm reading the Learning Spark book and couldn't understand the following pair rdd transformation.
rdd.flatMapValues(x => (x to 5))
It is applied on an rdd {(1,2),(3,4),(3,6)} and the output of the transformation is {(1,2),(1,3),(1,4),(1,5),(3,4),(3,5)}
Can someone please explain this.
flatMapValues method is a combination of flatMap and mapValues.
Let's start with the given rdd.
val sampleRDD = sc.parallelize(Array((1,2),(3,4),(3,6)))
mapValues maps the values while keeping the keys.
For example, sampleRDD.mapValues(x => x to 5) returns
Array((1,Range(2, 3, 4, 5)), (3,Range(4, 5)), (3,Range()))
notice that for key-value pair (3, 6), it produces (3,Range()) since 6 to 5 produces an empty collection of values.
flatMap "breaks down" collections into the elements of the collection. You can search for more accurate description of flatMap online like here and here.
For example,
given val rdd2 = sampleRDD.mapValues(x => x to 5),
if we do rdd2.flatMap(x => x), you will get
Array((1,2),(1,3),(1,4),(1,5),(3,4),(3,5)).
That is, for every element in the collection in each key, we create a (key, element) pair.
Also notice that (3, Range()) does not produce any additional key element pair since the sequence is empty.
now combining flatMap and mapValues, you get flatMapValues.
flatMapValues works on each value associated with key. In above case x to 5 means each value will be incremented till 5.
Taking first pair where you have (1,2) , here key is 1 and value is 2 so there after applying transformation it will become {(1,2),(1,3),(1,4),(1,5)}.
Hope this helps.

How to replace contents of RDD with another while preserving order?

I have two RDDs, one (a, b, a, c, b, c, a) and the other - a paired RDD ((a, 0), (b, 1), (c, 2)).
I want to replace the as, bs and cs in first RDD with 0,1,2 (which are the values of keys a,b,c respectively in second RDD), respectively. I'd like to preserve the order of the events in first RDD.
How to achieve it in Spark?
For example like this:
val rdd1 = sc.parallelize(Seq("a", "b", "a", "c", "b", "c", "a"))
val rdd2 = sc.parallelize(Seq(("a", 0), ("b", 1), ("c", 2)))
rdd1
.map((_, 1)) // Map first to PairwiseRDD with dummy values
.join(rdd2)
.map { case (_, (_, x)) => x } // Drop keys and dummy values
If mapping RDD is small it can be faster to broadcast and map:
val bd = sc.broadcast(rdd2.collectAsMap)
// This assumes all values are present. If not use get / getOrElse
// or map withDefault
rdd1.map(bd.value)
It will also preserve an order of elements.
In case of join you can add increasing identifiers (zipWithIndex / zipWithUniqueId) to be able to restore an initial ordering but it is substantially more expensive.
You can do this by using join.
First to simulate your RDDs:
val rdd = sc.parallelize(List("a","b","a","c","b","c","a"))
val mapping = sc.parallelize(List(("a",0),("b",1),("c",2)))
You can only join pairRDDs, so map the original rdd to a pairRDDand then join with mapping
rdd.map(s => (s, None)).join(mapping).map{case(_, (_, intValue)) => intValue}

How to generate a new RDD from another RDD according to specific logic

I am a freshman to Spark. I have a problem, but I don't know how to solve it. My data in RDD is as follows:
(1,{A,B,C,D})
(2,{E,F,G})
......
I know RDDs are immutable, but, I want to transform my RDD into a new RDD that looks like this:
11 A,B
12 B,C
13 C,D
21 E,F
22 F,G
......
How can I generate a new key and extract adjacent elements?
Assuming your collection is something similar to a List, you could do something like:
val rdd2 = rdd1.flatMap { case (key, values) =>
for (value <- values.sliding(2).zipWithIndex)
yield (key.toString + value._2, value._1)
}
What we are doing here is iterating through the values in your list, applying a sliding window of size 2 on the elements, zipping the elements with an integer index, and finally outputting a list of tuples keyed by the original index appended with the list indices (whose values are the slid elements). We also use a flatMap here in order to flatten the results into their own records.
When run in spark-shell, I'm seeing the following output on your example:
scala> val rdd1 = sc.parallelize(Array((1,List("A","B","C","D")), (2,List("E","F","G"))))
rdd1: org.apache.spark.rdd.RDD[(Int, List[String])] = ParallelCollectionRDD[0] at parallelize at <console>:21
scala> val rdd2 = rdd1.flatMap { case (key, values) => for (value <- values.sliding(2).zipWithIndex) yield (key.toString + value._2, value._1) }
rdd2: org.apache.spark.rdd.RDD[(String, Seq[String])] = MapPartitionsRDD[1] at flatMap at <console>:23
scala> rdd2.foreach(println)
...
(10,List(A, B))
(11,List(B, C))
(12,List(C, D))
(20,List(E, F))
(21,List(F, G))
The one note with this is that the output key (e.g. 10, 11) will have 3 digits if you have 11 or more elements. For example, for the input key 1, you will have an output key 110 on the 11th element. Not sure if that fits your use case, but it seemed like a reasonable extension of your request. Based off your output key scheme, I would actually suggest something different (like maybe adding a hyphen between the key and element?). This will prevent collisions later as you'll see 2-10 and 21-0 instead of 210 for both keys.

Join two (non)paired RDDs to make a DataFrame

As the title describes, say I have two RDDs
rdd1 = sc.parallelize([1,2,3])
rdd2 = sc.parallelize([1,0,0])
or
rdd3 = sc.parallelize([("Id", 1),("Id", 2),("Id",3)])
rdd4 = sc.parallelize([("Result", 1),("Result", 0),("Result", 0)])
How can I create the following DataFrame?
Id Result
1 1
2 0
3 0
If I could create the paired RDD [(1,1),(2,0),(3,0)] then sqlCtx.createDataFrame would give me what I want, but I don't know how?
I'd appreciate any comment or help!
So first off, there is an RDD operation called RDD.zipWithIndex. If you called rdd2.zipWithIndex you would get:
scala> rdd2.zipWithIndex collect() foreach println
(1,0)
(0,1)
(0,2)
If you wanted to make it look like yours, just do this:
scala> rdd2.zipWithIndex map(t => (t._2 + 1,t._1)) collect() foreach println
(1,1)
(2,0)
(3,0)
If you really need to zip the two RDDs, then just use RDD.zip
scala> rdd1.zip(rdd2) collect() foreach println
(1,1)
(2,0)
(3,0)
Provided that they have the same partitioner and the same number of elements per partition, you can use the zip function, e.g.
case class Elem(id: Int, result: Int)
val df = sqlCtx.createDataFrame(rdd1.zip(rdd2).map(x => Elem(x._1, x._2)))

Resources