Join Dstream[Document] and Rdd by key Spark Scala - apache-spark

Here is my code:
ssc =streamingcontext(sparkcontext,Seconds(time))
spark = sparksession.builder.config(properties).getorcreate()
val Dstream1: ReceiverInputDstream[Document] = ssc.receiverStream(properties) // Dstream1 has Id1 and other fields
val Rdd2 = spark.sql("select Id1,key from hdfs.table").rdd // RDD[Row]
Is there a way I can join this two?

You'll first want to transform your Dstream & Rdd to use pairRDD's.
Something like this should do.
val DstreamTuple = Dstream1.map(x => (x. Id1, x))
val Rdd2Tuple = Rdd2.map(x => (x. Id1, x))
Once you do that, you can simply do a transformation on the dstream and join it on the RDD.
val joinedStream = DstreamTuple.transform(rdd =>
rdd.leftOuterJoin(Rdd2Tuple)
)
Hope this helps :)

Related

How to JOIN 3 RDD's using Spark Scala

I want to join 3 tables using spark rdd. I achieved my objective using spark sql but when I tried to join it using Rdd I am not getting the desired results. Below is my query using spark SQL and the output:
scala> actorDF.as("df1").join(movieCastDF.as("df2"),$"df1.act_id"===$"df2.act_id").join(movieDF.as("df3"),$"df2.mov_id"===$"df3.mov_id").
filter(col("df3.mov_title")==="Annie Hall").select($"df1.act_fname",$"df1.act_lname",$"df2.role").show(false)
+---------+---------+-----------+
|act_fname|act_lname|role |
+---------+---------+-----------+
|Woody |Allen |Alvy Singer|
+---------+---------+-----------+
Now I created the pairedRDDs for three datasets and it is as below :
scala> val actPairedRdd=actRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2),p(3))))
scala> actPairedRdd.take(5).foreach(println)
(101,(James,Stewart,M))
(102,(Deborah,Kerr,F))
(103,(Peter,OToole,M))
(104,(Robert,De Niro,M))
(105,(F. Murray,Abraham,M))
scala> val movieCastPairedRdd=movieCastRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2))))
movieCastPairedRdd: org.apache.spark.rdd.RDD[(String, (String, String))] = MapPartitionsRDD[318] at map at <console>:29
scala> movieCastPairedRdd.foreach(println)
(101,(901,John Scottie Ferguson))
(102,(902,Miss Giddens))
(103,(903,T.E. Lawrence))
(104,(904,Michael))
(105,(905,Antonio Salieri))
(106,(906,Rick Deckard))
scala> val moviePairedRdd=movieRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2),p(3),p(4),p(5),p(6))))
moviePairedRdd: org.apache.spark.rdd.RDD[(String, (String, String, String, String, String, String))] = MapPartitionsRDD[322] at map at <console>:29
scala> moviePairedRdd.take(2).foreach(println)
(901,(Vertigo,1958,128,English,1958-08-24,UK))
(902,(The Innocents,1961,100,English,1962-02-19,SW))
Here actPairedRdd and movieCastPairedRdd is linked with each other and movieCastPairedRdd and moviePairedRdd is linked since they have common column.
Now when I join all the three datasets I am not getting any data
scala> actPairedRdd.join(movieCastPairedRdd).join(moviePairedRdd).take(2).foreach(println)
I am getting blank records. So where am I going wrong ?? Thanks in advance
JOINs like this with RDDs are painful, that's another reason why DFs are nicer.
You get no data as the pair RDD = K, V has no common data for the K part of the last RDD. The K's with 101, 102 will join, but there is no commonality with the 901, 902. You need to shift things around, like this, my more limited example:
val rdd1 = sc.parallelize(Seq(
(101,("James","Stewart","M")),
(102,("Deborah","Kerr","F")),
(103,("Peter","OToole","M")),
(104,("Robert","De Niro","M"))
))
val rdd2 = sc.parallelize(Seq(
(101,(901,"John Scottie Ferguson")),
(102,(902,"Miss Giddens")),
(103,(903,"T.E. Lawrence")),
(104,(904,"Michael"))
))
val rdd3 = sc.parallelize(Seq(
(901,("Vertigo",1958 )),
(902,("The Innocents",1961))
))
val rdd4 = rdd1.join(rdd2)
val new_rdd4 = rdd4.keyBy(x => x._2._2._1) // Redefine Key for join with rdd3
val rdd5 = rdd3.join(new_rdd4)
rdd5.collect
returns:
res14: Array[(Int, ((String, Int), (Int, ((String, String, String), (Int, String)))))] = Array((901,((Vertigo,1958),(101,((James,Stewart,M),(901,John Scottie Ferguson))))), (902,((The Innocents,1961),(102,((Deborah,Kerr,F),(902,Miss Giddens))))))
You will need to strip out the data via a map, I leave that to you. INNER join per default.

Perform join in spark only on one co-ordinate of pair key?

I have 3 RDDs:
1st one is of form ((a,b),c).
2nd one is of form (b,d).
3rd one is of form (a,e).
How can I perform join in scala over these RDDs such that my final output is of the form ((a,b),c,d,e)?
you can do something like this:
val rdd1: RDD[((A,B),C)]
val rdd2: RDD[(B,D)]
val rdd3: RDD[(A,E)]
val tmp1 = rdd1.map {case((a,b),c) => (a, (b,c))}
val tmp2 = tmp1.join(rdd3).map{case(a, ((b,c), e)) => (b, (a,c,e))}
val res = tmp2.join(rdd2).map{case(b, ((a,c,e), d)) => ((a,b), c,d,e)}
With current implementations of join apis for paired rdds, its not possible to use condtions. And you would need conditions when joining to get the desired result.
But you can use dataframes/datasets for the joins, where you can use conditions. So use dataframes/datasets for the joins. If you want the result of join in dataframes then you can proceed with that. In case you want your results in rdds, then *.rdd can be used to convert the dataframes/datasets to RDD[Row]*
Below is the sample codes of it can be done in scala
//creating three rdds
val first = sc.parallelize(Seq((("a", "b"), "c")))
val second = sc.parallelize(Seq(("b", "d")))
val third = sc.parallelize(Seq(("a", "e")))
//coverting rdds to dataframes
val firstdf = first.toDF("key1", "value1")
val seconddf = second.toDF("key2", "value2")
val thirddf = third.toDF("key3", "value3")
//udf function for the join condition
import org.apache.spark.sql.functions._
def joinCondition = udf((strct: Row, key: String) => strct.toSeq.contains(key))
//joins with conditions
firstdf
.join(seconddf, joinCondition(firstdf("key1"), seconddf("key2"))) //joining first with second
.join(thirddf, joinCondition(firstdf("key1"), thirddf("key3"))) //joining first with third
.drop("key2", "key3") //dropping unnecessary columns
.rdd //converting dataframe to rdd
You should have output as
[[a,b],c,d,e]

Spark filtering broadcast vars from RDD

I am learning broadcast vars and trying to filter those from the RDD. This is not happening for me.
Here is my sample data
content.txt
Hello this is Rogers.com
This is Bell.com
Apache Spark Training
This is Spark Learning Session
Spark is faster than MapReduce
remove.txt
Hello, is, this, the
script
scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(","))
scala> val remove = sc.textFile("FilterCount/Remove.txt")
scala> val removeRDD = remove.flatMap(x => x.split(",")).map(w => w.trim)
scala> val bRemove = sc.broadcast(removeRDD.collect().toList)
scala> val filtered = contentRDD.filter{case (word) => !bRemove.value.contains(word)}
scala> filtered.foreach(print)
Hello this is Rogers.com This is Bell.comApache Spark TrainingThis is
Spark Learning SessionSpark is faster than MapReduce
As you can see above, filtered list still contains the broadcast vars. How can i remove these?
This is because you are splitting a file with ",", But your file is delimited with space " ".
scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(","))
Replace this with
scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(" "))
Use this to ignore case
val filtered = contentRDD.filter{case (word) =>
!bRemove.value.map(_.toLowerCase).contains(word.toLowerCase()
)}
Hopr this should work!

How to iterate over groups from cogroup to print key and its values (per group)?

I am learning spark and have the following code:
val rdd2 = sc.parallelize(Seq(("key1", 5),("key2", 4),("key4", 1)))
val grouped = rdd1.cogroup(rdd2)
grouped.collect()
Output:
Array[(String, (Iterable[Int], Iterable[Int]))] = Array(
(key3,(CompactBuffer(1),CompactBuffer())),
(key1,(CompactBuffer(1, 3),CompactBuffer(5))),
(key4,(CompactBuffer(),CompactBuffer(1))),
(key2,(CompactBuffer(2),CompactBuffer(4))))
How to iterate the values in a way that I get the output as follows:
key1,1,3,5
key2,2,4
key4,1
key3,1
below is the code i have tried.
val z=grouped.map{x=>
val key=x._1
val value=x._2
val source1=value._1
val final_value=source1.map{if(value._1>=1) value._1}
(key,final_value)
}
I recommend that you replace cogroup with join that would give you a sequence of pairs with a key and its values (as a collection) as follows:
val rdd1 = sc.parallelize(Seq(("key1", 1), ("key1", 3), ("key2", 2), ("key3", 1)))
val rdd2 = sc.parallelize(Seq(("key1", 5),("key2", 4),("key4", 1)))
val joined = rdd1.join(rdd2)
scala> joined.foreach(println)
(key2,(2,4))
(key1,(1,5))
(key1,(3,5))
// or using Spark SQL's Dataset API
scala> joined.toDF("key", "values").show
+----+------+
| key|values|
+----+------+
|key1| [1,5]|
|key1| [3,5]|
|key2| [2,4]|
+----+------+
If however you want to stay with cogroup to learn Spark's RDD API, you'd print grouped.collect as follows:
// I assume grouped is the value after cogroup+collect
// just because it's easier to demo the solution
val grouped = rdd1.cogroup(rdd2).collect
scala> grouped.foreach(println)
(key1,(CompactBuffer(1, 3),CompactBuffer(5)))
(key2,(CompactBuffer(2),CompactBuffer(4)))
(key3,(CompactBuffer(1),CompactBuffer()))
(key4,(CompactBuffer(),CompactBuffer(1)))
// the solution
grouped.
map { case (k, (g1, g2)) => (k, g1 ++ g2) }.
map { case (k, vs) => s"$k,${vs.mkString(",")}" }.
foreach(println)
I think the esiest way is to convert to Data frame and group by key and collect the values as list.
val rdd2 = spark.sparkContext.parallelize(Seq(("key1", 3),("key1", 5),("key2", 4),("key4", 1))).toDF("K", "V")
rdd2.groupBy("K").agg(collect_list($"V")).show
Hope this helps

How two RDD according to funcation get Result RDD

I am a beginner of Apache Spark. I want to filter two RDD into result RDD with the below code
def runSpark(stList:List[SubStTime],icList:List[IcTemp]): Unit ={
val conf = new SparkConf().setAppName("OD").setMaster("local[*]")
val sc = new SparkContext(conf)
val st = sc.parallelize(stList).map(st => ((st.productId,st.routeNo),st)).groupByKey()
val ic = sc.parallelize(icList).map(ic => ((ic.productId,ic.routeNo),ic)).groupByKey()
//TODO
//val result = st.join(ic).mapValues( )
sc.stop()
}
here is what i want to do
List[ST] ->map ->Map(Key,st) ->groupByKey ->Map(Key,List[st])
List[IC] ->map ->Map(Key,ic) ->groupByKey ->Map(Key,List[ic])
STRDD join ICRDD get Map(Key,(List[st],List[ic]))
I have a function compare listST and listIC get the List[result] result contains both SubStTime and IcTemp information
def calcIcSt(st:List[SubStTime],ic:List[IcTemp]): List[result]
I don't know how to use mapvalues or other some way to get my result
Thanks
val result = st.join(ic).mapValues( x => calcIcSt(x._1,x._2) )

Resources