Spark filtering broadcast vars from RDD - apache-spark

I am learning broadcast vars and trying to filter those from the RDD. This is not happening for me.
Here is my sample data
content.txt
Hello this is Rogers.com
This is Bell.com
Apache Spark Training
This is Spark Learning Session
Spark is faster than MapReduce
remove.txt
Hello, is, this, the
script
scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(","))
scala> val remove = sc.textFile("FilterCount/Remove.txt")
scala> val removeRDD = remove.flatMap(x => x.split(",")).map(w => w.trim)
scala> val bRemove = sc.broadcast(removeRDD.collect().toList)
scala> val filtered = contentRDD.filter{case (word) => !bRemove.value.contains(word)}
scala> filtered.foreach(print)
Hello this is Rogers.com This is Bell.comApache Spark TrainingThis is
Spark Learning SessionSpark is faster than MapReduce
As you can see above, filtered list still contains the broadcast vars. How can i remove these?

This is because you are splitting a file with ",", But your file is delimited with space " ".
scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(","))
Replace this with
scala> val content = sc.textFile("FilterCount/Content.txt")
scala> val contentRDD = content.flatMap(x => x.split(" "))
Use this to ignore case
val filtered = contentRDD.filter{case (word) =>
!bRemove.value.map(_.toLowerCase).contains(word.toLowerCase()
)}
Hopr this should work!

Related

Why different behavior when mixed case are used, vs same case are used in spark 3.2

I am running a simple query in spark 3.2
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_same_case = List("id","col2","col3","col4", "col5", "id")
val df2 = df1.select(op_cols_same_case.head, op_cols_same_case.tail: _*)
df2.select("id").show()
The above query return the result, but when I mix the casing it gives exception
val df1 = sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("id","col2","col3","col4", "col5")
val op_cols_diff_case = List("id","col2","col3","col4", "col5", "ID")
val df2 = df1.select(op_cols_diff_case.head, op_cols_diff_case.tail: _*)
df2.select("id").show()
In my test caseSensitive was default (false).
I expect both queries to return the result. Or both queries to fail.
Why is it failing for one and not for the other one?

How to JOIN 3 RDD's using Spark Scala

I want to join 3 tables using spark rdd. I achieved my objective using spark sql but when I tried to join it using Rdd I am not getting the desired results. Below is my query using spark SQL and the output:
scala> actorDF.as("df1").join(movieCastDF.as("df2"),$"df1.act_id"===$"df2.act_id").join(movieDF.as("df3"),$"df2.mov_id"===$"df3.mov_id").
filter(col("df3.mov_title")==="Annie Hall").select($"df1.act_fname",$"df1.act_lname",$"df2.role").show(false)
+---------+---------+-----------+
|act_fname|act_lname|role |
+---------+---------+-----------+
|Woody |Allen |Alvy Singer|
+---------+---------+-----------+
Now I created the pairedRDDs for three datasets and it is as below :
scala> val actPairedRdd=actRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2),p(3))))
scala> actPairedRdd.take(5).foreach(println)
(101,(James,Stewart,M))
(102,(Deborah,Kerr,F))
(103,(Peter,OToole,M))
(104,(Robert,De Niro,M))
(105,(F. Murray,Abraham,M))
scala> val movieCastPairedRdd=movieCastRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2))))
movieCastPairedRdd: org.apache.spark.rdd.RDD[(String, (String, String))] = MapPartitionsRDD[318] at map at <console>:29
scala> movieCastPairedRdd.foreach(println)
(101,(901,John Scottie Ferguson))
(102,(902,Miss Giddens))
(103,(903,T.E. Lawrence))
(104,(904,Michael))
(105,(905,Antonio Salieri))
(106,(906,Rick Deckard))
scala> val moviePairedRdd=movieRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2),p(3),p(4),p(5),p(6))))
moviePairedRdd: org.apache.spark.rdd.RDD[(String, (String, String, String, String, String, String))] = MapPartitionsRDD[322] at map at <console>:29
scala> moviePairedRdd.take(2).foreach(println)
(901,(Vertigo,1958,128,English,1958-08-24,UK))
(902,(The Innocents,1961,100,English,1962-02-19,SW))
Here actPairedRdd and movieCastPairedRdd is linked with each other and movieCastPairedRdd and moviePairedRdd is linked since they have common column.
Now when I join all the three datasets I am not getting any data
scala> actPairedRdd.join(movieCastPairedRdd).join(moviePairedRdd).take(2).foreach(println)
I am getting blank records. So where am I going wrong ?? Thanks in advance
JOINs like this with RDDs are painful, that's another reason why DFs are nicer.
You get no data as the pair RDD = K, V has no common data for the K part of the last RDD. The K's with 101, 102 will join, but there is no commonality with the 901, 902. You need to shift things around, like this, my more limited example:
val rdd1 = sc.parallelize(Seq(
(101,("James","Stewart","M")),
(102,("Deborah","Kerr","F")),
(103,("Peter","OToole","M")),
(104,("Robert","De Niro","M"))
))
val rdd2 = sc.parallelize(Seq(
(101,(901,"John Scottie Ferguson")),
(102,(902,"Miss Giddens")),
(103,(903,"T.E. Lawrence")),
(104,(904,"Michael"))
))
val rdd3 = sc.parallelize(Seq(
(901,("Vertigo",1958 )),
(902,("The Innocents",1961))
))
val rdd4 = rdd1.join(rdd2)
val new_rdd4 = rdd4.keyBy(x => x._2._2._1) // Redefine Key for join with rdd3
val rdd5 = rdd3.join(new_rdd4)
rdd5.collect
returns:
res14: Array[(Int, ((String, Int), (Int, ((String, String, String), (Int, String)))))] = Array((901,((Vertigo,1958),(101,((James,Stewart,M),(901,John Scottie Ferguson))))), (902,((The Innocents,1961),(102,((Deborah,Kerr,F),(902,Miss Giddens))))))
You will need to strip out the data via a map, I leave that to you. INNER join per default.

Join Dstream[Document] and Rdd by key Spark Scala

Here is my code:
ssc =streamingcontext(sparkcontext,Seconds(time))
spark = sparksession.builder.config(properties).getorcreate()
val Dstream1: ReceiverInputDstream[Document] = ssc.receiverStream(properties) // Dstream1 has Id1 and other fields
val Rdd2 = spark.sql("select Id1,key from hdfs.table").rdd // RDD[Row]
Is there a way I can join this two?
You'll first want to transform your Dstream & Rdd to use pairRDD's.
Something like this should do.
val DstreamTuple = Dstream1.map(x => (x. Id1, x))
val Rdd2Tuple = Rdd2.map(x => (x. Id1, x))
Once you do that, you can simply do a transformation on the dstream and join it on the RDD.
val joinedStream = DstreamTuple.transform(rdd =>
rdd.leftOuterJoin(Rdd2Tuple)
)
Hope this helps :)

How two RDD according to funcation get Result RDD

I am a beginner of Apache Spark. I want to filter two RDD into result RDD with the below code
def runSpark(stList:List[SubStTime],icList:List[IcTemp]): Unit ={
val conf = new SparkConf().setAppName("OD").setMaster("local[*]")
val sc = new SparkContext(conf)
val st = sc.parallelize(stList).map(st => ((st.productId,st.routeNo),st)).groupByKey()
val ic = sc.parallelize(icList).map(ic => ((ic.productId,ic.routeNo),ic)).groupByKey()
//TODO
//val result = st.join(ic).mapValues( )
sc.stop()
}
here is what i want to do
List[ST] ->map ->Map(Key,st) ->groupByKey ->Map(Key,List[st])
List[IC] ->map ->Map(Key,ic) ->groupByKey ->Map(Key,List[ic])
STRDD join ICRDD get Map(Key,(List[st],List[ic]))
I have a function compare listST and listIC get the List[result] result contains both SubStTime and IcTemp information
def calcIcSt(st:List[SubStTime],ic:List[IcTemp]): List[result]
I don't know how to use mapvalues or other some way to get my result
Thanks
val result = st.join(ic).mapValues( x => calcIcSt(x._1,x._2) )

How can I check whether my RDD or dataframe is cached or not?

I have created a dataframe say df1. I cached this by using df1.cache(). How can I check whether this has been cached or not?
Also is there a way so that I am able to see all my cached RDD's or dataframes.
You can call getStorageLevel.useMemory on the Dataframe and the RDD to find out if the dataset is in memory.
For the Dataframe do this:
scala> val df = Seq(1, 2).toDF()
df: org.apache.spark.sql.DataFrame = [value: int]
scala> df.storageLevel.useMemory
res1: Boolean = false
scala> df.cache()
res0: df.type = [value: int]
scala> df.storageLevel.useMemory
res1: Boolean = true
For the RDD do this:
scala> val rdd = sc.parallelize(Seq(1,2))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:21
scala> rdd.getStorageLevel.useMemory
res9: Boolean = false
scala> rdd.cache()
res10: rdd.type = ParallelCollectionRDD[1] at parallelize at <console>:21
scala> rdd.getStorageLevel.useMemory
res11: Boolean = true
#Arnab,
Did you find the function in Python?
Here is an example for DataFrame DF:
DF.cache()
print DF.is_cached
Hope this helps.
Ram
Starting since Spark (Scala) 2.1.0, this can be checked for a dataframe as follows:
dataframe.storageLevel.useMemory
You can retrieve the storage level of a RDD since Spark 1.4 and since Spark 2.1 for DataFrame.
val storageLevel = rdd.getStorageLevel
val storageLevel = dataframe.storageLevel
Then you can check where it's stored as follows:
val isCached: Boolean = storageLevel.useMemory || storageLevel.useDisk || storageLevel.useOffHeap
In Java and Scala, following method could used to find all the persisted RDDs:
sparkContext.getPersistentRDDs()
Here is link to documentation.`
Looks like this method is not available in python yet:
https://issues.apache.org/jira/browse/SPARK-2141
But one could use this short-term hack:
sparkContext._jsc.getPersistentRDDs().items()

Resources