cache() do not work with collect(), but works with take(...) - apache-spark

I am trying to cache a DataFrame into Memory. When I use collect() after cache(), the DataFrame never gets cached in Memory. But when I use take(N), the DataFrame gets cached in Memory.
I would like to know Why am I not able to cache the DataFrame while calling collect() ? Given below is the example using both collect() & take()
cache() behavior with collect()
scala> val hc = new org.apache.spark.sql.hive.HiveContext(sc)
scala> hc.sql("select * from sparkdb.firsttable").cache()
res0: org.apache.spark.sql.DataFrame = [name: int]
scala> res0.collect()
res1: Array[org.apache.spark.sql.Row] = Array([12], [13], [14], [15])
As you can see, the DataFrame is not cached in this case. I was expecting the DataFrame to be cached when executing the action collect()
cache() behavior with take(N)
scala> res0.take(30)
res2: Array[org.apache.spark.sql.Row] = Array([12], [13], [14], [15])
As you can see, calling take(N) does cache the DataFrame in Memory.
PS : Given below is some other additional information
Storage Info after collect()
scala> sc.getRDDStorageInfo
res2: Array[org.apache.spark.storage.RDDInfo] = Array()
Storage Info after take()
scala> sc.getRDDStorageInfo
res3: Array[org.apache.spark.storage.RDDInfo] =
Array(RDD "HiveTableScan [name#0], (MetastoreRelation sparkdb,
firsttable, None), None
" (3) StorageLevel: StorageLevel(false, true, false, true, 1);
CachedPartitions: 1; TotalPartitions: 1; MemorySize: 256.0 B;
ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B)

Related

How to JOIN 3 RDD's using Spark Scala

I want to join 3 tables using spark rdd. I achieved my objective using spark sql but when I tried to join it using Rdd I am not getting the desired results. Below is my query using spark SQL and the output:
scala> actorDF.as("df1").join(movieCastDF.as("df2"),$"df1.act_id"===$"df2.act_id").join(movieDF.as("df3"),$"df2.mov_id"===$"df3.mov_id").
filter(col("df3.mov_title")==="Annie Hall").select($"df1.act_fname",$"df1.act_lname",$"df2.role").show(false)
+---------+---------+-----------+
|act_fname|act_lname|role |
+---------+---------+-----------+
|Woody |Allen |Alvy Singer|
+---------+---------+-----------+
Now I created the pairedRDDs for three datasets and it is as below :
scala> val actPairedRdd=actRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2),p(3))))
scala> actPairedRdd.take(5).foreach(println)
(101,(James,Stewart,M))
(102,(Deborah,Kerr,F))
(103,(Peter,OToole,M))
(104,(Robert,De Niro,M))
(105,(F. Murray,Abraham,M))
scala> val movieCastPairedRdd=movieCastRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2))))
movieCastPairedRdd: org.apache.spark.rdd.RDD[(String, (String, String))] = MapPartitionsRDD[318] at map at <console>:29
scala> movieCastPairedRdd.foreach(println)
(101,(901,John Scottie Ferguson))
(102,(902,Miss Giddens))
(103,(903,T.E. Lawrence))
(104,(904,Michael))
(105,(905,Antonio Salieri))
(106,(906,Rick Deckard))
scala> val moviePairedRdd=movieRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2),p(3),p(4),p(5),p(6))))
moviePairedRdd: org.apache.spark.rdd.RDD[(String, (String, String, String, String, String, String))] = MapPartitionsRDD[322] at map at <console>:29
scala> moviePairedRdd.take(2).foreach(println)
(901,(Vertigo,1958,128,English,1958-08-24,UK))
(902,(The Innocents,1961,100,English,1962-02-19,SW))
Here actPairedRdd and movieCastPairedRdd is linked with each other and movieCastPairedRdd and moviePairedRdd is linked since they have common column.
Now when I join all the three datasets I am not getting any data
scala> actPairedRdd.join(movieCastPairedRdd).join(moviePairedRdd).take(2).foreach(println)
I am getting blank records. So where am I going wrong ?? Thanks in advance
JOINs like this with RDDs are painful, that's another reason why DFs are nicer.
You get no data as the pair RDD = K, V has no common data for the K part of the last RDD. The K's with 101, 102 will join, but there is no commonality with the 901, 902. You need to shift things around, like this, my more limited example:
val rdd1 = sc.parallelize(Seq(
(101,("James","Stewart","M")),
(102,("Deborah","Kerr","F")),
(103,("Peter","OToole","M")),
(104,("Robert","De Niro","M"))
))
val rdd2 = sc.parallelize(Seq(
(101,(901,"John Scottie Ferguson")),
(102,(902,"Miss Giddens")),
(103,(903,"T.E. Lawrence")),
(104,(904,"Michael"))
))
val rdd3 = sc.parallelize(Seq(
(901,("Vertigo",1958 )),
(902,("The Innocents",1961))
))
val rdd4 = rdd1.join(rdd2)
val new_rdd4 = rdd4.keyBy(x => x._2._2._1) // Redefine Key for join with rdd3
val rdd5 = rdd3.join(new_rdd4)
rdd5.collect
returns:
res14: Array[(Int, ((String, Int), (Int, ((String, String, String), (Int, String)))))] = Array((901,((Vertigo,1958),(101,((James,Stewart,M),(901,John Scottie Ferguson))))), (902,((The Innocents,1961),(102,((Deborah,Kerr,F),(902,Miss Giddens))))))
You will need to strip out the data via a map, I leave that to you. INNER join per default.

Shuffle intermediate files shared between jobs?

Referring to https://spark.apache.org/docs/1.6.2/programming-guide.html#performance-impact
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files are preserved until the corresponding RDDs are no longer used and are garbage collected. This is done so the shuffle files don’t need to be re-created if the lineage is re-computed
I understand why these files will be retained. However, I cant seem to figure out whether these intermedaite files are shared between jobs?
My experimentations show that these shuffle files are NOT shared between jobs. Can anyone confirm?
The scenario I am talking about:
```
val rdd1 = sc.text...
val rdd2 = sc.text...
val rdd3 = rdd1.join(rdd2)
// at this point shuffle takes place
//Now, if I do this again:
val rdd4 = rdd1.join(rdd2)
// will the shuffle files be reused? And I think I ve got the answer, which is know since the rdds do not share the lineage
```
Between jobs - yes. That's the whole purpose of preserving shuffle files (What does "Stage Skipped" mean in Apache Spark web UI?). Consider following session transcript:
scala> val rdd1 = sc.parallelize(Seq((1, None), (2, None)), 4)
rdd1: org.apache.spark.rdd.RDD[(Int, None.type)] = ParallelCollectionRDD[0] at parallelize at <console>:24
scala> val rdd2 = sc.parallelize(Seq((1, None), (2, None)), 4)
rdd2: org.apache.spark.rdd.RDD[(Int, None.type)] = ParallelCollectionRDD[1] at parallelize at <console>:24
scala> val rdd3 = rdd1.join(rdd2)
rdd3: org.apache.spark.rdd.RDD[(Int, (None.type, None.type))] = MapPartitionsRDD[4] at join at <console>:27
scala> rdd3.count // First job
res0: Long = 2
scala> rdd3.foreach(_ => ()) // Second job
and corresponding state of the Spark UI
Between applications - no. Shuffle files are discarded when SparkContext is closed.
The shuffle files are meant for the stages within a job. Other jobs won't be able to use these shuffle files. So, afaik, No ! shuffle files cannot be shared between jobs

Temp table caching with spark-sql

Is a table registered with registerTempTable (createOrReplaceTempView with spark 2.+) cached?
Using Zeppelin, I register a DataFrame in my scala code, after heavy computation, and then within %pyspark I want to access it, and further filter it.
Will it use a memory-cached version of the table? Or will it be rebuilt each time?
Registered tables are not cached in memory.
The registerTempTable createOrReplaceTempView method will just create or replace a view of the given DataFrame with a given query plan.
It will convert the query plan to canonicalized SQL string, and store it as view text in metastore, if we need to create a permanent view.
You'll need to cache your DataFrame explicitly. e.g :
df.createOrReplaceTempView("my_table") # df.registerTempTable("my_table") for spark <2.+
spark.cacheTable("my_table")
EDIT:
Let's illustrate this with an example :
Using cacheTable :
scala> val df = Seq(("1",2),("b",3)).toDF
// df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
scala> sc.getPersistentRDDs
// res0: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()
scala> df.createOrReplaceTempView("my_table")
scala> sc.getPersistentRDDs
// res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()
scala> spark.catalog.cacheTable("my_table") // spark.cacheTable("...") before spark 2.0
scala> sc.getPersistentRDDs
// res4: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
// Map(2 -> In-memory table my_table MapPartitionsRDD[2] at
// cacheTable at <console>:26)
Now the same example using cache.registerTempTable cache.createOrReplaceTempView :
scala> sc.getPersistentRDDs
// res2: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()
scala> val df = Seq(("1",2),("b",3)).toDF
// df: org.apache.spark.sql.DataFrame = [_1: string, _2: int]
scala> df.createOrReplaceTempView("my_table")
scala> sc.getPersistentRDDs
// res4: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] = Map()
scala> df.cache.createOrReplaceTempView("my_table")
scala> sc.getPersistentRDDs
// res6: scala.collection.Map[Int,org.apache.spark.rdd.RDD[_]] =
// Map(2 -> ConvertToUnsafe
// +- LocalTableScan [_1#0,_2#1], [[1,2],[b,3]]
// MapPartitionsRDD[2] at cache at <console>:28)
It is not. You should cache explicitly:
sqlContext.cacheTable("someTable")

Not able to covert a Array RDD into List RDD in Spark

How do I convert a Array[String] RDD into List[String] RDD?
scala> val linesRDD = sc.textFile("/user/inputfiles/records.txt")
linesRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at textFile at <console>:21
scala> linesRDD.collect
res17: Array[String] = Array(100,surender,CTS,CHN, 101,ajay,CTS,BNG, 102,kumar,TCS,BNG, 103,Ankit,CTS,CHN, 104,Sukanya,TCS,BNG
scala> linesRDD.toList
<console>:24: error: value toList is not a member of org.apache.spark.rdd.RDD[String]
linesRDD.toList
As you can see above that it throws error.
But if you can see below that , if i apply a take Action and then applying toList works
scala> linesRDD.take(2).toList
res19: List[String] = List(100,surender,CTS,CHN, 101,ajay,CTS,BNG)
How do I convert a Array[String] RDD into List[String] RDD?
The exception is pretty clear, you are trying to apply a method that doesn't exist in the RDD class.
error: value toList is not a member of
org.apache.spark.rdd.RDD[String]
linesRDD.toList
However, to solve this you can collect then use toList. BTW don't forget that when the data is collected, all of it is moved to the driver, and if it doesn't fit there, you will receive an exception.
linesRDD.collect.toList

How can I check whether my RDD or dataframe is cached or not?

I have created a dataframe say df1. I cached this by using df1.cache(). How can I check whether this has been cached or not?
Also is there a way so that I am able to see all my cached RDD's or dataframes.
You can call getStorageLevel.useMemory on the Dataframe and the RDD to find out if the dataset is in memory.
For the Dataframe do this:
scala> val df = Seq(1, 2).toDF()
df: org.apache.spark.sql.DataFrame = [value: int]
scala> df.storageLevel.useMemory
res1: Boolean = false
scala> df.cache()
res0: df.type = [value: int]
scala> df.storageLevel.useMemory
res1: Boolean = true
For the RDD do this:
scala> val rdd = sc.parallelize(Seq(1,2))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:21
scala> rdd.getStorageLevel.useMemory
res9: Boolean = false
scala> rdd.cache()
res10: rdd.type = ParallelCollectionRDD[1] at parallelize at <console>:21
scala> rdd.getStorageLevel.useMemory
res11: Boolean = true
#Arnab,
Did you find the function in Python?
Here is an example for DataFrame DF:
DF.cache()
print DF.is_cached
Hope this helps.
Ram
Starting since Spark (Scala) 2.1.0, this can be checked for a dataframe as follows:
dataframe.storageLevel.useMemory
You can retrieve the storage level of a RDD since Spark 1.4 and since Spark 2.1 for DataFrame.
val storageLevel = rdd.getStorageLevel
val storageLevel = dataframe.storageLevel
Then you can check where it's stored as follows:
val isCached: Boolean = storageLevel.useMemory || storageLevel.useDisk || storageLevel.useOffHeap
In Java and Scala, following method could used to find all the persisted RDDs:
sparkContext.getPersistentRDDs()
Here is link to documentation.`
Looks like this method is not available in python yet:
https://issues.apache.org/jira/browse/SPARK-2141
But one could use this short-term hack:
sparkContext._jsc.getPersistentRDDs().items()

Resources