Null Pointer Exception accessing Dataframe - apache-spark

val opt2 :Option[DataFrame]= None
val result:Option[DataFrame] =opt2.getOrElse(None)
if (!result.isEmpty) {
result.show()
}
If i don't use getOrElse and use get instead , I am getting Null pointer exception.
If i use getOrElse , i m getting type Mismatch in result .How to fix. Also is there is any method like creatingEmpty Dataframe.

Spark provide pre-defined function to declare empty dataframe.
Please note, In below code spark in SparkSession.
scala> import org.apache.spark.sql.SparkSession
scala> val df = spark.emptyDataFrame
df: org.apache.spark.sql.DataFrame = []
scala> df.show()
++
||
++
++

Related

Syntax for using toEpochDate with a Dataframe with Spark Scala - elegantly

The following is nice and easy with an RDD in terms of epochDate derivation:
val rdd2 = rdd.map(x => (x._1, x._2, x._3,
LocalDate.parse(x._2.toString).toEpochDay, LocalDate.parse(x._3.toString).toEpochDay))
The RDD are all of String Type. The desired result is gotten. Get this, for example:
...(Mike,2018-09-25,2018-09-30,17799,17804), ...
Trying to do the same if there is a String in the DF appears too tricky for me, and I would like to see something elegant, if possible. Something like this and variations do not work.
val df2 = df.withColumn("s", $"start".LocalDate.parse.toString.toEpochDay)
Get:
notebook:50: error: value LocalDate is not a member of org.apache.spark.sql.ColumnName
I understand the error, but what is an elegant way of doing the conversion?
You can define to_epoch_day as datediff since the beginning of the epoch:
import org.apache.spark.sql.functions.{datediff, lit, to_date}
import org.apache.spark.sql.Column
def to_epoch_day(c: Column) = datediff(c, to_date(lit("1970-01-01")))
and apply it directly on a Column:
df.withColumn("s", to_epoch_day(to_date($"start")))
As long as the string format complies with ISO 8601 you could even skip data conversion (it will be done implicitly by datediff:
df.withColumn("s", to_epoch_day($"start"))
$"start" is of type ColumnName not String.
You will need to define a UDF
Example below:
scala> import java.time._
import java.time._
scala> def toEpochDay(s: String) = LocalDate.parse(s).toEpochDay
toEpochDay: (s: String)Long
scala> val toEpochDayUdf = udf(toEpochDay(_: String))
toEpochDayUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType)))
scala> val df = List("2018-10-28").toDF
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.withColumn("s", toEpochDayUdf($"value")).collect
res0: Array[org.apache.spark.sql.Row] = Array([2018-10-28,17832])

How to JOIN 3 RDD's using Spark Scala

I want to join 3 tables using spark rdd. I achieved my objective using spark sql but when I tried to join it using Rdd I am not getting the desired results. Below is my query using spark SQL and the output:
scala> actorDF.as("df1").join(movieCastDF.as("df2"),$"df1.act_id"===$"df2.act_id").join(movieDF.as("df3"),$"df2.mov_id"===$"df3.mov_id").
filter(col("df3.mov_title")==="Annie Hall").select($"df1.act_fname",$"df1.act_lname",$"df2.role").show(false)
+---------+---------+-----------+
|act_fname|act_lname|role |
+---------+---------+-----------+
|Woody |Allen |Alvy Singer|
+---------+---------+-----------+
Now I created the pairedRDDs for three datasets and it is as below :
scala> val actPairedRdd=actRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2),p(3))))
scala> actPairedRdd.take(5).foreach(println)
(101,(James,Stewart,M))
(102,(Deborah,Kerr,F))
(103,(Peter,OToole,M))
(104,(Robert,De Niro,M))
(105,(F. Murray,Abraham,M))
scala> val movieCastPairedRdd=movieCastRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2))))
movieCastPairedRdd: org.apache.spark.rdd.RDD[(String, (String, String))] = MapPartitionsRDD[318] at map at <console>:29
scala> movieCastPairedRdd.foreach(println)
(101,(901,John Scottie Ferguson))
(102,(902,Miss Giddens))
(103,(903,T.E. Lawrence))
(104,(904,Michael))
(105,(905,Antonio Salieri))
(106,(906,Rick Deckard))
scala> val moviePairedRdd=movieRdd.map(_.split("\t",-1)).map(p=>(p(0),(p(1),p(2),p(3),p(4),p(5),p(6))))
moviePairedRdd: org.apache.spark.rdd.RDD[(String, (String, String, String, String, String, String))] = MapPartitionsRDD[322] at map at <console>:29
scala> moviePairedRdd.take(2).foreach(println)
(901,(Vertigo,1958,128,English,1958-08-24,UK))
(902,(The Innocents,1961,100,English,1962-02-19,SW))
Here actPairedRdd and movieCastPairedRdd is linked with each other and movieCastPairedRdd and moviePairedRdd is linked since they have common column.
Now when I join all the three datasets I am not getting any data
scala> actPairedRdd.join(movieCastPairedRdd).join(moviePairedRdd).take(2).foreach(println)
I am getting blank records. So where am I going wrong ?? Thanks in advance
JOINs like this with RDDs are painful, that's another reason why DFs are nicer.
You get no data as the pair RDD = K, V has no common data for the K part of the last RDD. The K's with 101, 102 will join, but there is no commonality with the 901, 902. You need to shift things around, like this, my more limited example:
val rdd1 = sc.parallelize(Seq(
(101,("James","Stewart","M")),
(102,("Deborah","Kerr","F")),
(103,("Peter","OToole","M")),
(104,("Robert","De Niro","M"))
))
val rdd2 = sc.parallelize(Seq(
(101,(901,"John Scottie Ferguson")),
(102,(902,"Miss Giddens")),
(103,(903,"T.E. Lawrence")),
(104,(904,"Michael"))
))
val rdd3 = sc.parallelize(Seq(
(901,("Vertigo",1958 )),
(902,("The Innocents",1961))
))
val rdd4 = rdd1.join(rdd2)
val new_rdd4 = rdd4.keyBy(x => x._2._2._1) // Redefine Key for join with rdd3
val rdd5 = rdd3.join(new_rdd4)
rdd5.collect
returns:
res14: Array[(Int, ((String, Int), (Int, ((String, String, String), (Int, String)))))] = Array((901,((Vertigo,1958),(101,((James,Stewart,M),(901,John Scottie Ferguson))))), (902,((The Innocents,1961),(102,((Deborah,Kerr,F),(902,Miss Giddens))))))
You will need to strip out the data via a map, I leave that to you. INNER join per default.

Too many arguments for method filter

What I tried do here is to filter the rows where log_type = "1"
This is my code:
val sc1Rdd=parDf.select(parDf("token"),parDf("log_type")).rdd
val sc2Rdd=sc1Rdd.filter(x=>x,log_type=="1")
but the error code showed:
parDf: org.apache.spark.sql.DataFrame = [action_time: bigint,
action_type: bigint ... 21 more fields] sc1Rdd:
org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] =
MapPartitionsRDD[865] at rdd at :186 :188: error:
too many arguments for method filter: (f: org.apache.spark.sql.Row =>
Boolean)org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
val sc2Rdd=sc1Rdd.filter(x=>x,log_type=="1")
Any help will be appreciated.
You don't need to change the dataframe to RDD
you can simply filter or where as
val result = parDf.select(parDf("token"),parDf("log_type")).filter(parDf("long_type")===1)
parDf.select(p===1)arDf("token"),parDf("log_type")).where(parDf("long_type")
Hope this helps!

Not able to covert a Array RDD into List RDD in Spark

How do I convert a Array[String] RDD into List[String] RDD?
scala> val linesRDD = sc.textFile("/user/inputfiles/records.txt")
linesRDD: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[9] at textFile at <console>:21
scala> linesRDD.collect
res17: Array[String] = Array(100,surender,CTS,CHN, 101,ajay,CTS,BNG, 102,kumar,TCS,BNG, 103,Ankit,CTS,CHN, 104,Sukanya,TCS,BNG
scala> linesRDD.toList
<console>:24: error: value toList is not a member of org.apache.spark.rdd.RDD[String]
linesRDD.toList
As you can see above that it throws error.
But if you can see below that , if i apply a take Action and then applying toList works
scala> linesRDD.take(2).toList
res19: List[String] = List(100,surender,CTS,CHN, 101,ajay,CTS,BNG)
How do I convert a Array[String] RDD into List[String] RDD?
The exception is pretty clear, you are trying to apply a method that doesn't exist in the RDD class.
error: value toList is not a member of
org.apache.spark.rdd.RDD[String]
linesRDD.toList
However, to solve this you can collect then use toList. BTW don't forget that when the data is collected, all of it is moved to the driver, and if it doesn't fit there, you will receive an exception.
linesRDD.collect.toList

How can I check whether my RDD or dataframe is cached or not?

I have created a dataframe say df1. I cached this by using df1.cache(). How can I check whether this has been cached or not?
Also is there a way so that I am able to see all my cached RDD's or dataframes.
You can call getStorageLevel.useMemory on the Dataframe and the RDD to find out if the dataset is in memory.
For the Dataframe do this:
scala> val df = Seq(1, 2).toDF()
df: org.apache.spark.sql.DataFrame = [value: int]
scala> df.storageLevel.useMemory
res1: Boolean = false
scala> df.cache()
res0: df.type = [value: int]
scala> df.storageLevel.useMemory
res1: Boolean = true
For the RDD do this:
scala> val rdd = sc.parallelize(Seq(1,2))
rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:21
scala> rdd.getStorageLevel.useMemory
res9: Boolean = false
scala> rdd.cache()
res10: rdd.type = ParallelCollectionRDD[1] at parallelize at <console>:21
scala> rdd.getStorageLevel.useMemory
res11: Boolean = true
#Arnab,
Did you find the function in Python?
Here is an example for DataFrame DF:
DF.cache()
print DF.is_cached
Hope this helps.
Ram
Starting since Spark (Scala) 2.1.0, this can be checked for a dataframe as follows:
dataframe.storageLevel.useMemory
You can retrieve the storage level of a RDD since Spark 1.4 and since Spark 2.1 for DataFrame.
val storageLevel = rdd.getStorageLevel
val storageLevel = dataframe.storageLevel
Then you can check where it's stored as follows:
val isCached: Boolean = storageLevel.useMemory || storageLevel.useDisk || storageLevel.useOffHeap
In Java and Scala, following method could used to find all the persisted RDDs:
sparkContext.getPersistentRDDs()
Here is link to documentation.`
Looks like this method is not available in python yet:
https://issues.apache.org/jira/browse/SPARK-2141
But one could use this short-term hack:
sparkContext._jsc.getPersistentRDDs().items()

Resources