Syntax for using toEpochDate with a Dataframe with Spark Scala - elegantly - apache-spark

The following is nice and easy with an RDD in terms of epochDate derivation:
val rdd2 = rdd.map(x => (x._1, x._2, x._3,
LocalDate.parse(x._2.toString).toEpochDay, LocalDate.parse(x._3.toString).toEpochDay))
The RDD are all of String Type. The desired result is gotten. Get this, for example:
...(Mike,2018-09-25,2018-09-30,17799,17804), ...
Trying to do the same if there is a String in the DF appears too tricky for me, and I would like to see something elegant, if possible. Something like this and variations do not work.
val df2 = df.withColumn("s", $"start".LocalDate.parse.toString.toEpochDay)
Get:
notebook:50: error: value LocalDate is not a member of org.apache.spark.sql.ColumnName
I understand the error, but what is an elegant way of doing the conversion?

You can define to_epoch_day as datediff since the beginning of the epoch:
import org.apache.spark.sql.functions.{datediff, lit, to_date}
import org.apache.spark.sql.Column
def to_epoch_day(c: Column) = datediff(c, to_date(lit("1970-01-01")))
and apply it directly on a Column:
df.withColumn("s", to_epoch_day(to_date($"start")))
As long as the string format complies with ISO 8601 you could even skip data conversion (it will be done implicitly by datediff:
df.withColumn("s", to_epoch_day($"start"))

$"start" is of type ColumnName not String.
You will need to define a UDF
Example below:
scala> import java.time._
import java.time._
scala> def toEpochDay(s: String) = LocalDate.parse(s).toEpochDay
toEpochDay: (s: String)Long
scala> val toEpochDayUdf = udf(toEpochDay(_: String))
toEpochDayUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType)))
scala> val df = List("2018-10-28").toDF
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.withColumn("s", toEpochDayUdf($"value")).collect
res0: Array[org.apache.spark.sql.Row] = Array([2018-10-28,17832])

Related

Null Pointer Exception accessing Dataframe

val opt2 :Option[DataFrame]= None
val result:Option[DataFrame] =opt2.getOrElse(None)
if (!result.isEmpty) {
result.show()
}
If i don't use getOrElse and use get instead , I am getting Null pointer exception.
If i use getOrElse , i m getting type Mismatch in result .How to fix. Also is there is any method like creatingEmpty Dataframe.
Spark provide pre-defined function to declare empty dataframe.
Please note, In below code spark in SparkSession.
scala> import org.apache.spark.sql.SparkSession
scala> val df = spark.emptyDataFrame
df: org.apache.spark.sql.DataFrame = []
scala> df.show()
++
||
++
++

Too many arguments for method filter

What I tried do here is to filter the rows where log_type = "1"
This is my code:
val sc1Rdd=parDf.select(parDf("token"),parDf("log_type")).rdd
val sc2Rdd=sc1Rdd.filter(x=>x,log_type=="1")
but the error code showed:
parDf: org.apache.spark.sql.DataFrame = [action_time: bigint,
action_type: bigint ... 21 more fields] sc1Rdd:
org.apache.spark.rdd.RDD[org.apache.spark.sql.Row] =
MapPartitionsRDD[865] at rdd at :186 :188: error:
too many arguments for method filter: (f: org.apache.spark.sql.Row =>
Boolean)org.apache.spark.rdd.RDD[org.apache.spark.sql.Row]
val sc2Rdd=sc1Rdd.filter(x=>x,log_type=="1")
Any help will be appreciated.
You don't need to change the dataframe to RDD
you can simply filter or where as
val result = parDf.select(parDf("token"),parDf("log_type")).filter(parDf("long_type")===1)
parDf.select(p===1)arDf("token"),parDf("log_type")).where(parDf("long_type")
Hope this helps!

How to add a schema to a Dataset in Spark?

I am trying to load a file into spark.
If I load a normal textFile into Spark like below:
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
The outcome is:
partFile: org.apache.spark.sql.Dataset[String] = [value: string]
I can see a dataset in the output. But if I load a Json file:
val pfile = spark.read.json("hdfs://quickstart:8020/user/cloudera/pjson")
The outcome is a dataframe with a readymade schema:
pfile: org.apache.spark.sql.DataFrame = [address: struct<city: string, state: string>, age: bigint ... 1 more field]
The Json/parquet/orc files have schema. So I can understand that this is a feature from Spark version:2x, which made things easier as we directly get a DataFrame in this case and for a normal textFile you get a dataset where there is no schema which makes sense.
What I'd like to know is how can I add a schema to a dataset that is a resultant of loading a textFile into spark. For an RDD, there is case class/StructType option to add the schema and convert it to a DataFrame.
Could anyone let me know how can I do it ?
When you use textFile, each line of the file will be a string row in your Dataset. To convert to DataFrame with a schema, you can use toDF:
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
import sqlContext.implicits._
val df = partFile.toDF("string_column")
In this case, the DataFrame will have a schema of a single column of type StringType.
If your file contains a more complex schema, you can either use the csv reader (if the file is in a structured csv format):
val partFile = spark.read.option("header", "true").option("delimiter", ";").csv("hdfs://quickstart:8020/user/cloudera/partfile")
Or you can process your Dataset using map, then using toDF to convert to DataFrame. For example, suppose you want one column to be the first character of the line (as an Int) and the other column to be the fourth character (also as an Int):
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
val processedDataset: Dataset[(Int, Int)] = partFile.map {
line: String => (line(0).toInt, line(3).toInt)
}
import sqlContext.implicits._
val df = processedDataset.toDF("value0", "value3")
Also, you can define a case class, which will represent the final schema for your DataFrame:
case class MyRow(value0: Int, value3: Int)
val partFile = spark.read.textFile("hdfs://quickstart:8020/user/cloudera/partfile")
val processedDataset: Dataset[MyRow] = partFile.map {
line: String => MyRow(line(0).toInt, line(3).toInt)
}
import sqlContext.implicits._
val df = processedDataset.toDF
In both cases above, calling df.printSchema would show:
root
|-- value0: integer (nullable = true)
|-- value3: integer (nullable = true)

Spark.ml DataFrame containing SparseVector

I have a spark.ml DataFrame that contains many columns, each of these columns containing a SparseVector per row. I would like to apply MultivariateStatisticalSummary.colStats to each column, and colStats signature is:
def colStats(X: RDD[Vector]): MultivariateStatisticalSummary
which seems perfect... except that I can't seem to select a column from that DataFrame and get it to be a RDD[Vector]. Here is my attempt:
val df: DataFrame = data.select(shardId)
val col = df.as[(org.apache.spark.mllib.linalg.Vector)].rdd
val s: MultivariateStatisticalSummary = Statistics.colStats(col)
which doesn't compile with the message (in Scala):
Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing sqlContext.implicits._ Support for serializing other types will be added in future releases.
val col = df.as[(org.apache.spark.mllib.linalg.Vector)].rdd
I also tried:
val df = data.select(shardId)
val col: RDD[Vector] = df.map(x => x.asInstanceOf[org.apache.spark.mllib.linalg.Vector])
val s: MultivariateStatisticalSummary = Statistics.colStats(col)
which fails at runtime with error:
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to org.apache.spark.mllib.linalg.Vector
How can I bridge the gap between DataFrame and colStats?
I found the answer after all:
val df = data.select(shardId)
val col: RDD[Vector] = df.map { _.get(0).asInstanceOf[org.apache.spark.mllib.linalg.Vector] }
val s: MultivariateStatisticalSummary = Statistics.colStats(col)
The trick was only to extract the first element of each row before casting it.

how can i add a timestamp as an extra column to my dataframe

*Hi all,
I have an easy question for you all.
I have an RDD, created from kafka streaming using createStream method.
Now i want to add a timestamp as a value to this rdd before converting in to dataframe.
I have tried doing to add a value to the dataframe using with withColumn() but returning this error*
val topicMaps = Map("topic" -> 1)
val now = java.util.Calendar.getInstance().getTime()
val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
messages.foreachRDD(rdd =>
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val dataframe = sqlContext.read.json(rdd.map(_._2))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
org.apache.spark.sql.AnalysisException: Cannot resolve column name "now" among (action, device_os_ver, device_type, event_name,
item_name, lat, lon, memberid, productUpccd, tenantid);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:15
As i came to know that DataFrames cannot be altered as they are immutable, but RDDs are immutable as well.
Then what is the best way to do it.
How to a value to the RDD(adding timestamp to an RDD dynamically).
Try current_timestamp function.
import org.apache.spark.sql.functions.current_timestamp
df.withColumn("time_stamp", current_timestamp())
For add a new column with a constant like timestamp, you can use litfunction:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("timeStamp_column", lit(System.currentTimeMillis))
This works for me. I usually perform a write after this.
val d = dataframe.withColumn("SparkLoadedAt", current_timestamp())
In Scala/Databricks:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("Timestamp",current_timestamp())
See my output
I see in comments that some folks are having trouble getting the timestamp to string. Here is a way to do that using spark 3 datetime format
import org.apache.spark.sql.functions._
val d =dataframe.
.withColumn("timeStamp_column", date_format(current_timestamp(), "y-M-d'T'H:m:sX"))

Resources