how can i add a timestamp as an extra column to my dataframe - apache-spark

*Hi all,
I have an easy question for you all.
I have an RDD, created from kafka streaming using createStream method.
Now i want to add a timestamp as a value to this rdd before converting in to dataframe.
I have tried doing to add a value to the dataframe using with withColumn() but returning this error*
val topicMaps = Map("topic" -> 1)
val now = java.util.Calendar.getInstance().getTime()
val messages = KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](ssc, kafkaConf, topicMaps, StorageLevel.MEMORY_ONLY_SER)
messages.foreachRDD(rdd =>
{
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val dataframe = sqlContext.read.json(rdd.map(_._2))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
val d =dataframe.withColumn("timeStamp_column",dataframe.col("now"))
org.apache.spark.sql.AnalysisException: Cannot resolve column name "now" among (action, device_os_ver, device_type, event_name,
item_name, lat, lon, memberid, productUpccd, tenantid);
at org.apache.spark.sql.DataFrame$$anonfun$resolve$1.apply(DataFrame.scala:15
As i came to know that DataFrames cannot be altered as they are immutable, but RDDs are immutable as well.
Then what is the best way to do it.
How to a value to the RDD(adding timestamp to an RDD dynamically).

Try current_timestamp function.
import org.apache.spark.sql.functions.current_timestamp
df.withColumn("time_stamp", current_timestamp())

For add a new column with a constant like timestamp, you can use litfunction:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("timeStamp_column", lit(System.currentTimeMillis))

This works for me. I usually perform a write after this.
val d = dataframe.withColumn("SparkLoadedAt", current_timestamp())

In Scala/Databricks:
import org.apache.spark.sql.functions._
val newDF = oldDF.withColumn("Timestamp",current_timestamp())
See my output

I see in comments that some folks are having trouble getting the timestamp to string. Here is a way to do that using spark 3 datetime format
import org.apache.spark.sql.functions._
val d =dataframe.
.withColumn("timeStamp_column", date_format(current_timestamp(), "y-M-d'T'H:m:sX"))

Related

How do I apply multiple columns in window PartitionBy in Spark scala

val partitionsColumns = "idnum,monthnum"
val partitionsColumnsList = partitionsColumns.split(",").toList
val loc = "/data/omega/published/invoice"
val df = sqlContext.read.parquet(loc)
val windowFunction = Window.partitionBy (partitionsColumnsList:_*).orderBy(df("effective_date").desc)
<console>:38: error: overloaded method value partitionBy with alternatives:
(cols: org.apache.spark.sql.Column*) org.apache.spark.sql.expressions.WindowSpec <and>
(colName: String,colNames: String*)org.apache.spark.sql.expressions.WindowSpec
cannot be applied to (String)
val windowFunction = Window.partitionBy(partitionsColumnsList:_*).orderBy(df("effective_date").desc)
Is it possible to send List of Columns to partitionBy method Spark/Scala?
I have implemented for passing one column to partitionBy method which worked. I don't know how to pass multiple columns to partitionBy Method
basically I want to pass List(Columns) to partitionBy method
Spark version is 1.6.
Window.partitionBy has the following definitions:
static WindowSpec partitionBy(Column... cols)
Creates a WindowSpec with the partitioning defined.
static WindowSpec partitionBy(scala.collection.Seq<Column> cols)
Creates a WindowSpec with the partitioning defined.
static WindowSpec partitionBy(String colName, scala.collection.Seq<String> colNames)
Creates a WindowSpec with the partitioning defined.
static WindowSpec partitionBy(String colName, String... colNames)
Creates a WindowSpec with the partitioning defined.
With your example,
val partitionsColumnsList = partitionsColumns.split(",").toList
You can use it like:
Window.partitionBy(partitionsColumnsList.map(col(_)):_*).orderBy(df("effective_date").desc)
Or
Window.partitionBy(partitionsColumnsList.head, partitionsColumnsList.tail _* ).orderBy(df("effective_date").desc)
you could also apply multiple columns for partitionBy by assigning the column names as a list to the variable and use that in the partitionBy argument as below:
val partitioncolumns = List("idnum","monthnum")
val w = Window.partitionBy(partitioncolumns:_*).orderBy(df("effective_date").desc)
The code below worked for me:
Window.partitionBy(partitionsColumnsList.map(col(_)):_*).orderBy(df("effective_date").desc)

Syntax for using toEpochDate with a Dataframe with Spark Scala - elegantly

The following is nice and easy with an RDD in terms of epochDate derivation:
val rdd2 = rdd.map(x => (x._1, x._2, x._3,
LocalDate.parse(x._2.toString).toEpochDay, LocalDate.parse(x._3.toString).toEpochDay))
The RDD are all of String Type. The desired result is gotten. Get this, for example:
...(Mike,2018-09-25,2018-09-30,17799,17804), ...
Trying to do the same if there is a String in the DF appears too tricky for me, and I would like to see something elegant, if possible. Something like this and variations do not work.
val df2 = df.withColumn("s", $"start".LocalDate.parse.toString.toEpochDay)
Get:
notebook:50: error: value LocalDate is not a member of org.apache.spark.sql.ColumnName
I understand the error, but what is an elegant way of doing the conversion?
You can define to_epoch_day as datediff since the beginning of the epoch:
import org.apache.spark.sql.functions.{datediff, lit, to_date}
import org.apache.spark.sql.Column
def to_epoch_day(c: Column) = datediff(c, to_date(lit("1970-01-01")))
and apply it directly on a Column:
df.withColumn("s", to_epoch_day(to_date($"start")))
As long as the string format complies with ISO 8601 you could even skip data conversion (it will be done implicitly by datediff:
df.withColumn("s", to_epoch_day($"start"))
$"start" is of type ColumnName not String.
You will need to define a UDF
Example below:
scala> import java.time._
import java.time._
scala> def toEpochDay(s: String) = LocalDate.parse(s).toEpochDay
toEpochDay: (s: String)Long
scala> val toEpochDayUdf = udf(toEpochDay(_: String))
toEpochDayUdf: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,LongType,Some(List(StringType)))
scala> val df = List("2018-10-28").toDF
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.withColumn("s", toEpochDayUdf($"value")).collect
res0: Array[org.apache.spark.sql.Row] = Array([2018-10-28,17832])

Converting CassandraRow obtained from joinWithCassandraTable to DataFrame

case class SourcePartition(id: String, host:String ,bucket: Int)
joinedRDDs =partitions.joinWithCassandraTable("db_name","table_name")
joinedRDDs.values.foreach(println)
I have to use joinWithCassandraTable , How do i covert the result CassandraRow in to a DataFrame? OR is there any equivalent of joinWithCassandraTable with DataFrame ?
I have to read a lot of partitions in one go, I'm aware of Datastax Cassandra connector Predicate push down, but it allows to pull only one Partition at a time ( It doesnt seems to allow IN operator , Only = seems to be supported)
val spark: SparkSession = SparkSession.builder().master("local[4]").appName("RDD2DF").getOrCreate()
val sc: SparkContext = spark.sparkContext
import spark.implicits._
val internalJoinRDD = spark.sparkContext.cassandraTable("test", "test_table_1").joinWithCassandraTable("test", "table_table_2")
internalJoin.toDebugString
internalJoinRDD.toDF()
Can you try the above code snippet.
If you have a schema for your data, you can use
def createDataFrame(internalJoinRDD: RDD[Row], schema: StructType): DataFrame

Write dataframe to csv with datatype map<string,bigint> in Spark

I have a file which is file1snappy.parquet. It is having a complex data structure like a map, array inside that.After processing that I got final result.while writing that results to csv I am getting some error saying
"Exception in thread "main" java.lang.UnsupportedOperationException: CSV data source does not support map<string,bigint> data type."
Code which I have used:
val conf=new SparkConf().setAppName("student-example").setMaster("local")
val sc = new SparkContext(conf)
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val datadf = sqlcontext.read.parquet("C:\\file1.snappy.parquet")
def sumaggr=udf((aggr: Map[String, collection.mutable.WrappedArray[Long]]) => if (aggr.keySet.contains("aggr")) aggr("aggr").sum else 0)
datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)
datadf.write.format("com.databricks.spark.csv").option("header", "true").save("C:\\myfile.csv")
I tried converting datadf.toString() but still I am facing same issue.
How can write that result to CSV.
spark version 2.1.1
Spark CSV source supports only atomic types. You cannot store any columns that are non-atomic
I think best is to create a JSON for the column that has map<string,bigint> as a datatype and save it in csv as below.
import spark.implicits._
import org.apache.spark.sql.functions._
datadf.withColumn("column_name_with_map_type", to_json(struct($"column_name_with_map_type"))).write.csv("outputpath")
Hope this helps!
You are trying to save the output of
val datadf = sqlcontext.read.parquet("C:\\file1.snappy.parquet")
which I guess is a mistake as the udf function and all the aggregation done would go in vain if you do so
So I think you want to save the output of
datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0).show(false)
So you need to save it in a new dataframe variable and use that variable to save.
val finalDF = datadf.select(col("neid"),sumaggr(col("marks")).as("sum")).filter(col("sum") =!= 0)
finalDF.write.format("com.databricks.spark.csv").option("header", "true").save("C:\\myfile.csv")
And you should be fine.

Can I read a CSV represented as a string into Apache Spark using spark-csv

I know how to read a csv file into spark using spark-csv (https://github.com/databricks/spark-csv), but I already have the csv file represented as a string and would like to convert this string directly to dataframe. Is this possible?
Update : Starting from Spark 2.2.x
there is finally a proper way to do it using Dataset.
import org.apache.spark.sql.{Dataset, SparkSession}
val spark = SparkSession.builder().appName("CsvExample").master("local").getOrCreate()
import spark.implicits._
val csvData: Dataset[String] = spark.sparkContext.parallelize(
"""
|id, date, timedump
|1, "2014/01/01 23:00:01",1499959917383
|2, "2014/11/31 12:40:32",1198138008843
""".stripMargin.lines.toList).toDS()
val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.show()
frame.printSchema()
Old spark versions
Actually you can, though it's using library internals and not widely advertised. Just create and use your own CsvParser instance.
Example that works for me on spark 1.6.0 and spark-csv_2.10-1.4.0 below
import com.databricks.spark.csv.CsvParser
val csvData = """
|userid,organizationid,userfirstname,usermiddlename,userlastname,usertitle
|1,1,user1,m1,l1,mr
|2,2,user2,m2,l2,mr
|3,3,user3,m3,l3,mr
|""".stripMargin
val rdd = sc.parallelize(csvData.lines.toList)
val csvParser = new CsvParser()
.withUseHeader(true)
.withInferSchema(true)
val csvDataFrame: DataFrame = csvParser.csvRdd(sqlContext, rdd)
You can parse your string into a csv using, e.g. scala-csv:
val myCSVdata : Array[List[String]] =
myCSVString.split('\n').flatMap(CSVParser.parseLine(_))
Here you can do a bit more processing, data cleaning, verifying that every line parses well and has the same number of fields, etc ...
You can then make this an RDD of records:
val myCSVRDD : RDD[List[String]] = sparkContext.parallelize(msCSVdata)
Here you can massage your lists of Strings into a case class, to reflect the fields of your csv data better. You should get some inspiration from the creations of Persons in this example:
https://spark.apache.org/docs/latest/sql-programming-guide.html#inferring-the-schema-using-reflection
I omit this step.
You can then convert to a DataFrame:
import spark.implicits._
myCSVDataframe = myCSVRDD.toDF()
The accepted answer wasn't working for me in spark 2.2.0 but lead me to what I needed with csvData.lines.toList
val fileUrl = getClass.getResource(s"/file_in_resources.csv")
val stream = fileUrl.getContent.asInstanceOf[InputStream]
val streamString = Source.fromInputStream(stream).mkString
val csvList = streamString.lines.toList
spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(csvList.toDS())
.as[SomeCaseClass]

Resources