How to convert Timestamp to Date format in DataFrame? - apache-spark

I have a DataFrame with Timestamp column, which i need to convert as Date format.
Is there any Spark SQL functions available for this?

You can cast the column to date:
Scala:
import org.apache.spark.sql.types.DateType
val newDF = df.withColumn("dateColumn", df("timestampColumn").cast(DateType))
Pyspark:
df = df.withColumn('dateColumn', df['timestampColumn'].cast('date'))

In SparkSQL:
SELECT
CAST(the_ts AS DATE) AS the_date
FROM the_table

Imagine the following input:
val dataIn = spark.createDataFrame(Seq(
(1, "some data"),
(2, "more data")))
.toDF("id", "stuff")
.withColumn("ts", current_timestamp())
dataIn.printSchema
root
|-- id: integer (nullable = false)
|-- stuff: string (nullable = true)
|-- ts: timestamp (nullable = false)
You can use the to_date function:
val dataOut = dataIn.withColumn("date", to_date($"ts"))
dataOut.printSchema
root
|-- id: integer (nullable = false)
|-- stuff: string (nullable = true)
|-- ts: timestamp (nullable = false)
|-- date: date (nullable = false)
dataOut.show(false)
+---+---------+-----------------------+----------+
|id |stuff |ts |date |
+---+---------+-----------------------+----------+
|1 |some data|2017-11-21 16:37:15.828|2017-11-21|
|2 |more data|2017-11-21 16:37:15.828|2017-11-21|
+---+---------+-----------------------+----------+
I would recommend preferring these methods over casting and plain SQL.

For Spark 2.4+,
import spark.implicits._
val newDF = df.withColumn("dateColumn", $"timestampColumn".cast(DateType))
OR
val newDF = df.withColumn("dateColumn", col("timestampColumn").cast(DateType))

Best thing to use..tried and tested -
df_join_result.withColumn('order_date', df_join_result['order_date'].cast('date'))

Related

UDF to cast a map<bigint,struct<in1:bigint,in2:string>> column to add more fields to inner struct

I have a hive table, which when read into spark as spark.table(<table_name>) having below structure:
scala> df.printSchema
root
|-- id: long (nullable = true)
|-- info: map (nullable = true)
| |-- key: long
| |-- value: struct (valueContainsNull = true)
| | |-- in1: long (nullable = true)
| | |-- in2: string (nullable = true)
I want to cast the map column to add more fields to the inner struct, e.g. in3,in4
in this example : map<bigint,struct<in1:bigint,in2:string,in3:decimal(18,5),in4:string>>
I have tried normal cast but that doesn't work. So I am checking if I can achieve this through a UDF.
I will assign defaults to these new values like 0 for decimal and "" for string.
Below is what is tried and can't get it to work. Can anyone pls suggest how do I achieve this?
val origStructType = new StructType().add("in1", LongType, nullable = true).add("in2", StringType, nullable = true)
val newStructType = origStructType.add("in1", LongType, nullable = true).add("in2", StringType, nullable = true).add("in3", DecimalType(18,5), nullable = true).add("in4", StringType, nullable = true)
val newColSchema = MapType(LongType, newStructType)
val m = Map(101L->(101L,"val2"),102L->(102L,"val3"))
val df = Seq((100L,m)).toDF("id","info")
val typeUDFNewRet = udf((col1: Map[Long,Seq[(Long,String)]]) => {
col1.mapValues(v => Seq(v(0),v(1),null,"")) //Forced to use null here for another issue
}, newColSchema)
spark.udf.register("typeUDFNewRet",typeUDFNewRet)
df.registerTempTable("op1")
val df2 = spark.sql("select id, typeUDFNewRet(info) from op1")
scala> val df2 = spark.sql("select id, typeUDFNewRet(info) from op1")
df2: org.apache.spark.sql.DataFrame = [id: bigint, UDF(info): map<bigint,struct<in1:bigint,in2:string,in1:bigint,in2:string,in3:decimal(18,5),in4:string>>]
scala> df2.show(false)
java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to scala.collection.Seq
at $anonfun$1$$anonfun$apply$1.apply(<console>:43)
at scala.collection.MapLike$MappedValues$$anonfun$iterato
I have also tried returning as a Row using this answer but that gives a diff issue.
Try this-
val origStructType = new StructType().add("in1", LongType, nullable = true).add("in2", StringType, nullable = true)
val newStructType = origStructType.add("in3", DecimalType(18,5), nullable = true).add("in4", StringType, nullable = true)
val newColSchema = MapType(LongType, newStructType)
val m = Map(101L->(101L,"val2"),102L->(102L,"val3"))
val df = Seq((100L,m)).toDF("id","info")
df.show(false)
df.printSchema()
val typeUDFNewRet = udf((col1: Map[Long,Row]) => {
col1.mapValues(r => Row.merge(r, Row(null, ""))) //Forced to use null here for another issue
}, newColSchema)
spark.udf.register("typeUDFNewRet",typeUDFNewRet)
df.registerTempTable("op1")
val df2 = spark.sql("select id, typeUDFNewRet(info) from op1")
df2.show(false)
df2.printSchema()
/**
* +---+----------------------------------------------+
* |id |UDF(info) |
* +---+----------------------------------------------+
* |100|[101 -> [101, val2,, ], 102 -> [102, val3,, ]]|
* +---+----------------------------------------------+
*
* root
* |-- id: long (nullable = false)
* |-- UDF(info): map (nullable = true)
* | |-- key: long
* | |-- value: struct (valueContainsNull = true)
* | | |-- in1: long (nullable = true)
* | | |-- in2: string (nullable = true)
* | | |-- in3: decimal(18,5) (nullable = true)
* | | |-- in4: string (nullable = true)
*/

Aggregate one column, but show all columns in select

I try to show maximum value from column while I group rows by date column.
So i tried this code
maxVal = dfSelect.select('*')\
.groupBy('DATE')\
.agg(max('CLOSE'))
But output looks like that:
+----------+----------+
| DATE|max(CLOSE)|
+----------+----------+
|1987-05-08| 43.51|
|1987-05-29| 39.061|
+----------+----------+
I wanna have output like below
+------+---+----------+------+------+------+------+------+---+----------+
|TICKER|PER| DATE| TIME| OPEN| HIGH| LOW| CLOSE|VOL|max(CLOSE)|
+------+---+----------+------+------+------+------+------+---+----------+
| CDG| D|1987-01-02|000000|50.666|51.441|49.896|50.666| 0| 50.666|
| ABC| D|1987-01-05|000000|51.441| 52.02|51.441|51.441| 0| 51.441|
+------+---+----------+------+------+------+------+------+---+----------+
So my question is how to change the code to have output with all columns and aggregated 'CLOSE' column?
Scheme of my data looks like below:
root
|-- TICKER: string (nullable = true)
|-- PER: string (nullable = true)
|-- DATE: date (nullable = true)
|-- TIME: string (nullable = true)
|-- OPEN: float (nullable = true)
|-- HIGH: float (nullable = true)
|-- LOW: float (nullable = true)
|-- CLOSE: float (nullable = true)
|-- VOL: integer (nullable = true)
|-- OPENINT: string (nullable = true)
If you want the same aggregation all your columns in the original dataframe, then you can do something like,
import pyspark.sql.functions as F
expr = [F.max(coln).alias(coln) for coln in df.columns if 'date' not in coln] # df is your datafram
df_res = df.groupby('date').agg(*expr)
If you want multiple aggregations, then you can do like,
sub_col1 = # define
sub_col2=# define
expr1 = [F.max(coln).alias(coln) for coln in sub_col1 if 'date' not in coln]
expr2 = [F.first(coln).alias(coln) for coln in sub_col2 if 'date' not in coln]
expr=expr1+expr2
df_res = df.groupby('date').agg(*expr)
If you want only one of the columns aggregated and added to your original dataframe, then you can do a selfjoin after aggregating
df_agg = df.groupby('date').agg(F.max('close').alias('close_agg')).withColumn("dummy",F.lit("dummmy")) # dummy column is needed as a workaround in spark issues of self join
df_join = df.join(df_agg,on='date',how='left')
or you can use a windowing function
from pyspark.sql import Window
w= Window.partitionBy('date')
df_res = df.withColumn("max_close",F.max('close').over(w))

How to JSON-escape a String field in Spark dataFrame with new column

How to write a new column with JSON format through DataFrame. I tried several approaches but it's writing the data as JSON-escaped String field.
Currently its writing as
{"test":{"id":1,"name":"name","problem_field": "{\"x\":100,\"y\":200}"}}
Instead I want it to be as
{"test":{"id":1,"name":"name","problem_field": {"x":100,"y":200}}}
problem_field is a new column that is being created based on the values read from other fields as:
val dataFrame = oldDF.withColumn("problem_field", s)
I have tried the following approaches
dataFrame.write.json(<<outputPath>>)
dataFrame.toJSON.map(value => value.replace("\\", "").replace("{\"value\":\"", "").replace("}\"}", "}")).write.json(<<outputPath>>)
Tried converting to DataSet as well but no luck. Any pointers are greatly appreciated.
I have already tried the logic mentioned here: How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?
For starters, your example data has an extraneous comma after "y\":200 which will prevent it from being parsed as it is not valid JSON.
From there, you can use from_json to parse the field, assuming you know the schema. In this example, I'm parsing the field separately to first get the schema:
scala> val json = spark.read.json(Seq("""{"test":{"id":1,"name":"name","problem_field": "{\"x\":100,\"y\":200}"}}""").toDS)
json: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]
scala> json.printSchema
root
|-- test: struct (nullable = true)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
| |-- problem_field: string (nullable = true)
scala> val problem_field = spark.read.json(json.select($"test.problem_field").map{
case org.apache.spark.sql.Row(x : String) => x
})
problem_field: org.apache.spark.sql.DataFrame = [x: bigint, y: bigint]
scala> problem_field.printSchema
root
|-- x: long (nullable = true)
|-- y: long (nullable = true)
scala> val fixed = json.withColumn("test", struct($"test.id", $"test.name", from_json($"test.problem_field", problem_field.schema).as("problem_field")))
fixed: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]
scala> fixed.printSchema
root
|-- test: struct (nullable = false)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
| |-- problem_field: struct (nullable = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
If the schema of problem_fields contents is inconsistent between rows, this solution will still work but may not be an optimal way of handling things, as it will produce a sparse Dataframe where each row contains every field encountered in problem_field. For example:
scala> val json = spark.read.json(Seq("""{"test":{"id":1,"name":"name","problem_field": "{\"x\":100,\"y\":200}"}}""", """{"test":{"id":1,"name":"name","problem_field": "{\"a\":10,\"b\":20}"}}""").toDS)
json: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]
scala> val problem_field = spark.read.json(json.select($"test.problem_field").map{case org.apache.spark.sql.Row(x : String) => x})
problem_field: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 2 more fields]
scala> problem_field.printSchema
root
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- x: long (nullable = true)
|-- y: long (nullable = true)
scala> val fixed = json.withColumn("test", struct($"test.id", $"test.name", from_json($"test.problem_field", problem_field.schema).as("problem_field")))
fixed: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]
scala> fixed.printSchema
root
|-- test: struct (nullable = false)
| |-- id: long (nullable = true)
| |-- name: string (nullable = true)
| |-- problem_field: struct (nullable = true)
| | |-- a: long (nullable = true)
| | |-- b: long (nullable = true)
| | |-- x: long (nullable = true)
| | |-- y: long (nullable = true)
scala> fixed.select($"test.problem_field.*").show
+----+----+----+----+
| a| b| x| y|
+----+----+----+----+
|null|null| 100| 200|
| 10| 20|null|null|
+----+----+----+----+
Over the course of hundreds, thousands, or millions of rows, you can see how this would present a problem.

Convert datatype of cloumn from StringType to StructType in dataframe in spark scala

| ID|CO_ID| DATA|
+--------------------+--------------------+----+
|ABCD123|abc12|[{"month":"Jan","day":"monday"}] |
|BCHG345|wed34|[{"month":"Jul","day":"tuessay"}]|
I have dataframe above in which column DATA is of StringType.I want it to convert to StructType. How can I do this?
Use from_json
df.withColumn("data_struct",from_json($"data",StructType(Array(StructField("month", StringType),StructField("day", StringType)))))
On Spark 2.4.0, I get the following
import org.apache.spark.sql.types.{StructType, StructField, StringType}
val df = List ( ("[{\"month\":\"Jan\",\"day\":\"monday\"}]")).toDF("data")
val df2 = df.withColumn("data_struct",from_json($"data",StructType(Array(StructField("month", StringType),StructField("day", StringType)))))
df2.show
+--------------------+-------------+
| data| data_struct|
+--------------------+-------------+
|[{"month":"Jan","...|[Jan, monday]|
+--------------------+-------------+
df2.printSchema
root
|-- data: string (nullable = true)
|-- data_struct: struct (nullable = true)
| |-- month: string (nullable = true)
| |-- day: string (nullable = true)

Got a Error when using DataFrame.schema.fields.update

I want to cast two columns in my DataFrame. Here is my code:
val session = SparkSession
.builder
.master("local")
.appName("UDTransform").getOrCreate()
var df: DataFrame = session.createDataFrame(Seq((1, "Spark", 111), (2, "Storm", 112), (3, "Hadoop", 113), (4, "Kafka", 114), (5, "Flume", 115), (6, "Hbase", 116)))
.toDF("CID", "Name", "STD")
df.printSchema()
df.schema.fields.update(0, StructField("CID", StringType))
df.schema.fields.update(2, StructField("STD", StringType))
df.printSchema()
df.show()
I get these logs from my console:
root
|-- CID: integer (nullable = false)
|-- Name: string (nullable = true)
|-- STD: integer (nullable = false)
root
|-- CID: string (nullable = true)
|-- Name: string (nullable = true)
|-- STD: string (nullable = true)
17/06/28 12:44:32 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 36, Column 31: A method named "toString" is not declared in any enclosing class nor any supertype, nor through a static import
All I want to know is why this ERROR happen and how can I solve it?
appreciate that very much!
You can not update the schema of dataframe since the dataframe are immutable,
But you can update the schema of dataframe and assign to a new Dataframe.
Here is how you can do
val newDF = df.withColumn("CID", col("CID").cast("string"))
.withColumn("STD", col("STD").cast("string"))
newDF.printSchema()
The schema of newDF is
root
|-- CID: string (nullable = true)
|-- Name: string (nullable = true)
|-- STD: string (nullable = true)
Your code:
df.schema.fields.update(0, StructField("CID", StringType))
df.schema.fields.update(2, StructField("STD", StringType))
df.printSchema()
df.show()
In your code
df.schema.fields returns a Array of StructFields as
Array[StructFields]
then if you try to update as
df.schema.fields.update(0, StructField("CID", StringType))
This updates the value of Array[StructField] in 0th position, I this is not what you wanted
DataFrame.schema.fields.update does not update the dataframe schema rather it updates the array of StructField returned by DataFrame.schema.fields
Hope this helps

Resources