How to convert Timestamp column to milliseconds Long column in Spark SQL - apache-spark

What is the shortest and the most efficient way in Spark SQL to transform Timestamp column to a milliseconds timestamp Long column?
Here is an example of a transformation from timestamp to milliseconds
scala> val ts = spark.sql("SELECT now() as ts")
ts: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> ts.show(false)
+-----------------------+
|ts |
+-----------------------+
|2019-06-18 12:32:02.41 |
+-----------------------+
scala> val tss = ts.selectExpr(
| "ts",
| "BIGINT(ts) as seconds_ts",
| "BIGINT(ts) * 1000 + BIGINT(date_format(ts, 'SSS')) as millis_ts"
| )
tss: org.apache.spark.sql.DataFrame = [ts: timestamp, seconds_ts: bigint ... 1 more field]
scala> tss.show(false)
+----------------------+----------+-------------+
|ts |seconds_ts|millis_ts |
+----------------------+----------+-------------+
|2019-06-18 12:32:02.41|1560861122|1560861122410|
+----------------------+----------+-------------+
As you can see, the most straightforward method to get milliseconds from timestamp doesn't work - cast to long returns seconds, however milliseconds information in timestamp is preserved.
The only way I found to to extract milliseconds information is by using date_format function , which is nothing like as simple as I would expect.
Does anybody know the way to get milliseconds UNIX time out of Timestamp column simpler than that?

According to the code on Spark's DateTimeUtils:
"Timestamps are exposed externally as java.sql.Timestamp and are stored internally as longs, which are capable of storing timestamps with microsecond precision."
Therefore, if you define a UDF that has a java.sql.Timestamp as input you can simply call getTime for a Long in millisecond.
val tsConversionToLongUdf = udf((ts: java.sql.Timestamp) => ts.getTime)
Applying this to a variety of Timestamps:
val df = Seq("2017-01-18 11:00:00.000", "2017-01-18 11:00:00.111", "2017-01-18 11:00:00.110", "2017-01-18 11:00:00.100")
.toDF("timestampString")
.withColumn("timestamp", to_timestamp(col("timestampString")))
.withColumn("timestampConversionToLong", tsConversionToLongUdf(col("timestamp")))
.withColumn("timestampCastAsLong", col("timestamp").cast(LongType))
df.printSchema()
df.show(false)
// returns
root
|-- timestampString: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampConversionToLong: long (nullable = false)
|-- timestampCastAsLong: long (nullable = true)
+-----------------------+-----------------------+-------------------------+-------------------+
|timestampString |timestamp |timestampConversionToLong|timestampCastAsLong|
+-----------------------+-----------------------+-------------------------+-------------------+
|2017-01-18 11:00:00.000|2017-01-18 11:00:00 |1484733600000 |1484733600 |
|2017-01-18 11:00:00.111|2017-01-18 11:00:00.111|1484733600111 |1484733600 |
|2017-01-18 11:00:00.110|2017-01-18 11:00:00.11 |1484733600110 |1484733600 |
|2017-01-18 11:00:00.100|2017-01-18 11:00:00.1 |1484733600100 |1484733600 |
+-----------------------+-----------------------+-------------------------+-------------------+
Note that the column "timestampCastAsLong" just shows that a direct cast to a Long will not return the desired result in milliseconds, but only in seconds.

Related

pyspark: how to partition by date column in format 'yyyy-MM-dd HH'

I have tried the following:
df = (spark.createDataFrame([(1, '2020-12-03 01:01:01'), (2, '2022-11-04 10:10:10'),], ['id', 'txt'])
.withColumn("testCol", to_timestamp(col("txt"), "yyyy-MM-dd HH")))
I basically want a timestamp/datetime column in the format (yyyy-MM-dd HH). The above piece of code gives the following result as shown in image. But when I try to write this to Azure Blob Storage partitioned by this time column then it gives some garbage like:
Is there any cleaner way to do this such that the column format remains timestamp/datetime in the format (yyyy-MM-dd HH) and at same time while writing the partition it looks clean in the same way and not the garbage strings of '%3A55%....'
Thanks.
Use date_format:
import pyspark.sql.functions as F
df = spark.createDataFrame(
[(1, '2020-12-03 01:01:01'), (2, '2022-11-04 10:10:10')],
['id', 'txt']
)
df = df.withColumn("testCol", F.col("txt").cast("timestamp"))
df.withColumn("testCol", F.date_format("txt", "yyyy-MM-dd HH")).write.partitionBy('testCol').csv('output')
df.show()
+---+-------------------+-------------------+
| id| txt| testCol|
+---+-------------------+-------------------+
| 1|2020-12-03 01:01:01|2020-12-03 01:01:01|
| 2|2022-11-04 10:10:10|2022-11-04 10:10:10|
+---+-------------------+-------------------+
df.printSchema()
root
|-- id: long (nullable = true)
|-- txt: string (nullable = true)
|-- testCol: timestamp (nullable = true)
$ ls output
_SUCCESS testCol=2020-12-03 01 testCol=2022-11-04 10

Is there any way to handle time in pyspark?

I have a string with 6 characters which should be loaded into SQL Server as the TIME data type.
But spark doesn't have any time data type. I have tried a few ways but data type is not returning in the timestamp.
I am reading the data as a string and converting it to timestamp and then finally trying to extract time values but it is returning value as string again.
df.select('time_col').withColumn("time_col",to_timestamp(col("time_col"),"HHmmss").cast(TimestampType())).withColumn("tim2", date_format(col("time_col"), "HHmmss")).printSchema()
root
|-- time_col: timestamp (nullable = true)
|-- tim2: string (nullable = true)
And the data looks like this but in a different data type.
df.select('time_col').withColumn("time_col",to_timestamp(col("time_col"),"HHmmss").cast(TimestampType())).withColumn("tim2", date_format(col("time_col"), "HHmmss")).show(5)
+-------------------+------+
| time_col| tim2|
+-------------------+------+
|1970-01-01 14:44:51|144451|
|1970-01-01 14:48:37|144837|
|1970-01-01 14:46:10|144610|
|1970-01-01 11:46:39|114639|
|1970-01-01 17:44:33|174433|
+-------------------+------+
Is there any way I can get tim2 column in timestamp column or column equivalent to TIME data type of SQL Server?
I think you won't get what you are trying to do, there's no type in PySpark to handle "HH:mm:ss", see: What data type should be used for a time column
I'd suggest you to use it as string.
In my case I used to convert into timestamp in spark and before sending to SQL server just make it string.. it worked fine with me.
Maybe this will help you, but it seems to me that this changes the column in str
df.withColumn('TIME', date_format('datetime', 'HH:mm:ss'))
In scala, python will be similar:
scala> val df = Seq("144451","144837").toDF("c").select('c.cast("INT").cast("TIMESTAMP"))
df: org.apache.spark.sql.DataFrame = [c: timestamp]
scala> df.show()
+-------------------+
| c|
+-------------------+
|1970-01-02 17:07:31|
|1970-01-02 17:13:57|
+-------------------+
scala> df.printSchema()
root
|-- c: timestamp (nullable = true)

Aggregating tuples within a DataFrame together [duplicate]

This question already has answers here:
How to pivot Spark DataFrame?
(10 answers)
Closed 4 years ago.
I currently am trying to do some aggregation on the services column. I would like to group all the similar services and sum the values, and if possible flatten this into a single row.
Input:
+------------------+--------------------+
| cid | Services|
+------------------+--------------------+
|845124826013182686| [112931, serv1]|
|845124826013182686| [146936, serv1]|
|845124826013182686| [32718, serv2]|
|845124826013182686| [28839, serv2]|
|845124826013182686| [8710, serv2]|
|845124826013182686| [2093140, serv3]|
Hopeful Output:
+------------------+--------------------+------------------+--------------------+
| cid | serv1 | serv2 | serv3 |
+------------------+--------------------+------------------+--------------------+
|845124826013182686| 259867 | 70267 | 2093140 |
Below is the code I currently have
from pyspark.sql import SparkSession, functions
spark = SparkSession.builder.appName("Service Aggregation").getOrCreate()
pathToFile = '/path/to/jsonfile'
df = spark.read.json(pathToFile)
df2 = df.select('cid',functions.explode_outer(df.nodes.services))
finaldataFrame = df2.select('cid',(functions.explode_outer(df2.col)).alias('Services'))
finaldataFrame.show()
I am quite new to pyspark and have been looking at resources and trying to create some UDF to apply to that column but the map function withi pyspark only works fro RDDs and not DataFrames and am unsure how move forward to get the desired output.
Any suggestions or help would be much appreciated.
Result of printSchema
root
|-- clusterId: string (nullable = true)
|-- col: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- cpuCoreInSeconds: long (nullable = true)
| | |-- name: string (nullable = true)
First, extract the service and the value from the Services column by position. Note this assumes that the value is always in position 0 and the service is always in position 1 (as shown in your example).
import pyspark.sql.functions as f
df2 = df.select(
'cid',
f.col("Services").getItem(0).alias('value').cast('integer'),
f.col("Services").getItem(1).alias('service')
)
df2.show()
#+------------------+-------+-------+
#| cid| value|service|
#+------------------+-------+-------+
#|845124826013182686| 112931| serv1|
#|845124826013182686| 146936| serv1|
#|845124826013182686| 32718| serv2|
#|845124826013182686| 28839| serv2|
#|845124826013182686| 8710| serv2|
#|845124826013182686|2093140| serv3|
#+------------------+-------+-------+
Note that I casted the value to integer, but it may already be an integer depending on how your schema is defined.
Once the data is in this format, it's easy to pivot() it. Group by the cid column, pivot the service column, and aggregate by summing the value column:
df2.groupBy('cid').pivot('service').sum("value").show()
#+------------------+------+-----+-------+
#| cid| serv1|serv2| serv3|
#+------------------+------+-----+-------+
#|845124826013182686|259867|70267|2093140|
#+------------------+------+-----+-------+
Update
Based on the schema you provided, you will have to get the value and service by name, rather than by position:
df2 = df.select(
'cid',
f.col("Services").getItem("cpuCoreInSeconds").alias('value'),
f.col("Services").getItem("name").alias('service')
)
The rest is the same. Also, no need to cast to integer as cpuCoreInSeconds is already a long.

Can unix_timestamp() return unix time in milliseconds in Apache Spark?

I'm trying to get the unix time from a timestamp field in milliseconds (13 digits) but currently it returns in seconds (10 digits).
scala> var df = Seq("2017-01-18 11:00:00.000", "2017-01-18 11:00:00.123", "2017-01-18 11:00:00.882", "2017-01-18 11:00:02.432").toDF()
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df = df.selectExpr("value timeString", "cast(value as timestamp) time")
df: org.apache.spark.sql.DataFrame = [timeString: string, time: timestamp]
scala> df = df.withColumn("unix_time", unix_timestamp(df("time")))
df: org.apache.spark.sql.DataFrame = [timeString: string, time: timestamp ... 1 more field]
scala> df.take(4)
res63: Array[org.apache.spark.sql.Row] = Array(
[2017-01-18 11:00:00.000,2017-01-18 11:00:00.0,1484758800],
[2017-01-18 11:00:00.123,2017-01-18 11:00:00.123,1484758800],
[2017-01-18 11:00:00.882,2017-01-18 11:00:00.882,1484758800],
[2017-01-18 11:00:02.432,2017-01-18 11:00:02.432,1484758802])
Even though 2017-01-18 11:00:00.123 and 2017-01-18 11:00:00.000 are different, I get the same unix time back 1484758800
What am I missing?
Milliseconds hide in fraction part timestamp format
Try this:
df = df.withColumn("time_in_milliseconds", col("time").cast("double"))
You'll get something like 1484758800.792, where 792 it's milliseconds
At least it's works for me (Scala, Spark, Hive)
Implementing the approach suggested in Dao Thi's answer
import pyspark.sql.functions as F
df = spark.createDataFrame([('22-Jul-2018 04:21:18.792 UTC', ),('23-Jul-2018 04:21:25.888 UTC',)], ['TIME'])
df.show(2,False)
df.printSchema()
Output:
+----------------------------+
|TIME |
+----------------------------+
|22-Jul-2018 04:21:18.792 UTC|
|23-Jul-2018 04:21:25.888 UTC|
+----------------------------+
root
|-- TIME: string (nullable = true)
Converting string time-format (including milliseconds ) to unix_timestamp(double). Extracting milliseconds from string using substring method (start_position = -7, length_of_substring=3) and Adding milliseconds seperately to unix_timestamp. (Cast to substring to float for adding)
df1 = df.withColumn("unix_timestamp",F.unix_timestamp(df.TIME,'dd-MMM-yyyy HH:mm:ss.SSS z') + F.substring(df.TIME,-7,3).cast('float')/1000)
Converting unix_timestamp(double) to timestamp datatype in Spark.
df2 = df1.withColumn("TimestampType",F.to_timestamp(df1["unix_timestamp"]))
df2.show(n=2,truncate=False)
This will give you following output
+----------------------------+----------------+-----------------------+
|TIME |unix_timestamp |TimestampType |
+----------------------------+----------------+-----------------------+
|22-Jul-2018 04:21:18.792 UTC|1.532233278792E9|2018-07-22 04:21:18.792|
|23-Jul-2018 04:21:25.888 UTC|1.532319685888E9|2018-07-23 04:21:25.888|
+----------------------------+----------------+-----------------------+
Checking the Schema:
df2.printSchema()
root
|-- TIME: string (nullable = true)
|-- unix_timestamp: double (nullable = true)
|-- TimestampType: timestamp (nullable = true)
unix_timestamp() return unix timestamp in seconds.
The last 3 digits in the timestamps are the same with the last 3 digits of the milliseconds string (1.999sec = 1999 milliseconds), so just take the last 3 digits of the timestamps string and append to the end of the milliseconds string.
It cannot be done with unix_timestamp() but since Spark 3.1.0 there is a built-in function called unix_millis():
unix_millis(timestamp) - Returns the number of milliseconds since 1970-01-01 00:00:00 UTC. Truncates higher levels of precision.
Up to Spark version 3.0.1 it is not possible to convert a timestamp into unix time in milliseconds using the SQL built-in function unix_timestamp.
According to the code on Spark's DateTimeUtils
"Timestamps are exposed externally as java.sql.Timestamp and are stored internally as longs, which are capable of storing timestamps with microsecond precision."
Therefore, if you define a UDF that has a java.sql.Timestamp as input you can call getTime for a Long in millisecond. If you apply unix_timestamp you will only get unix time with precision in seconds.
val tsConversionToLongUdf = udf((ts: java.sql.Timestamp) => ts.getTime)
Applying this to a variety of Timestamps:
val df = Seq("2017-01-18 11:00:00.000", "2017-01-18 11:00:00.111", "2017-01-18 11:00:00.110", "2017-01-18 11:00:00.100")
.toDF("timestampString")
.withColumn("timestamp", to_timestamp(col("timestampString")))
.withColumn("timestampConversionToLong", tsConversionToLongUdf(col("timestamp")))
.withColumn("timestampUnixTimestamp", unix_timestamp(col("timestamp")))
df.printSchema()
df.show(false)
// returns
root
|-- timestampString: string (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampConversionToLong: long (nullable = false)
|-- timestampCastAsLong: long (nullable = true)
+-----------------------+-----------------------+-------------------------+-------------------+
|timestampString |timestamp |timestampConversionToLong|timestampUnixTimestamp|
+-----------------------+-----------------------+-------------------------+-------------------+
|2017-01-18 11:00:00.000|2017-01-18 11:00:00 |1484733600000 |1484733600 |
|2017-01-18 11:00:00.111|2017-01-18 11:00:00.111|1484733600111 |1484733600 |
|2017-01-18 11:00:00.110|2017-01-18 11:00:00.11 |1484733600110 |1484733600 |
|2017-01-18 11:00:00.100|2017-01-18 11:00:00.1 |1484733600100 |1484733600 |
+-----------------------+-----------------------+-------------------------+-------------------+
Wow, same with #Тимур Залимов just cast it
>>> df2 = df_msg.withColumn("datetime", F.col("timestamp").cast("timestamp")).withColumn("timestamp_back" , F.col("datetime").cast("double"))
>>> r = df2.rdd.take(1)[0]
>>> r.timestamp_back
1666509660.071501
>>> r.timestamp
1666509660.071501
>>> r.datetime
datetime.datetime(2022, 10, 23, 15, 21, 0, 71501)

Convert date from String to Date format in Dataframes

I am trying to convert a column which is in String format to Date format using the to_date function but its returning Null values.
df.createOrReplaceTempView("incidents")
spark.sql("select Date from incidents").show()
+----------+
| Date|
+----------+
|08/26/2016|
|08/26/2016|
|08/26/2016|
|06/14/2016|
spark.sql("select to_date(Date) from incidents").show()
+---------------------------+
|to_date(CAST(Date AS DATE))|
+---------------------------+
| null|
| null|
| null|
| null|
The Date column is in String format:
|-- Date: string (nullable = true)
Use to_date with Java SimpleDateFormat.
TO_DATE(CAST(UNIX_TIMESTAMP(date, 'MM/dd/yyyy') AS TIMESTAMP))
Example:
spark.sql("""
SELECT TO_DATE(CAST(UNIX_TIMESTAMP('08/26/2016', 'MM/dd/yyyy') AS TIMESTAMP)) AS newdate"""
).show()
+----------+
| dt|
+----------+
|2016-08-26|
+----------+
I solved the same problem without the temp table/view and with dataframe functions.
Of course I found that only one format works with this solution and that's yyyy-MM-DD.
For example:
val df = sc.parallelize(Seq("2016-08-26")).toDF("Id")
val df2 = df.withColumn("Timestamp", (col("Id").cast("timestamp")))
val df3 = df2.withColumn("Date", (col("Id").cast("date")))
df3.printSchema
root
|-- Id: string (nullable = true)
|-- Timestamp: timestamp (nullable = true)
|-- Date: date (nullable = true)
df3.show
+----------+--------------------+----------+
| Id| Timestamp| Date|
+----------+--------------------+----------+
|2016-08-26|2016-08-26 00:00:...|2016-08-26|
+----------+--------------------+----------+
The timestamp of course has 00:00:00.0 as a time value.
Since your main aim was to convert the type of a column in a DataFrame from String to Timestamp, I think this approach would be better.
import org.apache.spark.sql.functions.{to_date, to_timestamp}
val modifiedDF = DF.withColumn("Date", to_date($"Date", "MM/dd/yyyy"))
You could also use to_timestamp (I think this is available from Spark 2.x) if you require fine grained timestamp.
you can also do this query...!
sqlContext.sql("""
select from_unixtime(unix_timestamp('08/26/2016', 'MM/dd/yyyy'), 'yyyy:MM:dd') as new_format
""").show()
You can also pass date format
df.withColumn("Date",to_date(unix_timestamp(df.col("your_date_column"), "your_date_format").cast("timestamp")))
For Example
import org.apache.spark.sql.functions._
val df = sc.parallelize(Seq("06 Jul 2018")).toDF("dateCol")
df.withColumn("Date",to_date(unix_timestamp(df.col("dateCol"), "dd MMM yyyy").cast("timestamp")))
I have personally found some errors in when using unix_timestamp based date converstions from dd-MMM-yyyy format to yyyy-mm-dd, using spark 1.6, but this may extend into recent versions. Below I explain a way to solve the problem using java.time that should work in all versions of spark:
I've seen errors when doing:
from_unixtime(unix_timestamp(StockMarketClosingDate, 'dd-MMM-yyyy'), 'yyyy-MM-dd') as FormattedDate
Below is code to illustrate the error, and my solution to fix it.
First I read in stock market data, in a common standard file format:
import sys.process._
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType, DateType}
import sqlContext.implicits._
val EODSchema = StructType(Array(
StructField("Symbol" , StringType, true), //$1
StructField("Date" , StringType, true), //$2
StructField("Open" , StringType, true), //$3
StructField("High" , StringType, true), //$4
StructField("Low" , StringType, true), //$5
StructField("Close" , StringType, true), //$6
StructField("Volume" , StringType, true) //$7
))
val textFileName = "/user/feeds/eoddata/INDEX/INDEX_19*.csv"
// below is code to read using later versions of spark
//val eoddata = spark.read.format("csv").option("sep", ",").schema(EODSchema).option("header", "true").load(textFileName)
// here is code to read using 1.6, via, "com.databricks:spark-csv_2.10:1.2.0"
val eoddata = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("delimiter", ",") //.option("dateFormat", "dd-MMM-yyyy") failed to work
.schema(EODSchema)
.load(textFileName)
eoddata.registerTempTable("eoddata")
And here is the date conversions having issues:
%sql
-- notice there are errors around the turn of the year
Select
e.Date as StringDate
, cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date) as ProperDate
, e.Close
from eoddata e
where e.Symbol = 'SPX.IDX'
order by cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date)
limit 1000
A chart made in zeppelin shows spikes, which are errors.
and here is the check that shows the date conversion errors:
// shows the unix_timestamp conversion approach can create errors
val result = sqlContext.sql("""
Select errors.* from
(
Select
t.*
, substring(t.OriginalStringDate, 8, 11) as String_Year_yyyy
, substring(t.ConvertedCloseDate, 0, 4) as Converted_Date_Year_yyyy
from
( Select
Symbol
, cast(from_unixtime(unix_timestamp(e.Date, "dd-MMM-yyyy"), 'YYYY-MM-dd') as Date) as ConvertedCloseDate
, e.Date as OriginalStringDate
, Close
from eoddata e
where e.Symbol = 'SPX.IDX'
) t
) errors
where String_Year_yyyy <> Converted_Date_Year_yyyy
""")
//df.withColumn("tx_date", to_date(unix_timestamp($"date", "M/dd/yyyy").cast("timestamp")))
result.registerTempTable("SPX")
result.cache()
result.show(100)
result: org.apache.spark.sql.DataFrame = [Symbol: string, ConvertedCloseDate: date, OriginalStringDate: string, Close: string, String_Year_yyyy: string, Converted_Date_Year_yyyy: string]
res53: result.type = [Symbol: string, ConvertedCloseDate: date, OriginalStringDate: string, Close: string, String_Year_yyyy: string, Converted_Date_Year_yyyy: string]
+-------+------------------+------------------+-------+----------------+------------------------+
| Symbol|ConvertedCloseDate|OriginalStringDate| Close|String_Year_yyyy|Converted_Date_Year_yyyy|
+-------+------------------+------------------+-------+----------------+------------------------+
|SPX.IDX| 1997-12-30| 30-Dec-1996| 753.85| 1996| 1997|
|SPX.IDX| 1997-12-31| 31-Dec-1996| 740.74| 1996| 1997|
|SPX.IDX| 1998-12-29| 29-Dec-1997| 953.36| 1997| 1998|
|SPX.IDX| 1998-12-30| 30-Dec-1997| 970.84| 1997| 1998|
|SPX.IDX| 1998-12-31| 31-Dec-1997| 970.43| 1997| 1998|
|SPX.IDX| 1998-01-01| 01-Jan-1999|1229.23| 1999| 1998|
+-------+------------------+------------------+-------+----------------+------------------------+
FINISHED
After this result, I switched to java.time conversions with a UDF like this, which worked for me:
// now we will create a UDF that uses the very nice java.time library to properly convert the silly stockmarket dates
// start by importing the specific java.time libraries that superceded the joda.time ones
import java.time.LocalDate
import java.time.format.DateTimeFormatter
// now define a specific data conversion function we want
def fromEODDate (YourStringDate: String): String = {
val formatter = DateTimeFormatter.ofPattern("dd-MMM-yyyy")
var retDate = LocalDate.parse(YourStringDate, formatter)
// this should return a proper yyyy-MM-dd date from the silly dd-MMM-yyyy formats
// now we format this true local date with a formatter to the desired yyyy-MM-dd format
val retStringDate = retDate.format(DateTimeFormatter.ISO_LOCAL_DATE)
return(retStringDate)
}
Now I register it as a function for use in sql:
sqlContext.udf.register("fromEODDate", fromEODDate(_:String))
and check the results, and rerun test:
val results = sqlContext.sql("""
Select
e.Symbol as Symbol
, e.Date as OrigStringDate
, Cast(fromEODDate(e.Date) as Date) as ConvertedDate
, e.Open
, e.High
, e.Low
, e.Close
from eoddata e
order by Cast(fromEODDate(e.Date) as Date)
""")
results.printSchema()
results.cache()
results.registerTempTable("results")
results.show(10)
results: org.apache.spark.sql.DataFrame = [Symbol: string, OrigStringDate: string, ConvertedDate: date, Open: string, High: string, Low: string, Close: string]
root
|-- Symbol: string (nullable = true)
|-- OrigStringDate: string (nullable = true)
|-- ConvertedDate: date (nullable = true)
|-- Open: string (nullable = true)
|-- High: string (nullable = true)
|-- Low: string (nullable = true)
|-- Close: string (nullable = true)
res79: results.type = [Symbol: string, OrigStringDate: string, ConvertedDate: date, Open: string, High: string, Low: string, Close: string]
+--------+--------------+-------------+-------+-------+-------+-------+
| Symbol|OrigStringDate|ConvertedDate| Open| High| Low| Close|
+--------+--------------+-------------+-------+-------+-------+-------+
|ADVA.IDX| 01-Jan-1996| 1996-01-01| 364| 364| 364| 364|
|ADVN.IDX| 01-Jan-1996| 1996-01-01| 1527| 1527| 1527| 1527|
|ADVQ.IDX| 01-Jan-1996| 1996-01-01| 1283| 1283| 1283| 1283|
|BANK.IDX| 01-Jan-1996| 1996-01-01|1009.41|1009.41|1009.41|1009.41|
| BKX.IDX| 01-Jan-1996| 1996-01-01| 39.39| 39.39| 39.39| 39.39|
|COMP.IDX| 01-Jan-1996| 1996-01-01|1052.13|1052.13|1052.13|1052.13|
| CPR.IDX| 01-Jan-1996| 1996-01-01| 1.261| 1.261| 1.261| 1.261|
|DECA.IDX| 01-Jan-1996| 1996-01-01| 205| 205| 205| 205|
|DECN.IDX| 01-Jan-1996| 1996-01-01| 825| 825| 825| 825|
|DECQ.IDX| 01-Jan-1996| 1996-01-01| 754| 754| 754| 754|
+--------+--------------+-------------+-------+-------+-------+-------+
only showing top 10 rows
which looks ok, and I rerun my chart, to see if there are errors/spikes:
As you can see, no more spikes or errors. I now use a UDF as I've shown to apply my date format transformations to a standard yyyy-MM-dd format, and have not had spurious errors since. :-)
You could simply do df.withColumn("date", date_format(col("string"),"yyyy-MM-dd HH:mm:ss.ssssss")).show()
dateID is int column contains date in Int format
spark.sql("SELECT from_unixtime(unix_timestamp(cast(dateid as varchar(10)), 'yyyymmdd'), 'yyyy-mm-dd') from XYZ").show(50, false)
Find the below-mentioned code, it might be helpful for you.
val stringDate = spark.sparkContext.parallelize(Seq("12/16/2019")).toDF("StringDate")
val dateCoversion = stringDate.withColumn("dateColumn", to_date(unix_timestamp($"StringDate", "dd/mm/yyyy").cast("Timestamp")))
dateCoversion.show(false)
+----------+----------+
|StringDate|dateColumn|
+----------+----------+
|12/16/2019|2019-01-12|
+----------+----------+
This works in Spark SQL:
TO_DATE(date_string_or_column, 'yyyy-MM-dd') AS date_column_name. You can replace the second argument with however your date string is formatted, e.g. yyyy/MM/dd. The return type is date.
Use below function in PySpark to convert datatype into your required datatype.
Here I'm converting all the date datatype into the Timestamp column.
def change_dtype(df):
for name, dtype in df.dtypes:
if dtype == "date":
df = df.withColumn(name, col(name).cast('timestamp'))
return df
When you try to change the string data type to date format when you have the string data in the format 'dd/MM/yyyy' with slashes and using spark version greater than 3.0 it converts the value to null.
In order for that to work you can set the spark configuration property which will allow you to get the output that you want.
spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
and then we can use the below code to get the output that we want
df.withColumn("tx_date", to_date(unix_timestamp($"date", "dd/MM/yyyy").cast("timestamp")))
The solution proposed above by Sai Kiriti Badam worked for me.
I'm using Azure Databricks to read data captured from an EventHub. This contains a string column named EnqueuedTimeUtc with the following format...
12/7/2018 12:54:13 PM
I'm using a Python notebook and used the following...
import pyspark.sql.functions as func
sports_messages = sports_df.withColumn("EnqueuedTimestamp", func.to_timestamp("EnqueuedTimeUtc", "MM/dd/yyyy hh:mm:ss aaa"))
... to create a new column EnqueuedTimestamp of type "timestamp" with data in the following format...
2018-12-07 12:54:13

Resources