I am exploring different date formats and trying to convert date formats to others. Currently, I m stuck in a scenario where I have input dates and times as below:
I was able to convert it to a date timestamp using concatenation
concat_ws(' ',new_df.transaction_date,new_df.Transaction_Time)
While I m trying to use
withColumn("date_time2", F.to_date(col('date_time'), "MMM d yyyy hh:mmaa")) with ('spark.sql.legacy.timeParserPolicy','LEGACY')
It is displayed as 'undefined'
I am looking for pointers/code snippets to extract YYYY-MM-DD HH:MM:SS in CET (input is in PST) as below
input_date_time
output (in CET)
Mar 1, 2022 01:00:00 PM PST
2022-03-01 22:00:00
Parse PST string to timestamp with timezone in UTC. Then convert to "CET" time:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[["Mar 1, 2022 01:00:00 PM PST"]], schema=["input_date_time_pst"])
df = df.withColumn("input_date_time_pst", F.to_timestamp("input_date_time_pst", format="MMM d, yyyy hh:mm:ss a z"))
df = df.withColumn("output_cet", F.from_utc_timestamp("input_date_time_pst", "CET"))
[Out]:
+-------------------+-------------------+
|input_date_time_pst|output_cet |
+-------------------+-------------------+
|2022-03-01 21:00:00|2022-03-01 22:00:00|
+-------------------+-------------------+
Note - The 2022-03-01 21:00:00 above is Mar 1, 2022 01:00:00 PM PST displayed in UTC.
I have a string field with datetime in this format:
Thu Jul 01 09:26:47 UTC 2021
I have gone through spark documentation for date_format function and tried this one:
E L dd hh:mm:ss UTC yyyy
Still I get NULL and cant parse the string into datetime. any help would be appreciated.
df.select('Created Time', F.to_timestamp(F.col('Created Time'), 'E L dd hh:mm:ss UTC yyyy'))
.show(1, False)
You need to escape UTC and consider it as "text", not part of the date pattern.
from pyspark.sql.functions import unix_timestamp
df = spark.createDataFrame(["Thu Jul 01 09:26:47 UTC 2021"], "string").toDF("TS_String")
df\
.withColumn("TS_Converted", unix_timestamp(df.TS_String, "E MMM d h:m:s 'UTC' yyyy")\
.cast("timestamp"))\
.show()
I want to convert a string field to timestamp with Impala.
I tried using this expression which works fine in Hive:
select from_unixtime(unix_timestamp('Fri Mar 02 00:00:00 GMT 2018','EEE MMM dd HH:mm:ss z yyyy')) AS Date_Call_Year;
but I get the error "Bad date/time conversion format: EEE MMM dd HH:mm:ss z yyyy". What is wrong with this format?
I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0
from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime
df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str'])
df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date'))
df2.show(1, False)
In my sandbox environment I've updated to spark 3.0 and now get the following error for the above code, is there a new method of doing this in 3.0 to convert my string to a date
: org.apache.spark.SparkUpgradeException: You may get a different
result due to the upgrading of Spark 3.0: Fail to recognize 'EEE MMM
dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter.
You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the
behavior before Spark 3.0.
You can form a valid datetime pattern with the guide from
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
If you want to use the legacy format in a newer version of spark(>3), you need to set spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") or
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY"), which will resolve the issue.
Thanks for responses, excellent advice, for the moment I'll be going with the LEGACY setting. I have a workaround with Spark 3.0 by substringing out the EEE element but I've noticed a bug with how BST timezone converts incorrectly offseting by 10 hours while under LEGACY it correctly remains the same as I'm currently in BST zone. I can do something with this but will wait till the clocks change in the autumn to confirm.
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
).show(1, False)
df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
).show(1, False)
spark.sql("set spark.sql.legacy.timeParserPolicy=CORRECTED")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')
).show(1, False)
df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')
).show(1, False)
+----------------------------+-------------------+-------------------+
|mydate |datetime |LEGACYdatetime |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-24 00:00:00|2019-05-24 00:00:00|
+----------------------------+-------------------+-------------------+
+----------------------------+-------------------+-------------------+
|mydate |datetime |LEGACYdatetime |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|2019-05-24 01:00:00|
+----------------------------+-------------------+-------------------+
+----------------------------+-------------------+
|mydate |datetime |
+----------------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-23 14:00:00|
+----------------------------+-------------------+
+----------------------------+-------------------+
|mydate |datetime |
+----------------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|
+----------------------------+-------------------+
The difference between the Legacy and the current version of Spark is subtle
for example:
spark.sql("set spark.sql.legacy.timeParserPolicy=EXCEPTION")
df = spark.createDataFrame([('12/25/2019 01:30:00 PM',),], ['Christmas'])
df.select(to_timestamp(col('Christmas'),'MM/dd/yyyy hh:mm:ss a')).show()
Outputs the following:
+----------------------------------------------+
|to_timestamp(Christmas, MM/dd/yyyy hh:mm:ss a)|
+----------------------------------------------+
| 2019-12-25 13:30:00|
+----------------------------------------------+
However
spark.sql("set spark.sql.legacy.timeParserPolicy=EXCEPTION")
df = spark.createDataFrame([('12/25/2019 01:30:00 PM',),], ['Christmas'])
df.select(to_timestamp(col('Christmas'),'MM/dd/yyyy hh:mm:ss aa')).show()
Will raise a SparkUpgradeException
Notice we have 'aa' in the time format not just one.
According to Java Docs, which is what the to_timestamp function uses, 'aa' was always wrong, I guess the earlier version of Spark was more lenient.
So either fix the date formats or set the timeParserPolicy to 'LEGACY' as Shivam suggested.
I face this challenge: I have a complex date format that comes as a string. So I use the unix_timestamp function to parse it.
However I can not find the proper pattern to use. I do not know the proper abbreviations for timezone, day of week and month and I have not found a single link to clarify on them.
from pyspark.sql.functions import unix_timestamp
d = spark.createDataFrame([(1,"Mon Jan 14 11:43:20 EET 2019"),\
(2,"Wed Jun 27 16:26:46 EEST 2018")],\
["id","time_str"])
pattern = "aaa bbb dd HH:mm:ss ZZZ yyyy"
d= d.withColumn("timestampCol", unix_timestamp(d["time_str"], pattern).cast("timestamp"))
d.show()
>>>
+---+------------------------------+------------+
|id |time_str |timestampCol|
+---+------------------------------+------------+
|1 |Mon Jan 14 11:43:20 EET 2019 |null |
|2 |Wed Jun 27 16:26:46 EEST 2018 |null |
+---+------------------------------+------------+
Does somenone know how to correctly convert this string to timestamps?
you can try the following code:
from pyspark.sql.functions import *
d = spark.createDataFrame([(1,"Mon Jan 14 11:43:20 EET 2019"),\
(2,"Wed Jun 27 16:26:46 EEST 2018")],\
["id","time_str"])
pattern = "EEE MMM dd HH:mm:ss z yyyy"
d.withColumn("timestamp", unix_timestamp(col("time_str"), pattern).cast("timestamp")).show(truncate=False)
It produces the output below. For further documentation you could refer to https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html which I used to lookup the EEE and MMM which is required as you have 3 characters per weekday and Month respectively.
+---+-----------------------------+-------------------+
|id |time_str |timestamp |
+---+-----------------------------+-------------------+
|1 |Mon Jan 14 11:43:20 EET 2019 |2019-01-14 09:43:20|
|2 |Wed Jun 27 16:26:46 EEST 2018|2018-06-27 13:26:46|
+---+-----------------------------+-------------------+