Convert StringType to TimeStamp on Pyspark - apache-spark

How can I convert a column with string values in this format "Dec 25 2022 6:31AM" to Timestamp?
No matter what I do, I still get null values in the new column.
I've tried:
import pyspark.sql.functions as fn
df.withColumn('new_ts', fn.col('SendTime').cast("timestamp"))
df.withColumn("new_ts",fn.to_timestamp(fn.col("SendTime")).cast('string'))
df.withColumn('new_ts', (fn.to_timestamp('SendTime', 'yyyy-MM-dd HH:mm:ss.SSS-0300')).cast('date'))
among other attempts.

You were close, to_timestamp is correct function in your case but you need to fix your datetime pattern.
I was able to figure out something like this:
import pyspark.sql.functions as F
data1 = [
["Dec 25 2022 6:31AM"],
["Nov 11 2022 02:31AM"],
["Jun 03 2022 08:31PM"]
]
df = spark.createDataFrame(data1).toDF("time")
tmp = df.withColumn("test", F.to_timestamp(F.col("time"), "MMM dd yyyy h:mma"))
tmp.show(truncate = False)
And the output is:
+-------------------+-------------------+
|time |test |
+-------------------+-------------------+
|Dec 25 2022 6:31AM |2022-12-25 06:31:00|
|Nov 11 2022 02:31AM|2022-11-11 02:31:00|
|Jun 03 2022 08:31PM|2022-06-03 20:31:00|
+-------------------+-------------------+
So i think that you may try to use this format: MMM dd yyyy h:mma

The to_timestamp() function in Apache PySpark is popularly used to convert String to the Timestamp(i.e., Timestamp Type). The default format of the Timestamp is "MM-dd-yyyy HH:mm: ss. SSS," and if the input is not in the specified form, it returns Null.

Related

Extract YYYY-MM-DD HH:MM: SS and convert to different time zone

I am exploring different date formats and trying to convert date formats to others. Currently, I m stuck in a scenario where I have input dates and times as below:
I was able to convert it to a date timestamp using concatenation
concat_ws(' ',new_df.transaction_date,new_df.Transaction_Time)
While I m trying to use
withColumn("date_time2", F.to_date(col('date_time'), "MMM d yyyy hh:mmaa")) with ('spark.sql.legacy.timeParserPolicy','LEGACY')
It is displayed as 'undefined'
I am looking for pointers/code snippets to extract YYYY-MM-DD HH:MM:SS in CET (input is in PST) as below
input_date_time
output (in CET)
Mar 1, 2022 01:00:00 PM PST
2022-03-01 22:00:00
Parse PST string to timestamp with timezone in UTC. Then convert to "CET" time:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[["Mar 1, 2022 01:00:00 PM PST"]], schema=["input_date_time_pst"])
df = df.withColumn("input_date_time_pst", F.to_timestamp("input_date_time_pst", format="MMM d, yyyy hh:mm:ss a z"))
df = df.withColumn("output_cet", F.from_utc_timestamp("input_date_time_pst", "CET"))
[Out]:
+-------------------+-------------------+
|input_date_time_pst|output_cet |
+-------------------+-------------------+
|2022-03-01 21:00:00|2022-03-01 22:00:00|
+-------------------+-------------------+
Note - The 2022-03-01 21:00:00 above is Mar 1, 2022 01:00:00 PM PST displayed in UTC.

Need Pyspark String to Date Conversion Pattern

I have a string field with datetime in this format:
Thu Jul 01 09:26:47 UTC 2021
I have gone through spark documentation for date_format function and tried this one:
E L dd hh:mm:ss UTC yyyy
Still I get NULL and cant parse the string into datetime. any help would be appreciated.
df.select('Created Time', F.to_timestamp(F.col('Created Time'), 'E L dd hh:mm:ss UTC yyyy'))
.show(1, False)
You need to escape UTC and consider it as "text", not part of the date pattern.
from pyspark.sql.functions import unix_timestamp
df = spark.createDataFrame(["Thu Jul 01 09:26:47 UTC 2021"], "string").toDF("TS_String")
df\
.withColumn("TS_Converted", unix_timestamp(df.TS_String, "E MMM d h:m:s 'UTC' yyyy")\
.cast("timestamp"))\
.show()

Impala convert string to timestamp in format 'Fri Mar 02 00:00:00 GMT 2018'

I want to convert a string field to timestamp with Impala.
I tried using this expression which works fine in Hive:
select from_unixtime(unix_timestamp('Fri Mar 02 00:00:00 GMT 2018','EEE MMM dd HH:mm:ss z yyyy')) AS Date_Call_Year;
but I get the error "Bad date/time conversion format: EEE MMM dd HH:mm:ss z yyyy". What is wrong with this format?

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0
from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime
df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str'])
df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date'))
df2.show(1, False)
In my sandbox environment I've updated to spark 3.0 and now get the following error for the above code, is there a new method of doing this in 3.0 to convert my string to a date
: org.apache.spark.SparkUpgradeException: You may get a different
result due to the upgrading of Spark 3.0: Fail to recognize 'EEE MMM
dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter.
You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the
behavior before Spark 3.0.
You can form a valid datetime pattern with the guide from
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
If you want to use the legacy format in a newer version of spark(>3), you need to set spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") or
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY"), which will resolve the issue.
Thanks for responses, excellent advice, for the moment I'll be going with the LEGACY setting. I have a workaround with Spark 3.0 by substringing out the EEE element but I've noticed a bug with how BST timezone converts incorrectly offseting by 10 hours while under LEGACY it correctly remains the same as I'm currently in BST zone. I can do something with this but will wait till the clocks change in the autumn to confirm.
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
).show(1, False)
df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
).show(1, False)
spark.sql("set spark.sql.legacy.timeParserPolicy=CORRECTED")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')
).show(1, False)
df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')
).show(1, False)
+----------------------------+-------------------+-------------------+
|mydate |datetime |LEGACYdatetime |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-24 00:00:00|2019-05-24 00:00:00|
+----------------------------+-------------------+-------------------+
+----------------------------+-------------------+-------------------+
|mydate |datetime |LEGACYdatetime |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|2019-05-24 01:00:00|
+----------------------------+-------------------+-------------------+
+----------------------------+-------------------+
|mydate |datetime |
+----------------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-23 14:00:00|
+----------------------------+-------------------+
+----------------------------+-------------------+
|mydate |datetime |
+----------------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|
+----------------------------+-------------------+
The difference between the Legacy and the current version of Spark is subtle
for example:
spark.sql("set spark.sql.legacy.timeParserPolicy=EXCEPTION")
df = spark.createDataFrame([('12/25/2019 01:30:00 PM',),], ['Christmas'])
df.select(to_timestamp(col('Christmas'),'MM/dd/yyyy hh:mm:ss a')).show()
Outputs the following:
+----------------------------------------------+
|to_timestamp(Christmas, MM/dd/yyyy hh:mm:ss a)|
+----------------------------------------------+
| 2019-12-25 13:30:00|
+----------------------------------------------+
However
spark.sql("set spark.sql.legacy.timeParserPolicy=EXCEPTION")
df = spark.createDataFrame([('12/25/2019 01:30:00 PM',),], ['Christmas'])
df.select(to_timestamp(col('Christmas'),'MM/dd/yyyy hh:mm:ss aa')).show()
Will raise a SparkUpgradeException
Notice we have 'aa' in the time format not just one.
According to Java Docs, which is what the to_timestamp function uses, 'aa' was always wrong, I guess the earlier version of Spark was more lenient.
So either fix the date formats or set the timeParserPolicy to 'LEGACY' as Shivam suggested.

Pyspark parse custom date format

I face this challenge: I have a complex date format that comes as a string. So I use the unix_timestamp function to parse it.
However I can not find the proper pattern to use. I do not know the proper abbreviations for timezone, day of week and month and I have not found a single link to clarify on them.
from pyspark.sql.functions import unix_timestamp
d = spark.createDataFrame([(1,"Mon Jan 14 11:43:20 EET 2019"),\
(2,"Wed Jun 27 16:26:46 EEST 2018")],\
["id","time_str"])
pattern = "aaa bbb dd HH:mm:ss ZZZ yyyy"
d= d.withColumn("timestampCol", unix_timestamp(d["time_str"], pattern).cast("timestamp"))
d.show()
>>>
+---+------------------------------+------------+
|id |time_str |timestampCol|
+---+------------------------------+------------+
|1 |Mon Jan 14 11:43:20 EET 2019 |null |
|2 |Wed Jun 27 16:26:46 EEST 2018 |null |
+---+------------------------------+------------+
Does somenone know how to correctly convert this string to timestamps?
you can try the following code:
from pyspark.sql.functions import *
d = spark.createDataFrame([(1,"Mon Jan 14 11:43:20 EET 2019"),\
(2,"Wed Jun 27 16:26:46 EEST 2018")],\
["id","time_str"])
pattern = "EEE MMM dd HH:mm:ss z yyyy"
d.withColumn("timestamp", unix_timestamp(col("time_str"), pattern).cast("timestamp")).show(truncate=False)
It produces the output below. For further documentation you could refer to https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html which I used to lookup the EEE and MMM which is required as you have 3 characters per weekday and Month respectively.
+---+-----------------------------+-------------------+
|id |time_str |timestamp |
+---+-----------------------------+-------------------+
|1 |Mon Jan 14 11:43:20 EET 2019 |2019-01-14 09:43:20|
|2 |Wed Jun 27 16:26:46 EEST 2018|2018-06-27 13:26:46|
+---+-----------------------------+-------------------+

Resources