Pyspark parse custom date format - apache-spark

I face this challenge: I have a complex date format that comes as a string. So I use the unix_timestamp function to parse it.
However I can not find the proper pattern to use. I do not know the proper abbreviations for timezone, day of week and month and I have not found a single link to clarify on them.
from pyspark.sql.functions import unix_timestamp
d = spark.createDataFrame([(1,"Mon Jan 14 11:43:20 EET 2019"),\
(2,"Wed Jun 27 16:26:46 EEST 2018")],\
["id","time_str"])
pattern = "aaa bbb dd HH:mm:ss ZZZ yyyy"
d= d.withColumn("timestampCol", unix_timestamp(d["time_str"], pattern).cast("timestamp"))
d.show()
>>>
+---+------------------------------+------------+
|id |time_str |timestampCol|
+---+------------------------------+------------+
|1 |Mon Jan 14 11:43:20 EET 2019 |null |
|2 |Wed Jun 27 16:26:46 EEST 2018 |null |
+---+------------------------------+------------+
Does somenone know how to correctly convert this string to timestamps?

you can try the following code:
from pyspark.sql.functions import *
d = spark.createDataFrame([(1,"Mon Jan 14 11:43:20 EET 2019"),\
(2,"Wed Jun 27 16:26:46 EEST 2018")],\
["id","time_str"])
pattern = "EEE MMM dd HH:mm:ss z yyyy"
d.withColumn("timestamp", unix_timestamp(col("time_str"), pattern).cast("timestamp")).show(truncate=False)
It produces the output below. For further documentation you could refer to https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html which I used to lookup the EEE and MMM which is required as you have 3 characters per weekday and Month respectively.
+---+-----------------------------+-------------------+
|id |time_str |timestamp |
+---+-----------------------------+-------------------+
|1 |Mon Jan 14 11:43:20 EET 2019 |2019-01-14 09:43:20|
|2 |Wed Jun 27 16:26:46 EEST 2018|2018-06-27 13:26:46|
+---+-----------------------------+-------------------+

Related

Convert StringType to TimeStamp on Pyspark

How can I convert a column with string values in this format "Dec 25 2022 6:31AM" to Timestamp?
No matter what I do, I still get null values in the new column.
I've tried:
import pyspark.sql.functions as fn
df.withColumn('new_ts', fn.col('SendTime').cast("timestamp"))
df.withColumn("new_ts",fn.to_timestamp(fn.col("SendTime")).cast('string'))
df.withColumn('new_ts', (fn.to_timestamp('SendTime', 'yyyy-MM-dd HH:mm:ss.SSS-0300')).cast('date'))
among other attempts.
You were close, to_timestamp is correct function in your case but you need to fix your datetime pattern.
I was able to figure out something like this:
import pyspark.sql.functions as F
data1 = [
["Dec 25 2022 6:31AM"],
["Nov 11 2022 02:31AM"],
["Jun 03 2022 08:31PM"]
]
df = spark.createDataFrame(data1).toDF("time")
tmp = df.withColumn("test", F.to_timestamp(F.col("time"), "MMM dd yyyy h:mma"))
tmp.show(truncate = False)
And the output is:
+-------------------+-------------------+
|time |test |
+-------------------+-------------------+
|Dec 25 2022 6:31AM |2022-12-25 06:31:00|
|Nov 11 2022 02:31AM|2022-11-11 02:31:00|
|Jun 03 2022 08:31PM|2022-06-03 20:31:00|
+-------------------+-------------------+
So i think that you may try to use this format: MMM dd yyyy h:mma
The to_timestamp() function in Apache PySpark is popularly used to convert String to the Timestamp(i.e., Timestamp Type). The default format of the Timestamp is "MM-dd-yyyy HH:mm: ss. SSS," and if the input is not in the specified form, it returns Null.

Extract YYYY-MM-DD HH:MM: SS and convert to different time zone

I am exploring different date formats and trying to convert date formats to others. Currently, I m stuck in a scenario where I have input dates and times as below:
I was able to convert it to a date timestamp using concatenation
concat_ws(' ',new_df.transaction_date,new_df.Transaction_Time)
While I m trying to use
withColumn("date_time2", F.to_date(col('date_time'), "MMM d yyyy hh:mmaa")) with ('spark.sql.legacy.timeParserPolicy','LEGACY')
It is displayed as 'undefined'
I am looking for pointers/code snippets to extract YYYY-MM-DD HH:MM:SS in CET (input is in PST) as below
input_date_time
output (in CET)
Mar 1, 2022 01:00:00 PM PST
2022-03-01 22:00:00
Parse PST string to timestamp with timezone in UTC. Then convert to "CET" time:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[["Mar 1, 2022 01:00:00 PM PST"]], schema=["input_date_time_pst"])
df = df.withColumn("input_date_time_pst", F.to_timestamp("input_date_time_pst", format="MMM d, yyyy hh:mm:ss a z"))
df = df.withColumn("output_cet", F.from_utc_timestamp("input_date_time_pst", "CET"))
[Out]:
+-------------------+-------------------+
|input_date_time_pst|output_cet |
+-------------------+-------------------+
|2022-03-01 21:00:00|2022-03-01 22:00:00|
+-------------------+-------------------+
Note - The 2022-03-01 21:00:00 above is Mar 1, 2022 01:00:00 PM PST displayed in UTC.

String to Date migration from Spark 2.0 to 3.0 gives Fail to recognize 'EEE MMM dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter

I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0
from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime
df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str'])
df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date'))
df2.show(1, False)
In my sandbox environment I've updated to spark 3.0 and now get the following error for the above code, is there a new method of doing this in 3.0 to convert my string to a date
: org.apache.spark.SparkUpgradeException: You may get a different
result due to the upgrading of Spark 3.0: Fail to recognize 'EEE MMM
dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter.
You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the
behavior before Spark 3.0.
You can form a valid datetime pattern with the guide from
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
If you want to use the legacy format in a newer version of spark(>3), you need to set spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") or
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY"), which will resolve the issue.
Thanks for responses, excellent advice, for the moment I'll be going with the LEGACY setting. I have a workaround with Spark 3.0 by substringing out the EEE element but I've noticed a bug with how BST timezone converts incorrectly offseting by 10 hours while under LEGACY it correctly remains the same as I'm currently in BST zone. I can do something with this but will wait till the clocks change in the autumn to confirm.
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
).show(1, False)
df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
).show(1, False)
spark.sql("set spark.sql.legacy.timeParserPolicy=CORRECTED")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')
).show(1, False)
df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')
).show(1, False)
+----------------------------+-------------------+-------------------+
|mydate |datetime |LEGACYdatetime |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-24 00:00:00|2019-05-24 00:00:00|
+----------------------------+-------------------+-------------------+
+----------------------------+-------------------+-------------------+
|mydate |datetime |LEGACYdatetime |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|2019-05-24 01:00:00|
+----------------------------+-------------------+-------------------+
+----------------------------+-------------------+
|mydate |datetime |
+----------------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-23 14:00:00|
+----------------------------+-------------------+
+----------------------------+-------------------+
|mydate |datetime |
+----------------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|
+----------------------------+-------------------+
The difference between the Legacy and the current version of Spark is subtle
for example:
spark.sql("set spark.sql.legacy.timeParserPolicy=EXCEPTION")
df = spark.createDataFrame([('12/25/2019 01:30:00 PM',),], ['Christmas'])
df.select(to_timestamp(col('Christmas'),'MM/dd/yyyy hh:mm:ss a')).show()
Outputs the following:
+----------------------------------------------+
|to_timestamp(Christmas, MM/dd/yyyy hh:mm:ss a)|
+----------------------------------------------+
| 2019-12-25 13:30:00|
+----------------------------------------------+
However
spark.sql("set spark.sql.legacy.timeParserPolicy=EXCEPTION")
df = spark.createDataFrame([('12/25/2019 01:30:00 PM',),], ['Christmas'])
df.select(to_timestamp(col('Christmas'),'MM/dd/yyyy hh:mm:ss aa')).show()
Will raise a SparkUpgradeException
Notice we have 'aa' in the time format not just one.
According to Java Docs, which is what the to_timestamp function uses, 'aa' was always wrong, I guess the earlier version of Spark was more lenient.
So either fix the date formats or set the timeParserPolicy to 'LEGACY' as Shivam suggested.

spark date format MMM dd, yyyy hh:mm:ss AM to timestamp in df

I need to convert a descriptive date format from a log file "MMM dd, yyyy hh:mm:ss AM/PM" to the spark timestamp datatype. I tried something like below, but it is giving null.
val df = Seq(("Nov 05, 2018 02:46:47 AM"),("Nov 5, 2018 02:46:47 PM")).toDF("times")
df.withColumn("time2",date_format('times,"MMM dd, yyyy HH:mm:ss AM")).show(false)
+------------------------+-----+
|times |time2|
+------------------------+-----+
|Nov 05, 2018 02:46:47 AM|null |
|Nov 5, 2018 02:46:47 PM |null |
+------------------------+-----+
Expected output
+------------------------+----------------------------+
|times |time2 |
+------------------------+-----+----------------------+
|Nov 05, 2018 02:46:47 AM|2018-11-05 02:46:47.000000" |
|Nov 5, 2018 02:46:47 PM |2018-11-05 14:46:47.000000" |
+------------------------+-----+----------------------+
What is the proper format for converting this?. Note that DD may be having leading zeroes.
Here is your answer
val df = Seq(("Nov 05, 2018 02:46:47 AM"),("Nov 5, 2018 02:46:47 PM")).toDF("times")
scala> df.withColumn("times2", from_unixtime(unix_timestamp(col("times"), "MMM d, yyyy hh:mm:ss a"),"yyyy-MM-dd HH:mm:ss.SSSSSS")).show(false)
+------------------------+--------------------------+
|times |times2 |
+------------------------+--------------------------+
|Nov 05, 2018 02:46:47 AM|2018-11-05 02:46:47.000000|
|Nov 5, 2018 02:46:47 PM |2018-11-05 14:46:47.000000|
+------------------------+--------------------------+
Please use hh for hour instead of HH if you want to parse 12 hour format. Also am/pm is indicated by suffix "a" while parsing.
Hope this helps!!
Using to_timestamp and date_format functions
scala> df.withColumn("times2",to_timestamp('times,"MMM d, yyyy hh:mm:ss a")).show(false)
+------------------------+-------------------+
|times |times2 |
+------------------------+-------------------+
|Nov 05, 2018 02:46:47 AM|2018-11-05 02:46:47|
|Nov 5, 2018 02:46:47 PM |2018-11-05 14:46:47|
+------------------------+-------------------+
scala> df.withColumn("times2",date_format(to_timestamp('times,"MMM d, yyyy hh:mm:ss a"),"yyyy-MM-dd HH:mm:ss.SSSSSS")).show(false)
+------------------------+--------------------------+
|times |times2 |
+------------------------+--------------------------+
|Nov 05, 2018 02:46:47 AM|2018-11-05 02:46:47.000000|
|Nov 5, 2018 02:46:47 PM |2018-11-05 14:46:47.000000|
+------------------------+--------------------------+
scala>
Using SQL syntax:
select date_format(to_timestamp(ColumnTimestamp, "MM/dd/yyyy hh:mm:ss aa"), "yyyy-MM-dd") as ColumnDate
from database_name.table_name
We can use splitby
Select date.split('-')[2]||'-'||case when length(date.split('-')[0]) = 1 then '0'||date.split('-')[0] else date.split('-')[0] end || case when length(date.split('-')[1]) = 1 then '0'||date.split('-')[1] else date.split('-')[1] end]
Date = date_column
Date format yyyy-mm-dd
Delimiter can be different.
Without any date format

Problems with converting text to date (YYYY-MM-DD) in Excel (Mac)

I have been able to convert this date to YYYY-MM-DD HH:mm but not anymore. What can I do to convert this date.
Sep 15, 2014 9:30:32 AM
You need to know that I'm using Swedish keyboard, date and region.
Example:
Order # Purchased On
100026881 Sep 15, 2014 9:30:32 AM
100026880 Sep 15, 2014 9:10:56 AM
100026879 Sep 15, 2014 9:09:10 AM
100026878 Sep 15, 2014 9:03:27 AM
100026877 Sep 15, 2014 8:57:02 AM
100026876 Sep 15, 2014 8:38:37 AM
100026875 Sep 15, 2014 6:54:29 AM
100026874 Sep 15, 2014 5:03:23 AM
100026873 Sep 15, 2014 2:45:50 AM
100026872 Sep 15, 2014 1:42:26 AM
100026871 Sep 14, 2014 11:20:31 PM
100026870 Sep 14, 2014 11:16:29 PM
100026869 Sep 14, 2014 11:11:15 PM
100026868 Sep 14, 2014 11:10:06 PM
100026867 Sep 14, 2014 10:42:56 PM
100026866 Sep 14, 2014 10:41:22 PM
100026865 Sep 14, 2014 10:36:43 PM
100026863 Sep 14, 2014 10:26:13 PM
Formatting a date in Excel 2011 for Mac
You have at least three different ways to apply a date format. Perhaps the fastest is to select a cell or cell range, and then click the Home tab of the Ribbon. In the Number group, click the pop-up button under the Number group title and choose Date to display the date as m/d/yy, where m represents the month's number, d represents the day number, and yy represents a two-digit year.
Excel has many more built-in date formats, which you can apply by displaying the Format Cells dialog by pressing Command-1 and then clicking the Number tab. You can also display the Number tab of the Format Cells dialog by clicking the Home tab on the Ribbon. Then click the pop-up button under the Number group title and choose Custom from the pop-up menu.
When the Format Cells dialog displays, select the Date category. Choose a Type from the list. Choosing a different Location (language) or Calendar type changes the date types offered.
I hope this may helps..
This should be a comment since I have neither Swedish settings not a Mac but I am suggesting a lookup table:
+-----+----+
| Jan | 1 |
| Feb | 2 |
| Mar | 3 |
| Apr | 4 |
| May | 5 |
| Jun | 6 |
| Jul | 7 |
| Aug | 8 |
| Sep | 9 |
| Oct | 10 |
| Nov | 11 |
| Dec | 12 |
+-----+----+
say named Marray, along with:
=TEXT(DATE(MID(B2,9,4),VLOOKUP(LEFT(B2,3),Marray,2,0),MID(B2,5,2))+VALUE(TRIM(RIGHT(B2,11))),"[$-41D]mmmm dd, yyyy h:mm:ss AM/PM")
in C2 and copied down to suit (assuming Sep 15, 2014 9:30:32 AM is in B2).
For single digit dates, perhaps:
=TEXT(DATE(TRIM(MID(B2,8,5)),VLOOKUP(LEFT(B2,3),Marray,2,0),SUBSTITUTE(TRIM(MID(B2,4,3)),",",""))+VALUE(TRIM(RIGHT(B2,11))),"[$-41D]mmmm dd, yyyy h:mm:ss AM/PM")
For me (Windows, Excel 2013, English!) this returns:
It may be necessary to replace all ,s with ;, except one inside SUBSTITUTE.
I think that Jeeped might be close to the problem.
My guess is that now the data may have been pasted as text, instead of recognizing the date. (pnuts had an answer but it's a lot more work than using the builtin Excel functions.)
If the dates are in their own column, like:
Sep 15, 2014 9:30:32 AM
Sep 15, 2014 9:10:56 AM
Sep 15, 2014 9:09:10 AM
etc
Then you might have to get Excel to parse the text dates.
If the date text is in B2, put this formula in another cell (say B3):
=DATEVALUE(B2) + TIMEVALUE(B2)
Then you can format with this custom formatting string: yyyy-mm-dd h:mm:ss AM/PM
Which will give you:
2014-09-15 9:30:32 AM
2014-09-15 9:10:56 AM
2014-09-15 9:09:10 AM
etc.
Hope this helps.

Resources