How can I convert a column with string values in this format "Dec 25 2022 6:31AM" to Timestamp?
No matter what I do, I still get null values in the new column.
I've tried:
import pyspark.sql.functions as fn
df.withColumn('new_ts', fn.col('SendTime').cast("timestamp"))
df.withColumn("new_ts",fn.to_timestamp(fn.col("SendTime")).cast('string'))
df.withColumn('new_ts', (fn.to_timestamp('SendTime', 'yyyy-MM-dd HH:mm:ss.SSS-0300')).cast('date'))
among other attempts.
You were close, to_timestamp is correct function in your case but you need to fix your datetime pattern.
I was able to figure out something like this:
import pyspark.sql.functions as F
data1 = [
["Dec 25 2022 6:31AM"],
["Nov 11 2022 02:31AM"],
["Jun 03 2022 08:31PM"]
]
df = spark.createDataFrame(data1).toDF("time")
tmp = df.withColumn("test", F.to_timestamp(F.col("time"), "MMM dd yyyy h:mma"))
tmp.show(truncate = False)
And the output is:
+-------------------+-------------------+
|time |test |
+-------------------+-------------------+
|Dec 25 2022 6:31AM |2022-12-25 06:31:00|
|Nov 11 2022 02:31AM|2022-11-11 02:31:00|
|Jun 03 2022 08:31PM|2022-06-03 20:31:00|
+-------------------+-------------------+
So i think that you may try to use this format: MMM dd yyyy h:mma
The to_timestamp() function in Apache PySpark is popularly used to convert String to the Timestamp(i.e., Timestamp Type). The default format of the Timestamp is "MM-dd-yyyy HH:mm: ss. SSS," and if the input is not in the specified form, it returns Null.
I am exploring different date formats and trying to convert date formats to others. Currently, I m stuck in a scenario where I have input dates and times as below:
I was able to convert it to a date timestamp using concatenation
concat_ws(' ',new_df.transaction_date,new_df.Transaction_Time)
While I m trying to use
withColumn("date_time2", F.to_date(col('date_time'), "MMM d yyyy hh:mmaa")) with ('spark.sql.legacy.timeParserPolicy','LEGACY')
It is displayed as 'undefined'
I am looking for pointers/code snippets to extract YYYY-MM-DD HH:MM:SS in CET (input is in PST) as below
input_date_time
output (in CET)
Mar 1, 2022 01:00:00 PM PST
2022-03-01 22:00:00
Parse PST string to timestamp with timezone in UTC. Then convert to "CET" time:
import pyspark.sql.functions as F
df = spark.createDataFrame(data=[["Mar 1, 2022 01:00:00 PM PST"]], schema=["input_date_time_pst"])
df = df.withColumn("input_date_time_pst", F.to_timestamp("input_date_time_pst", format="MMM d, yyyy hh:mm:ss a z"))
df = df.withColumn("output_cet", F.from_utc_timestamp("input_date_time_pst", "CET"))
[Out]:
+-------------------+-------------------+
|input_date_time_pst|output_cet |
+-------------------+-------------------+
|2022-03-01 21:00:00|2022-03-01 22:00:00|
+-------------------+-------------------+
Note - The 2022-03-01 21:00:00 above is Mar 1, 2022 01:00:00 PM PST displayed in UTC.
I want to convert a string field to timestamp with Impala.
I tried using this expression which works fine in Hive:
select from_unixtime(unix_timestamp('Fri Mar 02 00:00:00 GMT 2018','EEE MMM dd HH:mm:ss z yyyy')) AS Date_Call_Year;
but I get the error "Bad date/time conversion format: EEE MMM dd HH:mm:ss z yyyy". What is wrong with this format?
I have a date string from a source in the format 'Fri May 24 00:00:00 BST 2019' that I would convert to a date and store in my dataframe as '2019-05-24' using code like my example which works for me under spark 2.0
from pyspark.sql.functions import to_date, unix_timestamp, from_unixtime
df = spark.createDataFrame([("Fri May 24 00:00:00 BST 2019",)], ['date_str'])
df2 = df.select('date_str', to_date(from_unixtime(unix_timestamp('date_str', 'EEE MMM dd HH:mm:ss zzz yyyy'))).alias('date'))
df2.show(1, False)
In my sandbox environment I've updated to spark 3.0 and now get the following error for the above code, is there a new method of doing this in 3.0 to convert my string to a date
: org.apache.spark.SparkUpgradeException: You may get a different
result due to the upgrading of Spark 3.0: Fail to recognize 'EEE MMM
dd HH:mm:ss zzz yyyy' pattern in the DateTimeFormatter.
You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the
behavior before Spark 3.0.
You can form a valid datetime pattern with the guide from
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
If you want to use the legacy format in a newer version of spark(>3), you need to set spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY") or
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY"), which will resolve the issue.
Thanks for responses, excellent advice, for the moment I'll be going with the LEGACY setting. I have a workaround with Spark 3.0 by substringing out the EEE element but I've noticed a bug with how BST timezone converts incorrectly offseting by 10 hours while under LEGACY it correctly remains the same as I'm currently in BST zone. I can do something with this but will wait till the clocks change in the autumn to confirm.
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
).show(1, False)
df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime'),
to_timestamp(df.mydate, 'EEE MMM dd HH:mm:ss zzz yyyy').alias('LEGACYdatetime')
).show(1, False)
spark.sql("set spark.sql.legacy.timeParserPolicy=CORRECTED")
df = spark.createDataFrame([('Fri May 24 00:00:00 BST 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')
).show(1, False)
df = spark.createDataFrame([('Fri May 24 00:00:00 GMT 2019',)], ['mydate'])
df = df.select('mydate',
to_timestamp(df.mydate.substr(5, 28), 'MMM dd HH:mm:ss zzz yyyy').alias('datetime')
).show(1, False)
+----------------------------+-------------------+-------------------+
|mydate |datetime |LEGACYdatetime |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-24 00:00:00|2019-05-24 00:00:00|
+----------------------------+-------------------+-------------------+
+----------------------------+-------------------+-------------------+
|mydate |datetime |LEGACYdatetime |
+----------------------------+-------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|2019-05-24 01:00:00|
+----------------------------+-------------------+-------------------+
+----------------------------+-------------------+
|mydate |datetime |
+----------------------------+-------------------+
|Fri May 24 00:00:00 BST 2019|2019-05-23 14:00:00|
+----------------------------+-------------------+
+----------------------------+-------------------+
|mydate |datetime |
+----------------------------+-------------------+
|Fri May 24 00:00:00 GMT 2019|2019-05-24 01:00:00|
+----------------------------+-------------------+
The difference between the Legacy and the current version of Spark is subtle
for example:
spark.sql("set spark.sql.legacy.timeParserPolicy=EXCEPTION")
df = spark.createDataFrame([('12/25/2019 01:30:00 PM',),], ['Christmas'])
df.select(to_timestamp(col('Christmas'),'MM/dd/yyyy hh:mm:ss a')).show()
Outputs the following:
+----------------------------------------------+
|to_timestamp(Christmas, MM/dd/yyyy hh:mm:ss a)|
+----------------------------------------------+
| 2019-12-25 13:30:00|
+----------------------------------------------+
However
spark.sql("set spark.sql.legacy.timeParserPolicy=EXCEPTION")
df = spark.createDataFrame([('12/25/2019 01:30:00 PM',),], ['Christmas'])
df.select(to_timestamp(col('Christmas'),'MM/dd/yyyy hh:mm:ss aa')).show()
Will raise a SparkUpgradeException
Notice we have 'aa' in the time format not just one.
According to Java Docs, which is what the to_timestamp function uses, 'aa' was always wrong, I guess the earlier version of Spark was more lenient.
So either fix the date formats or set the timeParserPolicy to 'LEGACY' as Shivam suggested.
df3 = pd.concat([dataframe0, dataframe1])
df3.shape
Out[13]: (29807, 11)
df3['Created Date'].dtype
Out[22]: dtype('O')
df3['Created Date'] = df3['Created Date'].astype('datetime64[ns]')
ValueError: ('Unknown string format:', 'Tue Jun 25 2019 00:13:23 GMT-0700 (Pacific Daylight Time)')
'Created Date' column contains date in (Fri Jun 28 2019 00:01:12 GMT-0700 (Pacific Daylight Time)) format
One problem is that extra text: "(Pacific Daylight Time)" so I wrote a regex to remove it. The regex looks for everything up to any whitespace before an open paren/bracket.
From there you can use a function with dt.datetime.strptime() to convert your date string to a pandas datetime64. Please note that there is a call to .strip() in the date converter function. This is because I didn't see a way to remove the trailing whitespace within the regex.
import datetime as dt
import pandas as pd
def date_converter(date_str):
return dt.datetime.strptime(date_str.strip(),'%a %b %d %Y %H:%M:%S GMT%z')
df = pd.DataFrame(['Tue Jun 25 2019 00:13:23 GMT-0700 (Pacific Daylight Time)'],columns=['a'])
pattern = r'^(.*?)((?:(?!\s\().)*)$'
df['b'] = df['a'].str.replace(pattern,r'\1')
df['b'] = df['b'].apply(date_converter)
>>> print(df)
a b
0 Tue Jun 25 2019 00:13:23 GMT-0700 (Pacific Day... 2019-06-25 00:13:23-07:00
>>> df['b'].dtype
datetime64[ns, UTC-07:00]