Pyspark parse datetime field with day and month names into timestamp - apache-spark

I'm not even sure where to start. I want to parse a column that is currently a string into a timestamp. The records look like the following:
Thu, 28 Jan 2021 02:54:17 +0000
What is the best way to parse this as a timestamp? I wasn't even sure where to start since it's not a super common way to store dates

You could probably start from the docs Datetime Patterns for Formatting and Parsing:
import pyspark.sql.functions as F
df = spark.createDataFrame([("Thu, 28 Jan 2021 02:54:17 +0000",)], ['timestamp'])
df.withColumn(
"timestamp",
F.to_timestamp("timestamp", "E, dd MMM yyyy HH:mm:ss Z")
).show()
#+-------------------+
#| timestamp|
#+-------------------+
#|2021-01-28 02:54:17|
#+-------------------+
However, since Spark version 3.0, you can no longer use some symbols like E while parsing to timestamp:
Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime
formatting, e.g. date_format. They are not allowed used for datetime
parsing, e.g. to_timestamp.
You can either set the time parser to legacy:
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
Or use some string functions to remove the day part from string before using to_timestamp:
df.withColumn(
"timestamp",
F.to_timestamp(F.split("timestamp", ",")[1], " dd MMM yyyy HH:mm:ss Z")
).show()

Related

How can I convert a specific string date to date or datetime in Spark?

I have this string pattern in my Spark dataframe: 'Sep 14, 2014, 1:34:36 PM'.
I want to convert this to date or datetime format, using Databricks and Spark.
I've already tried the cast and to_date functions, but nothing works and I got null return everytime.
How can I do that?
Thanks in advance!
If we have a created table like this:
var ds = spark.sparkContext.parallelize(Seq(
"Sep 14, 2014, 01:34:36 PM"
)).toDF("date")
Through the following statement:
ds = ds.withColumn("casted", to_timestamp(col("date"), "MMM dd, yyyy, hh:mm:ss aa"))
You get this result:
+-------------------------+-------------------+
|date |casted |
+-------------------------+-------------------+
|Sep 14, 2014, 01:34:36 PM|2014-09-14 13:34:36|
+-------------------------+-------------------+
which should be useful to you. You can use to_date or other APIs that require a datetime format, good luck!
Your date/time stamp string is incorrect. You have 1 instead of 01.
#
# 1 - Create sample dataframe + view
#
# required library
from pyspark.sql.functions import *
# array of tuples - data
dat1 = [
("1", "Sep 14, 2014, 01:34:36 pm")
]
# array of names - columns
col1 = ["row_id", "date_string1"]
# make data frame
df1 = spark.createDataFrame(data=dat1, schema=col1)
# expand date range into list of dates
df1 = df1.withColumn("time_stamp1", to_timestamp(col("date_string1"), "MMM dd, yyyy, hh:mm:ss a"))
# show schema
df1.printSchema()
# show data
display(df1)
This code produces the correct answer.
If the data has 1:34:36, it fails. You can use a when clause to pick the correct conversion.

Change the timestamp from UTC to given format in Pyspark

i have a timestamp value i.e "2021-08-18T16:49:42.175-06:00". how can i convert this to "2021-08-18T16:49:42.175Z" format in pyspark.
You can use Pyspark DataFrame function date_format to reformat your timestamp string to any other format.
Example:
df = df.withColumn("ts_column", date_format("ts_column", "yyyy-MM-dd'T'HH:mm:ss.SSSZ")
date_format expects a TimestampType column so you might need to cast it Timestamp first if it currently is StringType
Set the timeZone to "UTC" and read-only upt0 23 chars.
Try below:
spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.sql(""" select to_timestamp('2021-08-18T16:49:42.175-06:00') as ts,
date_format(to_timestamp(substr('2021-08-18T16:49:42.175-06:00',1,23)),'yyyy-MM-dd HH:mm:ss.SSSZ') as ts2 from range(1) """).show(false)
+-----------------------+----------------------------+
|ts |ts2 |
+-----------------------+----------------------------+
|2021-08-18 22:49:42.175|2021-08-18 16:49:42.175+0000|
+-----------------------+----------------------------+
Note that +0000 is UTC
If you want to get "Z", then use X
spark.conf.set("spark.sql.session.timeZone", "UTC")
spark.sql("""
with t1 ( select to_timestamp('2021-08-18T16:49:42.175-06:00') as ts,
to_timestamp(substr('2021-08-18T16:49:42.175-06:00',1,23)) as ts2 from range(1) )
select *, date_format(ts2,'YYYY-MM-d HH:MM:ss.SX') ts3 from t1
""").show(false)
+-----------------------+-----------------------+------------------------+
|ts |ts2 |ts3 |
+-----------------------+-----------------------+------------------------+
|2021-08-18 22:49:42.175|2021-08-18 16:49:42.175|2021-08-18 16:08:42.175Z|
+-----------------------+-----------------------+------------------------+

convert string with UTC offset to spark timestamp

How to store string 2018-03-21 08:15:00 +03:00 as a timestamptype, preserving the UTC offset, in spark?
tried below
from pyspark.sql.functions import *
df = spark.createDataFrame([("2018-03-21 08:15:00 +03:00",)], ["timestamp"])
newDf= df.withColumn("newtimestamp", to_timestamp(col('timestamp'), "yyyy-MM-dd HH:mm:ss XXX")
)
This prints newtimestamp column with value converted to UTC time i.e 2018-03-21 05:15:00
How I can store this string as timestamp column in dataframe preserving offset i.e store same string as timestamp or store like 2018-03-21 08:15:00 +3000
You need to format the timestamp you obtain from the convertion to the desired pattern using date_format:
newDf = df.withColumn(
"newtimestamp",
to_timestamp(col('timestamp'), "yyyy-MM-dd HH:mm:ss XXX")
).withColumn(
"newtimestamp_formatted",
date_format("newtimestamp", "yyyy-MM-dd HH:mm:ss Z")
)
newDf.show(truncate=False)
#+--------------------------+-------------------+-------------------------+
#|timestamp |newtimestamp |newtimestamp_formatted |
#+--------------------------+-------------------+-------------------------+
#|2018-03-21 08:15:00 +03:00|2018-03-21 06:15:00|2018-03-21 06:15:00 +0100|
#+--------------------------+-------------------+-------------------------+

Convert UTC timestamp to local time based on time zone in PySpark

I have a PySpark DataFrame, df, with some columns as shown below. The hour column is in UTC time and I want to create a new column that has the local time based on the time_zone column. How can I do that in PySpark?
df
+-------------------------+------------+
| hour | time_zone |
+-------------------------+------------+
|2019-10-16T20:00:00+0000 | US/Eastern |
|2019-10-15T23:00:00+0000 | US/Central |
+-------------------------+------------+
#What I want:
+-------------------------+------------+---------------------+
| hour | time_zone | local_time |
+-------------------------+------------+---------------------+
|2019-10-16T20:00:00+0000 | US/Eastern | 2019-10-16T15:00:00 |
|2019-10-15T23:00:00+0000 | US/Central | 2019-10-15T17:00:00 |
+-------------------------+------------+---------------------+
You can use the in-built from_utc_timestamp function. Note that the hour column needs to be passed in as a string without timezone to the function.
Code below works for spark versions starting 2.4.
from pyspark.sql.functions import *
df.select(from_utc_timestamp(split(df.hour,'\+')[0],df.time_zone).alias('local_time')).show()
For spark versions before 2.4, you have to pass in a constant string representing the time zone, as the second argument, to the function.
Documentation
pyspark.sql.functions.from_utc_timestamp(timestamp, tz)
This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the given time zone.
However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not timezone-agnostic. So in Spark this function just shift the timestamp value from UTC timezone to the given timezone.
This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.
Parameters
timestamp – the column that contains timestamps
tz – a string that has the ID of timezone, e.g. “GMT”, “America/Los_Angeles”, etc
Changed in version 2.4: tz can take a Column containing timezone ID strings.
You should also be able to use a spark UDF.
from pytz import timezone
from datetime import datetime
from pyspark.sql.functions import udf
def mytime(x,y):
dt = datetime.strptime(x, "%Y-%m-%dT%H:%M:%S%z")
return dt.astimezome(timezone(y))
mytimeUDF = udf(mytime, StringType())
df = df.withColumn('local_time', mytimeUDF("hour","timezone"))

Timestamp abbreviated date format natively in Spark

I'm converting successfully to timestamps numerical formats of dates (y-m-d, yyyyMMdd, etc) in spark using sql.functions.unix_timestamp.
The problem is when the date uses an abbreviated name of a month or a day, like
1991-Aug-09 Fri
Is there any way to achieve the conversion using only native spark functions?
(Disclaimer: I know I can do it using python functions, it's just curiosity)
You can use (reference - SimpleDateFormat)
yyyy-MMM-dd EEE
format with unix_timestamp
spark.sql("SELECT CAST(unix_timestamp('1991-Aug-09 Fri', 'yyyy-MMM-dd EEE') AS TIMESTAMP)").show()
+-------------------------------------------------------------------+
|CAST(unix_timestamp(1991-Aug-09 Fri, yyyy-MMM-dd EEE) AS TIMESTAMP)|
+-------------------------------------------------------------------+
| 1991-08-09 00:00:00|
+-------------------------------------------------------------------+
or to_date / to_timestamp (Spark 2.2 or later):
spark.sql("SELECT to_date('1991-Aug-09 Fri', 'yyyy-MMM-dd EEE')").show()
+--------------------------------------------+
|to_date('1991-Aug-09 Fri', 'yyyy-MMM-dd EEE'|
+--------------------------------------------+
| 1991-08-09|
+--------------------------------------------+

Resources