Timestamp abbreviated date format natively in Spark - apache-spark

I'm converting successfully to timestamps numerical formats of dates (y-m-d, yyyyMMdd, etc) in spark using sql.functions.unix_timestamp.
The problem is when the date uses an abbreviated name of a month or a day, like
1991-Aug-09 Fri
Is there any way to achieve the conversion using only native spark functions?
(Disclaimer: I know I can do it using python functions, it's just curiosity)

You can use (reference - SimpleDateFormat)
yyyy-MMM-dd EEE
format with unix_timestamp
spark.sql("SELECT CAST(unix_timestamp('1991-Aug-09 Fri', 'yyyy-MMM-dd EEE') AS TIMESTAMP)").show()
+-------------------------------------------------------------------+
|CAST(unix_timestamp(1991-Aug-09 Fri, yyyy-MMM-dd EEE) AS TIMESTAMP)|
+-------------------------------------------------------------------+
| 1991-08-09 00:00:00|
+-------------------------------------------------------------------+
or to_date / to_timestamp (Spark 2.2 or later):
spark.sql("SELECT to_date('1991-Aug-09 Fri', 'yyyy-MMM-dd EEE')").show()
+--------------------------------------------+
|to_date('1991-Aug-09 Fri', 'yyyy-MMM-dd EEE'|
+--------------------------------------------+
| 1991-08-09|
+--------------------------------------------+

Related

How to convert unix timestamp in Hive to unix timestamp in Spark for format "yyyy-MM-ddTHH:mm:ss.SSSZ"

One of my tables contains date columns with the format yyyy-MM-ddTHH:mm:ss.SSSZ and I need to convert this into yyyy-MM-dd HH:mm:ss format.
I'm able to convert this in Hive but when I do the same in spark, it throws error.
Hive:
select order.admit_date, from_unixtime(unix_timestamp(order.ADMIT_DATE,"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"),'yyyy-MM-dd HH:mm:ss') as ADMIT_DATE
from daily_orders order;
admit_date admit_date
-------------- --------------
2021-12-20T00:00:00.000Z 2021-12-20 00:00:00
Spark
spark.sql("select order.admit_date, from_unixtime(to_timestamp(order.ADMIT_DATE,"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"),'yyyy-MM-dd HH:mm:ss') as modified_date from daily_orders order).show();
Output:
:1: error: ')' expected but character literal found.
I have also tried to escape the quotes, but did not get through.
spark.sql("select order.admit_date, from_unixtime(unix_timestamp(order.ADMIT_DATE,"yyyy-MM-dd\'T\'HH:mm:ss.SSS\'Z\'"),'yyyy-MM-dd HH:mm:ss'),'yyyy-MM-dd HH:mm:ss') as modified_date from daily_orders order limit 10").show()
Output:
:1: error: ')' expected but ':' found.
Is there a common syntax that converts both in Hive and Spark. Please suggest
You have some escaping problems in your query (using " inside another "). You can use use multi-line string to escape them.
However, this can actually be done using only to_timestamp function:
spark.sql("""
select '2021-12-20T00:00:00.000Z' as admit_date,
to_timestamp('2021-12-20T00:00:00.000Z') as modified_date
""").show()
//+------------------------+-------------------+
//|admit_date |modified_date |
//+------------------------+-------------------+
//|2021-12-20T00:00:00.000Z|2021-12-20 00:00:00|
//+------------------------+-------------------+
See docs Spark Datetime Patterns for Formatting and Parsing
Edit:
If you want to keep the same syntax as Hive:
spark.sql("""
select '2021-12-20T00:00:00.000Z' as admit_date,
from_unixtime(unix_timestamp('2021-12-20T00:00:00.000Z', "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"), 'yyyy-MM-dd HH:mm:ss') as modified_date
""").show()
//+------------------------+-------------------+
//|admit_date |modified_date |
//+------------------------+-------------------+
//|2021-12-20T00:00:00.000Z|2021-12-20 00:00:00|
//+------------------------+-------------------+

Pyspark parse datetime field with day and month names into timestamp

I'm not even sure where to start. I want to parse a column that is currently a string into a timestamp. The records look like the following:
Thu, 28 Jan 2021 02:54:17 +0000
What is the best way to parse this as a timestamp? I wasn't even sure where to start since it's not a super common way to store dates
You could probably start from the docs Datetime Patterns for Formatting and Parsing:
import pyspark.sql.functions as F
df = spark.createDataFrame([("Thu, 28 Jan 2021 02:54:17 +0000",)], ['timestamp'])
df.withColumn(
"timestamp",
F.to_timestamp("timestamp", "E, dd MMM yyyy HH:mm:ss Z")
).show()
#+-------------------+
#| timestamp|
#+-------------------+
#|2021-01-28 02:54:17|
#+-------------------+
However, since Spark version 3.0, you can no longer use some symbols like E while parsing to timestamp:
Symbols of ‘E’, ‘F’, ‘q’ and ‘Q’ can only be used for datetime
formatting, e.g. date_format. They are not allowed used for datetime
parsing, e.g. to_timestamp.
You can either set the time parser to legacy:
spark.conf.set("spark.sql.legacy.timeParserPolicy", "LEGACY")
Or use some string functions to remove the day part from string before using to_timestamp:
df.withColumn(
"timestamp",
F.to_timestamp(F.split("timestamp", ",")[1], " dd MMM yyyy HH:mm:ss Z")
).show()

Convert UTC timestamp to local time based on time zone in PySpark

I have a PySpark DataFrame, df, with some columns as shown below. The hour column is in UTC time and I want to create a new column that has the local time based on the time_zone column. How can I do that in PySpark?
df
+-------------------------+------------+
| hour | time_zone |
+-------------------------+------------+
|2019-10-16T20:00:00+0000 | US/Eastern |
|2019-10-15T23:00:00+0000 | US/Central |
+-------------------------+------------+
#What I want:
+-------------------------+------------+---------------------+
| hour | time_zone | local_time |
+-------------------------+------------+---------------------+
|2019-10-16T20:00:00+0000 | US/Eastern | 2019-10-16T15:00:00 |
|2019-10-15T23:00:00+0000 | US/Central | 2019-10-15T17:00:00 |
+-------------------------+------------+---------------------+
You can use the in-built from_utc_timestamp function. Note that the hour column needs to be passed in as a string without timezone to the function.
Code below works for spark versions starting 2.4.
from pyspark.sql.functions import *
df.select(from_utc_timestamp(split(df.hour,'\+')[0],df.time_zone).alias('local_time')).show()
For spark versions before 2.4, you have to pass in a constant string representing the time zone, as the second argument, to the function.
Documentation
pyspark.sql.functions.from_utc_timestamp(timestamp, tz)
This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the given time zone.
However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not timezone-agnostic. So in Spark this function just shift the timestamp value from UTC timezone to the given timezone.
This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.
Parameters
timestamp – the column that contains timestamps
tz – a string that has the ID of timezone, e.g. “GMT”, “America/Los_Angeles”, etc
Changed in version 2.4: tz can take a Column containing timezone ID strings.
You should also be able to use a spark UDF.
from pytz import timezone
from datetime import datetime
from pyspark.sql.functions import udf
def mytime(x,y):
dt = datetime.strptime(x, "%Y-%m-%dT%H:%M:%S%z")
return dt.astimezome(timezone(y))
mytimeUDF = udf(mytime, StringType())
df = df.withColumn('local_time', mytimeUDF("hour","timezone"))

PySpark - to_date format from column

I am currently trying to figure out, how to pass the String - format argument to the to_date pyspark function via a column parameter.
Specifically, I have the following setup:
sc = SparkContext.getOrCreate()
df = sc.parallelize([('a','2018-01-01','yyyy-MM-dd'),
('b','2018-02-02','yyyy-MM-dd'),
('c','02-02-2018','dd-MM-yyyy')]).toDF(
["col_name","value","format"])
I am currently trying to add a new column, where each of the dates from the column F.col("value"), which is a string value, is parsed to a date.
Separately for each format, this can be done with
df = df.withColumn("test1",F.to_date(F.col("value"),"yyyy-MM-dd")).\
withColumn("test2",F.to_date(F.col("value"),"dd-MM-yyyy"))
This however gives me 2 new columns - but I want to have 1 column containing both results - but calling the column does not seem to be possible with the to_date function:
df = df.withColumn("test3",F.to_date(F.col("value"),F.col("format")))
Here an error "Column object not callable" is being thrown.
Is is possible to have a generic approach for all possible formats (so that I do not have to manually add new columns for each format)?
You can use a column value as a parameter without a udf using the spark-sql syntax:
Spark version 2.2 and above
from pyspark.sql.functions import expr
df.withColumn("test3",expr("to_date(value, format)")).show()
#+--------+----------+----------+----------+
#|col_name| value| format| test3|
#+--------+----------+----------+----------+
#| a|2018-01-01|yyyy-MM-dd|2018-01-01|
#| b|2018-02-02|yyyy-MM-dd|2018-02-02|
#| c|02-02-2018|dd-MM-yyyy|2018-02-02|
#+--------+----------+----------+----------+
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql("select *, to_date(value, format) as test3 from df").show()
Spark version 1.5 and above
Older versions of spark do not support having a format argument to the to_date function, so you'll have to use unix_timestamp and from_unixtime:
from pyspark.sql.functions import expr
df.withColumn(
"test3",
expr("from_unixtime(unix_timestamp(value,format))").cast("date")
).show()
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql(
"select *, cast(from_unixtime(unix_timestamp(value,format)) as date) as test3 from df"
).show()
As far as I know, your problem requires some udf(user defined functions) to apply the correct format. But then inside a udf you can not directly use spark functions like to_date. So I created a little workaround in the solution. First the udf takes the python date conversion with the appropriate format from the column and converts it to an iso-format. Then another withColumn converts the iso-date to the correct format in column test3. However, you have to adapt the format in the original column to match the python dateformat strings, e.g. yyyy -> %Y, MM -> %m, ...
test_df = spark.createDataFrame([
('a','2018-01-01','%Y-%m-%d'),
('b','2018-02-02','%Y-%m-%d'),
('c','02-02-2018','%d-%m-%Y')
], ("col_name","value","format"))
def map_to_date(s,format):
return datetime.datetime.strptime(s,format).isoformat()
myudf = udf(map_to_date)
test_df.withColumn("test3",myudf(col("value"),col("format")))\
.withColumn("test3",to_date("test3")).show(truncate=False)
Result:
+--------+----------+--------+----------+
|col_name|value |format |test3 |
+--------+----------+--------+----------+
|a |2018-01-01|%Y-%m-%d|2018-01-01|
|b |2018-02-02|%Y-%m-%d|2018-02-02|
|c |02-02-2018|%d-%m-%Y|2018-02-02|
+--------+----------+--------+----------+
You dont need the format column also. You can use coalesce to check for all possible options
def get_right_date_format(date_string):
from pyspark.sql import functions as F
return F.coalesce(
F.to_date(date_string, 'yyyy-MM-dd'),
F.to_date(date_string, 'dd-MM-yyyy'),
F.to_date(date_string, 'yyyy-dd-MM')
)
df = sc.parallelize([('a','2018-01-01'),
('b','2018-02-02'),
('c','2018-21-02'),
('d','02-02-2018')]).toDF(
["col_name","value"])
df = df.withColumn("formatted_data",get_right_date_format(df.value, 'dd-MM-yyyy'))
The issue with this approach though is a date like 2020-02-01 would be treated as 1st Feb 2020, when it is likely that 2nd Jan 2020 is also possible.
Just an alternative approach !!!

Spark sql - Pyspark string to date conversion

I've a column with the data 20180501 in string format, I want to convert it to date format, tried using
to_date(cast(unix_timestamp('20180501', 'YYYYMMDD') as timestamp))'
but still it didn't worked. I'm using Spark SQL with dataframes
The format should be yyyyMMdd:
spark.sql("SELECT to_date(cast(unix_timestamp('20180501', 'yyyyMMdd') as timestamp))").show()
# +------------------------------------------------------------------+
# |to_date(CAST(unix_timestamp('20180501', 'yyyyMMdd') AS TIMESTAMP))|
# +------------------------------------------------------------------+
# | 2018-05-01|
# +------------------------------------------------------------------+
As pointed out in the other answer the format you use is incorrect. But you can also use to_date directly:
spark.sql("SELECT to_date('20180501', 'yyyyMMdd')").show()
+-------------------------------+
|to_date('20180501', 'yyyyMMdd')|
+-------------------------------+
| 2018-05-01|
+-------------------------------+

Resources