Spark sql - Pyspark string to date conversion - apache-spark

I've a column with the data 20180501 in string format, I want to convert it to date format, tried using
to_date(cast(unix_timestamp('20180501', 'YYYYMMDD') as timestamp))'
but still it didn't worked. I'm using Spark SQL with dataframes

The format should be yyyyMMdd:
spark.sql("SELECT to_date(cast(unix_timestamp('20180501', 'yyyyMMdd') as timestamp))").show()
# +------------------------------------------------------------------+
# |to_date(CAST(unix_timestamp('20180501', 'yyyyMMdd') AS TIMESTAMP))|
# +------------------------------------------------------------------+
# | 2018-05-01|
# +------------------------------------------------------------------+

As pointed out in the other answer the format you use is incorrect. But you can also use to_date directly:
spark.sql("SELECT to_date('20180501', 'yyyyMMdd')").show()
+-------------------------------+
|to_date('20180501', 'yyyyMMdd')|
+-------------------------------+
| 2018-05-01|
+-------------------------------+

Related

How to convert unix timestamp in Hive to unix timestamp in Spark for format "yyyy-MM-ddTHH:mm:ss.SSSZ"

One of my tables contains date columns with the format yyyy-MM-ddTHH:mm:ss.SSSZ and I need to convert this into yyyy-MM-dd HH:mm:ss format.
I'm able to convert this in Hive but when I do the same in spark, it throws error.
Hive:
select order.admit_date, from_unixtime(unix_timestamp(order.ADMIT_DATE,"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"),'yyyy-MM-dd HH:mm:ss') as ADMIT_DATE
from daily_orders order;
admit_date admit_date
-------------- --------------
2021-12-20T00:00:00.000Z 2021-12-20 00:00:00
Spark
spark.sql("select order.admit_date, from_unixtime(to_timestamp(order.ADMIT_DATE,"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"),'yyyy-MM-dd HH:mm:ss') as modified_date from daily_orders order).show();
Output:
:1: error: ')' expected but character literal found.
I have also tried to escape the quotes, but did not get through.
spark.sql("select order.admit_date, from_unixtime(unix_timestamp(order.ADMIT_DATE,"yyyy-MM-dd\'T\'HH:mm:ss.SSS\'Z\'"),'yyyy-MM-dd HH:mm:ss'),'yyyy-MM-dd HH:mm:ss') as modified_date from daily_orders order limit 10").show()
Output:
:1: error: ')' expected but ':' found.
Is there a common syntax that converts both in Hive and Spark. Please suggest
You have some escaping problems in your query (using " inside another "). You can use use multi-line string to escape them.
However, this can actually be done using only to_timestamp function:
spark.sql("""
select '2021-12-20T00:00:00.000Z' as admit_date,
to_timestamp('2021-12-20T00:00:00.000Z') as modified_date
""").show()
//+------------------------+-------------------+
//|admit_date |modified_date |
//+------------------------+-------------------+
//|2021-12-20T00:00:00.000Z|2021-12-20 00:00:00|
//+------------------------+-------------------+
See docs Spark Datetime Patterns for Formatting and Parsing
Edit:
If you want to keep the same syntax as Hive:
spark.sql("""
select '2021-12-20T00:00:00.000Z' as admit_date,
from_unixtime(unix_timestamp('2021-12-20T00:00:00.000Z', "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"), 'yyyy-MM-dd HH:mm:ss') as modified_date
""").show()
//+------------------------+-------------------+
//|admit_date |modified_date |
//+------------------------+-------------------+
//|2021-12-20T00:00:00.000Z|2021-12-20 00:00:00|
//+------------------------+-------------------+

Convert a string column to timestamp when read into spark

I'm trying to read a csv file into spark with databricks, but my time column is in string format, my time column entry is like: 2019-08-01 23:59:05-07:00, I want to convert it into timestamp type, here's what I tried:
df = (spark.read
.option("header", "true")
.option("inferSchema", "true")
.csv(path_to_file)
.withColumn("observed", unix_timestamp("dt", "yyyy-MM-dd hh:mm:ss.SSSZ")
.cast("double")
.cast("timestamp"))
)
But I got error message: cannot resolve '`dt`' given input columns, I'm guessing I didn't get the "yyyy-MM-dd hh:mm:ss.SSSZ" format right?
Assuming your csv looks like this:
df = spark.createDataFrame([('2019-08-01 23:59:05-07:00',)], ['dt'])
df.show()
+--------------------+
| dt|
+--------------------+
|2019-08-01 23:59:...|
+--------------------+
You can simply parse the timestamp with a to_timestamp function
from pyspark.sql.functions import to_timestamp
df.withColumn('observed', to_timestamp('dt', "yyyy-MM-dd HH:mm:ssXXX")).show()
+--------------------+-------------------+
| dt| observed|
+--------------------+-------------------+
|2019-08-01 23:59:...|2019-08-02 08:59:05|
+--------------------+-------------------+
So, as #HristoIliev mentioned, the reason behind cannot resolve '`dt`' is that 'dt' is supposed to be name of the column already in your dataframe, and 'observed' is supposed to be the name of a new column. If you adjust the names thought it still won't work, because there is format mismatch: yyyy-MM-dd hh:mm:ss.SSSZ won't parse 2019-08-01 23:59:05-07:00, but "yyyy-MM-dd HH:mm:ssXXX" will.

spark data frame convert a string column to timestamp with given format

when i execute
sparkSession.sql("SELECT to_timestamp('2018-08-04.11:18:29 AM', 'yyyy-MM-dd.hh:mm:ss a') as timestamp")
am/pm is missing from the answer
+-------------------+
| timestamp|
+-------------------+
|2018-08-04 11:18:29|
+-------------------+
but if AM/PM is not present, then it gives the correct answer.
using unix_timestamp
sparkSession.sql("select from_unixtime(unix_timestamp('08-04-2018.11:18:29 AM','dd-MM-yyyy.HH:mm:ss a'), 'dd-MM-yyyy.HH:mm:ss a') as timestamp")
gives the correct answer but the datatype becomes string, whereas my requirement is to convert the datatype to timestamp without data loss.
has anyone has suggestions?
Thanks in advance.
The AM/PM is not missing from the Timestamp datatype. Its just showing the time in 24 hour format. You don't lose any information.
For example,
scala> spark.sql("SELECT to_timestamp('2018-08-04.11:18:29 PM', 'yyyy-MM-dd.hh:mm:ss a') as timestamp").show(false)
+-------------------+
|timestamp |
+-------------------+
|2018-08-04 23:18:29|
+-------------------+
Whenever you want your timestamp represented with AM/PM, just use a date/time formatter function
Format of the printed representation is fixed (ISO 8601 compliant string in a local timezone) and cannot be modified.
There is no conversion that can help you here, because any that would satisfy the output format, would have to covet data to string.

PySpark - to_date format from column

I am currently trying to figure out, how to pass the String - format argument to the to_date pyspark function via a column parameter.
Specifically, I have the following setup:
sc = SparkContext.getOrCreate()
df = sc.parallelize([('a','2018-01-01','yyyy-MM-dd'),
('b','2018-02-02','yyyy-MM-dd'),
('c','02-02-2018','dd-MM-yyyy')]).toDF(
["col_name","value","format"])
I am currently trying to add a new column, where each of the dates from the column F.col("value"), which is a string value, is parsed to a date.
Separately for each format, this can be done with
df = df.withColumn("test1",F.to_date(F.col("value"),"yyyy-MM-dd")).\
withColumn("test2",F.to_date(F.col("value"),"dd-MM-yyyy"))
This however gives me 2 new columns - but I want to have 1 column containing both results - but calling the column does not seem to be possible with the to_date function:
df = df.withColumn("test3",F.to_date(F.col("value"),F.col("format")))
Here an error "Column object not callable" is being thrown.
Is is possible to have a generic approach for all possible formats (so that I do not have to manually add new columns for each format)?
You can use a column value as a parameter without a udf using the spark-sql syntax:
Spark version 2.2 and above
from pyspark.sql.functions import expr
df.withColumn("test3",expr("to_date(value, format)")).show()
#+--------+----------+----------+----------+
#|col_name| value| format| test3|
#+--------+----------+----------+----------+
#| a|2018-01-01|yyyy-MM-dd|2018-01-01|
#| b|2018-02-02|yyyy-MM-dd|2018-02-02|
#| c|02-02-2018|dd-MM-yyyy|2018-02-02|
#+--------+----------+----------+----------+
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql("select *, to_date(value, format) as test3 from df").show()
Spark version 1.5 and above
Older versions of spark do not support having a format argument to the to_date function, so you'll have to use unix_timestamp and from_unixtime:
from pyspark.sql.functions import expr
df.withColumn(
"test3",
expr("from_unixtime(unix_timestamp(value,format))").cast("date")
).show()
Or equivalently using pyspark-sql:
df.createOrReplaceTempView("df")
spark.sql(
"select *, cast(from_unixtime(unix_timestamp(value,format)) as date) as test3 from df"
).show()
As far as I know, your problem requires some udf(user defined functions) to apply the correct format. But then inside a udf you can not directly use spark functions like to_date. So I created a little workaround in the solution. First the udf takes the python date conversion with the appropriate format from the column and converts it to an iso-format. Then another withColumn converts the iso-date to the correct format in column test3. However, you have to adapt the format in the original column to match the python dateformat strings, e.g. yyyy -> %Y, MM -> %m, ...
test_df = spark.createDataFrame([
('a','2018-01-01','%Y-%m-%d'),
('b','2018-02-02','%Y-%m-%d'),
('c','02-02-2018','%d-%m-%Y')
], ("col_name","value","format"))
def map_to_date(s,format):
return datetime.datetime.strptime(s,format).isoformat()
myudf = udf(map_to_date)
test_df.withColumn("test3",myudf(col("value"),col("format")))\
.withColumn("test3",to_date("test3")).show(truncate=False)
Result:
+--------+----------+--------+----------+
|col_name|value |format |test3 |
+--------+----------+--------+----------+
|a |2018-01-01|%Y-%m-%d|2018-01-01|
|b |2018-02-02|%Y-%m-%d|2018-02-02|
|c |02-02-2018|%d-%m-%Y|2018-02-02|
+--------+----------+--------+----------+
You dont need the format column also. You can use coalesce to check for all possible options
def get_right_date_format(date_string):
from pyspark.sql import functions as F
return F.coalesce(
F.to_date(date_string, 'yyyy-MM-dd'),
F.to_date(date_string, 'dd-MM-yyyy'),
F.to_date(date_string, 'yyyy-dd-MM')
)
df = sc.parallelize([('a','2018-01-01'),
('b','2018-02-02'),
('c','2018-21-02'),
('d','02-02-2018')]).toDF(
["col_name","value"])
df = df.withColumn("formatted_data",get_right_date_format(df.value, 'dd-MM-yyyy'))
The issue with this approach though is a date like 2020-02-01 would be treated as 1st Feb 2020, when it is likely that 2nd Jan 2020 is also possible.
Just an alternative approach !!!

Timestamp abbreviated date format natively in Spark

I'm converting successfully to timestamps numerical formats of dates (y-m-d, yyyyMMdd, etc) in spark using sql.functions.unix_timestamp.
The problem is when the date uses an abbreviated name of a month or a day, like
1991-Aug-09 Fri
Is there any way to achieve the conversion using only native spark functions?
(Disclaimer: I know I can do it using python functions, it's just curiosity)
You can use (reference - SimpleDateFormat)
yyyy-MMM-dd EEE
format with unix_timestamp
spark.sql("SELECT CAST(unix_timestamp('1991-Aug-09 Fri', 'yyyy-MMM-dd EEE') AS TIMESTAMP)").show()
+-------------------------------------------------------------------+
|CAST(unix_timestamp(1991-Aug-09 Fri, yyyy-MMM-dd EEE) AS TIMESTAMP)|
+-------------------------------------------------------------------+
| 1991-08-09 00:00:00|
+-------------------------------------------------------------------+
or to_date / to_timestamp (Spark 2.2 or later):
spark.sql("SELECT to_date('1991-Aug-09 Fri', 'yyyy-MMM-dd EEE')").show()
+--------------------------------------------+
|to_date('1991-Aug-09 Fri', 'yyyy-MMM-dd EEE'|
+--------------------------------------------+
| 1991-08-09|
+--------------------------------------------+

Resources