spark - detail explanation on data_format function - apache-spark

Where is date_format explained in detail such as what format is accepted in timestamp or expr argument?
Spark documentation - date_format
date_format(timestamp, fmt) - Converts timestamp to a value of string in the format specified by the date format fmt.
timestamp - A date/timestamp or string to be converted to the given format.
fmt - Date/time format pattern to follow. See Datetime Patterns for valid date and time format patterns.
Databricks documentation - date_format
date_format(expr, fmt)
expr: A DATE, TIMESTAMP, or a STRING in a valid datetime format.
fmt: A STRING expression describing the desired format.
2007-11-13 is OK for the timestamp expression but 2007-NOV-13 is not. Where is the explanation of this behavior?
spark.sql("select date_format(date '2007-11-13', 'MMM') AS month_text").show()
+----------+
|month_text|
+----------+
| Nov|
+----------+
spark.sql("select date_format(date '2007-JAN-13', 'MMM') AS month_text").show()
...
ParseException:
Cannot parse the DATE value: 2007-JAN-13(line 1, pos 19)
I suppose the time expression needs to be ISO 8601 but it should be documented somewhere.
Timestamp expression can be date '2007-11-13' or timestamp '2007-11-13' but where can I get the information of this expression format?

date '...' is a datetime literal and it has to follow a specific pattern. It's mostly used when you need to hardcode a date or you already have a datetime in ISO-8601 format.
To parse a date from string you use to_date function. To convert string representation from one format to the other first you parse with to_date, then format with date_format. In your case:
scala> spark.sql("select date_format(to_date('2007-JAN-13', 'yyyy-MMM-dd'), 'MMM') AS month_text").show()
+----------+
|month_text|
+----------+
| Jan|
+----------+

Related

spark dataframe: date formatting not working

I have a csv file in which a date column has values like 01080600, basically MM-dd-HH-mm.
I want to add a column in dataframe which shows this in a more readable format.
I do :
spark.sql("SELECT date...")
.withColumn("readable date", to_date(col("date"), "MM:dd HH:mm"))
.show(10)
But readable date is returned null.
What am I missing here?
While formating or converting to date or timestamp you need to provide the date_format as is following your pattern , example in your case you need to modify your format as below and further which can be formatted depending on the final format you wish your date col to take using date_format
References to various patterns and parsing can be found here
To Timestamp
sql.sql("""
SELECT
TO_TIMESTAMP('01080600','ddMMhhmm') as date,
DATE_FORMAT(TO_TIMESTAMP('01080600','ddMMhhmm'),'MM/dd hh:mm') as formated_date
""").show()
+-------------------+-------------+
| date|formated_date|
+-------------------+-------------+
|1970-08-01 06:00:00| 08/01 06:00|
+-------------------+-------------+

How to convert unix timestamp in Hive to unix timestamp in Spark for format "yyyy-MM-ddTHH:mm:ss.SSSZ"

One of my tables contains date columns with the format yyyy-MM-ddTHH:mm:ss.SSSZ and I need to convert this into yyyy-MM-dd HH:mm:ss format.
I'm able to convert this in Hive but when I do the same in spark, it throws error.
Hive:
select order.admit_date, from_unixtime(unix_timestamp(order.ADMIT_DATE,"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"),'yyyy-MM-dd HH:mm:ss') as ADMIT_DATE
from daily_orders order;
admit_date admit_date
-------------- --------------
2021-12-20T00:00:00.000Z 2021-12-20 00:00:00
Spark
spark.sql("select order.admit_date, from_unixtime(to_timestamp(order.ADMIT_DATE,"yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"),'yyyy-MM-dd HH:mm:ss') as modified_date from daily_orders order).show();
Output:
:1: error: ')' expected but character literal found.
I have also tried to escape the quotes, but did not get through.
spark.sql("select order.admit_date, from_unixtime(unix_timestamp(order.ADMIT_DATE,"yyyy-MM-dd\'T\'HH:mm:ss.SSS\'Z\'"),'yyyy-MM-dd HH:mm:ss'),'yyyy-MM-dd HH:mm:ss') as modified_date from daily_orders order limit 10").show()
Output:
:1: error: ')' expected but ':' found.
Is there a common syntax that converts both in Hive and Spark. Please suggest
You have some escaping problems in your query (using " inside another "). You can use use multi-line string to escape them.
However, this can actually be done using only to_timestamp function:
spark.sql("""
select '2021-12-20T00:00:00.000Z' as admit_date,
to_timestamp('2021-12-20T00:00:00.000Z') as modified_date
""").show()
//+------------------------+-------------------+
//|admit_date |modified_date |
//+------------------------+-------------------+
//|2021-12-20T00:00:00.000Z|2021-12-20 00:00:00|
//+------------------------+-------------------+
See docs Spark Datetime Patterns for Formatting and Parsing
Edit:
If you want to keep the same syntax as Hive:
spark.sql("""
select '2021-12-20T00:00:00.000Z' as admit_date,
from_unixtime(unix_timestamp('2021-12-20T00:00:00.000Z', "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'"), 'yyyy-MM-dd HH:mm:ss') as modified_date
""").show()
//+------------------------+-------------------+
//|admit_date |modified_date |
//+------------------------+-------------------+
//|2021-12-20T00:00:00.000Z|2021-12-20 00:00:00|
//+------------------------+-------------------+

Convert UTC timestamp to local time based on time zone in PySpark

I have a PySpark DataFrame, df, with some columns as shown below. The hour column is in UTC time and I want to create a new column that has the local time based on the time_zone column. How can I do that in PySpark?
df
+-------------------------+------------+
| hour | time_zone |
+-------------------------+------------+
|2019-10-16T20:00:00+0000 | US/Eastern |
|2019-10-15T23:00:00+0000 | US/Central |
+-------------------------+------------+
#What I want:
+-------------------------+------------+---------------------+
| hour | time_zone | local_time |
+-------------------------+------------+---------------------+
|2019-10-16T20:00:00+0000 | US/Eastern | 2019-10-16T15:00:00 |
|2019-10-15T23:00:00+0000 | US/Central | 2019-10-15T17:00:00 |
+-------------------------+------------+---------------------+
You can use the in-built from_utc_timestamp function. Note that the hour column needs to be passed in as a string without timezone to the function.
Code below works for spark versions starting 2.4.
from pyspark.sql.functions import *
df.select(from_utc_timestamp(split(df.hour,'\+')[0],df.time_zone).alias('local_time')).show()
For spark versions before 2.4, you have to pass in a constant string representing the time zone, as the second argument, to the function.
Documentation
pyspark.sql.functions.from_utc_timestamp(timestamp, tz)
This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the given time zone.
However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not timezone-agnostic. So in Spark this function just shift the timestamp value from UTC timezone to the given timezone.
This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.
Parameters
timestamp – the column that contains timestamps
tz – a string that has the ID of timezone, e.g. “GMT”, “America/Los_Angeles”, etc
Changed in version 2.4: tz can take a Column containing timezone ID strings.
You should also be able to use a spark UDF.
from pytz import timezone
from datetime import datetime
from pyspark.sql.functions import udf
def mytime(x,y):
dt = datetime.strptime(x, "%Y-%m-%dT%H:%M:%S%z")
return dt.astimezome(timezone(y))
mytimeUDF = udf(mytime, StringType())
df = df.withColumn('local_time', mytimeUDF("hour","timezone"))

check if the column is having a valid data or not in spark

I have a date column which has date in YYYYMM format when i take the data from a file but after i convert that to a dataframe i have to to check if the data is valid or not, which means i have to check whether it is in YYYYMMDD or not. Otherwise i have to replace it default date 9999-12-31. here is an example This is how my case statement has to look like--
case when is_valid_date(date) is not null then date else 9999-12-31.
I need to create a simple function called is_valid_data to check if the date value is valid or not
input table
ID date
1 12345
o/p expected
ID date
1 9999-12-31
If I understood your question properly, below is my approach.You need not required to build a function rather you can build an expression with in-build functions and pass the expression.
val df = Seq("12345", "20190312", "3", "4", "5").toDF("col1")
import org.apache.spark.sql.functions._
/*
* checks the given raw data is in expected date format or not.
* if not in the expected format, values are replaced with default value.
*
* Note: You need to change the dateformat according to your requirement
*/
val condExp = when(to_date(col("col1"), "yyyymmdd") isNull, lit("9999-12-31")).otherwise(col("col1"))
df.withColumn("col2", condExp).show()
Result
+--------+----------+
| col1| col2|
+--------+----------+
| 12345|9999-12-31|
|20190312| 20190312|
| 3|9999-12-31|
| 4|9999-12-31|
| 5|9999-12-31|
+--------+----------+

spark data frame convert a string column to timestamp with given format

when i execute
sparkSession.sql("SELECT to_timestamp('2018-08-04.11:18:29 AM', 'yyyy-MM-dd.hh:mm:ss a') as timestamp")
am/pm is missing from the answer
+-------------------+
| timestamp|
+-------------------+
|2018-08-04 11:18:29|
+-------------------+
but if AM/PM is not present, then it gives the correct answer.
using unix_timestamp
sparkSession.sql("select from_unixtime(unix_timestamp('08-04-2018.11:18:29 AM','dd-MM-yyyy.HH:mm:ss a'), 'dd-MM-yyyy.HH:mm:ss a') as timestamp")
gives the correct answer but the datatype becomes string, whereas my requirement is to convert the datatype to timestamp without data loss.
has anyone has suggestions?
Thanks in advance.
The AM/PM is not missing from the Timestamp datatype. Its just showing the time in 24 hour format. You don't lose any information.
For example,
scala> spark.sql("SELECT to_timestamp('2018-08-04.11:18:29 PM', 'yyyy-MM-dd.hh:mm:ss a') as timestamp").show(false)
+-------------------+
|timestamp |
+-------------------+
|2018-08-04 23:18:29|
+-------------------+
Whenever you want your timestamp represented with AM/PM, just use a date/time formatter function
Format of the printed representation is fixed (ISO 8601 compliant string in a local timezone) and cannot be modified.
There is no conversion that can help you here, because any that would satisfy the output format, would have to covet data to string.

Resources