Casting TIMESTAMP Data type into DATE only - databricks

I have a table which includes a Date column and a Time column. Both columns have TIMESTAMP data type by default, but I'd like to change the data type of both the date and time columns to match the format yyyy-mm-dd and hh:mm:ss, respectively, to only display the date in Date column and only the time in Time column.
I've used cast() to change the Date column to DATE datatype as follows:
cast(occ.OCCURRENCEDATE as DATE)
However, I cannot do the same for the Time column as there is no specific TIME datatype. I've tried passing more parameters to TIMESTAMP datatype to only store the time value, but I cannot figure it out. Any help would be highly appreciated. Solution can be in sql or python, thanks!
Edit: Just to clarify, the Date and Time fields are being used for a dashboard. Both fields only come with their respective information, ie. date value for Date column and time value for Time column. So, it's important that both fields only display the specific information, barring the auto-assigned values if they are TIMESTAMP data type.

You could use INTERVAL type to store time, this will be convenient for further processing.
SELECT ts as timestamp,
cast(ts as date) as date,
ts - cast(ts as date) as time
FROM (SELECT current_timestamp() as ts);
+-----------------------+----------+----------------------------------+
|timestamp |date |time |
+-----------------------+----------+----------------------------------+
|2023-02-01 19:41:24.068|2023-02-01|19 hours 41 minutes 24.068 seconds|
+-----------------------+----------+----------------------------------+
root
|-- timestamp: timestamp (nullable = false)
|-- date: date (nullable = false)
|-- time: interval (nullable = false)
Or if you only care about specific formatting into a string just use date_format function.
SELECT ts as timestamp,
date_format(ts, 'yyyy-MM-dd') as date,
date_format(ts, 'HH:mm:ss') as time
FROM (SELECT current_timestamp() as ts);
+-----------------------+----------+--------+
|timestamp |date |time |
+-----------------------+----------+--------+
|2023-02-03 14:42:17.845|2023-02-03|14:42:17|
+-----------------------+----------+--------+

Just another way to get the time information as string:
select
transform(
array(current_timestamp())
, ts -> concat_ws(':'
, transform(
array(hour(ts), minute(ts), second(ts)), hms -> right('0'|| hms, 2)
)))[0] time

Related

Convert Interval type to string in pyspark

I want to write a Spark dataframe to file, but I don't like the format of my column containing an interval:
INTERVAL '0 01:02:10.237' DAY TO SECOND
I would rather have:
01:02:10.237
How can I format/cast the column to return my preferred format as a string.
The column is of type
interval day to second (nullable = true)
The date_format function unfortunately requires a timestamp type.
Use regex
df.show()
+-----------------------------------+
|duration |
+-----------------------------------+
|INTERVAL '0 06:18:05' DAY TO SECOND|
+-----------------------------------+
df.withColumn('duration_new', regexp_extract('duration','\d{2}:\d{2}:\d{2}',0)).show(truncate=False)
+-----------------------------------+------------+
|duration |duration_new|
+-----------------------------------+------------+
|INTERVAL '0 06:18:05' DAY TO SECOND|06:18:05 |
+-----------------------------------+------------+

spark dataframe: date formatting not working

I have a csv file in which a date column has values like 01080600, basically MM-dd-HH-mm.
I want to add a column in dataframe which shows this in a more readable format.
I do :
spark.sql("SELECT date...")
.withColumn("readable date", to_date(col("date"), "MM:dd HH:mm"))
.show(10)
But readable date is returned null.
What am I missing here?
While formating or converting to date or timestamp you need to provide the date_format as is following your pattern , example in your case you need to modify your format as below and further which can be formatted depending on the final format you wish your date col to take using date_format
References to various patterns and parsing can be found here
To Timestamp
sql.sql("""
SELECT
TO_TIMESTAMP('01080600','ddMMhhmm') as date,
DATE_FORMAT(TO_TIMESTAMP('01080600','ddMMhhmm'),'MM/dd hh:mm') as formated_date
""").show()
+-------------------+-------------+
| date|formated_date|
+-------------------+-------------+
|1970-08-01 06:00:00| 08/01 06:00|
+-------------------+-------------+

Convert UTC timestamp to local time based on time zone in PySpark

I have a PySpark DataFrame, df, with some columns as shown below. The hour column is in UTC time and I want to create a new column that has the local time based on the time_zone column. How can I do that in PySpark?
df
+-------------------------+------------+
| hour | time_zone |
+-------------------------+------------+
|2019-10-16T20:00:00+0000 | US/Eastern |
|2019-10-15T23:00:00+0000 | US/Central |
+-------------------------+------------+
#What I want:
+-------------------------+------------+---------------------+
| hour | time_zone | local_time |
+-------------------------+------------+---------------------+
|2019-10-16T20:00:00+0000 | US/Eastern | 2019-10-16T15:00:00 |
|2019-10-15T23:00:00+0000 | US/Central | 2019-10-15T17:00:00 |
+-------------------------+------------+---------------------+
You can use the in-built from_utc_timestamp function. Note that the hour column needs to be passed in as a string without timezone to the function.
Code below works for spark versions starting 2.4.
from pyspark.sql.functions import *
df.select(from_utc_timestamp(split(df.hour,'\+')[0],df.time_zone).alias('local_time')).show()
For spark versions before 2.4, you have to pass in a constant string representing the time zone, as the second argument, to the function.
Documentation
pyspark.sql.functions.from_utc_timestamp(timestamp, tz)
This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the given time zone.
However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not timezone-agnostic. So in Spark this function just shift the timestamp value from UTC timezone to the given timezone.
This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.
Parameters
timestamp – the column that contains timestamps
tz – a string that has the ID of timezone, e.g. “GMT”, “America/Los_Angeles”, etc
Changed in version 2.4: tz can take a Column containing timezone ID strings.
You should also be able to use a spark UDF.
from pytz import timezone
from datetime import datetime
from pyspark.sql.functions import udf
def mytime(x,y):
dt = datetime.strptime(x, "%Y-%m-%dT%H:%M:%S%z")
return dt.astimezome(timezone(y))
mytimeUDF = udf(mytime, StringType())
df = df.withColumn('local_time', mytimeUDF("hour","timezone"))

check if the column is having a valid data or not in spark

I have a date column which has date in YYYYMM format when i take the data from a file but after i convert that to a dataframe i have to to check if the data is valid or not, which means i have to check whether it is in YYYYMMDD or not. Otherwise i have to replace it default date 9999-12-31. here is an example This is how my case statement has to look like--
case when is_valid_date(date) is not null then date else 9999-12-31.
I need to create a simple function called is_valid_data to check if the date value is valid or not
input table
ID date
1 12345
o/p expected
ID date
1 9999-12-31
If I understood your question properly, below is my approach.You need not required to build a function rather you can build an expression with in-build functions and pass the expression.
val df = Seq("12345", "20190312", "3", "4", "5").toDF("col1")
import org.apache.spark.sql.functions._
/*
* checks the given raw data is in expected date format or not.
* if not in the expected format, values are replaced with default value.
*
* Note: You need to change the dateformat according to your requirement
*/
val condExp = when(to_date(col("col1"), "yyyymmdd") isNull, lit("9999-12-31")).otherwise(col("col1"))
df.withColumn("col2", condExp).show()
Result
+--------+----------+
| col1| col2|
+--------+----------+
| 12345|9999-12-31|
|20190312| 20190312|
| 3|9999-12-31|
| 4|9999-12-31|
| 5|9999-12-31|
+--------+----------+

Parsing non iso datetime string to just date part in presto

I have table which stores datetime as varchar
Format looks like this 2018-07-16 15:00:00.0 ,
I want to parse this to extract only date part so that I use date part to compare with date in string format such as '2018-07-20' in where clause. What is the best way to achieve this in presto?
This particular format (based on example value 2018-07-16 15:00:00.0 in the question) is understood by cast from varchar to timestamp. You then need to extract date part with another cast:
presto> SELECT CAST(CAST('2018-07-16 15:00:00.0' AS timestamp) AS date);
_col0
------------
2018-07-16
(1 row)

Resources