Convert UTC timestamp to local time based on time zone in PySpark - apache-spark

I have a PySpark DataFrame, df, with some columns as shown below. The hour column is in UTC time and I want to create a new column that has the local time based on the time_zone column. How can I do that in PySpark?
df
+-------------------------+------------+
| hour | time_zone |
+-------------------------+------------+
|2019-10-16T20:00:00+0000 | US/Eastern |
|2019-10-15T23:00:00+0000 | US/Central |
+-------------------------+------------+
#What I want:
+-------------------------+------------+---------------------+
| hour | time_zone | local_time |
+-------------------------+------------+---------------------+
|2019-10-16T20:00:00+0000 | US/Eastern | 2019-10-16T15:00:00 |
|2019-10-15T23:00:00+0000 | US/Central | 2019-10-15T17:00:00 |
+-------------------------+------------+---------------------+

You can use the in-built from_utc_timestamp function. Note that the hour column needs to be passed in as a string without timezone to the function.
Code below works for spark versions starting 2.4.
from pyspark.sql.functions import *
df.select(from_utc_timestamp(split(df.hour,'\+')[0],df.time_zone).alias('local_time')).show()
For spark versions before 2.4, you have to pass in a constant string representing the time zone, as the second argument, to the function.
Documentation
pyspark.sql.functions.from_utc_timestamp(timestamp, tz)
This is a common function for databases supporting TIMESTAMP WITHOUT TIMEZONE. This function takes a timestamp which is timezone-agnostic, and interprets it as a timestamp in UTC, and renders that timestamp as a timestamp in the given time zone.
However, timestamp in Spark represents number of microseconds from the Unix epoch, which is not timezone-agnostic. So in Spark this function just shift the timestamp value from UTC timezone to the given timezone.
This function may return confusing result if the input is a string with timezone, e.g. ‘2018-03-13T06:18:23+00:00’. The reason is that, Spark firstly cast the string to timestamp according to the timezone in the string, and finally display the result by converting the timestamp to string according to the session local timezone.
Parameters
timestamp – the column that contains timestamps
tz – a string that has the ID of timezone, e.g. “GMT”, “America/Los_Angeles”, etc
Changed in version 2.4: tz can take a Column containing timezone ID strings.

You should also be able to use a spark UDF.
from pytz import timezone
from datetime import datetime
from pyspark.sql.functions import udf
def mytime(x,y):
dt = datetime.strptime(x, "%Y-%m-%dT%H:%M:%S%z")
return dt.astimezome(timezone(y))
mytimeUDF = udf(mytime, StringType())
df = df.withColumn('local_time', mytimeUDF("hour","timezone"))

Related

spark - detail explanation on data_format function

Where is date_format explained in detail such as what format is accepted in timestamp or expr argument?
Spark documentation - date_format
date_format(timestamp, fmt) - Converts timestamp to a value of string in the format specified by the date format fmt.
timestamp - A date/timestamp or string to be converted to the given format.
fmt - Date/time format pattern to follow. See Datetime Patterns for valid date and time format patterns.
Databricks documentation - date_format
date_format(expr, fmt)
expr: A DATE, TIMESTAMP, or a STRING in a valid datetime format.
fmt: A STRING expression describing the desired format.
2007-11-13 is OK for the timestamp expression but 2007-NOV-13 is not. Where is the explanation of this behavior?
spark.sql("select date_format(date '2007-11-13', 'MMM') AS month_text").show()
+----------+
|month_text|
+----------+
| Nov|
+----------+
spark.sql("select date_format(date '2007-JAN-13', 'MMM') AS month_text").show()
...
ParseException:
Cannot parse the DATE value: 2007-JAN-13(line 1, pos 19)
I suppose the time expression needs to be ISO 8601 but it should be documented somewhere.
Timestamp expression can be date '2007-11-13' or timestamp '2007-11-13' but where can I get the information of this expression format?
date '...' is a datetime literal and it has to follow a specific pattern. It's mostly used when you need to hardcode a date or you already have a datetime in ISO-8601 format.
To parse a date from string you use to_date function. To convert string representation from one format to the other first you parse with to_date, then format with date_format. In your case:
scala> spark.sql("select date_format(to_date('2007-JAN-13', 'yyyy-MMM-dd'), 'MMM') AS month_text").show()
+----------+
|month_text|
+----------+
| Jan|
+----------+

Filter pyspark by time difference

I have a dataframe in pyspark that looks like this:
+----------+-------------------+-------+-----------------------+-----------------------+--------+
|Session_Id|Instance_Id |Actions|Start_Date |End_Date |Duration|
+----------+-------------------+-------+-----------------------+-----------------------+--------+
|14252203 |i-051fc2d21fbe001e3|2 |2019-12-17 01:07:30.000|2019-12-17 01:07:30.000|0 |
|43024091 |i-051fc2d21fbe001e3|2 |2019-12-17 01:08:00.000|2019-12-17 01:08:00.000|0 |
|50961995 |i-0c733c7e356bc1615|2 |2019-12-17 01:07:30.000|2019-12-17 01:07:30.000|0 |
|56308963 |i-0c733c7e356bc1615|2 |2019-12-17 01:08:00.000|2019-12-17 01:08:00.000|0 |
|60120472 |i-0c733c7e356bc1615|2 |2019-12-17 01:07:30.000|2019-12-17 01:07:30.000|0 |
|69132492 |i-051fc2d21fbe001e3|2 |2019-12-17 01:07:30.000|2019-12-17 01:07:30.000|0 |
+----------+-------------------+-------+-----------------------+-----------------------+--------+
I'm trying to filter any rows that are too recent with this:
now = datetime.datetime.now()
filtered = grouped.filter(f.abs(f.unix_timestamp(now) - f.unix_timestamp(datetime.datetime.strptime(f.col('End_Date')[:-4], '%Y-%m-%d %H:%M:%S'))) > 100)
which transforms End_Date to a timestamp and calculates the difference from now till End_Date and filters anything less than 100 seconds. Which I got from Filter pyspark dataframe based on time difference between two columns
Every time I run this, I get this error:
TypeError: Invalid argument, not a string or column: 2019-12-19 18:55:13.268489 of type <type 'datetime.datetime'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
How can I filter by comparing timestamps?
I think you're confusing between Python functions and Spark. unix_timestamp function requires a string or Column object but you're passing a Python datetime object, that why you get that error.
Instead use Spark builtin functions : current_date which gives you column with current date value and to_date to convert End_Date column to date.
This should work fine for you:
filtered = grouped.filter(abs(unix_timestamp(current_date()) - unix_timestamp(to_date(col('End_Date'), 'yyyy-MM-dd HH:mm:ss'))) > 100)

check if the column is having a valid data or not in spark

I have a date column which has date in YYYYMM format when i take the data from a file but after i convert that to a dataframe i have to to check if the data is valid or not, which means i have to check whether it is in YYYYMMDD or not. Otherwise i have to replace it default date 9999-12-31. here is an example This is how my case statement has to look like--
case when is_valid_date(date) is not null then date else 9999-12-31.
I need to create a simple function called is_valid_data to check if the date value is valid or not
input table
ID date
1 12345
o/p expected
ID date
1 9999-12-31
If I understood your question properly, below is my approach.You need not required to build a function rather you can build an expression with in-build functions and pass the expression.
val df = Seq("12345", "20190312", "3", "4", "5").toDF("col1")
import org.apache.spark.sql.functions._
/*
* checks the given raw data is in expected date format or not.
* if not in the expected format, values are replaced with default value.
*
* Note: You need to change the dateformat according to your requirement
*/
val condExp = when(to_date(col("col1"), "yyyymmdd") isNull, lit("9999-12-31")).otherwise(col("col1"))
df.withColumn("col2", condExp).show()
Result
+--------+----------+
| col1| col2|
+--------+----------+
| 12345|9999-12-31|
|20190312| 20190312|
| 3|9999-12-31|
| 4|9999-12-31|
| 5|9999-12-31|
+--------+----------+

Convert Timestamp to date using Cassandra query

I need to convert timestamp '1998/02/12 00:00:00 to data 1998-02-12' using Cassandra query. Can anyone help me on this.
Is it possible or not?
You can use toDate function in cql to get date out of datetime. \
For example, if your table entry looks like:
id | datetime | value
-------------+---------------------------------+-------
22170825421 | 2018-02-15 14:06:01.000000+0000 | 50
You can run the following query:
select id, datetime, toDate(datetime) as day, value from datatable;
and it will give you:
id | datetime | day | value
-------------+---------------------------------+------------+-------
22170825421 | 2018-02-15 14:06:01.000000+0000 | 2018-02-15 | 50
You can't do it directly in the Cassandra, as it accepts data as YYYY-mm-dd, so you need to use some other method (depending on language that you're using) to convert your string into this format.

CQLSH: Converting unix timestamp to datetime

I am performing a cql query on a column that stores the values as unix timestmap, but want the results to output as datetime. Is there a way to do this?
i.e. something like the following:
select convertToDateTime(column) from table;
I'm trying to remember if there's an easier, more direct route. But if you have a table with a UNIX timestamp and want to show it in a datetime format, you can combine the dateOf and min/maxTimeuuid functions together, like this:
aploetz#cqlsh:stackoverflow2> SELECT datetime,unixtime,dateof(mintimeuuid(unixtime)) FROM unixtime;
datetimetext | unixtime | dateof(mintimeuuid(unixtime))
----------------+---------------+-------------------------------
2015-07-08 | 1436380283051 | 2015-07-08 13:31:23-0500
(1 rows)
aploetz#cqlsh:stackoverflow2> SELECT datetime,unixtime,dateof(maxtimeuuid(unixtime)) FROM unixtime;
datetimetext | unixtime | dateof(maxtimeuuid(unixtime))
----------------+---------------+-------------------------------
2015-07-08 | 1436380283051 | 2015-07-08 13:31:23-0500
(1 rows)
Note that timeuuid stores greater precision than either a UNIX timestamp or a datetime, so you'll need to first convert it to a TimeUUID using either the min or maxtimeuuid function. Then you'll be able to use dateof to convert it to a datetime timestamp.

Resources