Apache Spark: parse PT2H5M (duration ISO-8601) duration in minutes - apache-spark

In ISO 8601, durations are in the format PT5M ( 5 minutes) or PT2H5M (2 hours 5 minutes). I have a JSON file that contains values in such a format. I wanted to know if spark can extract the duration in minutes. I tried to read it as "DateType" and used the "minutes" function to get minutes, it returned me with null values.
Example json
{"name": "Fennel Mushrooms","cookTime":"PT30M"}
Currently, I am reading it as a string and using the "regex_extract" function. I wanted to know a more efficient way.
https://www.digi.com/resources/documentation/digidocs/90001437-13/reference/r_iso_8601_duration_format.htm

Spark does not provide for a way to convert ISO 8601 duration into intervals. Neither does timedelta in Python datetime library.
However, pd.Timdelta can parse ISO 8601 duration to time deltas. To support of a wider category of ISO 8601 duration, we can wrap the pd.Timdelta in a pandas_udf
from pyspark.sql import functions as F
import pandas as pd
df = spark.createDataFrame([("PT5M", ), ("PT50M", ), ("PT2H5M", ), ], ("duration", ))
#F.pandas_udf("int")
def parse_iso8601_duration(str_duration: pd.Series) -> pd.Series:
return str_duration.apply(lambda duration: (pd.Timedelta(duration).seconds / 60))
df.withColumn("duration_in_minutes", parse_iso8601_duration(F.col("duration"))).show()
Output
+--------+-------------------+
|duration|duration_in_minutes|
+--------+-------------------+
| PT5M| 5|
| PT50M| 50|
| PT2H5M| 125|
+--------+-------------------+

Related

Wrong sequence of months in PySpark sequence interval month

I am trying to create an array of dates that all months from a minimum date to a maximum date!
Example:
min_date = "2021-05-31"
max_date = "2021-11-30"
.withColumn('array_date', F.expr('sequence(to_date(min_date), to_date(max_date), interval 1 month)')
But it gives me the following Output:
['2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31', '2021-09-30', '2021-10-31']
Why doesn't the upper limit appear on 11/30/2021? In the documentation, it says that the extremes are included.
My desired output is:
['2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31', '2021-09-30', '2021-10-31', '2021-11-30']
Thank you!
I think this is related to the timezone. I can reproduce the same behavior in my timezone Europe/Paris but when setting timezone to UTC it gives expected result:
from pyspark.sql import functions as F
spark.conf.set("spark.sql.session.timeZone", "UTC")
df = spark.createDataFrame([("2021-05-31", "2021-11-30")], ["min_date", "max_date"])
df.withColumn(
"array_date",
F.expr("sequence(to_date(min_date), to_date(max_date), interval 1 month)")
).show(truncate=False)
#+----------+----------+------------------------------------------------------------------------------------+
#|min_date |max_date |array_date |
#+----------+----------+------------------------------------------------------------------------------------+
#|2021-05-31|2021-11-30|[2021-05-31, 2021-06-30, 2021-07-31, 2021-08-31, 2021-09-30, 2021-10-31, 2021-11-30]|
#+----------+----------+------------------------------------------------------------------------------------+
Alternatively, you can use TimestampType for start and end parameters of the sequence instead of DateType:
df.withColumn(
"array_date",
F.expr("sequence(to_timestamp(min_date), to_timestamp(max_date), interval 1 month)").cast("array<date>")
).show(truncate=False)

How to convert String to Time in PYSPARK?

iam trying to convert string to Time, but getting NULL.
For Ex,
val Start =='080000'
used the below steps,
1)unix_timestamp(col('Start'),'HH:mm:ss'),\
2)to_timestamp(lit('Start'),'HH:mm:ss'),\
3)to_timestamp(col('Start'),'HH:mm:ss'),\
4)from_unixtime(unix_timestamp(col('Start'),'HH:mm:ss'))
Expected Output :
08:00:00 (HH:MM:SS)
could someone please suggest the methods
Spark does have TimeType. Latest version v3.1.1 only has DateType and TimestampType, so the simple answer to your request converting String to Time is impossible.
However, it's possible to convert from 080000 (StringType) to 2000-01-01 08:00:00 (TimestampType) – or any date as the date doesn't matter – and you can perform any kind of date comparison you want
(df
.withColumn('from_timestamp', F.regexp_replace(F.col('from'), '(\d{2})(\d{2})(\d{2})', '2000-01-01 $1:$2:$3'))
.withColumn('to_timestamp', F.regexp_replace(F.col('to'), '(\d{2})(\d{2})(\d{2})', '2000-01-01 $1:$2:$3'))
.withColumn('diff', F.to_timestamp(F.col('to_timestamp')) - F.to_timestamp(F.col('from_timestamp')))
.show()
)
# +------+------+-------------------+-------------------+----------+
# | from| to| from_timestamp| to_timestamp| diff|
# +------+------+-------------------+-------------------+----------+
# |080000|083000|2000-01-01 08:00:00|2000-01-01 08:30:00|30 minutes|
# +------+------+-------------------+-------------------+----------+

Convert date time timestamp in spark dataframe to epocTimestamp

I have parquet file with TimeStamp column in this format 2020-07-07 18:30:14.500000+00:00 written from pandas. When I'm reading the same parquet file in spark, it is being read as 2020-07-08 00:00:14.5.
I wanted to convert this into epoch timestamp in milliseconds which is this 1594146614500
I have tried using java datetime format
val dtformat = new java.text.SimpleDateFormat("yyyy-MM-dd HH:mm:ss.SSS")
dtformat.parse(r2.getAs[Long]("date_time").toString).getTime
It;s converting but wrong value(1594146614005) instead of 1594146614500.
To make it correct I have to add dtformat.parse(r2.getAs[Long]("date_time").toString+"00").getTime .
Is there anyother cleaner approch than this ?
Any function available in spark to read it as milliseconds ?
update 1:
After using the below answer:
df.withColumn("timestamp", to_timestamp($"date_time", "yyyy-MM-dd HH:mm:ss.SSSSSSXXX")).withColumn("epoch", ($"timestamp".cast("decimal(20, 10)") * 1000).cast("bigint")).show()
+-------------+--------------------+-------------------+-------------+
|expected_time| original_time| timestamp| epoch|
+-------------+--------------------+-------------------+-------------+
|1597763904500|2020-08-18 20:48:...|2020-08-18 20:48:24|1597763904000|
|1597763905000| 2020-08-18 20:48:25|2020-08-18 20:48:25|1597763905000|
|1597763905500|2020-08-18 20:48:...|2020-08-18 20:48:25|1597763905000|
drawback is suppose if data is at 500ms granularity, then each timestamp has two same epoc timestamp which is not expected.
I recommend you switch from the outdated error-prone date/time API from the java.util and the corresonding formatting API (java.text.SimpleDateFormat) to the modern date/time API from java.time and the corresponding formatting API (java.time.format). Learn more about the modern date-time API from Trail: Date Time
import java.time.OffsetDateTime;
import java.time.format.DateTimeFormatter;
public class Main {
public static void main(String[] args) {
OffsetDateTime odt = OffsetDateTime.parse("2020-07-07 18:30:14.500000+00:00",
DateTimeFormatter.ofPattern("uuuu-MM-dd HH:mm:ss.SSSSSSZZZZZ"));
System.out.println(odt.toInstant().toEpochMilli());
}
}
Output:
1594146614500
With the spark dataframe functions,
df.withColumn("timestamp", to_timestamp($"time", "yyyy-MM-dd HH:mm:ss.SSSSSSXXX"))
.withColumn("epoch", ($"timestamp".cast("decimal(20, 10)") * 1000).cast("bigint"))
.show(false)
+--------------------------------+---------------------+-------------+
|time |timestamp |epoch |
+--------------------------------+---------------------+-------------+
|2020-07-07 18:30:14.500000+00:00|2020-07-07 18:30:14.5|1594146614500|
+--------------------------------+---------------------+-------------+
this is also possible way to do that.

pyspark to_timestamp does not include milliseconds

I'm trying to format my timestamp column to include milliseconds without success. How can I format my time to look like this - 2019-01-04 11:09:21.152 ?
I have looked at the documentation and following the SimpleDataTimeFormat , which the pyspark docs say are being used by the to_timestamp function.
This is my dataframe.
+--------------------------+
|updated_date |
+--------------------------+
|2019-01-04 11:09:21.152815|
+--------------------------+
I use the millisecond format without any success as below
>>> df.select('updated_date').withColumn("updated_date_col2",
to_timestamp("updated_date", "YYYY-MM-dd HH:mm:ss:SSS")).show(1,False)
+--------------------------+-------------------+
|updated_date |updated_date_col2 |
+--------------------------+-------------------+
|2019-01-04 11:09:21.152815|2019-01-04 11:09:21|
+--------------------------+-------------------+
I expect updated_date_col2 to be formatted as 2019-01-04 11:09:21.152
I think you can use UDF and Python's standard datetime module as below.
import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType
def _to_timestamp(s):
return datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')
udf_to_timestamp = udf(_to_timestamp, TimestampType())
df.select('updated_date').withColumn("updated_date_col2", udf_to_timestamp("updated_date")).show(1,False)
This is not a solution with to_timestamp but you can easily keep your column to time format
Following code is one of example on converting a numerical milliseconds to timestamp.
from datetime import datetime
ms = datetime.now().timestamp() # ex) ms = 1547521021.83301
df = spark.createDataFrame([(1, ms)], ['obs', 'time'])
df = df.withColumn('time', df.time.cast("timestamp"))
df.show(1, False)
+---+--------------------------+
|obs|time |
+---+--------------------------+
|1 |2019-01-15 12:15:49.565263|
+---+--------------------------+
if you use new Date().getTime() or Date.now() in JS or datetime.datetime.now().timestamp() in Python, you can get a numerical milliseconds.
Reason pyspark to_timestamp parses only till seconds, while TimestampType have the ability to hold milliseconds.
Following workaround may work:
If the timestamp pattern contains S, Invoke a UDF to get the string 'INTERVAL MILLISECONDS' to use in expression
ts_pattern = "YYYY-MM-dd HH:mm:ss:SSS"
my_col_name = "time_with_ms"
# get the time till seconds
df = df.withColumn(my_col_name, to_timestamp(df["updated_date_col2"],ts_pattern))
# add milliseconds as inteval
if 'S' in timestamp_pattern:
df = df.withColumn(my_col_name, df[my_col_name] + expr("INTERVAL 256 MILLISECONDS"))
To get INTERVAL 256 MILLISECONDS we may use a Java UDF:
df = df.withColumn(col_name, df[col_name] + expr(getIntervalStringUDF(df[my_col_name], ts_pattern)))
Inside UDF: getIntervalStringUDF(String timeString, String pattern)
Use SimpleDateFormat to parse date according to pattern
return formatted date as string using pattern "'INTERVAL 'SSS' MILLISECONDS'"
return 'INTERVAL 0 MILLISECONDS' on parse/format exceptions

change Unix(Epoch) time to local time in pyspark

I have a dataframe in Spark which contains Unix(Epoch) time and also timezone name. I hope to convert the epochtime to local time according to different tz name.
Here is how my data looks like:
data = [
(1420088400, 'America/New_York'),
(1420088400, 'America/Los_Angeles'),
(1510401180, 'America/New_York'),
(1510401180, 'America/Los_Angeles')]
df = spark.createDataFrame(data, ["epoch_time", "tz_name"])
df.createOrReplaceTempView("df")
df1 = spark.sql("""select *, from_unixtime(epoch_time) as gmt_time,"
from_utc_timestamp(from_unixtime(epoch_time), tz_name) as local_time"
from df""")
df1.show(truncate= False)
Here is the result:
+----------+-------------------+-------------------+---------------------+
|epoch_time|tz_name |gmt_time |local_time |
+----------+-------------------+-------------------+---------------------+
|1420088400|America/New_York |2015-01-01 05:00:00|2015-01-01 00:00:00.0|
|1420088400|America/Los_Angeles|2015-01-01 05:00:00|2014-12-31 21:00:00.0|
|1510401180|America/New_York |2017-11-11 11:53:00|2017-11-11 06:53:00.0|
|1510401180|America/Los_Angeles|2017-11-11 11:53:00|2017-11-11 03:53:00.0|
+----------+-------------------+-------------------+---------------------+
I'm not quite sure if this transfer is right, but it seems the daylight saving has been taking care of.
Should I first change epochtime to time string using from_unixtime, then change it to utc timestamp using to_utc_timestamp, finally change this UTC timestamp to local time with tz_name? Tried this but got error
df2 = spark.sql("""select *, from_unixtime(epoch_time) as gmt_time,
from_utc_timestamp(from_unixtime(epoch_time), tz_name) as local_time,
from_utc_timestamp(to_utc_timestamp(from_unixtime(epoch_time),from_unixtime(unix_timestamp(), 'z')), tz_name) as newtime from df""")
How could I check my EMR server timezone?
Tried use , is this the server timezone?
spark.sql("select from_unixtime(unix_timestamp(), 'z')").show()
which gave me:
+--------------------------------------------------------------------------+
|from_unixtime(unix_timestamp(current_timestamp(), yyyy-MM-dd HH:mm:ss), z)|
+--------------------------------------------------------------------------+
| UTC|
+--------------------------------------------------------------------------+
Thank you for your clarification.
When you call from_unixtime it will format the date based on your Java runtime's timezone, since it's just using the default timezone for SimpleDateFormat here. In your case it's UTC. So when you convert the values to local time you would only need to call from_utc_timestamp with the tz_name value passed in. However if you were to change your system timezone then you would need to call to_utc_timestamp first.
Spark 2.2 introduces a timezone setting so you can set the timezone for your SparkSession like so
spark.conf.set("spark.sql.session.timeZone", "GMT")
In which case the time functions will use GMT vs your system timezone, see source here

Resources