Pyspark date intervals and between dates? - apache-spark

In Snowflake/SQL we can do:
SELECT * FROM myTbl
WHERE date_col
BETWEEN
CONVERT_TIMEZONE('UTC','America/Los_Angeles', some_date_string_col)::DATE - INTERVAL '7 DAY'
AND
CONVERT_TIMEZONE('UTC','America/Los_Angeles', some_date_string_col)::DATE - INTERVAL '1 DAY'
Is there a pyspark translation for this for dataframes?
I imagine if something like this
myDf.filter(
col(date_col) >= to_utc_timestamp(...)
)
But how can we do BETWEEN and also the interval?

You can use INTERVAL within SQL expression like this:
df1 = df.filter(
F.col("date_col").between(
F.expr("current_timestamp - interval 7 days"),
F.expr("current_timestamp - interval 1 days"),
)
)
However if you only filter using days, you can simply use date_add (or date_sub) function:
from pyspark.sql import functions as F
df1 = df.filter(
F.col("date_col").between(
F.date_add(F.current_date(), -7),
F.date_add(F.current_date(), -1)
)
)

Related

How to add hours as variable to timestamp in Pyspark

Dataframe schema is like this:
["id", "t_create", "hours"]
string, timestamp, int
Sample data is like:
["abc", "2022-07-01 12:23:21.343998", 5]
I want to add hours to the t_create and get a new column t_update: "2022-07-01 17:23:21.343998"
Here is my code:
df_cols = ["id", "t_create", "hour"]
df = spark.read.format("delta").load("blablah path")
df = df.withColumn("t_update", df.t_create + expr(f"INTERVAL 5 HOURS"))
It works no problem. However the hours column should be a variable. I did not figure out how to put the variable to the expr, f string and the INTERVAL function, something like:
df = df.withColumn("t_update", df.t_create + expr(f"INTERVAL {df.hours} HOURS"))
df = df.withColumn("t_update", df.t_create + expr(f"INTERVAL {col(df.hours)} HOURS"))
etc... They don't work. Need help here.
Another way is to write a udf and wrap the whole expr string to the udf return value:
#udf
def udf_interval(hours):
return "INTERVAL " + str(hours) + " HOURS"
Then:
df = df.withColumn("t_update", df.t_create + expr(udf_interval(df.hours)))
Now I get TypeError: Column is not iterable.
Stuck. Need help in either the udf or non-udf way. Thanks!
You can do this without using the fiddly unix_timestamp and utilise make_interval within SparkSQL
SparkSQL - TO_TIMESTAMP & MAKE_INTERVAL
sql.sql("""
WITH INP AS (
SELECT
"abc" as id,
TO_TIMESTAMP("2022-07-01 12:23:21.343998","yyyy-MM-dd HH:mm:ss.SSSSSS") as t_create,
5 as t_hour
)
SELECT
id,
t_create,
t_hour,
t_create + MAKE_INTERVAL(0,0,0,0,t_hour,0,0) HOURS as t_update
FROM INP
""").show(truncate=False)
+---+--------------------------+------+--------------------------+
|id |t_create |t_hour|t_update |
+---+--------------------------+------+--------------------------+
|abc|2022-07-01 12:23:21.343998|5 |2022-07-01 17:23:21.343998|
+---+--------------------------+------+--------------------------+
Pyspark API
s = StringIO("""
id,t_create,t_hour
abc,2022-07-01 12:23:21.343998,5
"""
)
df = pd.read_csv(s,delimiter=',')
sparkDF = sql.createDataFrame(df)\
.withColumn('t_create'
,F.to_timestamp(F.col('t_create')
,'yyyy-MM-dd HH:mm:ss.SSSSSS'
)
).withColumn('t_update'
,F.expr('t_create + MAKE_INTERVAL(0,0,0,0,t_hour,0,0) HOURS')
).show(truncate=False)
+---+--------------------------+------+--------------------------+
|id |t_create |t_hour|t_update |
+---+--------------------------+------+--------------------------+
|abc|2022-07-01 12:23:21.343998|5 |2022-07-01 17:23:21.343998|
+---+--------------------------+------+--------------------------+
A simple way would be to cast the timestamp to bigint (or decimal if dealing with fraction of second) and add the number of seconds to it. Here's an example where I've created columns for every calculation for detailed understanding - you can merge all the calculations into a single column.
spark.sparkContext.parallelize([("2022-07-01 12:23:21.343998",)]).toDF(['ts_str']). \
withColumn('ts', func.col('ts_str').cast('timestamp')). \
withColumn('hours_to_add', func.lit(5)). \
withColumn('ts_as_decimal', func.col('ts').cast('decimal(20, 10)')). \
withColumn('seconds_to_add_as_decimal',
func.col('hours_to_add').cast('decimal(20, 10)') * 3600
). \
withColumn('new_ts_as_decimal',
func.col('ts_as_decimal') + func.col('seconds_to_add_as_decimal')
). \
withColumn('new_ts', func.col('new_ts_as_decimal').cast('timestamp')). \
show(truncate=False)
# +--------------------------+--------------------------+------------+---------------------+-------------------------+---------------------+--------------------------+
# |ts_str |ts |hours_to_add|ts_as_decimal |seconds_to_add_as_decimal|new_ts_as_decimal |new_ts |
# +--------------------------+--------------------------+------------+---------------------+-------------------------+---------------------+--------------------------+
# |2022-07-01 12:23:21.343998|2022-07-01 12:23:21.343998|5 |1656678201.3439980000|18000.0000000000 |1656696201.3439980000|2022-07-01 17:23:21.343998|
# +--------------------------+--------------------------+------------+---------------------+-------------------------+---------------------+--------------------------+

Wrong sequence of months in PySpark sequence interval month

I am trying to create an array of dates that all months from a minimum date to a maximum date!
Example:
min_date = "2021-05-31"
max_date = "2021-11-30"
.withColumn('array_date', F.expr('sequence(to_date(min_date), to_date(max_date), interval 1 month)')
But it gives me the following Output:
['2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31', '2021-09-30', '2021-10-31']
Why doesn't the upper limit appear on 11/30/2021? In the documentation, it says that the extremes are included.
My desired output is:
['2021-05-31', '2021-06-30', '2021-07-31', '2021-08-31', '2021-09-30', '2021-10-31', '2021-11-30']
Thank you!
I think this is related to the timezone. I can reproduce the same behavior in my timezone Europe/Paris but when setting timezone to UTC it gives expected result:
from pyspark.sql import functions as F
spark.conf.set("spark.sql.session.timeZone", "UTC")
df = spark.createDataFrame([("2021-05-31", "2021-11-30")], ["min_date", "max_date"])
df.withColumn(
"array_date",
F.expr("sequence(to_date(min_date), to_date(max_date), interval 1 month)")
).show(truncate=False)
#+----------+----------+------------------------------------------------------------------------------------+
#|min_date |max_date |array_date |
#+----------+----------+------------------------------------------------------------------------------------+
#|2021-05-31|2021-11-30|[2021-05-31, 2021-06-30, 2021-07-31, 2021-08-31, 2021-09-30, 2021-10-31, 2021-11-30]|
#+----------+----------+------------------------------------------------------------------------------------+
Alternatively, you can use TimestampType for start and end parameters of the sequence instead of DateType:
df.withColumn(
"array_date",
F.expr("sequence(to_timestamp(min_date), to_timestamp(max_date), interval 1 month)").cast("array<date>")
).show(truncate=False)

efficient cumulative pivot in pyspark

Is there a more efficient/idiomatic way of rewriting this query:
spark.table('registry_data')
.withColumn('age_days', datediff(lit(today), col('date')))
.withColumn('timeframe',
when(col('age_days')<7, "1w")
.when(col('age_days')<30, '1m')
.when(col('age_days')<92, '3m')
.when(col('age_days')<183, '6m')
.when(col('age_days')<365, '1y')
.otherwise('1y+')
)
.groupby('make', 'model')
.pivot('timeframe')
.agg(countDistinct('id').alias('count'))
.fillna(0)
.withColumn('1y+', col('1y+')+col('1y')+col('6m')+col('3m')+col('1m')+col('1w'))
.withColumn('1y', col('1y')+col('6m')+col('3m')+col('1m')+col('1w'))
.withColumn('6m', col('6m')+col('3m')+col('1m')+col('1w'))
.withColumn('3m', col('3m')+col('1m')+col('1w'))
.withColumn('1m', col('1m')+col('1w'))
The gist of the query is for every make/model combination to return the number of entries seen within a set of time periods from today. The period counts are cumulative, i.e. an entry that registered within the last 7 days would be counted for 1 week, 1 month, 3 months, etc.
if you want to use cumulative sum instead of summing for each columns, you can replace the code from .groupby onwards and use window functions
from pyspark.sql.window import Window
import pyspark.sql.functions as F
spark.table('registry_data')
.withColumn('age_days', datediff(lit(today), col('date')))
.withColumn('timeframe',
when(col('age_days')<7, "1w")
.when(col('age_days')<30, '1m')
.when(col('age_days')<92, '3m')
.when(col('age_days')<183, '6m')
.when(col('age_days')<365, '1y')
.otherwise('1y+')
)
.groupBy('make', 'model', 'timeframe')
.agg(F.countDistinct('id').alias('count'),
F.max('age_days').alias('max_days')) # for orderBy clause
.withColumn('cumsum',
F.sum('count').over(Window.partitionBy('make', 'model')
.orderBy('max_days')
.rowsBetween(Window.unboundedPreceding, 0)))
.groupBy('make', 'model').pivot('timeframe').agg(F.first('cumsum'))
.fillna(0)

pyspark sql: how to count the row with mutiple conditions

I have a dataframe like this after some operations;
df_new_1 = df_old.filter(df_old["col1"] >= df_old["col2"])
df_new_2 = df_old.filter(df_old["col1"] < df_old["col2"])
print(df_new_1.count(), df_new_2.count())
>> 10, 15
I can find the number of rows individually like above by calling count(). But how can I do this using pyspark sql row operation. i.e aggregating by row. I want to see the result like this;
Row(check1=10, check2=15)
Since you tagged pyspark-sql, you can do the following:
df_old.createOrReplaceTempView("df_table")
spark.sql("""
SELECT sum(int(col1 >= col2)) as check1
, sum(int(col1 < col2)) as check2
FROM df_table
""").collect()
Or use the API functions:
from pyspark.sql.functions import expr
df_old.agg(
expr("sum(int(col1 >= col2)) as check1"),
expr("sum(int(col1 < col2)) as check2")
).collect()

pyspark to_timestamp does not include milliseconds

I'm trying to format my timestamp column to include milliseconds without success. How can I format my time to look like this - 2019-01-04 11:09:21.152 ?
I have looked at the documentation and following the SimpleDataTimeFormat , which the pyspark docs say are being used by the to_timestamp function.
This is my dataframe.
+--------------------------+
|updated_date |
+--------------------------+
|2019-01-04 11:09:21.152815|
+--------------------------+
I use the millisecond format without any success as below
>>> df.select('updated_date').withColumn("updated_date_col2",
to_timestamp("updated_date", "YYYY-MM-dd HH:mm:ss:SSS")).show(1,False)
+--------------------------+-------------------+
|updated_date |updated_date_col2 |
+--------------------------+-------------------+
|2019-01-04 11:09:21.152815|2019-01-04 11:09:21|
+--------------------------+-------------------+
I expect updated_date_col2 to be formatted as 2019-01-04 11:09:21.152
I think you can use UDF and Python's standard datetime module as below.
import datetime
from pyspark.sql.functions import udf
from pyspark.sql.types import TimestampType
def _to_timestamp(s):
return datetime.datetime.strptime(s, '%Y-%m-%d %H:%M:%S.%f')
udf_to_timestamp = udf(_to_timestamp, TimestampType())
df.select('updated_date').withColumn("updated_date_col2", udf_to_timestamp("updated_date")).show(1,False)
This is not a solution with to_timestamp but you can easily keep your column to time format
Following code is one of example on converting a numerical milliseconds to timestamp.
from datetime import datetime
ms = datetime.now().timestamp() # ex) ms = 1547521021.83301
df = spark.createDataFrame([(1, ms)], ['obs', 'time'])
df = df.withColumn('time', df.time.cast("timestamp"))
df.show(1, False)
+---+--------------------------+
|obs|time |
+---+--------------------------+
|1 |2019-01-15 12:15:49.565263|
+---+--------------------------+
if you use new Date().getTime() or Date.now() in JS or datetime.datetime.now().timestamp() in Python, you can get a numerical milliseconds.
Reason pyspark to_timestamp parses only till seconds, while TimestampType have the ability to hold milliseconds.
Following workaround may work:
If the timestamp pattern contains S, Invoke a UDF to get the string 'INTERVAL MILLISECONDS' to use in expression
ts_pattern = "YYYY-MM-dd HH:mm:ss:SSS"
my_col_name = "time_with_ms"
# get the time till seconds
df = df.withColumn(my_col_name, to_timestamp(df["updated_date_col2"],ts_pattern))
# add milliseconds as inteval
if 'S' in timestamp_pattern:
df = df.withColumn(my_col_name, df[my_col_name] + expr("INTERVAL 256 MILLISECONDS"))
To get INTERVAL 256 MILLISECONDS we may use a Java UDF:
df = df.withColumn(col_name, df[col_name] + expr(getIntervalStringUDF(df[my_col_name], ts_pattern)))
Inside UDF: getIntervalStringUDF(String timeString, String pattern)
Use SimpleDateFormat to parse date according to pattern
return formatted date as string using pattern "'INTERVAL 'SSS' MILLISECONDS'"
return 'INTERVAL 0 MILLISECONDS' on parse/format exceptions

Resources