Spark Window Functions - rangeBetween dates - apache-spark

I have a Spark SQL DataFrame with date column, and what I'm trying to get is all the rows preceding current row in a given date range. So for example I want to have all the rows from 7 days back preceding given row. I figured out, I need to use a Window Function like:
Window \
.partitionBy('id') \
.orderBy('start')
I want to have a rangeBetween 7 days, but there is nothing in the Spark docs I could find on this. Does Spark even provide such option? For now I'm just getting all the preceding rows with:
.rowsBetween(-sys.maxsize, 0)
but would like to achieve something like:
.rangeBetween("7 days", 0)

Spark >= 2.3
Since Spark 2.3 it is possible to use interval objects using SQL API, but the DataFrame API support is still work in progress.
df.createOrReplaceTempView("df")
spark.sql(
"""SELECT *, mean(some_value) OVER (
PARTITION BY id
ORDER BY CAST(start AS timestamp)
RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
) AS mean FROM df""").show()
## +---+----------+----------+------------------+
## | id| start|some_value| mean|
## +---+----------+----------+------------------+
## | 1|2015-01-01| 20.0| 20.0|
## | 1|2015-01-06| 10.0| 15.0|
## | 1|2015-01-07| 25.0|18.333333333333332|
## | 1|2015-01-12| 30.0|21.666666666666668|
## | 2|2015-01-01| 5.0| 5.0|
## | 2|2015-01-03| 30.0| 17.5|
## | 2|2015-02-01| 20.0| 20.0|
## +---+----------+----------+------------------+
Spark < 2.3
As far as I know it is not possible directly neither in Spark nor Hive. Both require ORDER BY clause used with RANGE to be numeric. The closest thing I found is conversion to timestamp and operating on seconds. Assuming start column contains date type:
from pyspark.sql import Row
row = Row("id", "start", "some_value")
df = sc.parallelize([
row(1, "2015-01-01", 20.0),
row(1, "2015-01-06", 10.0),
row(1, "2015-01-07", 25.0),
row(1, "2015-01-12", 30.0),
row(2, "2015-01-01", 5.0),
row(2, "2015-01-03", 30.0),
row(2, "2015-02-01", 20.0)
]).toDF().withColumn("start", col("start").cast("date"))
A small helper and window definition:
from pyspark.sql.window import Window
from pyspark.sql.functions import mean, col
# Hive timestamp is interpreted as UNIX timestamp in seconds*
days = lambda i: i * 86400
Finally query:
w = (Window()
.partitionBy(col("id"))
.orderBy(col("start").cast("timestamp").cast("long"))
.rangeBetween(-days(7), 0))
df.select(col("*"), mean("some_value").over(w).alias("mean")).show()
## +---+----------+----------+------------------+
## | id| start|some_value| mean|
## +---+----------+----------+------------------+
## | 1|2015-01-01| 20.0| 20.0|
## | 1|2015-01-06| 10.0| 15.0|
## | 1|2015-01-07| 25.0|18.333333333333332|
## | 1|2015-01-12| 30.0|21.666666666666668|
## | 2|2015-01-01| 5.0| 5.0|
## | 2|2015-01-03| 30.0| 17.5|
## | 2|2015-02-01| 20.0| 20.0|
## +---+----------+----------+------------------+
Far from pretty but works.
* Hive Language Manual, Types

Spark 3.3 is released, but...
The answer may be as old as Spark 1.5.0:
datediff.
datediff(col_name, '1000') will return an integer difference of days from 1000-01-01 to col_name.
As the first argument, it accepts dates, timestamps and even strings.
As the second, it even accepts 1000.
The answer
Date difference in days - depending on the data type of the order column:
date
Spark 3.1+
.orderBy(F.expr("unix_date(col_name)")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
timestamp
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
long - UNIX time in microseconds (e.g. 1672534861000000)
Spark 2.1+
.orderBy(F.col("col_name") / 86400_000000).rangeBetween(-7, 0)
long - UNIX time in milliseconds (e.g. 1672534861000)
Spark 2.1+
.orderBy(F.col("col_name") / 86400_000).rangeBetween(-7, 0)
long - UNIX time in seconds (e.g. 1672534861)
Spark 2.1+
.orderBy(F.col("col_name") / 86400).rangeBetween(-7, 0)
long in format yyyyMMdd
Spark 3.3+
.orderBy(F.expr("unix_date(to_date(col_name, 'yyyyMMdd'))")).rangeBetween(-7, 0)
Spark 3.1+
.orderBy(F.expr("unix_date(to_date(cast(col_name as string), 'yyyyMMdd'))")).rangeBetween(-7, 0)
Spark 2.2+
.orderBy(F.expr("datediff(to_date(cast(col_name as string), 'yyyyMMdd'), '1000')")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.unix_timestamp(F.col("col_name").cast('string'), 'yyyyMMdd') / 86400).rangeBetween(-7, 0)
string in date format of 'yyyy-MM-dd'
Spark 3.1+
.orderBy(F.expr("unix_date(to_date(col_name))")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
string in other date format (e.g. 'MM-dd-yyyy')
Spark 3.1+
.orderBy(F.expr("unix_date(to_date(col_name, 'MM-dd-yyyy'))")).rangeBetween(-7, 0)
Spark 2.2+
.orderBy(F.expr("datediff(to_date(col_name, 'MM-dd-yyyy'), '1000')")).rangeBetween(-7, 0)
Spark 2.1+
.orderBy(F.unix_timestamp("col_name", 'MM-dd-yyyy') / 86400).rangeBetween(-7, 0)
string in timestamp format of 'yyyy-MM-dd HH:mm:ss'
Spark 2.1+
.orderBy(F.expr("datediff(col_name, '1000')")).rangeBetween(-7, 0)
string in other timestamp format (e.g. 'MM-dd-yyyy HH:mm:ss')
Spark 2.2+
.orderBy(F.expr("datediff(to_date(col_name, 'MM-dd-yyyy HH:mm:ss'), '1000')")).rangeBetween(-7, 0)

Fantastic solution #zero323, if you want to operate with minutes instead of days as I have to, and you don't need to partition with id, so you only have to modify a simply part of the code as I show:
df.createOrReplaceTempView("df")
spark.sql(
"""SELECT *, sum(total) OVER (
ORDER BY CAST(reading_date AS timestamp)
RANGE BETWEEN INTERVAL 45 minutes PRECEDING AND CURRENT ROW
) AS sum_total FROM df""").show()

Related

Spark 2.3 timestamp subtract milliseconds

I am using Spark 2.3 and I have read here that it does not support timestamp milliseconds (only in 2.4+), but am looking for ideas on how to do what I need to do.
The data I am processing stores dates as String datatype in Parquet files in this format: 2021-07-09T01:41:58Z
I need to subtract one millisecond from that. If it were Spark 2.4, I think I could do something like this:
to_timestamp(col("sourceStartTimestamp")) - expr("INTERVAL 0.001 SECONDS")
But since it is Spark 2.3, that does not do anything. I confirmed it can subtract 1 second, but it ignores any value less than a second.
Can anyone suggestion a workaround for how to do this in Spark 2.3? Ultimately, the result will need to be a String data type if that makes any difference.
Since millisecond-timestamp isn't supported by Spark 2.3 (or below), consider using a UDF that takes a delta millis and a date format to get what you need using java.time's plusNanos():
def getMillisTS(delta: Long, fmt: String = "yyyy-MM-dd HH:mm:ss.SSS") = udf{
(ts: java.sql.Timestamp) =>
import java.time.format.DateTimeFormatter
ts.toLocalDateTime.plusNanos(delta * 1000000).format(DateTimeFormatter.ofPattern(fmt))
}
Test-running the UDF:
val df = Seq("2021-01-01 00:00:00", "2021-02-15 12:30:00").toDF("ts")
df.withColumn("millisTS", getMillisTS(-1)($"ts")).show(false)
/*
+-------------------+-----------------------+
|ts |millisTS |
+-------------------+-----------------------+
|2021-01-01 00:00:00|2020-12-31 23:59:59.999|
|2021-02-15 12:30:00|2021-02-15 12:29:59.999|
+-------------------+-----------------------+
*/
df.withColumn("millisTS", getMillisTS(5000)($"ts")).show(false)
/*
+-------------------+-----------------------+
|ts |millisTS |
+-------------------+-----------------------+
|2021-01-01 00:00:00|2021-01-01 00:00:05.000|
|2021-02-15 12:30:00|2021-02-15 12:30:05.000|
+-------------------+-----------------------+
*/
val df = Seq("2021-01-01T00:00:00Z", "2021-02-15T12:30:00Z").toDF("ts")
df.withColumn(
"millisTS",
getMillisTS(-1, "yyyy-MM-dd'T'HH:mm:ss.SSS'Z'")(to_timestamp($"ts", "yyyy-MM-dd'T'HH:mm:ss'Z'"))
).show(false)
/*
+-------------------+------------------------+
|ts |millisTS |
+-------------------+------------------------+
|2021-01-01 00:00:00|2020-12-31T23:59:59.999Z|
|2021-02-15 12:30:00|2021-02-15T12:29:59.999Z|
+-------------------+------------------------+
*/

Select a next or previous record on a dataframe (PySpark)

I have a spark dataframe that has a list of timestamps (partitioned by uid, ordered by timestamp). Now, I'd like to query the dataframe to get either previous or next record.
df = myrdd.toDF().repartition("uid").sort(desc("timestamp"))
df.show()
+------------------------+-------+
|uid |timestamp |
+------------------------+-------+
|Peter_Parker|2020-09-19 02:14:40|
|Peter_Parker|2020-09-19 01:07:38|
|Peter_Parker|2020-09-19 00:04:39|
|Peter_Parker|2020-09-18 23:02:36|
|Peter_Parker|2020-09-18 21:58:40|
So for example if I were to query:
ts=datetime.datetime(2020, 9, 19, 0, 4, 39)
I want to get the previous record on (2020-09-18 23:02:36), and only that one.
How can I get the previous one?
It's possible to do it using withColumn() and diff, but is there a smarter more efficient way of doing that? I really really don't need to calculate diff for ALL events, since it is already ordered. I just want prev/next record.
You can use a filter and order by, and then limit the results to 1 row:
df2 = (df.filter("uid = 'Peter_Parker' and timestamp < timestamp('2020-09-19 00:04:39')")
.orderBy('timestamp', ascending=False)
.limit(1)
)
df2.show()
+------------+-------------------+
| uid| timestamp|
+------------+-------------------+
|Peter_Parker|2020-09-18 23:02:36|
+------------+-------------------+
Or by using row_number after filtering :
from pyspark.sql import Window
from pyspark.sql import functions as F
df1 = df.filter("timestamp < '2020-09-19 00:04:39'") \
.withColumn("rn", F.row_number().over(Window.orderBy(F.desc("timestamp")))) \
.filter("rn = 1").drop("rn")
df1.show()
#+------------+-------------------+
#| uid| timestamp|
#+------------+-------------------+
#|Peter_Parker|2020-09-18 23:02:36|
#+------------+-------------------+

Parsing timestamps from string and rounding seconds in spark

I have a spark DataFrame with a column "requestTime", which is a string representation of a timestamp. How can I convert it to get this format: YY-MM-DD HH:MM:SS, knowing that I have the following value: 20171107014824952 (which means : 2017-11-07 01:48:25)?
The part devoted to the seconds is formed of 5 digits, in the example above the seconds part is = 24952 and what was displayed in the log file is 25 so I have to round up 24.952 before applying the to_timestamp function, that's why I asked for help.
Assuming you have the following spark DataFrame:
df.show()
#+-----------------+
#| requestTime|
#+-----------------+
#|20171107014824952|
#+-----------------+
With the schema:
df.printSchema()
#root
# |-- requestTime: string (nullable = true)
You can use the techniques described in Convert pyspark string to date format to convert this to a timestamp. Since the solution is dependent on your spark version, I've created the following helper function:
import pyspark.sql.functions as f
def timestamp_from_string(date_str, fmt):
try:
"""For spark version 2.2 and above, to_timestamp is available"""
return f.to_timestamp(date_str, fmt)
except (TypeError, AttributeError):
"""For spark version 2.1 and below, you'll have to do it this way"""
return f.from_unixtime(f.unix_timestamp(date_str, fmt))
Now call it on your data using the appropriate format:
df.withColumn(
"requestTime",
timestamp_from_string(f.col("requestTime"), "yyyyMMddhhmmssSSS")
).show()
#+-------------------+
#| requestTime|
#+-------------------+
#|2017-11-07 01:48:24|
#+-------------------+
Unfortunately, this truncates the timestamp instead of rounding.
Therefore, you need to do the rounding yourself before converting. The tricky part is that the number is stored as a string - you'll have to convert it to a double, divide by 1000., convert it back to a long (to chop off the decimal and you can't use int as the number is too big), and finally back to a string.
df.withColumn(
"requestTime",
timestamp_from_string(
f.round(f.col("requestTime").cast("double")/1000.0).cast('long').cast('string'),
"yyyyMMddhhmmss"
)
).show()
#+-------------------+
#| requestTime|
#+-------------------+
#|2017-11-07 01:48:25|
#+-------------------+

Spark SQL dataframe: best way to compute across rowpairs

I have a Spark DataFrame "deviceDF" like so :
ID date_time state
a 2015-12-11 4:30:00 up
a 2015-12-11 5:00:00 down
a 2015-12-11 5:15:00 up
b 2015-12-12 4:00:00 down
b 2015-12-12 4:20:00 up
a 2015-12-12 10:15:00 down
a 2015-12-12 10:20:00 up
b 2015-12-14 15:30:00 down
I am trying to calculate the downtime for each of the IDs. I started simple by grouping based on id and separately computing the sum of all uptimes and downtimes. Then take the difference of the summed uptime and downtime.
val downtimeDF = deviceDF.filter($"state" === "down")
.groupBy("ID")
.agg(sum(unix_timestamp($"date_time")) as "down_time")
val uptimeDF = deviceDF.filter($"state" === "up")
.groupBy("ID")
.agg(sum(unix_timestamp($"date_time")) as "up_time")
val updownjoinDF = uptimeDF.join(downtimeDF, "ID")
val difftimeDF = updownjoinDF
.withColumn("diff_time", $"up_time" - $"down_time")
However there are few conditions that cause errors, such as the device went down but never came back up, in this case, the down_time is the difference between current_time and last_time it was down.
Also if the first entry for a particular device starts with 'up' then the down_time is difference of the first_entry and the time at the begining of this analysis, say 2015-12-11 00:00:00. Whats the best way to handle these border conditions using dataframe? Do I need to write a custom UDAF ?
The first thing you can try is to use window functions. While this is usually not the fastest possible solution it is concise and extremely expressive. Taking your data as an example:
import org.apache.spark.sql.functions.unix_timestamp
val df = sc.parallelize(Array(
("a", "2015-12-11 04:30:00", "up"), ("a", "2015-12-11 05:00:00", "down"),
("a", "2015-12-11 05:15:00", "up"), ("b", "2015-12-12 04:00:00", "down"),
("b", "2015-12-12 04:20:00", "up"), ("a", "2015-12-12 10:15:00", "down"),
("a", "2015-12-12 10:20:00", "up"), ("b", "2015-12-14 15:30:00", "down")))
.toDF("ID", "date_time", "state")
.withColumn("timestamp", unix_timestamp($"date_time"))
Lets define example window:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions.{coalesce, lag, when, sum}
val w = Window.partitionBy($"ID").orderBy($"timestamp")
some helper columns
val previousTimestamp = coalesce(lag($"timestamp", 1).over(w), $"timestamp")
val previousState = coalesce(lag($"state", 1).over(w), $"state")
val downtime = when(
previousState === "down",
$"timestamp" - previousTimestamp
).otherwise(0).alias("downtime")
val uptime = when(
previousState === "up",
$"timestamp" - previousTimestamp
).otherwise(0).alias("uptime")
and finally a basic query:
val upsAndDowns = df.select($"*", uptime, downtime)
upsAndDowns.show
// +---+-------------------+-----+----------+------+--------+
// | ID| date_time|state| timestamp|uptime|downtime|
// +---+-------------------+-----+----------+------+--------+
// | a|2015-12-11 04:30:00| up|1449804600| 0| 0|
// | a|2015-12-11 05:00:00| down|1449806400| 1800| 0|
// | a|2015-12-11 05:15:00| up|1449807300| 0| 900|
// | a|2015-12-12 10:15:00| down|1449911700|104400| 0|
// | a|2015-12-12 10:20:00| up|1449912000| 0| 300|
// | b|2015-12-12 04:00:00| down|1449889200| 0| 0|
// | b|2015-12-12 04:20:00| up|1449890400| 0| 1200|
// | b|2015-12-14 15:30:00| down|1450103400|213000| 0|
// +---+-------------------+-----+----------+------+--------+
In a similar manner you cna look forward and if there is no more records in a group you can adjust total uptime / downtime using current timestamp.
Window functions provide some other useful features like window definitions with ROWS BETWEEN and RANGE BETWEEN clauses.
Another possible solution is to move your data to RDD and use low level operations with RangePartitioner, mapPartitions and sliding windows. For basic things you can even groupBy. This requires significantly more effort but is also much more flexible.
Finally there is a spark-timeseries package from Cloudera. Documentation is close to non-existent but tests are comprehensive enough to give you some idea how to use it.
Regarding custom UDAFs I wouldn't be to optimistic. UDAF API is rather specific and not exactly flexible.

How to get the number of elements in partition? [duplicate]

This question already has answers here:
Apache Spark: Get number of records per partition
(6 answers)
Closed 2 years ago.
Is there any way to get the number of elements in a spark RDD partition, given the partition ID? Without scanning the entire partition.
Something like this:
Rdd.partitions().get(index).size()
Except I don't see such an API for spark. Any ideas? workarounds?
Thanks
The following gives you a new RDD with elements that are the sizes of each partition:
rdd.mapPartitions(iter => Array(iter.size).iterator, true)
PySpark:
num_partitions = 20000
a = sc.parallelize(range(int(1e6)), num_partitions)
l = a.glom().map(len).collect() # get length of each partition
print(min(l), max(l), sum(l)/len(l), len(l)) # check if skewed
Spark/scala:
val numPartitions = 20000
val a = sc.parallelize(0 until 1e6.toInt, numPartitions )
val l = a.glom().map(_.length).collect() # get length of each partition
print(l.min, l.max, l.sum/l.length, l.length) # check if skewed
The same is possible for a dataframe, not just for an RDD.
Just add DF.rdd.glom... into the code above.
Notice that glom() converts elements of each partition into a list, so it's memory-intensive. A less memory-intensive version (pyspark version only):
import statistics
def get_table_partition_distribution(table_name: str):
def get_partition_len (iterator):
yield sum(1 for _ in iterator)
l = spark.table(table_name).rdd.mapPartitions(get_partition_len, True).collect() # get length of each partition
num_partitions = len(l)
min_count = min(l)
max_count = max(l)
avg_count = sum(l)/num_partitions
stddev = statistics.stdev(l)
print(f"{table_name} each of {num_partitions} partition's counts: min={min_count:,} avg±stddev={avg_count:,.1f} ±{stddev:,.1f} max={max_count:,}")
get_table_partition_distribution('someTable')
outputs something like
someTable each of 1445 partition's counts:
min=1,201,201 avg±stddev=1,202,811.6 ±21,783.4 max=2,030,137
I know I'm little late here, but I have another approach to get number of elements in a partition by leveraging spark's inbuilt function. It works for spark version above 2.1.
Explanation:
We are going to create a sample dataframe (df), get the partition id, do a group by on partition id, and count each record.
Pyspark:
>>> from pyspark.sql.functions import spark_partition_id, count as _count
>>> df = spark.sql("set -v").unionAll(spark.sql("set -v")).repartition(4)
>>> df.rdd.getNumPartitions()
4
>>> df.withColumn("partition_id", spark_partition_id()).groupBy("partition_id").agg(_count("key")).orderBy("partition_id").show()
+------------+----------+
|partition_id|count(key)|
+------------+----------+
| 0| 48|
| 1| 44|
| 2| 32|
| 3| 48|
+------------+----------+
Scala:
scala> val df = spark.sql("set -v").unionAll(spark.sql("set -v")).repartition(4)
df: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [key: string, value: string ... 1 more field]
scala> df.rdd.getNumPartitions
res0: Int = 4
scala> df.withColumn("partition_id", spark_partition_id()).groupBy("partition_id").agg(count("key")).orderBy("partition_id").show()
+------------+----------+
|partition_id|count(key)|
+------------+----------+
| 0| 48|
| 1| 44|
| 2| 32|
| 3| 48|
+------------+----------+
pzecevic's answer works, but conceptually there's no need to construct an array and then convert it to an iterator. I would just construct the iterator directly and then get the counts with a collect call.
rdd.mapPartitions(iter => Iterator(iter.size), true).collect()
P.S. Not sure if his answer is actually doing more work since Iterator.apply will likely convert its arguments into an array.

Resources