Maintain order after partition by key groupByKey or aggregateByKey

Maintain order after partition by key groupByKey or aggregateByKey - apache-spark

I have data like this
Machine , date , hours
123,2014-06-15,15.4
123,2014-06-16,20.3
123,2014-06-18,11.4
131,2014-06-15,12.2
131,2014-06-16,11.5
131,2014-06-17,18.2
131,2014-06-18,19.2
134,2014-06-15,11.1
134,2014-06-16,16.2
I want to partition by key Machine, and find lag of hours by 1 default value 0
Machine , date , hours lag
123,2014-06-15,15.4,0
123,2014-06-16,20.3,15.4
123,2014-06-18,11.4,20.3
131,2014-06-15,12.2,0
131,2014-06-16,11.5,12.2
131,2014-06-17,18.2,11.5
131,2014-06-18,19.2,18.2
134,2014-06-15,11.1,0
134,2014-06-16,16.2,11.1
I am using PairedRDD and groupByKey method, but it doesn't yield in an expected order.

Because there is really no given order here. With some exceptions, RDDs should be considered unordered if any transformations you use require shuffling.
If you need specific order you have to sort your data manually:
case class Record(machine: Long, date: java.sql.Date, hours: Double)
case class RecordWithLag(
machine: Long, date: java.sql.Date, hours: Double, lag: Double
)
def getLag(xs: Seq[Record]): Seq[RecordWithLag] = ???
val rdd = sc.parallelize(List(
Record(123, java.sql.Date.valueOf("2014-06-15"), 15.4),
Record(123, java.sql.Date.valueOf("2014-06-16"), 20.3),
Record(123, java.sql.Date.valueOf("2014-06-18"), 11.4),
Record(131, java.sql.Date.valueOf("2014-06-15"), 12.2),
Record(131, java.sql.Date.valueOf("2014-06-16"), 11.5),
Record(131, java.sql.Date.valueOf("2014-06-17"), 18.2),
Record(131, java.sql.Date.valueOf("2014-06-18"), 19.2),
Record(134, java.sql.Date.valueOf("2014-06-15"), 11.1),
Record(134, java.sql.Date.valueOf("2014-06-16"), 16.2)
))
rdd
.groupBy(_.machine)
.mapValues(_.toSeq.sortWith((x, y) => x.date.compareTo(y.date) < 0))
.mapValues(getLag)
For performance you should consider updating your Spark distribution to >= 1.4.0 and using a data frame with window functions:
val df = sqlContext.createDataFrame(rdd)
df.registerTempTable("df")
sqlContext.sql(
""""SELECT *, lag(hours, 1, 0) OVER (
PARTITION BY machine ORDER BY date
) lag FROM df"""
)
+-------+----------+-----+----+
|machine| date|hours| lag|
+-------+----------+-----+----+
| 123|2014-06-15| 15.4| 0.0|
| 123|2014-06-16| 20.3|15.4|
| 123|2014-06-18| 11.4|20.3|
| 131|2014-06-15| 12.2| 0.0|
| 131|2014-06-16| 11.5|12.2|
| 131|2014-06-17| 18.2|11.5|
| 131|2014-06-18| 19.2|18.2|
| 134|2014-06-15| 11.1| 0.0|
| 134|2014-06-16| 16.2|11.1|
+-------+----------+-----+----+
or
df.select(
$"*",
lag($"hours", 1, 0).over(
Window.partitionBy($"machine").orderBy($"date")
).alias("lag")
)

Related

Extract Numeric data from the Column in Spark Dataframe

I have a Dataframe with 20 columns and I want to update one particular column (whose data is null) with the data extracted from another column and do some formatting. Below is a sample input
+------------------------+----+
|col1 |col2|
+------------------------+----+
|This_is_111_222_333_test|NULL|
|This_is_111_222_444_test|3296|
|This_is_555_and_666_test|NULL|
|This_is_999_test |NULL|
+------------------------+----+
and my output should be like below
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_and_666_test|555,666 |
|This_is_999_test |999 |
+------------------------+-----------+
Here is the code I have tried and it is working only when the the numeric is continuous, could you please help me with a solution.
df.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).show(false)
I can do this by creating a UDF, but I am thinking is it possible with the spark in-built functions. My Spark version is 2.2.0
Thank you in advance.

A UDF is a good choice here. Performance is similar to that of the withColumn approach given in the OP (see benchmark below), and it works even if the numbers are not contiguous, which is one of the issues mentioned in the OP.
import org.apache.spark.sql.functions.udf
import scala.util.Try
def getNums = (c: String) => {
c.split("_").map(n => Try(n.toInt).getOrElse(0)).filter(_ > 0)
}
I recreated your data as follows
val data = Seq(("This_is_111_222_333_test", null.asInstanceOf[Array[Int]]),
("This_is_111_222_444_test",Array(3296)),
("This_is_555_666_test",null.asInstanceOf[Array[Int]]),
("This_is_999_test",null.asInstanceOf[Array[Int]]))
.toDF("col1","col2")
data.createOrReplaceTempView("data")
Register the UDF and run it in a query
spark.udf.register("getNums",getNums)
spark.sql("""select col1,
case when size(col2) > 0 then col2 else getNums(col1) end new_col
from data""").show
Which returns
+--------------------+---------------+
| col1| new_col|
+--------------------+---------------+
|This_is_111_222_3...|[111, 222, 333]|
|This_is_111_222_4...| [3296]|
|This_is_555_666_test| [555, 666]|
| This_is_999_test| [999]|
+--------------------+---------------+
Performance was tested with a larger data set created as follows:
val bigData = (0 to 1000).map(_ => data union data).reduce( _ union _)
bigData.createOrReplaceTempView("big_data")
With that, the solution given in the OP was compared to the UDF solution and found to be about the same.
// With UDF
spark.sql("""select col1,
case when length(col2) > 0 then col2 else getNums(col1) end new_col
from big_data""").count
/// OP solution:
bigData.withColumn("col2",when($"col2".isNull,regexp_replace(regexp_replace(regexp_extract($"col1","([0-9]+_)+",0),"_",","),".$","")).otherwise($"col2")).count

Here is another way, please check the performance.
df.withColumn("col2", expr("coalesce(col2, array_join(filter(split(col1, '_'), x -> CAST(x as INT) IS NOT NULL), ','))"))
.show(false)
+------------------------+-----------+
|col1 |col2 |
+------------------------+-----------+
|This_is_111_222_333_test|111,222,333|
|This_is_111_222_444_test|3296 |
|This_is_555_666_test |555,666 |
|This_is_999_test |999 |
+------------------------+-----------+

How to pick latest record in spark structured streaming join

I am using spark-sql 2.4.x version , datastax-spark-cassandra-connector for Cassandra-3.x version. Along with kafka.
I have rates meta data of currency sample as below :
val ratesMetaDataDf = Seq(
("EUR","5/10/2019","1.130657","USD"),
("EUR","5/9/2019","1.13088","USD")
).toDF("base_code", "rate_date","rate_value","target_code")
.withColumn("rate_date", to_date($"rate_date" ,"MM/dd/yyyy").cast(DateType))
.withColumn("rate_value", $"rate_value".cast(DoubleType))
Sales records which i received from kafka topic is , as (sample) below
:
val kafkaDf = Seq((15,2016, 4, 100.5,"USD","2021-01-20","EUR",221.4)
).toDF("companyId", "year","quarter","sales","code","calc_date","c_code","prev_sales")
To calculate "prev_sales" , I need get its "c_code" 's respective "rate_value" which is nearest to the "calc_date" i.e. rate_date"
Which i am doing as following
val w2 = Window.orderBy(col("rate_date") desc)
val rateJoinResultDf = kafkaDf.as("k").join(ratesMetaDataDf.as("e"))
.where( ($"k.c_code" === $"e.base_code") &&
($"rate_date" < $"calc_date")
).orderBy($"rate_date" desc)
.withColumn("row",row_number.over(w2))
.where($"row" === 1).drop("row")
.withColumn("prev_sales", (col("prev_sales") * col("rate_value")).cast(DoubleType))
.select("companyId", "year","quarter","sales","code","calc_date","prev_sales")
In the above to get nearest record (i.e. "5/10/2019" from ratesMetaDataDf ) for given "rate_date" I am using window and row_number function and sorting the records by "desc".
But in the spark-sql streaming it is causing the error as below
"
Sorting is not supported on streaming DataFrames/Datasets, unless it is on aggregated DataFrame/Dataset in Complete output mode;;"
So how to fetch first record to join in the above.

Replace your last code part with below code. This code will do left join and calculate date difference calc_date & rate_date. Next Window function we will pick nearest date and calculate prev_sales by using same your calculation.
Please note I have added one filter condition filter(col("diff") >=0),
which will handle a case of calc_date < rate_date. I have added few
more records for better understanding of this case.
scala> ratesMetaDataDf.show
+---------+----------+----------+-----------+
|base_code| rate_date|rate_value|target_code|
+---------+----------+----------+-----------+
| EUR|2019-05-10| 1.130657| USD|
| EUR|2019-05-09| 1.12088| USD|
| EUR|2019-12-20| 1.1584| USD|
+---------+----------+----------+-----------+
scala> kafkaDf.show
+---------+----+-------+-----+----+----------+------+----------+
|companyId|year|quarter|sales|code| calc_date|c_code|prev_sales|
+---------+----+-------+-----+----+----------+------+----------+
| 15|2016| 4|100.5| USD|2021-01-20| EUR| 221.4|
| 15|2016| 4|100.5| USD|2019-06-20| EUR| 221.4|
+---------+----+-------+-----+----+----------+------+----------+
scala> val W = Window.partitionBy("companyId","year","quarter","sales","code","calc_date","c_code","prev_sales").orderBy(col("diff"))
scala> val rateJoinResultDf= kafkaDf.alias("k").join(ratesMetaDataDf.alias("r"), col("k.c_code") === col("r.base_code"), "left")
.withColumn("diff",datediff(col("calc_date"), col("rate_date")))
.filter(col("diff") >= 0)
.withColumn("closedate", row_number.over(W))
.filter(col("closedate") === 1)
.drop("diff", "closedate")
.withColumn("prev_sales", (col("prev_sales") * col("rate_value")).cast("Decimal(14,5)"))
.select("companyId", "year","quarter","sales","code","calc_date","prev_sales")
scala> rateJoinResultDf.show
+---------+----+-------+-----+----+----------+----------+
|companyId|year|quarter|sales|code| calc_date|prev_sales|
+---------+----+-------+-----+----+----------+----------+
| 15|2016| 4|100.5| USD|2021-01-20| 256.46976|
| 15|2016| 4|100.5| USD|2019-06-20| 250.32746|
+---------+----+-------+-----+----+----------+----------+

Poor performance on Window Lag function for large Spark dataframes [duplicate]

My question is triggered by the use case of calculating the differences between consecutive rows in a spark dataframe.
For example, I have:
>>> df.show()
+-----+----------+
|index| col1|
+-----+----------+
| 0.0|0.58734024|
| 1.0|0.67304325|
| 2.0|0.85154736|
| 3.0| 0.5449719|
+-----+----------+
If I choose to calculate these using "Window" functions, then I can do that like so:
>>> winSpec = Window.partitionBy(df.index >= 0).orderBy(df.index.asc())
>>> import pyspark.sql.functions as f
>>> df.withColumn('diffs_col1', f.lag(df.col1, -1).over(winSpec) - df.col1).show()
+-----+----------+-----------+
|index| col1| diffs_col1|
+-----+----------+-----------+
| 0.0|0.58734024|0.085703015|
| 1.0|0.67304325| 0.17850411|
| 2.0|0.85154736|-0.30657548|
| 3.0| 0.5449719| null|
+-----+----------+-----------+
Question: I explicitly partitioned the dataframe in a single partition. What is the performance impact of this and, if there is, why is that so and how could I avoid it? Because when I do not specify a partition, I get the following warning:
16/12/24 13:52:27 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

In practice performance impact will be almost the same as if you omitted partitionBy clause at all. All records will be shuffled to a single partition, sorted locally and iterated sequentially one by one.
The difference is only in the number of partitions created in total. Let's illustrate that with an example using simple dataset with 10 partitions and 1000 records:
df = spark.range(0, 1000, 1, 10).toDF("index").withColumn("col1", f.randn(42))
If you define frame without partition by clause
w_unpart = Window.orderBy(f.col("index").asc())
and use it with lag
df_lag_unpart = df.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1")
)
there will be only one partition in total:
df_lag_unpart.rdd.glom().map(len).collect()
[1000]
Compared to that frame definition with dummy index (simplified a bit compared to your code:
w_part = Window.partitionBy(f.lit(0)).orderBy(f.col("index").asc())
will use number of partitions equal to spark.sql.shuffle.partitions:
spark.conf.set("spark.sql.shuffle.partitions", 11)
df_lag_part = df.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_part) - f.col("col1")
)
df_lag_part.rdd.glom().count()
11
with only one non-empty partition:
df_lag_part.rdd.glom().filter(lambda x: x).count()
1
Unfortunately there is no universal solution which can be used to address this problem in PySpark. This just an inherent mechanism of the implementation combined with distributed processing model.
Since index column is sequential you could generate artificial partitioning key with fixed number of records per block:
rec_per_block = df.count() // int(spark.conf.get("spark.sql.shuffle.partitions"))
df_with_block = df.withColumn(
"block", (f.col("index") / rec_per_block).cast("int")
)
and use it to define frame specification:
w_with_block = Window.partitionBy("block").orderBy("index")
df_lag_with_block = df_with_block.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_with_block) - f.col("col1")
)
This will use expected number of partitions:
df_lag_with_block.rdd.glom().count()
11
with roughly uniform data distribution (we cannot avoid hash collisions):
df_lag_with_block.rdd.glom().map(len).collect()
[0, 180, 0, 90, 90, 0, 90, 90, 100, 90, 270]
but with a number of gaps on the block boundaries:
df_lag_with_block.where(f.col("diffs_col1").isNull()).count()
12
Since boundaries are easy to compute:
from itertools import chain
boundary_idxs = sorted(chain.from_iterable(
# Here we depend on sequential identifiers
# This could be generalized to any monotonically increasing
# id by taking min and max per block
(idx - 1, idx) for idx in
df_lag_with_block.groupBy("block").min("index")
.drop("block").rdd.flatMap(lambda x: x)
.collect()))[2:] # The first boundary doesn't carry useful inf.
you can always select:
missing = df_with_block.where(f.col("index").isin(boundary_idxs))
and fill these separately:
# We use window without partitions here. Since number of records
# will be small this won't be a performance issue
# but will generate "Moving all data to a single partition" warning
missing_with_lag = missing.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1")
).select("index", f.col("diffs_col1").alias("diffs_fill"))
and join:
combined = (df_lag_with_block
.join(missing_with_lag, ["index"], "leftouter")
.withColumn("diffs_col1", f.coalesce("diffs_col1", "diffs_fill")))
to get desired result:
mismatched = combined.join(df_lag_unpart, ["index"], "outer").where(
combined["diffs_col1"] != df_lag_unpart["diffs_col1"]
)
assert mismatched.count() == 0

Order Spark SQL Dataframe with nested values / complex data types

My goal is to collect an ordered list of nested values. It should be ordered based on an element in the nested list. I tried out different approaches but have concerns in terms of performance and correctness.
Order globally
case class Payment(Id: String, Date: String, Paid: Double)
val payments = Seq(
Payment("mk", "10:00 AM", 8.6D),
Payment("mk", "06:00 AM", 12.6D),
Payment("yc", "07:00 AM", 16.6D),
Payment("yc", "09:00 AM", 2.6D),
Payment("mk", "11:00 AM", 5.6D)
)
val df = spark.createDataFrame(payments)
// order globally
df.orderBy(col("Paid").desc)
.groupBy(col("Id"))
.agg(
collect_list(struct(col("Date"), col("Paid"))).as("UserPayments")
)
.withColumn("LargestPayment", col("UserPayments")(0).getField("Paid"))
.withColumn("LargestPaymentDate", col("UserPayments")(0).getField("Date"))
.show(false)
+---+-------------------------------------------------+--------------+------------------+
|Id |UserPayments |LargestPayment|LargestPaymentDate|
+---+-------------------------------------------------+--------------+------------------+
|yc |[[07:00 AM,16.6], [09:00 AM,2.6]] |16.6 |07:00 AM |
|mk |[[06:00 AM,12.6], [10:00 AM,8.6], [11:00 AM,5.6]]|12.6 |06:00 AM |
+---+-------------------------------------------------+--------------+------------------+
This is a naive and straight-forward approach, but I have concerns in terms of correctness. Will the list really be ordered globally or only within a partition?
Window function
// use Window
val window = Window.partitionBy(col("Id")).orderBy(col("Paid").desc)
df.withColumn("rank", rank().over(window))
.groupBy(col("Id"))
.agg(
collect_list(struct(col("rank"), col("Date"), col("Paid"))).as("UserPayments")
)
.withColumn("LargestPayment", col("UserPayments")(0).getField("Paid"))
.withColumn("LargestPaymentDate", col("UserPayments")(0).getField("Date"))
.show(false)
+---+-------------------------------------------------------+--------------+------------------+
|Id |UserPayments |LargestPayment|LargestPaymentDate|
+---+-------------------------------------------------------+--------------+------------------+
|yc |[[1,07:00 AM,16.6], [2,09:00 AM,2.6]] |16.6 |07:00 AM |
|mk |[[1,06:00 AM,12.6], [2,10:00 AM,8.6], [3,11:00 AM,5.6]]|12.6 |06:00 AM |
+---+-------------------------------------------------------+--------------+------------------+
This should work or do I miss something?
Order in UDF on-the-fly
// order in UDF
val largestPaymentDate = udf((lr: Seq[Row]) => {
lr.max(Ordering.by((l: Row) => l.getAs[Double]("Paid"))).getAs[String]("Date")
})
df.groupBy(col("Id"))
.agg(
collect_list(struct(col("Date"), col("Paid"))).as("UserPayments")
)
.withColumn("LargestPaymentDate", largestPaymentDate(col("UserPayments")))
.show(false)
+---+-------------------------------------------------+------------------+
|Id |UserPayments |LargestPaymentDate|
+---+-------------------------------------------------+------------------+
|yc |[[07:00 AM,16.6], [09:00 AM,2.6]] |07:00 AM |
|mk |[[10:00 AM,8.6], [06:00 AM,12.6], [11:00 AM,5.6]]|06:00 AM |
+---+-------------------------------------------------+------------------+
I guess nothing to complain here in terms of correctness. But for the following operations, I'd prefer that the list is ordered and I don't have to do every time explicitly.
I tried to write a UDF which takes the list as an input and returns the ordered list - but returning a list was too painful and I gave it up.

I'd reverse the order of the struct and aggregate with max:
val result = df
.groupBy(col("Id"))
.agg(
collect_list(struct(col("Date"), col("Paid"))) as "UserPayments",
max(struct(col("Paid"), col("Date"))) as "MaxPayment"
)
result.show
// +---+--------------------+---------------+
// | Id| UserPayments| MaxPayment|
// +---+--------------------+---------------+
// | yc|[[07:00 AM,16.6],...|[16.6,07:00 AM]|
// | mk|[[10:00 AM,8.6], ...|[12.6,06:00 AM]|
// +---+--------------------+---------------+
You can later flatten the struct:
result.select($"id", $"UserPayments", $"MaxPayment.*").show
// +---+--------------------+----+--------+
// | id| UserPayments|Paid| Date|
// +---+--------------------+----+--------+
// | yc|[[07:00 AM,16.6],...|16.6|07:00 AM|
// | mk|[[10:00 AM,8.6], ...|12.6|06:00 AM|
// +---+--------------------+----+--------+
Same way you can sort_array of reordered structs
df
.groupBy(col("Id"))
.agg(
sort_array(collect_list(struct(col("Paid"), col("Date")))) as "UserPayments"
)
.show(false)
// +---+-------------------------------------------------+
// |Id |UserPayments |
// +---+-------------------------------------------------+
// |yc |[[2.6,09:00 AM], [16.6,07:00 AM]] |
// |mk |[[5.6,11:00 AM], [8.6,10:00 AM], [12.6,06:00 AM]]|
// +---+-------------------------------------------------+
Finally:
This is a naive and straight-forward approach, but I have concerns in terms of correctness. Will the list really be ordered globally or only within a partition?
Data will be ordered globally, but the order will be destroyed by groupBy so this is is not a solution, and can work only accidentally.
repartition (by id) and sortWithinPartitions (by id and Paid) should be reliable replacement.

Avoid performance impact of a single partition mode in Spark window functions

My question is triggered by the use case of calculating the differences between consecutive rows in a spark dataframe.
For example, I have:
>>> df.show()
+-----+----------+
|index| col1|
+-----+----------+
| 0.0|0.58734024|
| 1.0|0.67304325|
| 2.0|0.85154736|
| 3.0| 0.5449719|
+-----+----------+
If I choose to calculate these using "Window" functions, then I can do that like so:
>>> winSpec = Window.partitionBy(df.index >= 0).orderBy(df.index.asc())
>>> import pyspark.sql.functions as f
>>> df.withColumn('diffs_col1', f.lag(df.col1, -1).over(winSpec) - df.col1).show()
+-----+----------+-----------+
|index| col1| diffs_col1|
+-----+----------+-----------+
| 0.0|0.58734024|0.085703015|
| 1.0|0.67304325| 0.17850411|
| 2.0|0.85154736|-0.30657548|
| 3.0| 0.5449719| null|
+-----+----------+-----------+
Question: I explicitly partitioned the dataframe in a single partition. What is the performance impact of this and, if there is, why is that so and how could I avoid it? Because when I do not specify a partition, I get the following warning:
16/12/24 13:52:27 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.

In practice performance impact will be almost the same as if you omitted partitionBy clause at all. All records will be shuffled to a single partition, sorted locally and iterated sequentially one by one.
The difference is only in the number of partitions created in total. Let's illustrate that with an example using simple dataset with 10 partitions and 1000 records:
df = spark.range(0, 1000, 1, 10).toDF("index").withColumn("col1", f.randn(42))
If you define frame without partition by clause
w_unpart = Window.orderBy(f.col("index").asc())
and use it with lag
df_lag_unpart = df.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1")
)
there will be only one partition in total:
df_lag_unpart.rdd.glom().map(len).collect()
[1000]
Compared to that frame definition with dummy index (simplified a bit compared to your code:
w_part = Window.partitionBy(f.lit(0)).orderBy(f.col("index").asc())
will use number of partitions equal to spark.sql.shuffle.partitions:
spark.conf.set("spark.sql.shuffle.partitions", 11)
df_lag_part = df.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_part) - f.col("col1")
)
df_lag_part.rdd.glom().count()
11
with only one non-empty partition:
df_lag_part.rdd.glom().filter(lambda x: x).count()
1
Unfortunately there is no universal solution which can be used to address this problem in PySpark. This just an inherent mechanism of the implementation combined with distributed processing model.
Since index column is sequential you could generate artificial partitioning key with fixed number of records per block:
rec_per_block = df.count() // int(spark.conf.get("spark.sql.shuffle.partitions"))
df_with_block = df.withColumn(
"block", (f.col("index") / rec_per_block).cast("int")
)
and use it to define frame specification:
w_with_block = Window.partitionBy("block").orderBy("index")
df_lag_with_block = df_with_block.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_with_block) - f.col("col1")
)
This will use expected number of partitions:
df_lag_with_block.rdd.glom().count()
11
with roughly uniform data distribution (we cannot avoid hash collisions):
df_lag_with_block.rdd.glom().map(len).collect()
[0, 180, 0, 90, 90, 0, 90, 90, 100, 90, 270]
but with a number of gaps on the block boundaries:
df_lag_with_block.where(f.col("diffs_col1").isNull()).count()
12
Since boundaries are easy to compute:
from itertools import chain
boundary_idxs = sorted(chain.from_iterable(
# Here we depend on sequential identifiers
# This could be generalized to any monotonically increasing
# id by taking min and max per block
(idx - 1, idx) for idx in
df_lag_with_block.groupBy("block").min("index")
.drop("block").rdd.flatMap(lambda x: x)
.collect()))[2:] # The first boundary doesn't carry useful inf.
you can always select:
missing = df_with_block.where(f.col("index").isin(boundary_idxs))
and fill these separately:
# We use window without partitions here. Since number of records
# will be small this won't be a performance issue
# but will generate "Moving all data to a single partition" warning
missing_with_lag = missing.withColumn(
"diffs_col1", f.lag("col1", 1).over(w_unpart) - f.col("col1")
).select("index", f.col("diffs_col1").alias("diffs_fill"))
and join:
combined = (df_lag_with_block
.join(missing_with_lag, ["index"], "leftouter")
.withColumn("diffs_col1", f.coalesce("diffs_col1", "diffs_fill")))
to get desired result:
mismatched = combined.join(df_lag_unpart, ["index"], "outer").where(
combined["diffs_col1"] != df_lag_unpart["diffs_col1"]
)
assert mismatched.count() == 0

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Maintain order after partition by key groupByKey or aggregateByKey - apache-spark

Related

Extract Numeric data from the Column in Spark Dataframe

How to pick latest record in spark structured streaming join

Poor performance on Window Lag function for large Spark dataframes [duplicate]

Order Spark SQL Dataframe with nested values / complex data types

Avoid performance impact of a single partition mode in Spark window functions

Categories

Resources