How to get the rows with Max value in Spark DataFrame - apache-spark

I have a dataframe (df1) with the following details
| Date |High|Low |
| -------- |----|----|
| 2021-01-23| 89 | 43 |
| 2021-02-09| 90 | 54 |
|2009-09-19 | 96 | 50 |
I then apply aggregate functions to the High
df1.agg({'High':'max'}).show()
This will give me:
| max(High)|
| -------- |
| 96 |
How can I apply filter or other methods so that I can get the other columns that is within the same row as max(High) to show together with aggregated results?
My desired outcome is -
| Date | High | Low |
| -------- | ---- |------|
|2009-09-19| 96 | 50 |

You can easily to did by extracting the MAX High value and finally applying a filter against the value on the entire Dataframe
Data Preparation
df = pd.DataFrame({
'Date':['2021-01-23','2021-02-09','2009-09-19'],
'High':[89,90,96],
'Low':[43,54,50]
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+----------+----+---+
| Date|High|Low|
+----------+----+---+
|2021-01-23| 89| 43|
|2021-02-09| 90| 54|
|2009-09-19| 96| 50|
+----------+----+---+
Filter
max_high = sparkDF.select(F.max(F.col('High')).alias('High')).collect()[0]['High']
>>> 96
sparkDF.filter(F.col('High') == max_high).orderBy(F.col('Date').desc()).limit(1).show()
+----------+----+---+
| Date|High|Low|
+----------+----+---+
|2009-09-19| 96| 50|
+----------+----+---+

Related

Mapping column from arrays in Pyspark

I'm new to working with Pyspark df when there are arrays stored in columns and looking for some help in trying to map a column based on 2 PySpark Dataframes with one being a reference df.
Reference Dataframe (Number of Subgroups varies for each Group):
| Group | Subgroup | Size | Type |
| ---- | -------- | ------------------| --------------- |
|A | A1 |['Small','Medium'] | ['A','B'] |
|A | A2 |['Small','Medium'] | ['C','D'] |
|B | B1 |['Small'] | ['A','B','C','D']|
Source Dataframe:
| ID | Size | Type |
| ---- | -------- | ---------|
|ID_001 | 'Small' |'A' |
|ID_002 | 'Medium' |'B' |
|ID_003 | 'Small' |'D' |
In the result, each ID belongs to every Group, but is exclusive for its' subgroups based on the reference df with the result looking something like this:
| ID | Size | Type | A_Subgroup | B_Subgroup |
| ---- | -------- | ---------| ---------- | ------------- |
|ID_001 | 'Small' |'A' | 'A1' | 'B1' |
|ID_002 | 'Medium' |'B' | 'A1' | Null |
|ID_003 | 'Small' |'D' | 'A2' | 'B1' |
You can do a join using array_contains conditions, and pivot the result:
import pyspark.sql.functions as F
result = source.alias('source').join(
ref.alias('ref'),
F.expr("""
array_contains(ref.Size, source.Size) and
array_contains(ref.Type, source.Type)
"""),
'left'
).groupBy(
'ID', source['Size'], source['Type']
).pivot('Group').agg(F.first('Subgroup'))
result.show()
+------+------+----+---+----+
| ID| Size|Type| A| B|
+------+------+----+---+----+
|ID_003| Small| D| A2| B1|
|ID_002|Medium| B| A1|null|
|ID_001| Small| A| A1| B1|
+------+------+----+---+----+

PySpark: Filtering duplicates of a union, keeping only the groupby rows with the maximum value for a specified column

I want to create a DataFrame that contains all the rows from two DataFrames, and where there are duplicates we keep only the row with the max value of a column.
For example, if we have two tables with the same schema, like below, we will merge into one table which includes only the rows with the maximum column value (highest score) for the group of rows grouped by another column ("name" in the below example).
Table A
+--------------------------+
| name | source | score |
+--------+---------+-------+
| Finch | Acme | 62 |
| Jones | Acme | 30 |
| Lewis | Acme | 59 |
| Smith | Acme | 98 |
| Starr | Acme | 87 |
+--------+---------+-------+
Table B
+--------------------------+
| name | source | score |
+--------+---------+-------+
| Bryan | Beta | 93 |
| Jones | Beta | 75 |
| Lewis | Beta | 59 |
| Smith | Beta | 64 |
| Starr | Beta | 81 |
+--------+---------+-------+
Final Table
+--------------------------+
| name | source | score |
+--------+---------+-------+
| Bryan | Beta | 93 |
| Finch | Acme | 62 |
| Jones | Beta | 75 |
| Lewis | Acme | 59 |
| Smith | Acme | 98 |
| Starr | Acme | 87 |
+--------+---------+-------+
Here's what seems to work:
from pyspark.sql import functions as F
schema = ["name", "source", "score"]
rows1 = [("Smith", "Acme", 98),
("Jones", "Acme", 30),
("Finch", "Acme", 62),
("Lewis", "Acme", 59),
("Starr", "Acme", 87)]
rows2 = [("Smith", "Beta", 64),
("Jones", "Beta", 75),
("Bryan", "Beta", 93),
("Lewis", "Beta", 59),
("Starr", "Beta", 81)]
df1 = spark.createDataFrame(rows1, schema)
df2 = spark.createDataFrame(rows2, schema)
df_union = df1.unionAll(df2)
df_agg = df_union.groupBy("name").agg(F.max("score").alias("score"))
df_final = df_union.join(df_agg, on="score", how="leftsemi").orderBy("name", F.col("score").desc()).dropDuplicates(["name"])
The above results in the DataFrame I expect. It seems like a convoluted way to do this, but I don't know as I'm relatively new to Spark. Can this be done in a more efficient, elegant, or "Pythonic" manner?
You can use window functions. Partition by name and choose the record with the highest score.
from pyspark.sql.functions import *
from pyspark.sql.window import Window
w=Window().partitionBy("name").orderBy(desc("score"))
df_union.withColumn("rank", row_number().over(w))\
.filter(col("rank")==1).drop("rank").show()
+-----+------+-----+
| name|source|score|
+-----+------+-----+
|Bryan| Beta| 93|
|Finch| Acme| 62|
|Jones| Beta| 75|
|Lewis| Acme| 59|
|Smith| Acme| 98|
|Starr| Acme| 87|
+-----+------+-----+
I don't see anything wrong with your answer, except for the last line - you cannot join on score only, but need to join on combination of "name" and "score", and you can choose inner join, which will eliminate the need to remove rows with lower scores for the same name:
df_final = (df_union.join(df_agg, on=["name", "score"], how="inner")
.orderBy("name")
.dropDuplicates(["name"]))
Notice that there is no need to order by score, and .dropDuplicates(["name"]) is only needed if you want to avoid displaying two rows for name = Lewis who has the same score in both dataframes.

Append a monotonically increasing id column that increases on column value match

I am ingesting a dataframe and I want to append a monotonically increasing column that increases whenever another column matches a certain value. For example I have the following table
+------+-------+
| Col1 | Col2 |
+------+-------+
| B | 543 |
| A | 1231 |
| B | 14234 |
| B | 34234 |
| B | 3434 |
| A | 43242 |
| B | 43242 |
| B | 56453 |
+------+-------+
I would like to append a column that increases in value whenever "A" in col1 is present. So the result would look like
+------+-------+------+
| Col1 | Col2 | Col3 |
+------+-------+------+
| B | 543 | 0 |
| A | 1231 | 1 |
| B | 14234 | 1 |
| B | 34234 | 1 |
| B | 3434 | 1 |
| A | 43242 | 2 |
| B | 43242 | 2 |
| B | 56453 | 2 |
+------+-------+------+
Keeping the initial order is important.
I tried zippering but that doesn't seem to produce the right result. Splitting it up into individual seqs manually and doing it that way is not going to be performant enough (think 100+ GB tables).
I looked into trying this with a map function that would keep a counter somewhere but couldn't get that to work.
Any advice or pointer in the right direction would be greatly appreciated.
spark does not provide any default functions to achieve this kind of functionality
I would do like to do most probably in this way
//inputDF contains Col1 | Col2
val df = inputDF.select("Col1").distinct.rdd.zipWithIndex().toDF("Col1","Col2")
val finalDF = inputDF.join(df,df("Col1") === inputDF("Col1"),"left").select(inputDF("*"),"Col3")
but the problem here I can see is (join which will result in the shuffle).
you can also check other autoincrement API's here.
Use window and sum over the window of the value 1 when Col1 = A.
import pyspark.sql.functions as f
from pyspark.sql import Window
w = Window.partitionBy().rowsBetween(Window.unboundedPreceding, Window.currentRow)
df.withColumn('Col3', f.sum(f.when(f.col('Col1') == f.lit('A'), 1).otherwise(0)).over(w)).show()
+----+-----+----+
|Col1| Col2|Col3|
+----+-----+----+
| B| 543| 0|
| A| 1231| 1|
| B|14234| 1|
| B|34234| 1|
| B| 3434| 1|
| A|43242| 2|
| B|43242| 2|
| B|56453| 2|
+----+-----+----+

How to calculate rolling sum with varying window sizes in PySpark

I have a spark dataframe that contains sales prediction data for some products in some stores over a time period. How do I calculate the rolling sum of Predictions for a window size of next N values?
Input Data
+-----------+---------+------------+------------+---+
| ProductId | StoreId | Date | Prediction | N |
+-----------+---------+------------+------------+---+
| 1 | 100 | 2019-07-01 | 0.92 | 2 |
| 1 | 100 | 2019-07-02 | 0.62 | 2 |
| 1 | 100 | 2019-07-03 | 0.89 | 2 |
| 1 | 100 | 2019-07-04 | 0.57 | 2 |
| 2 | 200 | 2019-07-01 | 1.39 | 3 |
| 2 | 200 | 2019-07-02 | 1.22 | 3 |
| 2 | 200 | 2019-07-03 | 1.33 | 3 |
| 2 | 200 | 2019-07-04 | 1.61 | 3 |
+-----------+---------+------------+------------+---+
Expected Output Data
+-----------+---------+------------+------------+---+------------------------+
| ProductId | StoreId | Date | Prediction | N | RollingSum |
+-----------+---------+------------+------------+---+------------------------+
| 1 | 100 | 2019-07-01 | 0.92 | 2 | sum(0.92, 0.62) |
| 1 | 100 | 2019-07-02 | 0.62 | 2 | sum(0.62, 0.89) |
| 1 | 100 | 2019-07-03 | 0.89 | 2 | sum(0.89, 0.57) |
| 1 | 100 | 2019-07-04 | 0.57 | 2 | sum(0.57) |
| 2 | 200 | 2019-07-01 | 1.39 | 3 | sum(1.39, 1.22, 1.33) |
| 2 | 200 | 2019-07-02 | 1.22 | 3 | sum(1.22, 1.33, 1.61 ) |
| 2 | 200 | 2019-07-03 | 1.33 | 3 | sum(1.33, 1.61) |
| 2 | 200 | 2019-07-04 | 1.61 | 3 | sum(1.61) |
+-----------+---------+------------+------------+---+------------------------+
There are lots of questions and answers to this problem in Python but I couldn't find any in PySpark.
Similar Question 1
There is a similar question here but in this one frame size is fixed to 3. In the provided answer rangeBetween function is used and it is only working with fixed sized frames so I cannot use it for varying sizes.
Similar Question 2
There is also a similar question here. In this one, writing cases for all possible sizes is suggested but it is not applicable for my case since I don't know how many distinct frame sizes I need to calculate.
Solution attempt 1
I've tried to solve the problem using a pandas udf:
rolling_sum_predictions = predictions.groupBy('ProductId', 'StoreId').apply(calculate_rolling_sums)
calculate_rolling_sums is a pandas udf where I solve the problem in python. This solution works with a small amount of test data. However, when the data gets bigger (in my case, the input df has around 1B rows), calculations take so long.
Solution attempt 2
I have used a workaround of the answer of Similar Question 1 above. I've calculated the biggest possible N, created the list using it and then calculate the sum of predictions by slicing the list.
predictions = predictions.withColumn('DayIndex', F.rank().over(Window.partitionBy('ProductId', 'StoreId').orderBy('Date')))
# find the biggest period
biggest_period = predictions.agg({"N": "max"}).collect()[0][0]
# calculate rolling predictions starting from the DayIndex
w = (Window.partitionBy(F.col("ProductId"), F.col("StoreId")).orderBy(F.col('DayIndex')).rangeBetween(0, biggest_period - 1))
rolling_prediction_lists = predictions.withColumn("next_preds", F.collect_list("Prediction").over(w))
# calculate rolling forecast sums
pred_sum_udf = udf(lambda preds, period: float(np.sum(preds[:period])), FloatType())
rolling_pred_sums = rolling_prediction_lists \
.withColumn("RollingSum", pred_sum_udf("next_preds", "N"))
This solution is also works with the test data. I couldn't have chance to test it with the original data yet but whether it works or not I do not like this solution. Is there any smarter way to solve this?
If you're using spark 2.4+, you can use the new higher-order array functions slice and aggregate to efficiently implement your requirement without any UDFs:
summed_predictions = predictions\
.withColumn("summed", F.collect_list("Prediction").over(Window.partitionBy("ProductId", "StoreId").orderBy("Date").rowsBetween(Window.currentRow, Window.unboundedFollowing))\
.withColumn("summed", F.expr("aggregate(slice(summed,1,N), cast(0 as double), (acc,d) -> acc + d)"))
summed_predictions.show()
+---------+-------+-------------------+----------+---+------------------+
|ProductId|StoreId| Date|Prediction| N| summed|
+---------+-------+-------------------+----------+---+------------------+
| 1| 100|2019-07-01 00:00:00| 0.92| 2| 1.54|
| 1| 100|2019-07-02 00:00:00| 0.62| 2| 1.51|
| 1| 100|2019-07-03 00:00:00| 0.89| 2| 1.46|
| 1| 100|2019-07-04 00:00:00| 0.57| 2| 0.57|
| 2| 200|2019-07-01 00:00:00| 1.39| 3| 3.94|
| 2| 200|2019-07-02 00:00:00| 1.22| 3| 4.16|
| 2| 200|2019-07-03 00:00:00| 1.33| 3|2.9400000000000004|
| 2| 200|2019-07-04 00:00:00| 1.61| 3| 1.61|
+---------+-------+-------------------+----------+---+------------------+
It might not be the best, but you can get distinct "N" column values and loop like below.
val arr = df.select("N").distinct.collect
for(n <- arr) df.filter(col("N") === n.get(0))
.withColumn("RollingSum",sum(col("Prediction"))
.over(Window.partitionBy("N").orderBy("N").rowsBetween(Window.currentRow, n.get(0).toString.toLong-1))).show
This will give you like:
+---------+-------+----------+----------+---+------------------+
|ProductId|StoreId| Date|Prediction| N| RollingSum|
+---------+-------+----------+----------+---+------------------+
| 2| 200|2019-07-01| 1.39| 3| 3.94|
| 2| 200|2019-07-02| 1.22| 3| 4.16|
| 2| 200|2019-07-03| 1.33| 3|2.9400000000000004|
| 2| 200|2019-07-04| 1.61| 3| 1.61|
+---------+-------+----------+----------+---+------------------+
+---------+-------+----------+----------+---+----------+
|ProductId|StoreId| Date|Prediction| N|RollingSum|
+---------+-------+----------+----------+---+----------+
| 1| 100|2019-07-01| 0.92| 2| 1.54|
| 1| 100|2019-07-02| 0.62| 2| 1.51|
| 1| 100|2019-07-03| 0.89| 2| 1.46|
| 1| 100|2019-07-04| 0.57| 2| 0.57|
+---------+-------+----------+----------+---+----------+
Then you can do a union of all the dataframes inside the loop.

Finding efficiently all relevant sub ranges for bigdata tables in Hive/ Spark

Following this question, I would like to ask.
I have 2 tables:
The first table - MajorRange
row | From | To | Group ....
-----|--------|---------|---------
1 | 1200 | 1500 | A
2 | 2200 | 2700 | B
3 | 1700 | 1900 | C
4 | 2100 | 2150 | D
...
The second table - SubRange
row | From | To | Group ....
-----|--------|---------|---------
1 | 1208 | 1300 | E
2 | 1400 | 1600 | F
3 | 1700 | 2100 | G
4 | 2100 | 2500 | H
...
The output table should be the all the SubRange groups who has overlap over the MajorRange groups. In the following example the result table is:
row | Major | Sub |
-----|--------|------|-
1 | A | E |
2 | A | F |
3 | B | H |
4 | C | G |
5 | D | H |
In case there is no overlapping between the ranges the Major will not appear.
Both tables are big data tables.How can I do it using Hive/ Spark in most efficient way?
With spark, maybe a non equi join like this?
val join_expr = major_range("From") < sub_range("To") && major_range("To") > sub_range("From")
(major_range.join(sub_range, join_expr)
.select(
monotonically_increasing_id().as("row"),
major_range("Group").as("Major"),
sub_range("Group").as("Sub")
)
).show
+---+-----+---+
|row|Major|Sub|
+---+-----+---+
| 0| A| E|
| 1| A| F|
| 2| B| H|
| 3| C| G|
| 4| D| H|
+---+-----+---+

Resources