Find nth row per group in a large dataset with Spark - apache-spark

I have a (very) large dataset partitioned by year, month and day. The partition columns were derived from a updated_at column during ingestion. Here is how it looks like:
id
user
updated_at
year
month
day
1
a
1992-01-19
1992
1
19
2
c
1992-01-20
1992
1
20
3
a
1992-01-21
1992
1
21
...
...
...
...
...
...
720987
c
2012-07-20
2012
7
20
720988
a
2012-07-21
2012
7
21
...
...
...
...
...
...
I need to use Apache Spark to find the 5th earliest event per user.
A simple window function like the one below is impossible since I use a shared cluster and I won't have enough resources to do in-memory processing at any given time due to the size of the dataset.
window = Window.partitionBy("user").orderBy(F.asc("updated_at"))
.withColumn("rank", F.dense_rank().over(window))
.filter(F.col("rank") == 5)
I am considering looping through partitions, processing and persisting data to disk, and then merging them back. How would you solve it? Thanks!

I think the code below will be faster because data is partitioned by these cols and spark can benefit from data locality.
Window.partitionBy("user").orderBy(F.asc("year"), F.asc("month"), F.asc("day"))

Related

How to estimate a Spark DataFrame's row count after join two or more table?

I'm developing a feature, support dynamic sql as input, and then using the input to submit a spark job. But the inputs are unpredicatable, some inputs may exceed the limit, it's a dager for me. I want to check the sql's cost before submit the job, is a way I can estimate the cost accurately?
My Spark conf is :
Spark Version: 3.3.1
conf:
spark.sql.cbo.enabled: true
spark.sql.statistics.histogram.enabled:true
example:
I have a dataFrame df1 like this
n x y z
'A' 1 2 3
'A' 4 5 6
'A' 7 8 9
'A' 10 11 12
'A' 13 14 15
'A' 16 17 18
'A' 19 20 21
'A' 22 23 24
'A' 25 26 27
'A' 28 29 30
row count of df1.join(df1,"n","left").join(df1,"n","left") should be 1000
row count of df1.join(df1,"n","left").join(df1,"n","left") should be 10
but result of dataFrame.queryExecution.optimizedPlan.stats is awlyways 1000 for examples above.
I've tried in some way:
dataFrame.queryExecution.optimizedPlan.stats, but the es rows is much bigger than actual rows, especially when join operation exists.
Use dataFrame.rdd.countApprox. The problem is that it need much time to get the actual result when dataFrame is big
I also try to use org.apache.spark.sql.execution.command.CommandUtils#calculateMultipleLocationSizesInParallel, it's better than dataFrame.rdd.countApprox, but in some extreme scenario, it also cost more than tens of minutes。
First let's calculate the number of rows in each table
df1_count = df1.count()
df2_count = df2.count()
df3_count = df3.count()
Then use cogroup to create a DataFrame containing the row counts from each table
counts_df = df1.cogroup(df2, df3)
Add up the row counts to get the estimated total number of rows in the joined DataFrame
estimated_row_count = counts_df.sum()
Eventually, when you join, you can try this approach
joined_df = df1.join(df2, on=..., how=...).join(df3, on=..., how=...)
exact_row_count = joined_df.count()

Spark withWatermark, only store values before and after a gap

I've got data comming in over Kafka to a Spark Structured Streaming application. To simplify, each message on kafka contains a device-id, a datetime and a value. The purpose of the streaming application is to calculate the difference between all values for each device.
I.e. if the input is
Device-ID
Datetime
Value
1
20210922-15:15
21
Device-ID
Datetime
Value
1
20210922-15:16
24
Device-ID
Datetime
Value
1
20210922-15:17
26
I would like the output to be
Device-ID
Datetime
Value
1
20210922-15:16
3
1
20210922-15:17
2
To solve this, and to handle that messages can come in late (up to 10 days) I'm using withWatermark on the Dateteime-column, with 10 days. However this leads to a huge memory usage if I have many devices since spark will store all the values for 10 days for all devices in memory.
In practice however, I do not have the need to store e.g., the value for 15:16 for Device X, if I already have retreived the values for 15:15 and 15:17.
So, instead of storing something like this in memory (due to withWatermark)
Device-ID
Datetime
Value
1
20210922-15:15
1
20210922-15:16
1
20210922-15:17
1
20210922-15:18
2
20210922-14:15
2
20210922-14:16
2
20210922-14:17
2
20210922-14:18
I would only need this
Device-ID
Datetime
Value
1
20210922-15:15
1
20210922-15:18
2
20210922-14:15
2
20210922-14:18
Is this doable?

Comparing data across executors in Spark

We have a spark application, in which data is shared across different executors. But we also need to compare data between executors, where some data is present in executor-1, and some data is present in executor-2. We wanted to know how can we achieve in spark?
For example: have a file with following details:
Name, Date1, Date2
A, 2019-01-01, 2019-01-23
A, 2019-02-12, 2019-03-21
A, 2019-04-01, 2019-05-31
A, 2019-06-02, 2019-12-30
B, 2019-01-01, 2019-01-21
B, 2019-02-10, 2019-03-21
B, 2019-04-01, 2019-12-31
I need to find total gaps between these elements by checking date2 of first row, against date1 of second row, and so on.. i.e.
For example: for Name A: (2019-02-12 - 2019-01-23) + (2019-04-01 - 2019-03-21) + (2019-06-02 -2019-05-31) + (2019-12-31 - 2019-12-30) .. Year is ending on 2019-12-31, so there is gap of 1 day
and also number of gaps (if difference between above formula per date > 0)
will be 4.
For Name B: (2019-02-10 - 2019-01-21) + (2019-04-01 -
2019-03-21), and number of gaps would be 2.
One approach is to use collectAsList(), which retrieves all data to driver, but is there a different efficient way to compare them directly across executors, if yes how can we do that?
Just write an SQL query with lag windowing, qualifying, check the adjacent rows for date ad date minus 1, with major key qualification being Name. Sort as well within Name.
You need not worry about Executors, Spark will hash for you automatically based on Name to a Partition serviced by an Executor.

Pyspark: How do I get today's score and 30 day avg score in a single row

I have use-case where I want to get the rank for today as well as 30 day average as a column. The data has 30 day data for a particular ID and type. The data looks like: -
Id Type checkInDate avgrank
1 ALONE 2019-04-24 1.333333
1 ALONE 2019-03-31 34.057471
2 ALONE 2019-04-17 1.660842
1 TOGETHER 2019-04-13 19.500000
1 TOGETHER 2019-04-08 5.481203
2 ALONE 2019-03-29 122.449156
3 ALONE 2019-04-07 3.375000
1 TOGETHER 2019-04-01 49.179719
5 TOGETHER 2019-04-17 1.391753
2 ALONE 2019-04-22 3.916667
1 ALONE 2019-04-15 2.459151
As my result I want to have output like
Id Type TodayAvg 30DayAvg
1 ALONE 30.0 9.333333
1 TOGETHER 1.0 34.057471
2 ALONE 7.8 99.660842
2 TOGETHER 3 19.500000
.
.
The way I think I can achieve it is having 2 dataframes, one doing a filter on today's date and the 2nd dataframe doing an average over 30 days and then joining the today dataframes on ID and Type
rank = glueContext.create_dynamic_frame.from_catalog(database="testing", table_name="rank", transformation_ctx="rank")
filtert_rank = Filter.apply(frame=rank, f=lambda x: (x["checkInDate"] == curr_dt))
rank_avg = glueContext.create_dynamic_frame.from_catalog(database="testing", table_name="rank", transformation_ctx="rank_avg")
rank_avg_f = rank_avg.groupBy("id", "type").agg(F.mean("avgrank"))
rank_join = filtert_rank.join(rank_avg, ["id", "type"], how='inner')
Is there a simpler way to do it i.e. without reading the dataframe twice?
You can convert the dynamic frame to a apache spark data frame and perform regular sql.
Check the documentation for toDF() and sparksql.

identifying decrease in values in spark (outliers)

I have a large data set with millions of records which is something like
Movie Likes Comments Shares Views
A 100 10 20 30
A 102 11 22 35
A 104 12 25 45
A *103* 13 *24* 50
B 200 10 20 30
B 205 *9* 21 35
B *203* 12 29 42
B 210 13 *23* *39*
Likes, comments etc are rolling totals and they are suppose to increase. If there is drop in any of this for a movie then its a bad data needs to be identified.
I have initial thoughts about groupby movie and then sort within the group. I am using dataframes in spark 1.6 for processing and it does not seem to be achievable as there is no sorting within the grouped data in dataframe.
Buidling something for outlier detection can be another approach but because of time constraint I have not explored it yet.
Is there anyway I can achieve this ?
Thanks !!
You can use the lag window function to bring the previous values into scope:
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.partitionBy('Movie).orderBy('maybesometemporalfield)
dataset.withColumn("lag_likes", lag('Likes, 1) over windowSpec)
.withColumn("lag_comments", lag('Comments, 1) over windowSpec)
.show
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-functions.html#lag
Another approach would be to assign a row number (if there isn't one already), lag that column, then join the row to it's previous row, to allow you to do the comparison.
HTH

Resources