Spark withWatermark, only store values before and after a gap - apache-spark

I've got data comming in over Kafka to a Spark Structured Streaming application. To simplify, each message on kafka contains a device-id, a datetime and a value. The purpose of the streaming application is to calculate the difference between all values for each device.
I.e. if the input is
Device-ID
Datetime
Value
1
20210922-15:15
21
Device-ID
Datetime
Value
1
20210922-15:16
24
Device-ID
Datetime
Value
1
20210922-15:17
26
I would like the output to be
Device-ID
Datetime
Value
1
20210922-15:16
3
1
20210922-15:17
2
To solve this, and to handle that messages can come in late (up to 10 days) I'm using withWatermark on the Dateteime-column, with 10 days. However this leads to a huge memory usage if I have many devices since spark will store all the values for 10 days for all devices in memory.
In practice however, I do not have the need to store e.g., the value for 15:16 for Device X, if I already have retreived the values for 15:15 and 15:17.
So, instead of storing something like this in memory (due to withWatermark)
Device-ID
Datetime
Value
1
20210922-15:15
1
20210922-15:16
1
20210922-15:17
1
20210922-15:18
2
20210922-14:15
2
20210922-14:16
2
20210922-14:17
2
20210922-14:18
I would only need this
Device-ID
Datetime
Value
1
20210922-15:15
1
20210922-15:18
2
20210922-14:15
2
20210922-14:18
Is this doable?

Related

Find nth row per group in a large dataset with Spark

I have a (very) large dataset partitioned by year, month and day. The partition columns were derived from a updated_at column during ingestion. Here is how it looks like:
id
user
updated_at
year
month
day
1
a
1992-01-19
1992
1
19
2
c
1992-01-20
1992
1
20
3
a
1992-01-21
1992
1
21
...
...
...
...
...
...
720987
c
2012-07-20
2012
7
20
720988
a
2012-07-21
2012
7
21
...
...
...
...
...
...
I need to use Apache Spark to find the 5th earliest event per user.
A simple window function like the one below is impossible since I use a shared cluster and I won't have enough resources to do in-memory processing at any given time due to the size of the dataset.
window = Window.partitionBy("user").orderBy(F.asc("updated_at"))
.withColumn("rank", F.dense_rank().over(window))
.filter(F.col("rank") == 5)
I am considering looping through partitions, processing and persisting data to disk, and then merging them back. How would you solve it? Thanks!
I think the code below will be faster because data is partitioned by these cols and spark can benefit from data locality.
Window.partitionBy("user").orderBy(F.asc("year"), F.asc("month"), F.asc("day"))

calculate average difference between dates using pyspark

I have a data frame that looks like this- user ID and dates of activity. I need to calculate the average difference between dates using RDD functions (such as reduce and map) and not SQL.
The dates for each ID needs to be sorted by order before calculating the difference, as I need the difference between each consecutive dates.
ID
Date
1
2020-09-03
1
2020-09-03
2
2020-09-02
1
2020-09-04
2
2020-09-06
2
2020-09-16
the needed outcome for this example will be:
ID
average difference
1
0.5
2
7
thanks for helping!
You can use datediff with window function to calculate the difference, then take an average.
lag is one of the window function and it will take a value from the previous row within the window.
from pyspark.sql import functions as F
# define the window
w = Window.partitionBy('ID').orderBy('Date')
# datediff takes the date difference from the first arg to the second arg (first - second).
(df.withColumn('diff', F.datediff(F.col('Date'), F.lag('Date').over(w)))
.groupby('ID') # aggregate over ID
.agg(F.avg(F.col('diff')).alias('average difference'))
)

How do I efficiently generate data with random variation in a time series based on existing data points

I have a handful of data points in a csv as follows:
date value
0 8/1/2019 0.243902
1 8/17/2019 0.322581
2 9/1/2019 0.476190
3 10/6/2019 0.322581
4 10/29/2019 0.476190
5 11/10/2019 0.526316
6 11/21/2019 1.818182
7 12/8/2019 2.500000
8 12/22/2019 3.076923
9 1/5/2020 3.333333
10 1/12/2020 3.333333
11 1/19/2020 0.000000
12 2/2/2020 0.000000
I want to generate a value for every hour between the first date and the last date (assuming that each one starts at 00:00 on that date) such that the generated values create a fairly smooth curve between each existing data point. I would also like to add a small amount of random variation to the generated values if possible so that the curves are not perfectly smooth. I ultimately want to output this new dataset to a csv with the same two columns containing the original rows along with the generated values and their associated datetimes (each in its own row).
Is there way to easily generate these points and output the result to a csv? I have thus far tried using pandas to store the data but I can't figure out a way to ensure that the generated data takes the existing data points into account.
Let's try scipy.interpolate:
# this is the new timestamps
new_date = pd.date_range(df.date.min(), df.date.max() + pd.to_timedelta('23h'),
freq='H')
from scipy import interpolate
tck = interpolate.splrep(df['date'].astype('int64'), df['value'], s=0)
new_values = interpolate.splev(new_date.astype('int64'), tck)
# visualize
plt.plot(df.date, df.value, label='raw')
plt.plot(new_date, new_values, label='intepolated')
plt.legend();
Output:

Pyspark: How do I get today's score and 30 day avg score in a single row

I have use-case where I want to get the rank for today as well as 30 day average as a column. The data has 30 day data for a particular ID and type. The data looks like: -
Id Type checkInDate avgrank
1 ALONE 2019-04-24 1.333333
1 ALONE 2019-03-31 34.057471
2 ALONE 2019-04-17 1.660842
1 TOGETHER 2019-04-13 19.500000
1 TOGETHER 2019-04-08 5.481203
2 ALONE 2019-03-29 122.449156
3 ALONE 2019-04-07 3.375000
1 TOGETHER 2019-04-01 49.179719
5 TOGETHER 2019-04-17 1.391753
2 ALONE 2019-04-22 3.916667
1 ALONE 2019-04-15 2.459151
As my result I want to have output like
Id Type TodayAvg 30DayAvg
1 ALONE 30.0 9.333333
1 TOGETHER 1.0 34.057471
2 ALONE 7.8 99.660842
2 TOGETHER 3 19.500000
.
.
The way I think I can achieve it is having 2 dataframes, one doing a filter on today's date and the 2nd dataframe doing an average over 30 days and then joining the today dataframes on ID and Type
rank = glueContext.create_dynamic_frame.from_catalog(database="testing", table_name="rank", transformation_ctx="rank")
filtert_rank = Filter.apply(frame=rank, f=lambda x: (x["checkInDate"] == curr_dt))
rank_avg = glueContext.create_dynamic_frame.from_catalog(database="testing", table_name="rank", transformation_ctx="rank_avg")
rank_avg_f = rank_avg.groupBy("id", "type").agg(F.mean("avgrank"))
rank_join = filtert_rank.join(rank_avg, ["id", "type"], how='inner')
Is there a simpler way to do it i.e. without reading the dataframe twice?
You can convert the dynamic frame to a apache spark data frame and perform regular sql.
Check the documentation for toDF() and sparksql.

find first and last dates from log data in azure table storage

I have about 3 million rows of data in a azure table storage which come from log files. Each row in the table is a detection of a certain event (this may be 1 or 100 rows of data per client, we don't know till its there) and there is a number of different events.
For each event i need to find the duration of the event from the timestamp of each row for each client. If there is a gap between end and start time, it would be count as a new event. EventId is the Partition Key for the row, and a composite key of timestamp to epoch and client ID make up the rowKey.
The Azure Table Storage Looks like the following, with some example data:
PartitionKey RowKey ClientId Epoch Additional
1 1370966492_1 1 1370969592 34
1 1370967792_1 1 1370967792 63
2 1370969592_1 1 1370969592 34
1 1370972592_1 2 1370972592 47
1 1370973542_1 1 1370969592 44
2 1370976562_1 1 1370976562 18
1 1370978592_1 2 1370978592 92
3 1370981542_1 2 1370981542 34
2 1370982562_1 1 1370982562 37
1 1370982592_1 1 1370982592 73
And the output i need is (example not related to data above:
EventId ClientId StartTime EndTime Max(additional)
1 1 1370966492 1370973492 78
1 2 1370967834 1370979536 29
What would be the most efficient way of processing the data? would it to be to keep the data in Table Storage? once i have processed these logs it is possible to change the import procedure to the Table Storage if need be.
You may need a different format for your date in the RowKey. The problem is that the RowKey and PartitionKey are both string then the comparison is always OrdinalCase. You must provide a format that represent all dates with the same number of characters and the OrdinalCase comparison be the same of the Dates comparison. EX: 20130805122200 (yyyyMMddhhmmss).
The other thing is that TableStorage services work as follow:
- Search for a given PartitionKey
- For each Partition that match search for a given RowKey
- For each Entity that match search for any Other criteria
Then in the example above you use the Date and the Event in the RowKey. If you always search by dates i recommend you that you include this property in the PartitionKey too.

Resources