Pyspark: How do I get today's score and 30 day avg score in a single row - apache-spark

I have use-case where I want to get the rank for today as well as 30 day average as a column. The data has 30 day data for a particular ID and type. The data looks like: -
Id Type checkInDate avgrank
1 ALONE 2019-04-24 1.333333
1 ALONE 2019-03-31 34.057471
2 ALONE 2019-04-17 1.660842
1 TOGETHER 2019-04-13 19.500000
1 TOGETHER 2019-04-08 5.481203
2 ALONE 2019-03-29 122.449156
3 ALONE 2019-04-07 3.375000
1 TOGETHER 2019-04-01 49.179719
5 TOGETHER 2019-04-17 1.391753
2 ALONE 2019-04-22 3.916667
1 ALONE 2019-04-15 2.459151
As my result I want to have output like
Id Type TodayAvg 30DayAvg
1 ALONE 30.0 9.333333
1 TOGETHER 1.0 34.057471
2 ALONE 7.8 99.660842
2 TOGETHER 3 19.500000
.
.
The way I think I can achieve it is having 2 dataframes, one doing a filter on today's date and the 2nd dataframe doing an average over 30 days and then joining the today dataframes on ID and Type
rank = glueContext.create_dynamic_frame.from_catalog(database="testing", table_name="rank", transformation_ctx="rank")
filtert_rank = Filter.apply(frame=rank, f=lambda x: (x["checkInDate"] == curr_dt))
rank_avg = glueContext.create_dynamic_frame.from_catalog(database="testing", table_name="rank", transformation_ctx="rank_avg")
rank_avg_f = rank_avg.groupBy("id", "type").agg(F.mean("avgrank"))
rank_join = filtert_rank.join(rank_avg, ["id", "type"], how='inner')
Is there a simpler way to do it i.e. without reading the dataframe twice?

You can convert the dynamic frame to a apache spark data frame and perform regular sql.
Check the documentation for toDF() and sparksql.

Related

calculate average difference between dates using pyspark

I have a data frame that looks like this- user ID and dates of activity. I need to calculate the average difference between dates using RDD functions (such as reduce and map) and not SQL.
The dates for each ID needs to be sorted by order before calculating the difference, as I need the difference between each consecutive dates.
ID
Date
1
2020-09-03
1
2020-09-03
2
2020-09-02
1
2020-09-04
2
2020-09-06
2
2020-09-16
the needed outcome for this example will be:
ID
average difference
1
0.5
2
7
thanks for helping!
You can use datediff with window function to calculate the difference, then take an average.
lag is one of the window function and it will take a value from the previous row within the window.
from pyspark.sql import functions as F
# define the window
w = Window.partitionBy('ID').orderBy('Date')
# datediff takes the date difference from the first arg to the second arg (first - second).
(df.withColumn('diff', F.datediff(F.col('Date'), F.lag('Date').over(w)))
.groupby('ID') # aggregate over ID
.agg(F.avg(F.col('diff')).alias('average difference'))
)

How to call a created funcion with pandas apply to all rows (axis=1) but only to some specific rows of a dataframe?

I have a function which sends automated messages to clients, and takes as input all the columns from a dataframe like the one below.
name
phone
status
date
name_1
phone_1
sending
today
name_2
phone_2
sending
yesterday
I iterate through the dataframe with a pandas apply (axis=1) and use the values on the columns of each row as inputs to my function. At the end of it, after sending, it changes the status to "sent". The thing is I only want to send to the clients whose date reference is "today". Now, with pandas.apply(axis=1) this is perfectly doable, but in order to slice the clients with "today" value, I need to:
create a new dataframe with today's value,
remove it from the original, and then
reappend it to the original.
I thought about running through the whole dataframe and ignore the rows which have dates different than "today", but if my dataframe keeps growing, I'm afraid of the whole process becoming slower.
I saw examples of this being done with mask, although usually people only use 1 column, and I need more than just the one. Is there any way to do this with pandas apply?
Thank you.
I think you can use .loc to filter the data and apply func to it.
In [13]: df = pd.DataFrame(np.random.rand(5,5))
In [14]: df
Out[14]:
0 1 2 3 4
0 0.085870 0.013683 0.221890 0.533393 0.622122
1 0.191646 0.331533 0.259235 0.847078 0.649680
2 0.334781 0.521263 0.402030 0.973504 0.903314
3 0.189793 0.251130 0.983956 0.536816 0.703726
4 0.902107 0.226398 0.596697 0.489761 0.535270
if we want double the values of rows where the value in first column > 0.3
Out[16]:
0 1 2 3 4
2 0.334781 0.521263 0.402030 0.973504 0.903314
4 0.902107 0.226398 0.596697 0.489761 0.535270
In [18]: df.loc[df[0] > 0.3] = df.loc[df[0] > 0.3].apply(lambda x: x*2, axis=1)
In [19]: df
Out[19]:
0 1 2 3 4
0 0.085870 0.013683 0.221890 0.533393 0.622122
1 0.191646 0.331533 0.259235 0.847078 0.649680
2 0.669563 1.042527 0.804061 1.947008 1.806628
3 0.189793 0.251130 0.983956 0.536816 0.703726
4 1.804213 0.452797 1.193394 0.979522 1.070540

Pandas: Subtracting Two Mismatched Dataframes

Forgive me if this is a repeat question, but I can't find the answer and I'm not even sure what the right terminology is.
I have two dataframes that don't have completely matching rows or columns. Something like:
Balances = pd.DataFrame({'Name':['Alan','Barry','Carl', 'Debbie', 'Elaine'],
'Age Of Debt':[1,4,3,7,2],
'Balance':[500,5000,300,100,3000],
'Payment Due Date':[1,1,30,14,1]})
Payments = pd.DataFrame({'Name':['Debbie','Alan','Carl'],
'Balance':[50,100,30]})
I want to subtract the Payments dataframe from the Balances dataframe based on Name, so essentially a new dataframe that looks like this:
pd.DataFrame({'Name':['Alan','Barry','Carl', 'Debbie', 'Elaine'],
'Age Of Debt':[1,4,3,7,2],
'Balance':[400,5000,270,50,3000],
'Payment Due Date':[1,1,30,14,1]})
I can imagine having to iterate over the rows of Balances, but when both dataframes are very large I don't think it's very efficient.
You can use .merge:
tmp = pd.merge(Balances, Payments, on="Name", how="outer").fillna(0)
Balances["Balance"] = tmp["Balance_x"] - tmp["Balance_y"]
print(Balances)
Prints:
Name Age Of Debt Balance Payment Due Date
0 Alan 1 400.0 1
1 Barry 4 5000.0 1
2 Carl 3 270.0 30
3 Debbie 7 50.0 14
4 Elaine 2 3000.0 1

Sorting data frames with time in hours days months format

I have one data frame with time in columns but not sorted order, I want to sort in ascending order, can some one suggest any direct function or code for data frames sort time.
My input data frame:
Time
data1
1 month
43.391588
13 h
31.548372
14 months
41.956652
3.5 h
31.847388
Expected data frame:
Time
data1
3.5 h
31.847388
13 h
31.847388
1 month
43.391588
14 months
41.956652
You need replace units to numbers first by Series.replace and then convert to numeric by pandas.eval, last use this column for sorting by DataFrame.sort_values:
d = {' months': '*30*24', ' month': '*30*24', ' h': '*1'}
df['sort'] = df['Time'].replace(d, regex=True).map(pd.eval)
df = df.sort_values('sort')
print (df)
Time data1 sort
3 3.5 h 31.847388 3.5
1 13 h 31.548372 13.0
0 1 month 43.391588 720.0
2 14 months 41.956652 10080.0
Firstly you have to assert the type of data you have in your dataframe.
This will indicate how you may proceed.
df.dtypes or at your case df.index.dtypes .
Preferred option for sorting dataframes is df.sort_values()

Using apply to modify different dataframe in pandas

I'm running into some issues regarding the use of apply in Pandas.
I have a dataframe, where there is multiple measurements made on certain days, on different measurement sites. To give an example, Site 1 has 2 measurements to make, every 7 or so days.
We know it has to be 2 measurements. So what I'm trying to do now is to check where on which days there were not enough measurements made.
site measurement expected date
0 1 2 1 01-01-2020
1 2 3 2 01-01-2020
2 3 4 2 01-01-2020
3 3 5 2 01-01-2020
4 2 1 2 08-01-2020
5 2 4 2 08-01-2020
I've made a sorted and aggregated DataFrame, that has aggregated the measurements, to basically be able to iterate over the measurements as to not go over the same days twice when there are multiple measurements.
site measurement expected date
0 1 2 1 01-01-2020
1 2 3 2 01-01-2020
2 3 9 2 01-01-2020
3 2 5 2 08-01-2020
For a function, I'm now using the function filter_amount(df_sorted, df, group).
group is to count the amount of measurements made df.groupby(["site_id", "date"]).count()
site measurement date
0 1 2 01-01-2020
1 2 1 01-01-2020
2 3 2 01-01-2020
3 2 2 08-01-2020
The current function basically goes like this:
def filter_amount(df_sorted, df, group):
for i in df_sorted.index:
"Locate amount of measurements for this day and site actually made, in group"
Check how many measurements are expected.
If not enough measurements:
find all measurements in normal df and drop them
So in this example, the measurements from site 2 on 1-1-20 have to be dropped, because there are not enough measurements. The ones from 8-1-20 are valid, because they expect 2, and 2 happen.
The problem is that this is extremely slow with over 500k rows.
The variables I need, I get by using .at[], but I'm trying to make it faster by using apply so I can parallelize the operations, but can't figure it out. I'm doing the apply on df_sorted and passing the arguments needed in, but it's not actually dropping the measurements from the original df.
I have a feeling that it's possible to do it with some sort of groupby on the original df to save operations..
I hope it's clear enough, happy to elaborate any questions.

Resources