How To use Pysark Window Function on new unprocessed data? - python-3.x

I have developed window functions on pyspark DataFrame to calculate Total Transaction Amount made by customer on monthly basis per transaction.
For Eg:
Input Table has data:
And the window function process the data and inserts it into table
Now, if i get new transactions today, I want to develop a code where it loads the last month transaction into spark dataframe and running window function on new rows and saving it into Processed Table. The current window function will process all the rows and then need to manually avoid already inserted records and insert only new records. This will use high resources and high memory, when the window function becomes for a year.
#Function to apply window function
def cumulative_total_CR(df, from_column, to_column, window_function):
intermediate_column = from_column + "_temp"
df = df.withColumn(from_column,df[from_column].cast("double"))
df = df.withColumn(intermediate_column,when(col("Flow") == 'C',df[from_column]).otherwise(0))
df = df.withColumn(to_column, F.sum(intermediate_column).over(window_function))
return df
def cumulative_total_DR(df, from_column, to_column, window_function):
intermediate_column = from_column + "_temp"
df = df.withColumn(from_column,df[from_column].cast("double"))
df = df.withColumn(intermediate_column,when(col("Flow") == 'D',df[from_column]).otherwise(0))
df = df.withColumn(to_column, F.sum(intermediate_column).over(window_function))
return df
#Window Function:
window = (Window.partitionBy("CUSNO").orderBy(F.col(TxnDateTime).cast('long')).rangeBetween(-30,0))
df = load.data.from.hive
#appending TxnDate and TxnTime into new column TxnDateTime with type casting as timestamp and format as 'yyyy-MM-dd HH:mm:ss.SSS'
df = cumulative_total_CR(df, "TXNAMT", "Total_Cr_Monthly_Amt", window_function_30_days)
df = cumulative_total_DR(df, "TXNAMT", "Total_Dr_Monthly_Amt", window_function_30_days)
df = saving.data.to.disk for new records

Related

Does PySpark run operation out-of-sequence due to optimization?

I'm confused about the result my code is giving me. Here is the code I wrote:
def update_cassandra(df : DataFrame, aggr: str):
aggr_map_dict = {
'Giornaliera' : 'day',
'Settimanale' : 'week',
'Bi-Settimanale' : 'bi_week',
'Mensile': 'month'
}
max_min_dates = df.agg(F.max(df['data']), F.min(df['data'])).collect()[0]
upper_date = max_min_dates[0]
lower_date = max_min_dates[1]
df = (df.select('data', 'punto_di_interesse', 'id_telco', 'presenze', 'presenze_uniche', 'presenze_00_06','presenze_06_08', 'presenze_08_10', 'presenze_10_12', 'presenze_12_14', 'presenze_14_16', 'presenze_16_18', 'presenze_18_20', 'presenze_20_22', 'presenze_22_24')
)
print('contenuto del csv')
display(df.where(F.col('punto_di_interesse')== 'CC - Neapolis'))
telco_day_aggr = read_from_cassandra_dev(f'telco_{aggr_map_dict[aggr]}_aggr').where(F.col('data').between(lower_date,upper_date))
if telco_day_aggr.count() == 0:
telco_day_aggr = create_empty_df()
print('telco_day_aggr as is')
display(telco_day_aggr.where(F.col('punto_di_interesse')== 'CC - Neapolis'))
union_df = df.union(telco_day_aggr)
print('unione del AS-IS e del csv')
display(union_df.where(F.col('punto_di_interesse')== 'CC - Neapolis'))
output_df = (union_df.groupBy('data', 'punto_di_interesse', 'id_telco')
.agg(
F.sum('presenze').alias('presenze'),
F.sum('presenze_uniche').alias('presenze_uniche'),
F.sum('presenze_00_06').alias('presenze_00_06'),
F.sum('presenze_06_08').alias('presenze_06_08'),
F.sum('presenze_08_10').alias('presenze_08_10'),
F.sum('presenze_10_12').alias('presenze_10_12'),
F.sum('presenze_12_14').alias('presenze_12_14'),
F.sum('presenze_14_16').alias('presenze_14_16'),
F.sum('presenze_16_18').alias('presenze_16_18'),
F.sum('presenze_18_20').alias('presenze_18_20'),
F.sum('presenze_20_22').alias('presenze_20_22'),
F.sum('presenze_22_24').alias('presenze_22_24')
)
)
return output_df
aggregate_df = aggregate_table(df_daily, 'Giornaliera')
write_on_cassandra_dev(aggregate_df, 'telco_day_aggr')
What I expect to achieve is to create a sort of update for cassandra, becouse the cassandra drivers. So the operation in my head are like this:
read from blob storage the csv and store it in a dataframe (the df variable, input of the method)
with max and min dates of this csv file, query the table in cassandra and save it in another variable
concatenate the two dataframe
summing up with the groupby
write on cassandra the new dataframe overwriting the existing rows with the new ones
it seems to me that, some how, what is in the dataframe "df" is written before I can read "telco_day_aggr" and that the union and grupby part are ininfluent. In other words on my cassandra table there is present only the content of df.
I can provide additional information if needed.

Return a dataframe to a dataframe

I am trying to take a list of "ID"s from my dataframe dfrep and pass the column with the ID into a function I created in order to pass the values into a query to return back to dfrep.
My function is returning a dataframe, but the results of the dataframe are including the header and when I print dfrep there are two lines. I also cannot write the dataframe to excel using xlwings because I get TypeError: must be a pywintypes time object (got DataFrame).
def overrides(id):
sql = f"select name from sales..rep where id in({id})"
mydf = pd.read_sql(sql, conn)
return mydf
overrides = np.vectorize(overrides)
dfrep['name'] = overrides(dfrep['ID'])
wsData.range('A1').options(pd.DataFrame,index=False).value = dfrep
My goal is to load the column(s) in my function's dataframe into my main dataframe dfrep and then write to excel via xlwings. Any help is appreciated.

How to get specific attributes of a df that has been grouped

I'm printing out the frequency of murders in each state in each particular decade. However, I just want to print the state, decade, and it's victim count. What I have right now is that it's printing out all the columns with the same frequencies. How do I change it so that I just have 3 columns, State, Decade, and Victim Count?
I'm currently using the groupby function to group by the state and decade and setting that equal to a variable called count.
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
print(counts)
The outcome is printing out all the columns in the file with the same frequencies whereas I just want 3 columns: State Decade Victim Count
Sample Text File
You should reset_index of the groupby object, and then select the columns from the new dataframe.
Something like
xl = pd.ExcelFile('Wyoming.xlsx')
df = xl.parse('Sheet1')
df['Decade'] = (df['Year'] // 10) * 10
counts = df.groupby(['State', 'Decade']).count()
counts = counts.reset_index()[['State', 'Decade','Vistim Count']]
print(counts)
Select the columns that you want:
counts = df.loc[:,['State', 'Decade','Vistim Count']].groupby(['State', 'Decade']).count()
or
print(count.loc[:,['State', 'Decade','Vistim Count']])

Convert huge number of lists to pandas dataframe

User defined function=> my_fun(x): returns a list
XYZ = file with LOTS of lines
pandas_frame = pd.DataFrame() # Created empty data frame
for index in range(0,len(XYZ)):
pandas_frame = pandas_frame.append(pd.DataFrame(my_fun(XYZ[i])).transpose(), ignore_index=True)
This code is taking very long time to run like in days. How do I speed up?
I think need apply for each row funcion to new list by list comprehension and then use only once DataFrame constructor:
L = [my_fun(i) for i in range(len(XYZ))]
df = pd.DataFrame(L)

Improving the speed of cross-referencing rows in the same DataFrame in pandas

I'm trying to apply a complex function to a pandas DataFrame, and I'm wondering if there's a faster way to do it. A simplified version of my data looks like this:
UID,UID2,Time,EventType
1,1,18:00,A
1,1,18:05,B
1,2,19:00,A
1,2,19:03,B
2,6,20:00,A
3,4,14:00,A
What I want to do is for each combination of UID and UID2 check if there is both a row with EventType = A and EventType = B, and then calculate the time difference, and then add it back as a new column. So the new dataset would be:
UID,UID2,Time,EventType,TimeDiff
1,1,18:00,A,5
1,1,18:05,B,5
1,2,19:00,A,3
1,2,19:03,B,3
2,6,20:00,A,nan
3,4,14:00,A,nan
This is the current implementation, where I group the records by UID and UID2, then have only a small subset of rows to search to identify whether both EventTypes exist. I can't figure out a faster one, and profiling in PyCharm hasn't helped uncover where the bottleneck is.
for (uid, uid2), group in df.groupby(["uid", "uid2"]):
# if there is a row for both A and B for a uid, uid2 combo
if len(group[group["EventType"] == "A"]) > 0 and len(group[group["EventType"] == "D"]) > 0:
time_a = group.loc[group["EventType"] == "A", "Time"].iloc[0]
time_b = group.loc[group["EventType"] == "B", "Time"].iloc[0]
timediff = time_b - time_a
timediff_min = timediff.components.minutes
df.loc[(df["uid"] == uid) & (df["uid2"] == uid2), "TimeDiff"] = timediff_min
I need to make sure Time column is a timedelta
df.Time = pd.to_datetime(df.Time)
df.Time = df.Time - pd.to_datetime(df.Time.dt.date)
After that I create a helper dataframe
df1 = df.set_index(['UID', 'UID2', 'EventType']).unstack().Time
df1
Finally, I take the diff and merge to df
df.merge((df1.B - df1.A).rename('TimeDiff').reset_index())

Resources