efficient cumulative pivot in pyspark - apache-spark

Is there a more efficient/idiomatic way of rewriting this query:
spark.table('registry_data')
.withColumn('age_days', datediff(lit(today), col('date')))
.withColumn('timeframe',
when(col('age_days')<7, "1w")
.when(col('age_days')<30, '1m')
.when(col('age_days')<92, '3m')
.when(col('age_days')<183, '6m')
.when(col('age_days')<365, '1y')
.otherwise('1y+')
)
.groupby('make', 'model')
.pivot('timeframe')
.agg(countDistinct('id').alias('count'))
.fillna(0)
.withColumn('1y+', col('1y+')+col('1y')+col('6m')+col('3m')+col('1m')+col('1w'))
.withColumn('1y', col('1y')+col('6m')+col('3m')+col('1m')+col('1w'))
.withColumn('6m', col('6m')+col('3m')+col('1m')+col('1w'))
.withColumn('3m', col('3m')+col('1m')+col('1w'))
.withColumn('1m', col('1m')+col('1w'))
The gist of the query is for every make/model combination to return the number of entries seen within a set of time periods from today. The period counts are cumulative, i.e. an entry that registered within the last 7 days would be counted for 1 week, 1 month, 3 months, etc.

if you want to use cumulative sum instead of summing for each columns, you can replace the code from .groupby onwards and use window functions
from pyspark.sql.window import Window
import pyspark.sql.functions as F
spark.table('registry_data')
.withColumn('age_days', datediff(lit(today), col('date')))
.withColumn('timeframe',
when(col('age_days')<7, "1w")
.when(col('age_days')<30, '1m')
.when(col('age_days')<92, '3m')
.when(col('age_days')<183, '6m')
.when(col('age_days')<365, '1y')
.otherwise('1y+')
)
.groupBy('make', 'model', 'timeframe')
.agg(F.countDistinct('id').alias('count'),
F.max('age_days').alias('max_days')) # for orderBy clause
.withColumn('cumsum',
F.sum('count').over(Window.partitionBy('make', 'model')
.orderBy('max_days')
.rowsBetween(Window.unboundedPreceding, 0)))
.groupBy('make', 'model').pivot('timeframe').agg(F.first('cumsum'))
.fillna(0)

Related

ThreadPooling in pyspark seems to sum the contents of a dataframe

I am using Threadpooling to try to parallelize multiple for loops that I have in my pyspark code.
However, it seems that when using the reduce function at the end, it is simply summing the contents of the resulting list of dataframes.
Here is the code I am using:
from multiprocessing.pool import ThreadPool
from multiprocessing import cpu_count
from pyspark.sql import functions as f
days=[180,5]
def run_threads(iterr):
for a in range(len(datepaths)-iterr, len(datepaths)-(iterr-1)):
path="s3://" + datepaths[a] + "/"
date_inpath.append(path)
uva=spark.read.option("basePath",inpath).parquet(*date_inpath)\
.select(*uvakeepcols)\
.filter((col('RESULT').isNotNull()))\
.dropDuplicates()
uva1=uva.groupBy('SEQ_ID','TOOL_ID')\
.agg(f.count('RESULT').alias('count'))\
return uva1
pool = ThreadPool(cpu_count()-1)
df_list = pool.map(run_threads, days)
pool.close()
pool.join()
df=reduce(DataFrame.unionByName, df_list)
The above code extracts the results of a particular SEQ_ID, TOOL_ID combination for 2 days and then counts the number of times the RESULT column has values.
When I use thread pooling and compare the counts for the 2 days, I get the following result:
+-------+-------+-----+
| SEQ_ID|TOOL_ID|count|
+-------+-------+-----+
|2945783| 15032| 574|
|2945783| 15032| 574|
+-------+-------+-----+
However, when I extract the data manually from each of the 2 days I get the following result:
From 180 days ago:
+-------+-------+-------------+
| SEQ_ID|TOOL_ID|count(RESULT)|
+-------+-------+-------------+
|2945783| 15032| 285|
+-------+-------+-------------+
From 5 days ago:
+-------+-------+-------------+
| SEQ_ID|TOOL_ID|count(RESULT)|
+-------+-------+-------------+
|2945783| 15032| 289|
+-------+-------+-------------+
Somehow the pooling method is adding the two counts 289+285=574
What exactly is going on here, and is there a way to correct this?
Any insight would be greatly appreciated.
Thank you

pyspark collect_list but limit to max N results

I have the following pyspark logic intended to group on some target columns and then collect another target column into an array:
(
df
.groupBy(groupby_cols)
.agg(
F.collect_list(
F.col(target_col)
).alias(target_col)
)
)
I would like to limit the results to keep at most N values for each collected list such that the resulting target column is composed of cells with arrays of at most length N.
Right now, I can achieve this in pyspark with a UDF that takes the target_col and applies a lambda: lambda x: x[:N] on each cell, but this seems to be an inefficient means to achieving the behavior I seek.
what about:
from pyspark.sql import Window, functions as F
(
df
.withColumn("rn", F.row_number().over(
Window.partitionBy(groupby_cols).orderBy(orderby_cols) # orderby_cols can be replaced by F.rand(1) if you don't mind which will be stayed/dropped
)) # this will count from 1, for every element in the groupby_cols
.filter(f"rn <= {N}") # removes all instances that larger than N
.groupBy(groupby_cols)
.agg(F.collect_list(F.col(target_col)).alias(target_col))
)
This should do the trick

Is there a way to add multiple columns to a dataframe calculated from moving averages from different columns and/or over different duration

I have a dataframe with time-series data and I am trying to add a lot of moving average columns to it with different windows of various ranges. When I do this column by column, results are pretty slow.
I have tried to just pile the withColumn calls until I have all of them.
Pseudo code:
import pyspark.sql.functions as pysparkSqlFunctions
## working from a data frame with 12 colums:
## - key as a String
## - time as a DateTime
## - col_{1:10} as numeric values
window_1h = Window.partitionBy("key") \
.orderBy(col("time").cast("long")) \
.rangeBetween(-3600, 0)
window_2h = Window.partitionBy("key") \
.orderBy(col("time").cast("long")) \
.rangeBetween(-7200, 0)
df = df.withColumn("col1_1h", pysparkSqlFunctions.avg("col_1").over(window_1h))
df = df.withColumn("col1_2h", pysparkSqlFunctions.avg("col_1").over(window_2h))
df = df.withColumn("col2_1h", pysparkSqlFunctions.avg("col_2").over(window_1h))
df = df.withColumn("col2_2h", pysparkSqlFunctions.avg("col_2").over(window_2h))
What I would like is the ability to add all 4 columns (or many more) in one call, hopefully traversing the data only once for better performance.
I prefer to import the functions library as F as it looks neater and it is the standard alias used in the official Spark documentation.
The star string, '*', should capture all the current columns within the dataframe. Alternatively, you could replace the star string with *df.columns. Here the star explodes the list into separate parameters for the select method.
from pyspark.sql import functions as F
df = df.select(
"*",
F.avg("col_1").over(window_1h).alias("col1_1h"),
F.avg("col_1").over(window_2h).alias("col1_2h"),
F.avg("col_2").over(window_1h).alias("col2_1h"),
F.avg("col_2").over(window_1h).alias("col2_1h"),
)

Pandas Data Frame, find max value and return adjacent column value, not the entire row

New to Pandas so I'm sorry if there is an obvious solution...
I imported a CSV that only had 2 columns and I created a 3rd column.
Here's a screen shot of the top 10 rows and header:
Screen shot of DataFrame
I've figured out how to find the min and max values in the ['Amount Changed'] column but also need to pull the date associated with the min and max - but not the index and ['Profit/Loss']. I've tried iloc, loc, read about groupby - I can't get any of them to return a single value (in this case a date) that I can use again.
My goal is to create a new variable 'Gi_Date' that is in the same row as the max value in ['Amount Changed'] but tied to the date in the ['Date'] column.
I'm trying to keep the variables separate so I can use them in print statements, write them to txt files, etc.
import os
import csv
import pandas as pd
import numpy as np
#path for CSV file
csvpath = ("budget_data.csv")
#Read CSV into Panadas and give it a variable name Bank_pd
Bank_pd = pd.read_csv(csvpath, parse_dates=True)
#Number of month records in the CSV
Months = Bank_pd["Date"].count()
#Total amount of money captured in the data converted to currency
Total_Funds = '${:.0f}'.format(Bank_pd["Profit/Losses"].sum())
#Determine the amount of increase or decrease from the previous month
AmtChange = Bank_pd["Profit/Losses"].diff()
Bank_pd["Amount Changed"] = AmtChange
#Identify the greatest positive change
GreatestIncrease = '${:.0f}'.format(Bank_pd["Amount Changed"].max())
Gi_Date = Bank_pd[Bank_pd["Date"] == GreatestIncrease]
#Identify the greatest negative change
GreatestDecrease = '${:.0f}'.format(Bank_pd["Amount Changed"].min())
Gd_Date = Bank_pd[Bank_pd['Date'] == GreatestDecrease]
print(f"Total Months: {Months}")
print(f"Total: {Total_Funds}")
print(f"Greatest Increase in Profits: {Gi_Date} ({GreatestIncrease})")
print(f"Greatest Decrease in Profits: {Gd_Date} ({GreatestDecrease})")
When I run the script in git bash I don't get an error anymore so I think I'm getting close, rather than showing the date it says:
$ python PyBank.py
Total Months: 86
Total: $38382578
Greatest Increase in Profits: Empty DataFrame
Columns: [Date, Profit/Losses, Amount Changed]
Index: [] ($1926159)
Greatest Decrease in Profits: Empty DataFrame
Columns: [Date, Profit/Losses, Amount Changed]
Index: [] ($-2196167)
I want it to print out like this:
$ python PyBank.py
Total Months: 86
Total: $38382578
Greatest Increase in Profits: Feb-2012 ($1926159)
Greatest Decrease in Profits: Sept-2013 ($-2196167)
Here is one years worth of the original DataFrame:
bank_pd = pd.DataFrame({'Date':['Jan-10', 'Feb-10', 'Mar-10', 'Apl-10', 'May-10', 'Jun-10', 'Jul-10', 'Aug-10', 'Sep-10', 'Oct-10', 'Nov-10', 'Dec-10'],
'Profit/Losses':[867884, 984655, 322013, -69417, 310503, 522857, 1033096, 604885, -216386, 477532, 893810, -80353]})
The expected output with the sample df would be:
Total Months: 12
Total Funds: $5651079
Greatest Increase in Profits: Oct-10 ($693918)
Greatest Decrease in Profits: Dec-10 ($-974163)
I also had an error in the sample dataframe from above, I was missing a month when I typed it out quickly - it's fixed now.
Thanks!
I'm seeing few glitches in the variables used.
Bank_pd["Amount Changed"] = AmtChange
The above statement is actually replacing the dataframe with column "Amount Changed". After this statement you can use this column for any manipulation.
Below is the updated code and highlighted the newly added lines. You could add further formatting:
import pandas as pd
csvpath = ("budget_data.csv")
Bank_pd = pd.read_csv(csvpath, parse_dates=True)
inp_bank_pd = pd.DataFrame(Bank_pd)
Months = Bank_pd["Date"].count()
Total_Funds = '${:.0f}'.format(Bank_pd["Profit/Losses"].sum())
AmtChange = Bank_pd["Profit/Losses"].diff()
GreatestIncrease = Bank_pd["Amount Changed"].max()
Gi_Date = inp_bank_pd.loc[Bank_pd["Amount Changed"] == GreatestIncrease]
print(Months)
print(Total_Funds)
print(Gi_Date['Date'].values[0])
print(GreatestIncrease)
In your example code, Gi_date and Gd_date are trying to initialize new DF's instead of calling values. Change Gi_Date and Gd_Date:
Gi_Date = Bank_pd.sort_values('Profit/Losses').tail(1).Date
Gd_Date = Bank_pd.sort_values('Profit/Losses').head(1).Date
Check outputs:
Gi_Date
Jul-10
Gd_Date
Sep-10
To print how you want to print using string formatting:
print("Total Months: %s" %(Months))
print("Total: %s" %(Total_Funds))
print("Greatest Increase in Profits: %s %s" %(Gi_Date.to_string(index=False), GreatestIncrease))
print("Greatest Decrease in Profits: %s %s" %(Gd_Date.to_string(index=False), GreatestDecrease))
Note if you don't use the:
(Gd_Date.to_string(index=False)
The pandas object information will be included in the print output, like it is in your example when you see the DataFrame info.
Output for 12 month sample DF:
Total Months: 12
Total: $5651079
Greatest Increase in Profits: Jul-10 $693918
Greatest Decrease in Profits: Sep-10 $-974163
Use Series.idxmin and Series.idxmax with loc:
df.loc[df['Amount Changed'].idxmin(), 'Date']
df.loc[df['Amount Changed'].idxmax(), 'Date']
Full example based on your sample DataFrame:
df = pd.DataFrame({'Date':['Jan-2010', 'Feb-2010', 'Mar-2010', 'Apr-2010', 'May-2010',
'Jun-2010', 'Jul-2010', 'Aug-2010', 'Sep-2010', 'Oct-2010'],
'Profit/Losses': [867884,984655,322013,-69417,310503,522857,
1033096,604885,-216386,477532]})
df['Amount Changed'] = df['Profit/Losses'].diff()
print(df)
Date Profit/Losses Amount Changed
0 Jan-2010 867884 NaN
1 Feb-2010 984655 116771.0
2 Mar-2010 322013 -662642.0
3 Apr-2010 -69417 -391430.0
4 May-2010 310503 379920.0
5 Jun-2010 522857 212354.0
6 Jul-2010 1033096 510239.0
7 Aug-2010 604885 -428211.0
8 Sep-2010 -216386 -821271.0
9 Oct-2010 477532 693918.0
print(df.loc[df['Amount Changed'].idxmin(), 'Date'])
print(df.loc[df['Amount Changed'].idxmax(), 'Date'])
Sep-2010
Oct-2010

subtract mean from pyspark dataframe

I'm trying to calculate the average for each column in a dataframe and subtract from each element in the column. I've created a function that attempts to do that, but when I try to implement it using a UDF, I get an error: 'float' object has no attribute 'map'. Any ideas on how I can create such a function? Thanks!
def normalize(data):
average=data.map(lambda x: x[0]).sum()/data.count()
out=data.map(lambda x: (x-average))
return out
mapSTD=udf(normalize,IntegerType())
dats = data.withColumn('Normalized', mapSTD('Fare'))
In your example there is problem with UDF function which can not be applied to row and whole DataFrame. UDF can be applied only to single row, but Spark also enables implementing UDAF (User Defined Aggregate Functions) working on whole DataFrame.
To solve your problem you can use below function:
from pyspark.sql.functions import mean
def normalize(df, column):
average = df.agg(mean(df[column]).alias("mean")).collect()[0]["mean"]
return df.select(df[column] - average)
Use it like this:
normalize(df, "Fare")
Please note that above only works on single column, but it is possible to implement something more generic:
def normalize(df, columns):
selectExpr = []
for column in columns:
average = df.agg(mean(df[column]).alias("mean")).collect()[0]["mean"]
selectExpr.append(df[column] - average)
return df.select(selectExpr)
use it like:
normalize(df, ["col1", "col2"])
This works, but you need to run aggregation for each column, so with many columns performance could be issue, but it is possible to generate only one aggregate expression:
def normalize(df, columns):
aggExpr = []
for column in columns:
aggExpr.append(mean(df[column]).alias(column))
averages = df.agg(*aggExpr).collect()[0]
selectExpr = []
for column in columns:
selectExpr.append(df[column] - averages[column])
return df.select(selectExpr)
Adding onto Piotr's answer. If you need to keep the existing dataframe and add normalized columns with aliases, the function can be modified as:
def normalize(df, columns):
aggExpr = []
for column in columns:
aggExpr.append(mean(df[column]).alias(column))
averages = df.agg(*aggExpr).collect()[0]
selectExpr = ['*']
for column in columns:
selectExpr.append((df[column] - averages[column]).alias('normalized_'+column))
return df.select(selectExpr)

Resources