How to track number of distinct values incrementally from a spark table? - apache-spark

Suppose we have a very large table that we'd like to process statistics for incrementally.
Date
Amount
Customer
2022-12-20
30
Mary
2022-12-21
12
Mary
2022-12-20
12
Bob
2022-12-21
15
Bob
2022-12-22
15
Alice
We'd like to be able to calculate incrementally how much we made per distinct customer for a date range. So from 12-20 to 12-22 (inclusive), we'd have 3 distinct customers, but 12-20 to 12-21 there are 2 distinct customers.
If we want to run this pipeline once a day and there are many customers, how can we keep a rolling count of distinct customers for an arbitrary date range? Is there a way to do this without storing a huge list of customer names for each day?
We'd like to support a frontend that has a date range filter and can quickly calculate results for that date range. For example:
Start Date
End Date
Average Income Per Customer
2022-12-20
2022-12-21
(30+12+12+15)/2 = 34.5
2022-12-20
2022-12-22
(30+12+12+15+15)/3 = 28
The only approach I can think of is to store a set of customer names for each day, and when viewing the results calculate the size of the joined set of sets to calculate distinct customers. This seems inefficient. In this case we'd store the following table, with the customer column being extremely large.
Date
Total Income
Customers
2022-12-20
42
set(Mary, Bob)
2022-12-21
27
set(Mary, Bob)
2022-12-22
15
set(Alice)

For me the best solution is to do some pre calculations for the existing data, then for the new data that come everyday, do the caclulation only on new data, and add the results to the previous calclulated data, also do partitioning on date column as we filter on dates, this will trigger spark push down filters and accelerate your queries.
There's 2 approach: one to get the sum amount between 2 dates, and other for the distinct customers between 2 dates:
For amout use prefix sum by adding the sum of all previous days to the last day, then to get the difference between the 2 dates you can just substract these 2 days only without looping all dates between.
For distinct customers, the best approach I can think of is to save the date and customer columns in a new file, and partition by dates, that would help to optimize the queries, then use the fast approx_count_distinct.
Here's some code:
spark = SparkSession.builder.master("local[*]").getOrCreate()
data = [
["2022-12-20", 30, "Mary"],
["2022-12-21", 12, "Mary"],
["2022-12-20", 12, "Bob"],
["2022-12-21", 15, "Bob"],
["2022-12-22", 15, "Alice"],
]
df = spark.createDataFrame(data).toDF("Date", "Amount", "Customer")
def init_amout_data(df):
w = Window.orderBy(col("Date"))
amount_sum_df = df.groupby("Date").agg(sum("Amount").alias("Amount")) \
.withColumn("amout_sum", sum(col("Amount")).over(w)) \
.withColumn("prev_amout_sum", lag("amout_sum", 1, 0).over(w)).select("Date", "amout_sum", "prev_amout_sum")
amount_sum_df.write.mode("overwrite").partitionBy("Date").parquet("./path/amount_data_df")
amount_sum_df.show(truncate=False)
# keep only customer data to avoid unecessary data when querying, partitioning by Date will make query faster due to spark filter push down mechanism
def init_customers_data(df):
df.select("Date", "Customer").write.mode("overwrite").partitionBy("Date").parquet("./path/customers_data_df")
# each day update the amount data dataframe (example at midnight), with only yesterday data: by talking the last amout_sum and adding to it the amount of the last day
def update_amount_data(last_partition):
amountDataDf = spark.read.parquet("./path/amount_data_df")
maxDate = getMaxDate("./path/amount_data_df") # implement a hadoop method to get the last partition date
lastMaxPartition = amountDataDf.filter(col("date") == maxDate)
lastPartitionAmountSum = lastMaxPartition.select("amout_sum").first.getLong(0)
yesterday_amount_sum = last_partition.groupby("Date").agg(sum("Amount").alias("amount_sum"))
newPartition = yesterday_amount_sum.withColumn("amount_sum", col("amount_sum") + lastPartitionAmountSum) \
.withColumn("prev_amout_sum", lit(lastPartitionAmountSum))
newPartition.write.mode("append").partitionBy("Date").parquet("./path/amount_data_df")
def update_cusomers_data(last_partition):
last_partition.write.mode("append").partitionBy("Date").parquet("./path/customers_data_df")
def query_amount_date(beginDate, endDate):
amountDataDf = spark.read.parquet("./path/amount_data_df")
endDateAmount = amountDataDf.filter(col("Date") == endDate).select("amout_sum").first.getLong(0)
beginDateDf = amountDataDf.filter(col("Date") == beginDate).select("prev_amout_sum").first.getLong(0)
diff_amount = endDateAmount - beginDateDf
return diff_amount
def query_customers_date(beginDate, endDate):
customersDataDf = spark.read.parquet("./path/customers_data_df")
distinct_customers_nb = customersDataDf.filter(col("date").between(lit(beginDate), lit(endDate))) \
.agg(approx_count_distinct(df.Customer).alias('distinct_customers')).first.getLong(0)
return distinct_customers_nb
# This is should be executed the first time only
init_amout_data(df)
init_customers_data(df)
# This is should be executed everyday at midnight with data of the last day only
last_day_partition = df.filter(col("date") == yesterday_date)
update_amount_data(last_day_partition)
update_cusomers_data(last_day_partition)
# Optimized queries that should be executed with
beginDate = "2022-12-20"
endDate = "2022-12-22"
answer = query_amount_date(beginDate, endDate) / query_customers_date(beginDate, endDate)
print(answer)
If calculating the distinct customer is not fast enough, there's another approach using the same pre sum calculation of all distinct customers and another table for distinct customer, each day if there's a new customer increment the first table and add that customer to the second table, if not don't do anything.
Finally there are some tricks for optimizing the goupBy or window functions using salting oo extended partitioning.

You can achieve this by filtering rows with dates between start_date and end_date then grouping by customer_id and calculating the sum of amounts and then getting avg of these amounts. this approach works for only one start_date and end_date and you should run this code with different parameters to solve with different date ranges
start_date = '2022-12-20'
end_date = '2022-12-21'
(
df
.withColumn('isInRange', F.col('date').between(start_date, end_date))
.filter(F.col('isInRange'))
.groupby('customer')
.agg(F.sum('amount').alias('sum'))
.agg(F.avg('sum').alias('avg income'))
).show()

Related

Trying to groupby two columns in pandas and return a max value based on criteria

I need some help with a dataset that I am trying to perform a .groupby() on to find the .max() consumption value based on Entity and Year.
enter image description hereimport
The consumption column that I am using to perform the max function, sometimes has the same value for different Years. When this occurs I would like to return the max Year for that occurrence.
df.groupby(['Entity','Year']).consumption.max().reset_index()
returns
enter image description here
In the end I would like a DataFrame with ['Entity','Year','consumption'] as the columns and when the consumption is the same for a specific Entity Year pair, to return the highest Year of the two.
I worked up a solution
df4 = df.groupby('Entity').consumption.max().reset_index()
df5 = df4.merge(df, left_on=['Entity', 'consumption'] , right_on=['Entity', 'consumption'])
df5 = df5.groupby(['Entity', 'consumption']).max().reset_index()
df5.head(10)
Unique Countries with max consumption rates, and if the consumption rate was the same year-over-year, return the max year.

Computing number of business days between start/end columns

I have two Dataframes
facts:
columns: data, start_date and end_date
holidays:
column: holiday_date
What I want is a way to produce another Dataframe that has columns:
data, start_date, end_date and num_holidays
Where num_holidays is computed as: Number of days between start and end that are not weekends or holidays (as in the holidays table).
The solution is here if we wanted to do this in PL/SQL. Crux is this part of code:
--Calculate and return the number of workdays using the input parameters.
--This is the meat of the function.
--This is really just one formula with a couple of parts that are listed on separate lines for documentation purposes.
RETURN (
SELECT
--Start with total number of days including weekends
(DATEDIFF(dd,#StartDate, #EndDate)+1)
--Subtact 2 days for each full weekend
-(DATEDIFF(wk,#StartDate, #EndDate)*2)
--If StartDate is a Sunday, Subtract 1
-(CASE WHEN DATENAME(dw, #StartDate) = 'Sunday'
THEN 1
ELSE 0
END)
--If EndDate is a Saturday, Subtract 1
-(CASE WHEN DATENAME(dw, #EndDate) = 'Saturday'
THEN 1
ELSE 0
END)
--Subtract all holidays
-(Select Count(*) from [dbo].[tblHolidays]
where [HolDate] between #StartDate and #EndDate )
)
END
I'm new to pyspark and was wondering what's the efficient way to do this? I can post the udf I'm writing if it helps though I'm going slow because I feel it's the wrong thing to do:
Is there a better way than creating a UDF that reads the holidays table in a Dataframe and joins with it to count the holidays? Can I even join inside a udf?
Is there a way to write a pandas_udf instead? Would it be faster enough?
Are there some optimizations I can apply like cache the holidays table somehow on every worker?
Something like this may work:
from pyspark.sql import functions as F
df_facts = spark.createDataFrame(
[('data1', '2022-05-08', '2022-05-14'),
('data1', '2022-05-08', '2022-05-21')],
['data', 'start_date', 'end_date']
)
df_holidays = spark.createDataFrame([('2022-05-10',)], ['holiday_date'])
df = df_facts.withColumn('exploded', F.explode(F.sequence(F.to_date('start_date'), F.to_date('end_date'))))
df = df.filter(~F.dayofweek('exploded').isin([1, 7]))
df = df.join(F.broadcast(df_holidays), df.exploded == df_holidays.holiday_date, 'anti')
df = df.groupBy('data', 'start_date', 'end_date').agg(F.count('exploded').alias('business_days'))
df.show()
# +-----+----------+----------+-------------+
# | data|start_date| end_date|business_days|
# +-----+----------+----------+-------------+
# |data1|2022-05-08|2022-05-14| 4|
# |data1|2022-05-08|2022-05-21| 9|
# +-----+----------+----------+-------------+
Answers:
Is there a better way than creating a UDF...?
This method does not use udf, so it must perform better.
Is there a way to write a pandas_udf instead? Would it be faster enough?
pandas_udf performs better than regular udf. But no-udf approaches should be even better.
Are there some optimizations I can apply like cache the holidays table somehow on every worker?
Spark engine performs optimizations itself. However, there are some relatively rare cases when you may help it. In the answer, I have used F.broadcast(df_holidays). The broadcast sends the dataframe to all of the workers. But I am sure that the table would automatically be broadcasted to the workers, as it looks like it's supposed to be very small.

How can I split weekly data to monthly using Excel/ Power Pivot?

My Data is in weekly buckets. I want to split the number into a monthly number but, since there is an overlap in days falling in both the months, I want a weighted average of the data in terms of days that fall in each of the months. For example:
Now, in the above picture, I want to split that 200 (5/7*200 in Jan, 2/7 in Feb). How can I do that using Excel/ Power Pivot/ Dax Functions? Any help here is much appreciated.
Thank you!
Assuming your fact table looks something like below. Values are associated with the starting date of the week it occurred.
Although it may actually be a more granular data, having multiple rows for each week with additional attributes (such as identifiers of a person, a store, depending on the business), what being shown below will work the same.
What we need to do first is to create a date table. We can do that in "Design" tab, by clicking "Date Table", then "New".
In this date table, we need to add a column for starting date of the week which the date of each row is in. Set the cursor to "Add Column" area, and input following formula. Then rename this column to "Week Start Date".
= [Date] - [Day Of Week Number] + 1
Now, we can define the measure to calculate the number allocated to each month with following formula. What this measure is doing is:
Iterating over each row of the fact table
Count the number of days for the week visible in the filter context
Add the value portion for the visible days
Value Allocation := SUMX (
MyData,
VAR WeekStartDate = MyData[Week]
VAR NumDaysInSelection = COUNTROWS (
FILTER (
'Calendar',
'Calendar'[Week Start Date] = WeekStartDate
)
)
VAR AllocationRate = DIVIDE ( NumDaysInSelection, 7 )
RETURN AllocationRate * MyData[Value]
)
Result in the pivot table will be looking like this.

Python Pandas stack by zip code and group by month/year

I have a large data frame with transaction data. What I am trying to do is use python to aggregate the data starting with zip codes, then a year and month, finally the total number of transactions for that month.
My Df:
Date VAR1 VAR2 ZipCode Transactions
YYYY-MM-DD. X. Y. 12345. 1.
So the first thing I did was convert the to date time
df['Date'] = pd.to_datetime(df['Date'])
df.info()
# Date datetime64[ns]
Then I split the data into year-month and number of transactions:
# grouping the data by year and month
per = df.Date.dt.to_period("M")
g = df.groupby(per)
g.sum() # so now that this works, we need to break it up into zip codes
Which gives an output of:
Date. Transactions
YYYY-MM. X
YYYY-MM. Y
My questions is, what am I missing to get the zipcodes in front:
ZipCode. Date. Transactions
123345. YYYY-MM. sum()
Any and all help is greatly apprecaited
I believe you need add column ZipCode to groupby if need grouping per zip and per months:
per = df.Date.dt.to_period("M")
df1 = df.groupby(['ZipCode',per])['Transactions'].sum().reset_index()

DAX Measure: IF sum of all users = Max THEN return value of individual user

He Everyone,
I newer to DAX measures and I am trying to get my measure to have the following logic:
When sum of all users = Max, THEN return value of individual user
The Data Model has the following columns: CustomerID \ Usage \ Interval (DATETIME). What is tripping me up is that DATETIME is in 15-minute increments. I have approximately 700 unique CustomerIDs and I need to be able to return the usage of each CustomerID during the MAXSUM of all the CustomerIDs.
I am not sure if this would be an IF-THEN statement or if I need to use a time function. I am writing this DAX measure in Power Pivot to send to a Pivot Table within Excel.
Thanks in advance,
This may not be the simplest solution, but it works for me (I've named the table Load).
CustomerUsage =
VAR Summarized = CALCULATETABLE(
SUMMARIZE(Load, Load[Interval], "TotalUsage", SUM(Load[Usage])),
ALLSELECTED(Load))
VAR MaxUsage = MAXX(Summarized, [TotalUsage])
VAR MaxInterval = MAXX(FILTER(Summarized, [TotalUsage] = MaxUsage), [Interval])
RETURN CALCULATE(SUM(Load[Usage]), Load[Interval] = MaxInterval)
First, you generate a table that finds the total usage per interval by summing over all selected users.
The MaxUsage is simply the largest TotalUsage over all of the intervals.
The MaxInterval is the interval for which the TotalUsage = MaxUsage.
Then we can find the Usage for a customer for the interval with maximum usage.

Resources