I have two Dataframes
facts:
columns: data, start_date and end_date
holidays:
column: holiday_date
What I want is a way to produce another Dataframe that has columns:
data, start_date, end_date and num_holidays
Where num_holidays is computed as: Number of days between start and end that are not weekends or holidays (as in the holidays table).
The solution is here if we wanted to do this in PL/SQL. Crux is this part of code:
--Calculate and return the number of workdays using the input parameters.
--This is the meat of the function.
--This is really just one formula with a couple of parts that are listed on separate lines for documentation purposes.
RETURN (
SELECT
--Start with total number of days including weekends
(DATEDIFF(dd,#StartDate, #EndDate)+1)
--Subtact 2 days for each full weekend
-(DATEDIFF(wk,#StartDate, #EndDate)*2)
--If StartDate is a Sunday, Subtract 1
-(CASE WHEN DATENAME(dw, #StartDate) = 'Sunday'
THEN 1
ELSE 0
END)
--If EndDate is a Saturday, Subtract 1
-(CASE WHEN DATENAME(dw, #EndDate) = 'Saturday'
THEN 1
ELSE 0
END)
--Subtract all holidays
-(Select Count(*) from [dbo].[tblHolidays]
where [HolDate] between #StartDate and #EndDate )
)
END
I'm new to pyspark and was wondering what's the efficient way to do this? I can post the udf I'm writing if it helps though I'm going slow because I feel it's the wrong thing to do:
Is there a better way than creating a UDF that reads the holidays table in a Dataframe and joins with it to count the holidays? Can I even join inside a udf?
Is there a way to write a pandas_udf instead? Would it be faster enough?
Are there some optimizations I can apply like cache the holidays table somehow on every worker?
Something like this may work:
from pyspark.sql import functions as F
df_facts = spark.createDataFrame(
[('data1', '2022-05-08', '2022-05-14'),
('data1', '2022-05-08', '2022-05-21')],
['data', 'start_date', 'end_date']
)
df_holidays = spark.createDataFrame([('2022-05-10',)], ['holiday_date'])
df = df_facts.withColumn('exploded', F.explode(F.sequence(F.to_date('start_date'), F.to_date('end_date'))))
df = df.filter(~F.dayofweek('exploded').isin([1, 7]))
df = df.join(F.broadcast(df_holidays), df.exploded == df_holidays.holiday_date, 'anti')
df = df.groupBy('data', 'start_date', 'end_date').agg(F.count('exploded').alias('business_days'))
df.show()
# +-----+----------+----------+-------------+
# | data|start_date| end_date|business_days|
# +-----+----------+----------+-------------+
# |data1|2022-05-08|2022-05-14| 4|
# |data1|2022-05-08|2022-05-21| 9|
# +-----+----------+----------+-------------+
Answers:
Is there a better way than creating a UDF...?
This method does not use udf, so it must perform better.
Is there a way to write a pandas_udf instead? Would it be faster enough?
pandas_udf performs better than regular udf. But no-udf approaches should be even better.
Are there some optimizations I can apply like cache the holidays table somehow on every worker?
Spark engine performs optimizations itself. However, there are some relatively rare cases when you may help it. In the answer, I have used F.broadcast(df_holidays). The broadcast sends the dataframe to all of the workers. But I am sure that the table would automatically be broadcasted to the workers, as it looks like it's supposed to be very small.
Related
In pandas dataframe df, one can extract a subset of rows and store it in another pandas data frame. For example, df1 = df[10:20]. Can we do something similar in spark dataframe?
Since we're at Spark, we're considering large datasets for which Pandas (and Python) are still catching up. I'm trying to stress out that the reason you may've considered PySpark as a better fit for your data processing problem is exactly the amount of data - to large for pandas to handle nicely.
With that said, you simply cannot think about the huge dataset as something to "rank" as no computer could handle it (either because lack of RAM or time).
In order to answer your question:
one can extract a subset of rows and store it in another pandas data frame.
think of filter or where that you use to filter out rows you don't want to include in a result dataset.
That could be as follows (using Scala API):
val cdf: DataFrame = ...
val result: DataFrame = cdf.where("here comes your filter expression")
Use result data frame however you wish. That's what you wanted to work with and is now available. That's a sort of "Spark way".
#chlebek since your answer works for me. I corrected a typo and post here as an answer.
b = cdf.withColumn("id", row_number().over(Window.orderBy("INTERVAL_END_DATETIME")))
b = b.where(b.id >= 10)
b = b.where(b.id <= 20)
You could try to use row_number, it will add increasing row number column. The data will be sorted by column used in .orderBy clause. Then you can just select needed rows.
import org.apache.spark.sql.expressions.Window
val new_df = df.withColumn("id",row_number.over(Window.orderBy('someColumnFromDf))).where('id <= 20 and 'id >= 10)
At the moment I have 9 functions which do specific calculations to a data frame - average balance per month included, rolling P&L, period start balances, ratio calculation.
Each of those functions produce the following:
the first columns are the group by columns which the function accepts and the final column is the statistic calculation.
I.e.
Each of those functions produce a spark data frame that has the same group by variables(same first columns - 1 column if the group by variables is only 1, 2 columns if the group by variables are 2, etc.) and 1 column where the values are the specific calculation - examples of which I listed at the beginning.
Because each of those functions do different calculations, I need to produce a data frame for each one and then join them to produce a report
I join them on the group by variables because they are common in all of them(each individual statistic report).
But doing 7-8 and even more joins is very slow.
Is there a way to add those columns together without using join?
Thank you.
I can think of multiple approaches. But this looks like a good use case for a new pandas udf spark api.
You can define one group_by udf. The udf will receive the aggregated group as a pandas dataframe. You apply 9 aggregate functions on the group and return a pandas dataframe with additional 9 aggregated columns. Spark will combine each new returned pandas dataframe into a large spark dataframe.
e.g
# given you want to aggregate average and ratio
#pandas_udf("month long, avg double, ratio dbl", PandasUDFType.GROUPED_MAP)
def compute(pdf):
# pdf is a pandas.DataFrame
pdf['avg'] = compute_avg(pdf)
pdf['ratio'] = compute_ratio(pdf)
return pdf
df.groupby("month").apply(compute).show()
See Pandas-UDF#Grouped-Map
If you cluster is on a lower version you have 2 options:
Stick to dataframe api and write custom aggregate functions. See answer. They have a horrible api but usage would look like this.
df.groupBy(df.month).agg(
my_avg_func(df.amount).alias('avg'),
my_ratio_func(df.amount).alias('ratio'),
Fall back to good ol' rdd map reduce api
#pseudocode
def _create_tuple_key(record):
return (record.month, record)
def _compute_stats(acc, record):
acc['count'] += 1
acc['avg'] = _accumulate_avg(acc['count'], record)
acc['ratio'] = _accumulate_ratio(acc['count'], record)
return acc
df.toRdd.map(__create_tuple_key).reduceByKey(_compute_stats)
I found a way to get the number I need, but I was hoping to get some input on how to accomplish it in a less cumbersome way. I need a running total of transactions to date in order to make it into a plotly plot. The data I have only includes a few columns: id, date, and amount. Here's the code I have so far:
fy20 = pd.read_excel('./data/transactions.xlsx', parse_dates=['date'])
def daily_money(df):
df = df.groupby('date').amount.sum()
df = df.groupby(df.index.day).cumsum()
df = df.cumsum().to_frame().reset_index()
return df
fy20 = daily_money(fy20)
This appears to accomplish the goal, but it seems like there must be a simpler way. Please let me know if you have any suggestions on how to simplify this.
It looks to me like this should work:
df.groupby('date')['amount'].sum().cumsum()
This works because DataFrame.groupby automatically sorts by the group keys, so the cumulative sum is already looking at the data it needs.
If you want it as a DataFrame with a new index instead of a Series, you can just use Series.reset_index, which converts the series to a DataFrame first, but unless you need the date as a normal column (rather than the index) later, you don't need to do that.
Given a Spark dataframe that I have
val df = Seq(
("2019-01-01",100),
("2019-01-02",101),
("2019-01-03",102),
("2019-01-04",103),
("2019-01-05",102),
("2019-01-06",99),
("2019-01-07",98),
("2019-01-08",100),
("2019-01-09",47)
).toDF("day","records")
I want to add a new column to this so that I get an average value of last N records on a given day. For example, if N=3, then on a given day, that value should be average of last 3 values EXCLUDING the current record
For example, for day 2019-01-05, it would be (103+102+101)/3
How I can use efficiently use over() clause in order to do this in Spark?
PySpark solution.
Window definition should be 3 PRECEDING AND 1 FOLLOWING which translates to positions (-3,-1) with both boundaries included.
from pyspark.sql import Window
from pyspark.sql.functions import avg
w = Window.orderBy(df.day)
df_with_rsum = df.withColumn("rsum_prev_3_days",avg(df.records).over(w).rowsBetween(-3, -1))
df_with_rsum.show()
The solution assumes there is one row per date in the dataframe without missing dates in between. If not, aggregate the rows by date before applying the window function.
Can anyone explain to me why these two conditions produce different outputs (even different count() )?
FIRST:
(df
.where(cond1)
.where((cond2) | (cond3))
.groupBy('id')
.agg(F.avg(F.column('col1')).alias('name'),
F.avg(F.column('col2')).alias('name'))
).count()
SECOND:
(df
.groupBy('id')
.agg(F.avg(F.when(((cond2) | (cond3))) & (cond1),
F.column('col1'))).alias('name'),
F.avg(F.when(((cond2) | (cond3)) & (cond1),
F.column('col2'))).alias('name'))
).count()
I just figured it out. when() returns None when it finds no match, but None is still a return, which means the aggregation takes into account all the values. When compared to a simple df grouped by the same column and just aggregated with no conditions, the result is the same.
On the other hand, where() filters the DataFrame, so the aggregation is only applied to the filtered version of the DataFrame, hence lower number of results
Without knowing what the conditions are, my understanding is that they are different processes: in the first case you first filter the rows you need to process, group by id and get the averages of the filtered data, that results to lets say x rows. In the second case, you group by id first, no rows filtering, and you tell spark to add a column named 'name' that holds the conditional average to the grouped df. You don't conditionally filter the rows, so you now have x+something more rows (depending on your conditions)
(df
.where(cond1) # remove rows by applying cond1
.where((cond2) | (cond3)) # remove rows by applying cond2, 3
.groupBy('id') # group *remaining* rows by id
.agg(F.avg(F.column('col1')).alias('name'), # then get the average
F.avg(F.column('col2')).alias('name'))
).count()
But:
(df
.groupBy('id') # group initial data by id
.agg(F.avg(F.when(((cond2) | (cond3))) & (cond1), # add a column to the grouped data that computes average conditionally
F.column('col1'))).alias('name'),
F.avg(F.when(((cond2) | (cond3)) & (cond1),
F.column('col2'))).alias('name'))
).count()
# the agg does not change the number of the rows.
Hope this helps (I think you've already figured it out though :) ). Good luck!