I have some daily data in a df, which goes back as far as 1st January 2020. It looks similar to the below but with many id1s on each day.
| yyyy_mm_dd | id1 | id2 | cost |
|------------|-----|------|-------|
| 2020-01-01 | 23 | 7253 | 5003 |
| 2020-01-01 | 23 | 7743 | 30340 |
| 2020-01-02 | 23 | 7253 | 450 |
| 2020-01-02 | 23 | 7743 | 4500 |
| ... | ... | ... | ... |
| 2021-01-01 | 23 | 7253 | 5675 |
| 2021-01-01 | 23 | 134 | 1030 |
| 2021-01-01 | 23 | 3445 | 564 |
| 2021-01-01 | 23 | 4534 | 345 |
| ... | ... | ... | ... |
I have grouped and calculated the summed cost like so:
grouped_quarterly = (
df
.withColumn('year_quarter', (F.year(F.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd'))
.groupby('id1', 'year_quarter')
.agg(
F.sum('cost').alias('cost')
)
)
I am able to then successfully make a quarter over quarter comparison like so:
w = Window.partitionBy(F.col('id1'), F.expr('substring(string(year_quarter), -2)')).orderBy('year_quarter')
growth = (
grouped_quarterly
.withColumn('prev_value', F.lag(F.col('cost')).over(w))
.withColumn('diff', F.when(F.isnull(F.col('cost') - F.col('prev_value')), 0).otherwise(F.col('cost') - F.col('prev_value')))
).where(F.col('year_quarter') >= 202101)
I would like to modify this to be quarter to date instead of quarter over quarter. For example, the above would compare April 1st 2020 - June 30th 2020 with April 1st 2020 - April 15th 2021 (or whatever maximum date in df is).
Instead, I would prefer to compare April 1st 2020 - April 15th 2020 with April 1st 2021 - April 15th 2021.
Is it possible to ensure only the same periods are compared within year_quarter?
Edit: Adding sample output:
grouped_quarterly.where(F.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost |
|-----|--------------|-------|
| 222 | 202001 | 49428 |
| 222 | 202002 | 43292 |
| 222 | 202003 | 73928 |
| 222 | 202004 | 12028 |
| 222 | 202101 | 19382 |
| 222 | 202102 | 4282 |
growth.where(F.col('id1') == 222).sort('year_quarter').show(10,False)
| id1 | year_quarter | cost | prev_value | diff | growth |
|-----|--------------|-------|------------|--------|--------|
| 222 | 202101 | 52494 | 49428 | 3066 | 6.20 |
| 222 | 202102 | 4282 | 43292 | -39010 | -90.10 |
The growth calculation from the window is being done correctly. However, since 202102 is in progress, it gets compared to the full 202002. The comparison for 202101 works perfectly as both year_quarters are complete.
Is there anyway to ensure the window function only compares the same period within the year_quarter with the previous year, for incomplete quarters? I hope the sample data makes my question a bit more clear
The idea is to split the task into two parts:
Calculate the growth for the complete quarters. This logic is completely taken over from the question and then
calculate the growth for the currently running quarter.
First generate some additional test data for 2019Q2, 2020Q2 and 2021Q2:
data = [('2019-04-01', 23, 1), ('2019-04-01', 23, 2), ('2019-04-02', 23, 3), ('2019-04-15', 23, 4),
('2019-04-16', 23, 5), ('2019-04-17', 23, 6), ('2019-05-01', 23, 7), ('2019-06-30', 23, 8),
('2019-07-01', 23, 9), ('2020-01-01',23,5003),('2020-01-01',23,30340), ('2020-01-02',23,450),
('2020-01-02',23,4500), ('2020-04-01', 23, 10), ('2020-04-01', 23, 20), ('2020-04-02', 23, 30),
('2020-04-15', 23, 40), ('2020-04-16', 23, 50), ('2020-04-17', 23, 60), ('2020-05-01', 23, 70),
('2020-06-30', 23, 80), ('2020-07-01', 23, 90), ('2021-01-01',23,5675), ('2021-01-01',23,1030),
('2021-01-01',23,564), ('2021-01-01',23,345), ('2021-04-01', 23, -10), ('2021-04-01', 23, -20),
('2021-04-02', 23, -30), ('2021-04-15', 23, -40)]
Calcuate the year_quarter column and cache the result:
df = spark.createDataFrame(data=data, schema = ["yyyy_mm_dd", "id1", "cost"]) \
.withColumn("yyyy_mm_dd", F.to_date("yyyy_mm_dd", "yyyy-MM-dd")) \
.withColumn('year_quarter', (F.year(F.col('yyyy_mm_dd')) * 100 + F.quarter(F.col('yyyy_mm_dd')))) \
.cache()
Get the maximum date and its corresponding quarter:
max_row = df.selectExpr("max(yyyy_mm_dd)", "max_by(year_quarter, yyyy_mm_dd)").head()
cur_date, cur_quarter = max_row[0], max_row[1]
It is not strictly necessary to set cur_date to the maximum date of the data. Instead cur_date and cur_quarter could also be set manually.
For all quarters but the current one apply the logic given in the question:
w = Window.partitionBy(F.col('id1'), F.expr('substring(string(year_quarter), -2)')).orderBy('year_quarter')
df_full_quarters = df.filter(f"year_quarter <> {cur_quarter}") \
.groupby('id1', 'year_quarter') \
.agg(F.sum('cost').alias('cost')) \
.withColumn('prev_value', F.lag(F.col('cost')).over(w))
For the current quarter filter out all dates in the previous year that should be ignored:
df_cur_quarter = df.filter(f"year_quarter = {cur_quarter} or (year_quarter = {cur_quarter - 100} and add_months(yyyy_mm_dd, 12) <= '{cur_date}')") \
.groupby('id1', 'year_quarter') \
.agg(F.sum('cost').alias('cost')) \
.withColumn('prev_value', F.lag(F.col('cost')).over(w)) \
.filter(f"year_quarter = {cur_quarter}")
Finally union the two parts and calculate the diff column:
growth = df_full_quarters.union(df_cur_quarter) \
.withColumn('diff', F.when(F.isnull(F.col('cost') - F.col('prev_value')), 0).otherwise(F.col('cost') - F.col('prev_value'))) \
.orderBy("id1", "year_quarter")
The result will be:
+---+------------+-----+----------+------+
|id1|year_quarter| cost|prev_value| diff|
+---+------------+-----+----------+------+
| 23| 201902| 36| null| 0|
| 23| 201903| 9| null| 0|
| 23| 202001|40293| null| 0|
| 23| 202002| 360| 36| 324|
| 23| 202003| 90| 9| 81|
| 23| 202101| 7614| 40293|-32679|
| 23| 202102| -100| 100| -200|
+---+------------+-----+----------+------+
In this example, for the comparison of 2021Q2 with the previous year the sum for 2020Q2 is given as 100, but the actual value for the full 2020Q2 is 360.
If you want quarter to date comparison YoY but quarter is incomplete, then do agg by dayofmonth(col("input")).alias("dayofmonth") if current quarter being compared to is equal to current month of current year maybe with .agg(when(col("date_column") condition exp)) Some more insights here
Related
I am trying to implement a YtD measure for my report in Excel with Power Pivot. My source looks roughly like this:
Table 1
| Month | Store | Branch | Article | Value |
|----------|-------|--------|---------|-------|
| January | 1 | A | Sales | 200 |
| January | 1 | A | Costs | 100 |
| January | 1 | A | Rent | 10 |
| February | 1 | A | Costs | 20 |
| February | 1 | A | Sales | 80 |
| March | 1 | A | Costs | 30 |
| March | 1 | A | Sales | 80 |
| February | 2 | B | Sales | 100 |
| February | 2 | B | Costs | 40 |
| February | 2 | B | Rent | 20 |
Linked to it, are a table Table 2 of months (name - number from 1 to 12), a table Table 3 of unique articles and a table Table 4 of unique stores with their branches.
I want to be able to display YtD for every article depending on the chosen month.
I have measures:
Val. := sum(table1[Value])
YtD1:= calculate(Val., all('Table 2'[Name]))
The former sums across all the values, which are filtered by article in my pivot report. The latter calculates a YtD across all months. It works, but I have to rewrite it so that it responds to filtering the last month and sums from the first month to the selected month.
I have tried to format month numbers to process them as dates (e.g. first day of the month), but couldn't appropriately handle the FORMAT function.
I have also tried to do a sum of months, i.e.:
YtD2= calculate(Val., filter(Table2;Table2[Number]<=2))
which, I hoped, would count months from January to February. It doesn't seem to do any good, resulting in numbers I cannot explain.
My desired output should look like this:
| Store | Sales | | Costs | |
|-------|-------|-----|-------|-----|
| | Val. | YtD | Val. | YtD |
| 1 | 80 | 280 | 20 | 120 |
| 2 | 100 | 100 | 40 | 40 |
if data is filtered by February.
Or
| Store | Sales | | Costs | |
|-------|-------|-----|-------|-----|
| | Val. | YtD | Val. | YtD |
| 1 | 160 | 360 | 50 | 150 |
| 2 | 100 | 100 | 40 | 40 |
if February and March are selected (Val. is displayed for February and March, but YtD from January to March).
Is there a way to implement this in DAX? Can this be done without conversion from month names (or numbers) to some date&
If not, can I get it to work for a month filter instead of a month slicer? That is, if only one month can be selected.
I cannot use variables and similar Power BI features.
Try:
YTD :=
VAR MaxSelectedMonth =
MAX( Table2[Number] )
RETURN
CALCULATE(
[Val.],
FILTER(
ALL( Table2 ),
Table2[Number] <= MaxSelectedMonth
)
)
I would like to collapse the rows in a dataframe based on an ID column and count the number of records per ID using window functions. Doing this, I would like to avoid partitioning the window by ID, because this would result in a very large number of partitions.
I have a dataframe of the form
+----+-----------+-----------+-----------+
| ID | timestamp | metadata1 | metadata2 |
+----+-----------+-----------+-----------+
| 1 | 09:00 | ABC | apple |
| 1 | 08:00 | NULL | NULL |
| 1 | 18:00 | XYZ | apple |
| 2 | 07:00 | NULL | banana |
| 5 | 23:00 | ABC | cherry |
+----+-----------+-----------+-----------+
where I would like to keep only the records with the most recent timestamp per ID, such that I have
+----+-----------+-----------+-----------+-------+
| ID | timestamp | metadata1 | metadata2 | count |
+----+-----------+-----------+-----------+-------+
| 1 | 18:00 | XYZ | apple | 3 |
| 2 | 07:00 | NULL | banana | 1 |
| 5 | 23:00 | ABC | cherry | 1 |
+----+-----------+-----------+-----------+-------+
I have tried:
window = Window.orderBy( [asc('ID'), desc('timestamp')] )
window_count = Window.orderBy( [asc('ID'), desc('timestamp')] ).rowsBetween(-sys.maxsize,sys.maxsize)
columns_metadata = [metadata1, metadata2]
df = df.select(
*(first(col_name, ignorenulls=True).over(window).alias(col_name) for col_name in columns_metadata),
count(col('ID')).over(window_count).alias('count')
)
df = df.withColumn("row_tmp", row_number().over(window)).filter(col('row_tmp') == 1).drop(col('row_tmp'))
which is in part based on How to select the first row of each group?
This without the use of pyspark.sql.Window.partitionBy, this does not give the desired output.
I read you wanted without partitioning by ID after I posted it. I could only think of this approach.
Your dataframe:
df = sqlContext.createDataFrame(
[
('1', '09:00', 'ABC', 'apple')
,('1', '08:00', '', '')
,('1', '18:00', 'XYZ', 'apple')
,('2', '07:00', '', 'banana')
,('5', '23:00', 'ABC', 'cherry')
]
,['ID', 'timestamp', 'metadata1', 'metadata2']
)
We can use rank and partition by ID over timestamp:
from pyspark.sql.window import Window
import pyspark.sql.functions as F
w1 = Window().partitionBy(df['ID']).orderBy(df['timestamp']).orderBy(F.desc('timestamp'))
w2 = Window().partitionBy(df['ID'])
df\
.withColumn("rank", F.rank().over(w1))\
.withColumn("count", F.count('ID').over(w2))\
.filter(F.col('rank') == 1)\
.select('ID', 'timestamp', 'metadata1', 'metadata2', 'count')\
.show()
+---+---------+---------+---------+-----+
| ID|timestamp|metadata1|metadata2|count|
+---+---------+---------+---------+-----+
| 1| 18:00| XYZ| apple| 3|
| 2| 07:00| | banana| 1|
| 5| 23:00| ABC| cherry| 1|
+---+---------+---------+---------+-----+
I want to create a DataFrame that contains all the rows from two DataFrames, and where there are duplicates we keep only the row with the max value of a column.
For example, if we have two tables with the same schema, like below, we will merge into one table which includes only the rows with the maximum column value (highest score) for the group of rows grouped by another column ("name" in the below example).
Table A
+--------------------------+
| name | source | score |
+--------+---------+-------+
| Finch | Acme | 62 |
| Jones | Acme | 30 |
| Lewis | Acme | 59 |
| Smith | Acme | 98 |
| Starr | Acme | 87 |
+--------+---------+-------+
Table B
+--------------------------+
| name | source | score |
+--------+---------+-------+
| Bryan | Beta | 93 |
| Jones | Beta | 75 |
| Lewis | Beta | 59 |
| Smith | Beta | 64 |
| Starr | Beta | 81 |
+--------+---------+-------+
Final Table
+--------------------------+
| name | source | score |
+--------+---------+-------+
| Bryan | Beta | 93 |
| Finch | Acme | 62 |
| Jones | Beta | 75 |
| Lewis | Acme | 59 |
| Smith | Acme | 98 |
| Starr | Acme | 87 |
+--------+---------+-------+
Here's what seems to work:
from pyspark.sql import functions as F
schema = ["name", "source", "score"]
rows1 = [("Smith", "Acme", 98),
("Jones", "Acme", 30),
("Finch", "Acme", 62),
("Lewis", "Acme", 59),
("Starr", "Acme", 87)]
rows2 = [("Smith", "Beta", 64),
("Jones", "Beta", 75),
("Bryan", "Beta", 93),
("Lewis", "Beta", 59),
("Starr", "Beta", 81)]
df1 = spark.createDataFrame(rows1, schema)
df2 = spark.createDataFrame(rows2, schema)
df_union = df1.unionAll(df2)
df_agg = df_union.groupBy("name").agg(F.max("score").alias("score"))
df_final = df_union.join(df_agg, on="score", how="leftsemi").orderBy("name", F.col("score").desc()).dropDuplicates(["name"])
The above results in the DataFrame I expect. It seems like a convoluted way to do this, but I don't know as I'm relatively new to Spark. Can this be done in a more efficient, elegant, or "Pythonic" manner?
You can use window functions. Partition by name and choose the record with the highest score.
from pyspark.sql.functions import *
from pyspark.sql.window import Window
w=Window().partitionBy("name").orderBy(desc("score"))
df_union.withColumn("rank", row_number().over(w))\
.filter(col("rank")==1).drop("rank").show()
+-----+------+-----+
| name|source|score|
+-----+------+-----+
|Bryan| Beta| 93|
|Finch| Acme| 62|
|Jones| Beta| 75|
|Lewis| Acme| 59|
|Smith| Acme| 98|
|Starr| Acme| 87|
+-----+------+-----+
I don't see anything wrong with your answer, except for the last line - you cannot join on score only, but need to join on combination of "name" and "score", and you can choose inner join, which will eliminate the need to remove rows with lower scores for the same name:
df_final = (df_union.join(df_agg, on=["name", "score"], how="inner")
.orderBy("name")
.dropDuplicates(["name"]))
Notice that there is no need to order by score, and .dropDuplicates(["name"]) is only needed if you want to avoid displaying two rows for name = Lewis who has the same score in both dataframes.
In this table, I want to find the Average number of days between actions per each user.
What I mean here is, I want to group by user_id and then I want to subtract each date directly from the date before it by days per each user. and then find the average number of these days per each user (the average number of No_Action days per each user).
+---------+-----------+----------------------+
| User_ID | Action_ID | Action_At |
+---------+-----------+----------------------+
| 1 | 11 | 2019-01-31T23:00:37Z |
+---------+-----------+----------------------+
| 2 | 12 | 2019-01-31T23:11:12Z |
+---------+-----------+----------------------+
| 3 | 13 | 2019-01-31T23:14:53Z |
+---------+-----------+----------------------+
| 1 | 14 | 2019-02-01T00:00:30Z |
+---------+-----------+----------------------+
| 2 | 15 | 2019-02-01T00:01:03Z |
+---------+-----------+----------------------+
| 3 | 16 | 2019-02-01T00:02:32Z |
+---------+-----------+----------------------+
| 1 | 17 | 2019-02-06T11:30:28Z |
+---------+-----------+----------------------+
| 2 | 18 | 2019-02-06T11:30:28Z |
+---------+-----------+----------------------+
| 3 | 19 | 2019-02-07T09:09:16Z |
+---------+-----------+----------------------+
| 1 | 20 | 2019-02-11T15:37:24Z |
+---------+-----------+----------------------+
| 2 | 21 | 2019-02-18T10:02:07Z |
+---------+-----------+----------------------+
| 3 | 22 | 2019-02-26T12:01:31Z |
+---------+-----------+----------------------+
You can do it like this (and next time, please provide the data so that it is easy to help you; it took me much longer to enter the data than to get to the solution):
df = pd.DataFrame({'User_ID': [1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 3],
'Action_ID': [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22],
'Action_At': ['2019-01-31T23:00:37Z', '2019-01-31T23:11:12Z', '2019-01-31T23:14:53Z', '2019-02-01T00:00:30Z', '2019-02-01T00:01:03Z', '2019-02-01T00:02:32Z', '2019-02-06T11:30:28Z', '2019-02-06T11:30:28Z', '2019-02-07T09:09:16Z', '2019-02-11T15:37:24Z', '2019-02-18T10:02:07Z', '2019-02-26T12:01:31Z']})
df.Action_At = pd.to_datetime(df.Action_At)
df.groupby('User_ID').apply(lambda x: (x.Action_At - x.Action_At.shift()).mean())
## User_ID
## 1 3 days 13:32:15.666666
## 2 5 days 19:36:58.333333
## 3 8 days 12:15:32.666666
## dtype: timedelta64[ns]
Or, if you want the solution in days:
df.groupby('User_ID').apply(lambda x: (x.Action_At - x.Action_At.shift()).dt.days.mean())
## User_ID
## 1 3.333333
## 2 5.333333
## 3 8.333333
## dtype: float64
I have a dataset with the current stock for some products:
+--------------+-------+
| Product | Stock |
+--------------+-------+
| chocolate | 300 |
| coal | 70 |
| orange juice | 400 |
+--------------+-------+
and the sales for every product over the years for the current month and the next month in another dataset:
+--------------+------+-------+-------+
| Product | Year | Month | Sales |
+--------------+------+-------+-------+
| chocolate | 2017 | 05 | 55 |
| chocolate | 2017 | 04 | 250 |
| chocolate | 2016 | 05 | 70 |
| chocolate | 2016 | 04 | 200 |
| | | | | | | | |
| coal | 2017 | 05 | 40 |
| coal | 2017 | 04 | 30 |
| coal | 2016 | 05 | 50 |
| coal | 2016 | 04 | 20 |
| | | | | | | | |
| orange juice | 2017 | 05 | 400 |
| orange juice | 2017 | 04 | 350 |
| orange juice | 2016 | 05 | 400 |
| orange juice | 2016 | 04 | 300 |
+--------------+--------------+-------+
I want to compute the stock that I will need to order for the next month, by computing the expected sales over the current month and the next month, using the following formula:
ExpectedSales = max(salesMaxCurrentMonth) + max(salesMaxNextMonth)
The orders will then be
Orders = ExpectedSales * (1 + margin) - Stock
Where margin is, for example, 10%.
I tried to group by several columns using GroupBy, as in the following, but it seems to aggregate by Stock instead of Product:
salesDataset
.groupBy(Columns.col("Month"), Columns.col(“Product”))
.agg(Columns.max(“Sales”).as(“SalesMaxPerMonth”))
.agg(Columns.sum(“SalesMaxPerMonth”).as(SalesPeriod))
.withColumn(
“SalesExpected”,
Columns.col(“SalesPeriod”).multiply(Columns.literal(1 + margin)))
.withColumn(
“Orders”,
Columns.col(“SalesExpected”).minus(Columns.col(“Stock”)))
.withColumn(
“Orders”,
Columns.col(“Orders”).map((Double a) -> a >= 0 ? a: 0))
.doNotAggregateAbove()
.toCellSet()
.show();
You got the logic correct in terms of aggregation but there is another way to build your CellSet, where you provide a map to describe the location of the query which generates it.
salesDataset
.groupBy(Columns.col("Month"), Columns.col(“Product”))
.agg(Columns.max(“Sales”).as(“SalesMaxPerMonth”))
.agg(Columns.sum(“SalesMaxPerMonth”).as(SalesPeriod))
.withColumn(
“SalesExpected”,
Columns.col(“SalesPeriod”).multiply(Columns.literal(1 + margin)))
.withColumn(“Orders”, Columns.col(“SalesExpected”).minus(Columns.col(“Stock”)))
.withColumn(“Orders”, Columns.col(“Orders”).map((Double a) -> a >= 0 ? a: 0))
.doNotAggregateAbove()
.toCellSet(
Empty.<String, Object>map()
.put(“Product”,null)
.put(“Stock”, null))
.show();
Where null in a location represents the wildcard *.