Categorizing Consecutive Months With Zero Values Into Buckets - excel

I've constructed a data model around utilization for my company's fleet within PowerQuery. We have a number of different columns in the data model. Specifically, mileage, VIN, start date, and end date (see below for example table).
Mileage | VIN | Start Date | End Date |
0 | 123 | 6/1/18 | 6/30/18 |
0 | 123 | 7/1/18 | 7/31/18 |
0 | 123 | 8/1/18 | 8/31/18 |
0 | 123 | 9/1/18 | 9/30/18 |
0 | 123 | 10/1/18 | 10/31/18 |
What I'm trying to accomplish is if mileage is equal to 0 for one month it is categorized into a bucket labeled 0-30 days, if mileage is equal to 0 for two consecutive months it is categorized as 31-60 days, and 0 mileage for more than 3 consecutive months would be categorized as >60 days. From the example above, this vehicle would be categorized in the ">60 days" bucket. Is there an easy way to do this within the data model using DAX? Please let me know if you have any follow up questions. Thank you!

Try this as a Calculated Column:
Buckets =
VAR rowDate = 'myTable'[ Start Date ]
VAR previousDate =
CALCULATE (
MAX ( 'myTable'[ Start Date ] ),
FILTER (
ALLEXCEPT ( myTable, myTable[ VIN ] ),
'myTable'[ Start Date ] < rowDate
)
)
VAR prePreviousDate =
CALCULATE (
MAX ( 'myTable'[ Start Date ] ),
FILTER (
ALLEXCEPT ( myTable, myTable[ VIN ] ),
'myTable'[ Start Date ] < previousDate
)
)
VAR PreviousMileage =
CALCULATE (
MAX ( 'myTable'[Mileage ] ),
ALLEXCEPT ( 'myTable', 'myTable'[ VIN ] ),
'myTable'[ Start Date ] = previousDate
)
VAR PrePreviousMileage =
CALCULATE (
MAX ( 'myTable'[Mileage ] ),
ALLEXCEPT ( 'myTable', 'myTable'[ VIN ] ),
'myTable'[ Start Date ] = prePreviousDate
)
RETURN
SWITCH (
TRUE (),
'myTable'[Mileage ] + IF ( ISBLANK ( PreviousMileage ), 1, PreviousMileage )
+ IF ( ISBLANK ( PrePreviousMileage ), 1, PrePreviousMileage )
= 0, "> 60 Days",
'myTable'[Mileage ] + IF ( ISBLANK ( PreviousMileage ), 1, PreviousMileage )
= 0, "31 to 60 Days",
'myTable'[Mileage ] = 0, "0 to 30 Days",
"No Days"
)
The result looks like this. I added some values for testing.

Related

Pyspark Window function: Counting number of categorical variables and calculating percentages

I have the dataframe of the below format. There are different IDs, and product names and types associated for each product.
ID Prod Name Type Total Qty
1 ABC A 200
1 DEF B 350
1 GEH B 120
1 JIK C 100
1 LMO A 40
2 ABC A 10
2 DEF A 20
2 GEH C 30
2 JIK C 40
2 LMO A 50
So I am trying to get the percentage of total A's, B's and C's for that product name and ID in a separate column. As a first step, I was trying to use window function, but it gave me the count of "A" across the whole column.
df.withColumn("count_cat", F.count("Type").over(Window.partitionBy("Type")))
But I need something like this
ID total Products Total Qty % of A % of B % of C
1 5 810 0.29 0.58 0.12
Approach 1: Group By Aggregation
Based on your expected output, aggregates based on a GROUP BY Id would be sufficient.
You may achieve this using the following assuming your initial dataset is stored in a dataframe input_df
Using spark sql
ensure your dataframe is accessible by creating a temporary view
input_df.createOrReplaceTempView("input_df")
Running the sql below on your spark session
output_df = sparkSession.sql("""
SELECT
ID,
COUNT(Prod_Name) as `total products`,
SUM(Total_Qty) as `Total Qty`,
SUM(
CASE WHEN Type='A' THEN Total_Qty END
) / SUM(Total_Qty) as `% of A`,
SUM(
CASE WHEN Type='B' THEN Total_Qty END
) / SUM(Total_Qty) as `% of B`,
SUM(
CASE WHEN Type='C' THEN Total_Qty END
) / SUM(Total_Qty) as `% of C`
FROM
input_df
GROUP BY
ID
""").na.fill(0)
Using the pyspark API
from pyspark.sql import functions as F
output_df = (
input_df.groupBy("ID")
.agg(
F.count("Prod_Name").alias("total products"),
F.sum("Total_Qty").alias("Total Qty"),
(F.sum(
F.when(
F.col("Type")=="A",F.col("Total_Qty")
).otherwise(0)
) / F.sum("Total_Qty")).alias("% of A"),
(F.sum(
F.when(
F.col("Type")=="B",F.col("Total_Qty")
).otherwise(0)
)/ F.sum("Total_Qty")).alias("% of B"),
(F.sum(
F.when(
F.col("Type")=="C",F.col("Total_Qty")
).otherwise(0)
)/ F.sum("Total_Qty")).alias("% of C")
)
)
Approach 2: Using Windows
If it is that you would like to add these as 5 addition columns to your dataset you may use similar aggregations with the following window OVER (PARTITION BY ID) or Window.partitionBy("ID") as shown below
Using spark sql
ensure your dataframe is accessible by creating a temporary view
input_df.createOrReplaceTempView("input_df")
Running the sql below on your spark session
output_df = sparkSession.sql("""
SELECT
*,
COUNT(Prod_Name) OVER (PARTITION BY ID) as `total products`,
SUM(Total_Qty) OVER (PARTITION BY ID) as `Total Qty`,
SUM(
CASE WHEN Type='A' THEN Total_Qty END
) OVER (PARTITION BY ID) / SUM(Total_Qty) OVER (PARTITION BY ID) as `% of A`,
SUM(
CASE WHEN Type='B' THEN Total_Qty END
) OVER (PARTITION BY ID)/ SUM(Total_Qty) OVER (PARTITION BY ID) as `% of B`,
SUM(
CASE WHEN Type='C' THEN Total_Qty END
) OVER (PARTITION BY ID) / SUM(Total_Qty) OVER (PARTITION BY ID) as `% of C`
FROM
input_df
GROUP BY
ID
""").na.fill(0)
Using the pyspark API
from pyspark.sql import functions as F
from pyspark.sql import Window
agg_window = Window.partitionBy("Id")
output_df = (
input_df.withColumn(
"total products",
F.count("Prod_Name").over(agg_window)
)
.withColumn(
"Total Qty",
F.sum("Total_Qty").over(agg_window)
)
.withColumn(
"% of A",
F.sum(
F.when(
F.col("Type")=="A",F.col("Total_Qty")
).otherwise(0)
).over(agg_window) / F.sum("Total_Qty").over(agg_window)
)
.withColumn(
"% of B",
F.sum(
F.when(
F.col("Type")=="B",F.col("Total_Qty")
).otherwise(0)
).over(agg_window) / F.sum("Total_Qty").over(agg_window)
)
.withColumn(
"% of C",
F.sum(
F.when(
F.col("Type")=="C",F.col("Total_Qty")
).otherwise(0)
).over(agg_window) / F.sum("Total_Qty").over(agg_window)
)
)
Let me know if this works for you.
One approach (without repeating A B C etc), is using pivot. The idea is grouping first then pivoting the type:
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(df
.groupBy('ID', 'Type')
.agg(F.sum('Total Qty').alias('qty'))
.withColumn('pct', F.col('qty') / F.sum('qty').over(W.partitionBy('ID')))
.groupBy('ID')
.pivot('Type')
.agg(F.first('pct'))
.show()
)
# Output
# +---+------------------+------------------+-------------------+
# | ID| A| B| C|
# +---+------------------+------------------+-------------------+
# | 1|0.2962962962962963|0.5802469135802469|0.12345679012345678|
# | 2|0.5333333333333333| null| 0.4666666666666667|
# +---+------------------+------------------+-------------------+

DAX Search a string for multiple values

I need to create a new DAX column that will search a string from another column in the same table. It will search for any of the values in a 2nd table, and return True if any of those values are found. Simplified example:
Let's say I have a table named Sentences with 1 column:
Sentences
Col1
----------------
"The aardvark admitted it was wrong"
"The attractive peanut farmer graded the term paper"
"The awning was too tall to touch"
And another table named FindTheseWords with a list of values
FindTheseWords
Col1
----------------
peanut
aardvark
I'll be creating Col2 in the Sentences table, which should return
Sentences
Col1 Col2
---------------------------------------------------- ------------------------
"The aardvark admitted it was wrong" TRUE
"The attractive peanut farmer graded the term paper" TRUE
"The awning was too tall to touch" FALSE
The list of FindTheseWords is actually pretty long, so I can't just hardcode them and use an OR. I need to reference the table. I don't care about spaces, so a sentence with "peanuts" would also return true.
I've seen a good implementation of this in M, but the performance of my load took a pretty good hit, so I'm hoping to find a DAX option for a new column.
The M Solution, for reference: How to search multiple strings in a string?
fact table
| Column1 |
|------------------------------------------------------|
| The aardvark admitted it was wrong |
| The attractive peanut farmer graded the term paper |
| The awning was too tall to touch |
| This is text string |
| Tester is needed |
sentence table
| Column1 |
|------------|
| attractive |
| peanut |
| aardvark |
| Tester |
Calculated column
Column =
VAR _1 =
ADDCOLUMNS ( 'fact', "newColumn", SUBSTITUTE ( 'fact'[Column1], " ", "|" ) )
VAR _2 =
GENERATE (
_1,
ADDCOLUMNS (
GENERATESERIES ( 1, PATHLENGTH ( [newColumn] ) ),
"Words", PATHITEM ( [newColumn], [Value], TEXT )
)
)
VAR _3 =
ADDCOLUMNS (
_2,
"test", CONTAINS ( VALUES ( sentence[Column1] ), sentence[Column1], [Words] )
)
VAR _4 =
DISTINCT (
SELECTCOLUMNS (
FILTER ( _3, [test] = TRUE ),
"Column1", [Column1] & "",
"test", [test] & ""
)
)
VAR _5 =
DISTINCT (
SELECTCOLUMNS (
FILTER ( _3, [test] = FALSE ),
"Column1", [Column1] & "",
"test", [test] & ""
)
)
VAR _7 =
FILTER ( _5, [Column1] = MAXX ( _4, [Column1] ) )
VAR _8 =
UNION ( _4, _7 )
RETURN
MAXX (
FILTER ( _8, [Column1] = CALCULATE ( MAX ( 'fact'[Column1] ) ) ),
[test]
)

Finding Next Business Day with DAX

=IF(AND(WEEKDAY(AA3,2)<5,(AA3-INT(AA3))<17/24),((INT(AA3)+1)+12/24),IF(AND(WEEKDAY(AA3,2)<5,(AA3-INT(AA3))>17/24),((INT(AA3)+2)+12/24),IF(WEEKDAY(AA3,2)=5,(INT(AA3)+4)+12/24,IF(WEEKDAY(AA3,2)=7,(INT(AA3)+2)+12/24,IF(WEEKDAY(AA3,2)=6,(INT(AA3)+3)+12/24,)))))
I am trying to find the next business day depending on day of the week and hour of the day. Here is what I have converted to DAX but it does not work and I have no idea why.
NBD =
IF (
AND (
WEEKDAY ( D2S[Actual Received Time], 2 <= 5 ),
HOUR ( D2S[Actual Received Time] ) < 14
),
(
INT ( D2S[Actual Received Time] ) + 23.99 / 24
),
IF (
AND (
WEEKDAY ( D2S[Actual Received Time], 2 ) = 5,
HOUR ( D2S[Actual Received Time] > 14 )
),
(
INT ( D2S[Actual Received Time] ) + 3 + 12 / 24
),
IF (
AND (
WEEKDAY ( D2S[Actual Received Time], 2 ) <= 5,
HOUR ( D2S[Actual Received Time] ) >= 14
),
INT ( D2S[Actual Received Time] ) + 1 + 12 / 24,
IF (
WEEKDAY ( D2S[Actual Received Time], 2 ) = 6,
INT ( D2S[Actual Received Time] ) + 2 + 12 / 24,
IF (
WEEKDAY ( D2S[Actual Received Time], 2 ) = 7,
INT ( D2S[Actual Received Time] ) + 1 + 12 / 24
)
)
)
)
)
Line 4 of your formula should read:
WEEKDAY ( D2S[Actual Received Time], 2 ) <= 5,
(with "<= 5" outside the parenthesis)

How to execute operations in dataframe for each unique id?

I have a dataframe which looks like this:
[id purchase_date]
[1 1-1-19 ]
[1 1-4-19 ]
[2 1-3-19 ]
[3 1-5-19 ]
[1 1-10-19 ]
[... ]
I want to add a column and apply a condition which will do the following:
For each id, subtract the maximum date from today's date. This will imply "inactive days". The resulting table should look like this (note, that 20 appears 3 times as user 1 appears 3 times in this table):
Today's date= January 30, 2019 (1-30-19)
[id purchase_date inactivity_days]
[1 1-1-19 20 ]
[1 1-4-19 20 ]
[2 1-3-19 27 ]
[3 1-5-19 25 ]
[1 1-10-19 20 ]
[... ]
How would I do this in pandas?
You can use groupby and transform:
import pandas as pd
# Make sure that purchase date is a proper datetime column:
df['purchase_date'] = pd.to_datetime(df['purchase_date'])
# Define todays_date variable:
todays_date = pd.to_datetime("1-30-19")
# group by id, and transform the `purchase_date` column with an anonymous function
df['inactivity_days'] = df.groupby('id').purchase_date.transform(lambda x: (todays_date - x.max()).days)
In [7]: df
Out[7]:
id purchase_date inactivity_days
0 1 2019-01-01 20
1 1 2019-01-04 20
2 2 2019-01-03 27
3 3 2019-01-05 25
4 1 2019-01-10 20

How do I use the DAX function ParallelPeriod

The ParaellePeriod function allows for the comparison of values between points in time (how do sales compare to a year ago). I'm doing something wrong in my use of it, but have no idea what that thing may be.
Set up
I created a bog simple PowerPivot SQL Server 2008+ source query and named it Source. The query generates 168 rows: 6 IDs (100-600) and 28 dates (first of a month from Jan 2010 to Apr 2012) all cross applied together.
; WITH SRC (groupKey, eventDate, value) AS
(
SELECT G.groupKey, D.eventDate, CAST(rand(G.groupKey * year(D.eventDate) * month(D.eventDate)) * 100 AS int)
FROM
(
SELECT 100
UNION ALL SELECT 200
UNION ALL SELECT 300
UNION ALL SELECT 400
UNION ALL SELECT 500
UNION ALL SELECT 600
) G (groupKey)
CROSS APPLY
(
SELECT CAST('2010-01-01' AS date)
UNION ALL SELECT CAST('2010-02-01' AS date)
UNION ALL SELECT CAST('2010-03-01' AS date)
UNION ALL SELECT CAST('2010-04-01' AS date)
UNION ALL SELECT CAST('2010-05-01' AS date)
UNION ALL SELECT CAST('2010-06-01' AS date)
UNION ALL SELECT CAST('2010-07-01' AS date)
UNION ALL SELECT CAST('2010-08-01' AS date)
UNION ALL SELECT CAST('2010-09-01' AS date)
UNION ALL SELECT CAST('2010-10-01' AS date)
UNION ALL SELECT CAST('2010-11-01' AS date)
UNION ALL SELECT CAST('2010-12-01' AS date)
UNION ALL SELECT CAST('2011-01-01' AS date)
UNION ALL SELECT CAST('2011-02-01' AS date)
UNION ALL SELECT CAST('2011-03-01' AS date)
UNION ALL SELECT CAST('2011-04-01' AS date)
UNION ALL SELECT CAST('2011-05-01' AS date)
UNION ALL SELECT CAST('2011-06-01' AS date)
UNION ALL SELECT CAST('2011-07-01' AS date)
UNION ALL SELECT CAST('2011-08-01' AS date)
UNION ALL SELECT CAST('2011-09-01' AS date)
UNION ALL SELECT CAST('2011-10-01' AS date)
UNION ALL SELECT CAST('2011-11-01' AS date)
UNION ALL SELECT CAST('2011-12-01' AS date)
UNION ALL SELECT CAST('2012-01-01' AS date)
UNION ALL SELECT CAST('2012-02-01' AS date)
UNION ALL SELECT CAST('2012-03-01' AS date)
UNION ALL SELECT CAST('2012-04-01' AS date)
) D (eventDate)
)
SELECT
*
FROM
SRC;
I added a derived column in PowerPivot using a formula I lifted from MSDN
=CALCULATE(SUM(Source[value]), PARALLELPERIOD(Source[eventDate], -1, year))
There are no errors displayed but there's never any calculated data. I've tried different intervals (-1, +1) and periods (year, month) but to no avail.
The only thing I could observe that was different between my demo and the MSDN was theirs had a separate dimension defined for the date. Easy enough to rectify so I created a Dates query with the following. This query generates a row for all the days between 2010-01-01 and 2012-06-01 (1096 rows)
DECLARE
#start int = 20100101
, #stop int = 20120601;
WITH L0 AS
(
SELECT
0 AS C
UNION ALL
SELECT
0
)
, L1 AS
(
SELECT
0 AS c
FROM
L0 AS A
CROSS JOIN L0 AS B
)
, L2 AS
(
SELECT
0 AS c
FROM
L1 AS A
CROSS JOIN L1 AS B
)
, L3 AS
(
SELECT
0 AS c
FROM
L2 AS A
CROSS JOIN L2 AS B
)
, L4 AS
(
SELECT
0 AS c
FROM
L3 AS A
CROSS JOIN L3 AS B
)
, L5 AS
(
SELECT
0 AS c
FROM
L4 AS A
CROSS JOIN L4 AS B
)
, NUMS AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS number
FROM
L5
)
, YEARS AS
(
SELECT
Y.number
FROM
NUMS Y
WHERE
Y.number BETWEEN #start / 10000 AND #stop / 10000
)
, MONTHS AS
(
SELECT
Y.number
FROM
NUMS Y
WHERE
Y.number BETWEEN 1 and 12
)
, DAYS AS
(
SELECT
Y.number
FROM
NUMS Y
WHERE
Y.number BETWEEN 1 and 31
)
, CANDIDATES_0 AS
(
SELECT
Y.number * 10000 + M.number * 100 + D.number AS SurrogateKey
, CAST(Y.number * 10000 + M.number * 100 + D.number AS char(8)) AS DateValue
FROM
YEARS Y
CROSS APPLY
MONTHS M
CROSS APPLY
DAYS D
)
, HC AS
(
SELECT
Y.number * 10000 + M.number * 100 + D.number AS SurrogateKey
, CAST(Y.number * 10000 + M.number * 100 + D.number AS char(8)) AS DateValue
FROM
YEARS Y
CROSS APPLY
MONTHS M
CROSS APPLY
DAYS D
WHERE
D.number < 31
AND M.number IN (4,6,9,11)
UNION ALL
SELECT
Y.number * 10000 + M.number * 100 + D.number AS SurrogateKey
, CAST(Y.number * 10000 + M.number * 100 + D.number AS char(8)) AS DateValue
FROM
YEARS Y
CROSS APPLY
MONTHS M
CROSS APPLY
DAYS D
WHERE
D.number < 32
AND M.number IN (1,3,5,7,8,10,12)
UNION ALL
SELECT
Y.number * 10000 + M.number * 100 + D.number AS SurrogateKey
, CAST(Y.number * 10000 + M.number * 100 + D.number AS char(8)) AS DateValue
FROM
YEARS Y
CROSS APPLY
MONTHS M
CROSS APPLY
DAYS D
WHERE
D.number < 29
AND M.number = 2
AND
(
Y.number % 4 > 0
OR Y.number % 100 = 0 AND Y.number % 400 > 0
)
UNION ALL
SELECT
Y.number * 10000 + M.number * 100 + D.number AS SurrogateKey
, CAST(Y.number * 10000 + M.number * 100 + D.number AS char(8)) AS DateValue
FROM
YEARS Y
CROSS APPLY
MONTHS M
CROSS APPLY
DAYS D
WHERE
D.number < 30
AND M.number = 2
AND
(
Y.number % 4 = 0
OR Y.number % 100 = 0 AND Y.number % 400 = 0
)
)
, CANDIDATES AS
(
SELECT
C.SurrogateKey
, CAST(C.DateValue as date) As DateValue
FROM
HC C
WHERE
ISDATE(c.DateValue) = 1
)
, PARTS
(
DateKey
, FullDateAlternateKey
, DayNumberOfWeek
, EnglishDayNameOfWeek
, DayNumberOfMonth
, DayNumberOfYear
, WeekNumberOfYear
, EnglishMonthName
, MonthNumberOfYear
, CalendarQuarter
, CalendarYear
, CalendarSemester
--,FiscalQuarter
--,FiscalYear
--,FiscalSemester
) AS
(
SELECT
CAST(C.SurrogateKey AS int)
, C.DateValue
, DATEPART(WEEKDAY, C.DateValue)
, DATENAME(WEEKDAY, C.DateValue)
, DATEPART(DAY, C.DateValue)
, DATEPART(DAYOFYEAR, C.DateValue)
, DATEPART(WEEK, C.DateValue)
, DATENAME(MONTH, C.DateValue)
, DATEPART(MONTH, C.DateValue)
, DATEPART(QUARTER, C.DateValue)
, DATEPART(YEAR, C.DateValue)
, DATEPART(WEEK, C.DateValue)
FROM
CANDIDATES C
WHERE
C.DateValue IS NOT NULL
)
SELECT
P.*
FROM
--HC P
PARTS P
ORDER BY 1;
With data generated, I created a relationship between the Source and Dates and tried this formula with no luck either
=CALCULATE(SUM(Source[value]), PARALLELPERIOD(Dates[FullDateAlternateKey], -1, year))
The PowerPivot designer looks like
Any thoughts on what I'm doing wrong?
References
PARALLELPERIOD Function
PowerPivot DAX PARALLELPERIOD vs DATEADD
The DAX expression you used in the derived column should be a measure and defined in the calculation area...
MeasurePriorPeriodValue := CALCULATE(SUM(Source[value]), PARALLELPERIOD(Source[eventDate], -1, year))
...as long as the column you use in the parallelperiod function is configured as a date datatype, it should still work. Having the date table separated from the rest is "best practice" but not required...because it allows you to ensure that there are no gaps (which can cause problems with some DAX Time-Intelligence functions) and things like that.

Resources