Databricks Spark SQL query plan optimization to avoid read duplicate

Databricks Spark SQL query plan optimization to avoid read duplicate - apache-spark

I have a spark SQL query as below:
with xxx as (
select pb_id, pb_name, req_id, level, t_val_id_path
from(
select pb_id, pb_name, req_id, explode(req_vals) as t_id
from A
where dt = '2022-11-20') a
join (
select t_val_id, level, t_val_id_path
from B
where dt = '2022-11-20')b
on a.t_id = b.t_val_id)
select distinct
pb_id, req_id, -1 as l1, -1 as l2, -1 as l3
from xxx
union all
select distinct
pb_id, req_id, t_val_id_path[0] as l1, -1 as l2, -1 as l3
from xxx
union all
select distinct
pb_id, req_id, t_val_id_path[0] as l1, t_val_id_path[1] as l2, -1 as l3
from xxx
where level > 0
union all
select distinct
pb_id, req_id, t_val_id_path[0] as l1, if(level > 0, t_val_id_path[1], 1) as l2, if(level > 1, t_val_id_path[2], 1) as l3
from xxx;
In Azure Databricks, the SQL query plan is below:
Question:
From the SQL script it may just read table A & B of hive table. But in the query plan, we could see that we would read A 4 times and B 4 times. Is it possible that we read A & B just once and then do the filter and transformation in memory instead to read it again and again?

Maybe you can try to cache your xxx table. You can use cache also in sql, not only in df api
https://spark.apache.org/docs/3.0.0-preview/sql-ref-syntax-aux-cache-cache-table.html
So imo it may look like this:
spark.sql("CACHE TABLE xxx as (
select pb_id, pb_name, req_id, level, t_val_id_path
from(
select pb_id, pb_name, req_id, explode(req_vals) as t_id
from A
where dt = '2022-11-20') a
join (
select t_val_id, level, t_val_id_path
from B
where dt = '2022-11-20')b
on a.t_id = b.t_val_id)")
spark.sql(
"select distinct
pb_id, req_id, -1 as l1, -1 as l2, -1 as l3
from xxx
union all
select distinct
pb_id, req_id, t_val_id_path[0] as l1, -1 as l2, -1 as l3
from xxx
union all
select distinct
pb_id, req_id, t_val_id_path[0] as l1, t_val_id_path[1] as l2, -1 as l3
from xxx
where level > 0
union all
select distinct
pb_id, req_id, t_val_id_path[0] as l1, if(level > 0, t_val_id_path[1], 1) as l2, if(level > 1, t_val_id_path[2], 1) as l3
from xxx"
)

Related

Pyspark Window function: Counting number of categorical variables and calculating percentages

I have the dataframe of the below format. There are different IDs, and product names and types associated for each product.
ID Prod Name Type Total Qty
1 ABC A 200
1 DEF B 350
1 GEH B 120
1 JIK C 100
1 LMO A 40
2 ABC A 10
2 DEF A 20
2 GEH C 30
2 JIK C 40
2 LMO A 50
So I am trying to get the percentage of total A's, B's and C's for that product name and ID in a separate column. As a first step, I was trying to use window function, but it gave me the count of "A" across the whole column.
df.withColumn("count_cat", F.count("Type").over(Window.partitionBy("Type")))
But I need something like this
ID total Products Total Qty % of A % of B % of C
1 5 810 0.29 0.58 0.12

Approach 1: Group By Aggregation
Based on your expected output, aggregates based on a GROUP BY Id would be sufficient.
You may achieve this using the following assuming your initial dataset is stored in a dataframe input_df
Using spark sql
ensure your dataframe is accessible by creating a temporary view
input_df.createOrReplaceTempView("input_df")
Running the sql below on your spark session
output_df = sparkSession.sql("""
SELECT
ID,
COUNT(Prod_Name) as `total products`,
SUM(Total_Qty) as `Total Qty`,
SUM(
CASE WHEN Type='A' THEN Total_Qty END
) / SUM(Total_Qty) as `% of A`,
SUM(
CASE WHEN Type='B' THEN Total_Qty END
) / SUM(Total_Qty) as `% of B`,
SUM(
CASE WHEN Type='C' THEN Total_Qty END
) / SUM(Total_Qty) as `% of C`
FROM
input_df
GROUP BY
ID
""").na.fill(0)
Using the pyspark API
from pyspark.sql import functions as F
output_df = (
input_df.groupBy("ID")
.agg(
F.count("Prod_Name").alias("total products"),
F.sum("Total_Qty").alias("Total Qty"),
(F.sum(
F.when(
F.col("Type")=="A",F.col("Total_Qty")
).otherwise(0)
) / F.sum("Total_Qty")).alias("% of A"),
(F.sum(
F.when(
F.col("Type")=="B",F.col("Total_Qty")
).otherwise(0)
)/ F.sum("Total_Qty")).alias("% of B"),
(F.sum(
F.when(
F.col("Type")=="C",F.col("Total_Qty")
).otherwise(0)
)/ F.sum("Total_Qty")).alias("% of C")
)
)
Approach 2: Using Windows
If it is that you would like to add these as 5 addition columns to your dataset you may use similar aggregations with the following window OVER (PARTITION BY ID) or Window.partitionBy("ID") as shown below
Using spark sql
ensure your dataframe is accessible by creating a temporary view
input_df.createOrReplaceTempView("input_df")
Running the sql below on your spark session
output_df = sparkSession.sql("""
SELECT
*,
COUNT(Prod_Name) OVER (PARTITION BY ID) as `total products`,
SUM(Total_Qty) OVER (PARTITION BY ID) as `Total Qty`,
SUM(
CASE WHEN Type='A' THEN Total_Qty END
) OVER (PARTITION BY ID) / SUM(Total_Qty) OVER (PARTITION BY ID) as `% of A`,
SUM(
CASE WHEN Type='B' THEN Total_Qty END
) OVER (PARTITION BY ID)/ SUM(Total_Qty) OVER (PARTITION BY ID) as `% of B`,
SUM(
CASE WHEN Type='C' THEN Total_Qty END
) OVER (PARTITION BY ID) / SUM(Total_Qty) OVER (PARTITION BY ID) as `% of C`
FROM
input_df
GROUP BY
ID
""").na.fill(0)
Using the pyspark API
from pyspark.sql import functions as F
from pyspark.sql import Window
agg_window = Window.partitionBy("Id")
output_df = (
input_df.withColumn(
"total products",
F.count("Prod_Name").over(agg_window)
)
.withColumn(
"Total Qty",
F.sum("Total_Qty").over(agg_window)
)
.withColumn(
"% of A",
F.sum(
F.when(
F.col("Type")=="A",F.col("Total_Qty")
).otherwise(0)
).over(agg_window) / F.sum("Total_Qty").over(agg_window)
)
.withColumn(
"% of B",
F.sum(
F.when(
F.col("Type")=="B",F.col("Total_Qty")
).otherwise(0)
).over(agg_window) / F.sum("Total_Qty").over(agg_window)
)
.withColumn(
"% of C",
F.sum(
F.when(
F.col("Type")=="C",F.col("Total_Qty")
).otherwise(0)
).over(agg_window) / F.sum("Total_Qty").over(agg_window)
)
)
Let me know if this works for you.

One approach (without repeating A B C etc), is using pivot. The idea is grouping first then pivoting the type:
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(df
.groupBy('ID', 'Type')
.agg(F.sum('Total Qty').alias('qty'))
.withColumn('pct', F.col('qty') / F.sum('qty').over(W.partitionBy('ID')))
.groupBy('ID')
.pivot('Type')
.agg(F.first('pct'))
.show()
)
# Output
# +---+------------------+------------------+-------------------+
# | ID| A| B| C|
# +---+------------------+------------------+-------------------+
# | 1|0.2962962962962963|0.5802469135802469|0.12345679012345678|
# | 2|0.5333333333333333| null| 0.4666666666666667|
# +---+------------------+------------------+-------------------+

How to transform one row into multiple columns in ADF?

I have TABLE as a source with just 1 row, like this in Azure data factory with 336 columns:
1
2
3
4
5
6
7
8
9
value1
value2
value3
value4
value5
value6
value7
value8
value9
And a want to combine every 3 columns into the first 3:
1
2
3
value1
value2
value3
value4
value5
value6
value7
value8
value9
What is the alternative to using Select on every 3 columns and then Join as it is long process with this many columns?

If your datasource is Azure SQL DB, you could conventional SQL to transform the row with a combination of UNVPIVOT, PIVOT and some of the ranking functions to help group the data. A simple example:
DROP TABLE IF EXISTS #tmp;
CREATE TABLE #tmp (
col1 VARCHAR(10),
col2 VARCHAR(10),
col3 VARCHAR(10),
col4 VARCHAR(10),
col5 VARCHAR(10),
col6 VARCHAR(10),
col7 VARCHAR(10),
col8 VARCHAR(10),
col9 VARCHAR(10)
);
INSERT INTO #tmp
VALUES ( 'value1', 'value2', 'value3', 'value4', 'value5', 'value6', 'value7', 'value8', 'value9' )
SELECT [1], [2], [0] AS [3]
FROM
(
SELECT
NTILE(3) OVER( ORDER BY ( SELECT NULL ) ) nt,
ROW_NUMBER() OVER( ORDER BY ( SELECT NULL ) ) % 3 groupNumber,
newCol
FROM #tmp
UNPIVOT ( newCol for sourceCol In ( col1, col2, col3, col4, col5, col6, col7, col8, col9 ) ) uvpt
) x
PIVOT ( MAX(newCol) For groupNumber In ( [1], [2], [0] ) ) pvt;
Tweak the NTILE value depending on the number of columns you have - it should be the total number of columns you have divided by 3. For example if you have 300 columns, the NTILE value should be 100, if you have 336 columns it should be 112. A bigger example with 336 columns is available here.
Present the data to Azure Data Factory (ADF) either as a view or use the Query option in the Copy activity for example.
My results:
If you are using Azure Synapse Analytics then another fun way to approach this would be using Synapse Notebooks. With just three lines of code, you can get the table from the dedicated SQL pool, unpivot all 336 columns using the stack function and write it back to the database. This simple example is in Scala:
val df = spark.read.synapsesql("someDb.dbo.pivotWorking")
val df2 = df.select( expr("stack(112, *)"))
// Write it back
df2.write.synapsesql("someDb.dbo.pivotWorking_after", Constants.INTERNAL)
I have to admire the simplicity of it.

Pyspark How to create columns and fill True/False if rolling datetime record exists

Data-set contains products with daily record but sometime it misses out so i want to create extra columns to show whether it exists or not in the past few days
i have conditions below
Create T-1, T-2 and so on columns and fill it with below
Fill T-1 with 1 the record exist, otherwise zero
Original Table :
Item Cat DateTime Value
A C1 1-1-2021 10
A C1 2-1-2021 10
A C1 3-1-2021 10
A C1 4-1-2021 10
A C1 5-1-2021 10
A C1 6-1-2021 10
B C1 1-1-2021 20
B C1 4-1-2021 20
Expect Result :
Item Cat DateTime Value T-1 T-2 T-3 T-4 T-5
A C1 1-1-2021 10 0 0 0 0 0
A C1 2-1-2021 10 1 0 0 0 0 (T-1 is 1 as we have 1-1-2021 record)
A C1 3-1-2021 10 1 1 0 0 0
A C1 4-1-2021 10 1 1 1 0 0
A C1 5-1-2021 10 1 1 1 1 0
A C1 6-1-2021 10 1 1 1 1 1
B C1 1-1-2021 20 0 0 0 0 0
B C1 2-1-2021 0 1 0 0 0 0 (2-1-2021 record need to be created with value zero since we miss this from original data-set, plus T-1 is 1 as we have this record from original data-set)
B C1 3-1-2021 0 0 1 0 0 0
B C1 4-1-2021 20 0 0 1 0 0
B C1 5-1-2021 0 1 0 0 1 0

Let's assume you have the original table data stored in original_data, we can
create a temporary view to query with spark sql named daily_records
generate possible dates . This was done by identifying the amount of days between the min and max dates from the dataset then generating the possible dates using table generating function explode and spaces
generate all possible item, date records
join these records with the actual to have a complete dataset with values
Use spark sql to query the view and create the additional column using the left joins and CASE statements
# Step 1
original_data.createOrReplaceTempView("daily_records")
# Step 2-4
daily_records = sparkSession.sql("""
WITH date_bounds AS (
SELECT min(DateTime) as mindate, max(DateTime) as maxdate FROM daily_records
),
possible_dates AS (
SELECT
date_add(mindate,index.pos) as DateTime
FROM
date_bounds
lateral view posexplode(split(space(datediff(maxdate,mindate)),"")) index
),
unique_items AS (
SELECT DISTINCT Item, Cat from daily_records
),
possible__item_dates AS (
SELECT Item, Cat, DateTime FROM unique_items INNER JOIN possible_dates ON 1=1
),
possible_records AS (
SELECT
p.Item,
p.Cat,
p.DateTime,
r.Value
FROM
possible__item_dates p
LEFT JOIN
daily_records r on p.Item = r.Item and p.DateTime = r.DateTime
)
select * from possible_records
""")
daily_records.createOrReplaceTempView("daily_records")
daily_records.show()
# Step 5 - store results in desired_result
# This is optional, but I have chosen to generate the sql to create this dataframe
periods = 5 # Number of periods to check for
period_columns = ",".join(["""
CASE
WHEN t{0}.Value IS NULL THEN 0
ELSE 1
END as `T-{0}`
""".format(i) for i in range(1,periods+1)])
period_joins = " ".join(["""
LEFT JOIN
daily_records t{0} on datediff(to_date(t.DateTime),to_date(t{0}.DateTime))={0} and t.Item = t{0}.Item
""".format(i) for i in range(1,periods+1)])
period_sql = """
SELECT
t.*
{0}
FROM
daily_records t
{1}
ORDER BY
Item, DateTime
""".format(
"" if len(period_columns)==0 else ",{0}".format(period_columns),
period_joins
)
desired_result= sparkSession.sql(period_sql)
desired_result.show()
Actual SQL generated:
SELECT
t.*,
CASE
WHEN t1.Value IS NULL THEN 0
ELSE 1
END as `T-1`,
CASE
WHEN t2.Value IS NULL THEN 0
ELSE 1
END as `T-2`,
CASE
WHEN t3.Value IS NULL THEN 0
ELSE 1
END as `T-3`,
CASE
WHEN t4.Value IS NULL THEN 0
ELSE 1
END as `T-4`,
CASE
WHEN t5.Value IS NULL THEN 0
ELSE 1
END as `T-5`
FROM
daily_records t
LEFT JOIN
daily_records t1 on datediff(to_date(t.DateTime),to_date(t1.DateTime))=1 and t.Item = t1.Item
LEFT JOIN
daily_records t2 on datediff(to_date(t.DateTime),to_date(t2.DateTime))=2 and t.Item = t2.Item
LEFT JOIN
daily_records t3 on datediff(to_date(t.DateTime),to_date(t3.DateTime))=3 and t.Item = t3.Item
LEFT JOIN
daily_records t4 on datediff(to_date(t.DateTime),to_date(t4.DateTime))=4 and t.Item = t4.Item
LEFT JOIN
daily_records t5 on datediff(to_date(t.DateTime),to_date(t5.DateTime))=5 and t.Item = t5.Item
ORDER BY
Item, DateTime
NB. to_date is optional if DateTime is already formatted as a date field or in the format yyyy-mm-dd

Add character to string in SQL

I have got two strings:
12, H220, H280
and
11, 36, 66, 67, H225, H319, H336
and I want to add character A to every place where there is no 'H', so the strings should look like
A12, H220, H280
and
A11, A36, A66, A67, H225, H319, H336

select REPLACE(Test,Test,'A'+Test) from (
select REPLACE(Test,', ', ',A') Test from (
select REPLACE(Test,', H',',H') Test from (
select '11, 36, 66, 67, H225, H319, H336' as Test) S) S1 ) S2

Try this:
SQL Fiddle demo
--Sample data
DECLARE #T TABLE (ID INT, COL1 VARCHAR(100))
INSERT #T (ID, COL1)
VALUES (1, '12, H220, H280'), (2, '11, 36, 66, 67, H225, H319, H336')
--Query
;WITH CTE AS
(
SELECT ID, STUFF(COL1, PATINDEX('%[^H]%', COL1), 0, 'A') COL1, 1 NUMBER
FROM #T
UNION ALL
SELECT CTE.ID, STUFF(CTE.COL1, PATINDEX('%[,][ ][^HA]%', CTE.COL1) + 2, 0, 'A'), NUMBER + 1
FROM CTE JOIN #T T
ON CTE.ID = T.ID
WHERE PATINDEX('%[,][ ][^HA]%', CTE.COL1) > 0
)
,
CTE2 AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY NUMBER DESC) rn
FROM CTE
)
SELECT ID,COL1 FROM CTE2 WHERE RN = 1
Results:
| ID | COL1 |
|----|--------------------------------------|
| 1 | A12, H220, H280 |
| 2 | A11, A36, A66, A67, H225, H319, H336 |

How do I use the DAX function ParallelPeriod

The ParaellePeriod function allows for the comparison of values between points in time (how do sales compare to a year ago). I'm doing something wrong in my use of it, but have no idea what that thing may be.
Set up
I created a bog simple PowerPivot SQL Server 2008+ source query and named it Source. The query generates 168 rows: 6 IDs (100-600) and 28 dates (first of a month from Jan 2010 to Apr 2012) all cross applied together.
; WITH SRC (groupKey, eventDate, value) AS
(
SELECT G.groupKey, D.eventDate, CAST(rand(G.groupKey * year(D.eventDate) * month(D.eventDate)) * 100 AS int)
FROM
(
SELECT 100
UNION ALL SELECT 200
UNION ALL SELECT 300
UNION ALL SELECT 400
UNION ALL SELECT 500
UNION ALL SELECT 600
) G (groupKey)
CROSS APPLY
(
SELECT CAST('2010-01-01' AS date)
UNION ALL SELECT CAST('2010-02-01' AS date)
UNION ALL SELECT CAST('2010-03-01' AS date)
UNION ALL SELECT CAST('2010-04-01' AS date)
UNION ALL SELECT CAST('2010-05-01' AS date)
UNION ALL SELECT CAST('2010-06-01' AS date)
UNION ALL SELECT CAST('2010-07-01' AS date)
UNION ALL SELECT CAST('2010-08-01' AS date)
UNION ALL SELECT CAST('2010-09-01' AS date)
UNION ALL SELECT CAST('2010-10-01' AS date)
UNION ALL SELECT CAST('2010-11-01' AS date)
UNION ALL SELECT CAST('2010-12-01' AS date)
UNION ALL SELECT CAST('2011-01-01' AS date)
UNION ALL SELECT CAST('2011-02-01' AS date)
UNION ALL SELECT CAST('2011-03-01' AS date)
UNION ALL SELECT CAST('2011-04-01' AS date)
UNION ALL SELECT CAST('2011-05-01' AS date)
UNION ALL SELECT CAST('2011-06-01' AS date)
UNION ALL SELECT CAST('2011-07-01' AS date)
UNION ALL SELECT CAST('2011-08-01' AS date)
UNION ALL SELECT CAST('2011-09-01' AS date)
UNION ALL SELECT CAST('2011-10-01' AS date)
UNION ALL SELECT CAST('2011-11-01' AS date)
UNION ALL SELECT CAST('2011-12-01' AS date)
UNION ALL SELECT CAST('2012-01-01' AS date)
UNION ALL SELECT CAST('2012-02-01' AS date)
UNION ALL SELECT CAST('2012-03-01' AS date)
UNION ALL SELECT CAST('2012-04-01' AS date)
) D (eventDate)
)
SELECT
*
FROM
SRC;
I added a derived column in PowerPivot using a formula I lifted from MSDN
=CALCULATE(SUM(Source[value]), PARALLELPERIOD(Source[eventDate], -1, year))
There are no errors displayed but there's never any calculated data. I've tried different intervals (-1, +1) and periods (year, month) but to no avail.
The only thing I could observe that was different between my demo and the MSDN was theirs had a separate dimension defined for the date. Easy enough to rectify so I created a Dates query with the following. This query generates a row for all the days between 2010-01-01 and 2012-06-01 (1096 rows)
DECLARE
#start int = 20100101
, #stop int = 20120601;
WITH L0 AS
(
SELECT
0 AS C
UNION ALL
SELECT
0
)
, L1 AS
(
SELECT
0 AS c
FROM
L0 AS A
CROSS JOIN L0 AS B
)
, L2 AS
(
SELECT
0 AS c
FROM
L1 AS A
CROSS JOIN L1 AS B
)
, L3 AS
(
SELECT
0 AS c
FROM
L2 AS A
CROSS JOIN L2 AS B
)
, L4 AS
(
SELECT
0 AS c
FROM
L3 AS A
CROSS JOIN L3 AS B
)
, L5 AS
(
SELECT
0 AS c
FROM
L4 AS A
CROSS JOIN L4 AS B
)
, NUMS AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS number
FROM
L5
)
, YEARS AS
(
SELECT
Y.number
FROM
NUMS Y
WHERE
Y.number BETWEEN #start / 10000 AND #stop / 10000
)
, MONTHS AS
(
SELECT
Y.number
FROM
NUMS Y
WHERE
Y.number BETWEEN 1 and 12
)
, DAYS AS
(
SELECT
Y.number
FROM
NUMS Y
WHERE
Y.number BETWEEN 1 and 31
)
, CANDIDATES_0 AS
(
SELECT
Y.number * 10000 + M.number * 100 + D.number AS SurrogateKey
, CAST(Y.number * 10000 + M.number * 100 + D.number AS char(8)) AS DateValue
FROM
YEARS Y
CROSS APPLY
MONTHS M
CROSS APPLY
DAYS D
)
, HC AS
(
SELECT
Y.number * 10000 + M.number * 100 + D.number AS SurrogateKey
, CAST(Y.number * 10000 + M.number * 100 + D.number AS char(8)) AS DateValue
FROM
YEARS Y
CROSS APPLY
MONTHS M
CROSS APPLY
DAYS D
WHERE
D.number < 31
AND M.number IN (4,6,9,11)
UNION ALL
SELECT
Y.number * 10000 + M.number * 100 + D.number AS SurrogateKey
, CAST(Y.number * 10000 + M.number * 100 + D.number AS char(8)) AS DateValue
FROM
YEARS Y
CROSS APPLY
MONTHS M
CROSS APPLY
DAYS D
WHERE
D.number < 32
AND M.number IN (1,3,5,7,8,10,12)
UNION ALL
SELECT
Y.number * 10000 + M.number * 100 + D.number AS SurrogateKey
, CAST(Y.number * 10000 + M.number * 100 + D.number AS char(8)) AS DateValue
FROM
YEARS Y
CROSS APPLY
MONTHS M
CROSS APPLY
DAYS D
WHERE
D.number < 29
AND M.number = 2
AND
(
Y.number % 4 > 0
OR Y.number % 100 = 0 AND Y.number % 400 > 0
)
UNION ALL
SELECT
Y.number * 10000 + M.number * 100 + D.number AS SurrogateKey
, CAST(Y.number * 10000 + M.number * 100 + D.number AS char(8)) AS DateValue
FROM
YEARS Y
CROSS APPLY
MONTHS M
CROSS APPLY
DAYS D
WHERE
D.number < 30
AND M.number = 2
AND
(
Y.number % 4 = 0
OR Y.number % 100 = 0 AND Y.number % 400 = 0
)
)
, CANDIDATES AS
(
SELECT
C.SurrogateKey
, CAST(C.DateValue as date) As DateValue
FROM
HC C
WHERE
ISDATE(c.DateValue) = 1
)
, PARTS
(
DateKey
, FullDateAlternateKey
, DayNumberOfWeek
, EnglishDayNameOfWeek
, DayNumberOfMonth
, DayNumberOfYear
, WeekNumberOfYear
, EnglishMonthName
, MonthNumberOfYear
, CalendarQuarter
, CalendarYear
, CalendarSemester
--,FiscalQuarter
--,FiscalYear
--,FiscalSemester
) AS
(
SELECT
CAST(C.SurrogateKey AS int)
, C.DateValue
, DATEPART(WEEKDAY, C.DateValue)
, DATENAME(WEEKDAY, C.DateValue)
, DATEPART(DAY, C.DateValue)
, DATEPART(DAYOFYEAR, C.DateValue)
, DATEPART(WEEK, C.DateValue)
, DATENAME(MONTH, C.DateValue)
, DATEPART(MONTH, C.DateValue)
, DATEPART(QUARTER, C.DateValue)
, DATEPART(YEAR, C.DateValue)
, DATEPART(WEEK, C.DateValue)
FROM
CANDIDATES C
WHERE
C.DateValue IS NOT NULL
)
SELECT
P.*
FROM
--HC P
PARTS P
ORDER BY 1;
With data generated, I created a relationship between the Source and Dates and tried this formula with no luck either
=CALCULATE(SUM(Source[value]), PARALLELPERIOD(Dates[FullDateAlternateKey], -1, year))
The PowerPivot designer looks like
Any thoughts on what I'm doing wrong?
References
PARALLELPERIOD Function
PowerPivot DAX PARALLELPERIOD vs DATEADD

The DAX expression you used in the derived column should be a measure and defined in the calculation area...
MeasurePriorPeriodValue := CALCULATE(SUM(Source[value]), PARALLELPERIOD(Source[eventDate], -1, year))
...as long as the column you use in the parallelperiod function is configured as a date datatype, it should still work. Having the date table separated from the rest is "best practice" but not required...because it allows you to ensure that there are no gaps (which can cause problems with some DAX Time-Intelligence functions) and things like that.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Databricks Spark SQL query plan optimization to avoid read duplicate - apache-spark

Related

Pyspark Window function: Counting number of categorical variables and calculating percentages

How to transform one row into multiple columns in ADF?

Pyspark How to create columns and fill True/False if rolling datetime record exists

Add character to string in SQL

How do I use the DAX function ParallelPeriod

Categories

Resources