DAX Search a string for multiple values - excel

I need to create a new DAX column that will search a string from another column in the same table. It will search for any of the values in a 2nd table, and return True if any of those values are found. Simplified example:
Let's say I have a table named Sentences with 1 column:
Sentences
Col1
----------------
"The aardvark admitted it was wrong"
"The attractive peanut farmer graded the term paper"
"The awning was too tall to touch"
And another table named FindTheseWords with a list of values
FindTheseWords
Col1
----------------
peanut
aardvark
I'll be creating Col2 in the Sentences table, which should return
Sentences
Col1 Col2
---------------------------------------------------- ------------------------
"The aardvark admitted it was wrong" TRUE
"The attractive peanut farmer graded the term paper" TRUE
"The awning was too tall to touch" FALSE
The list of FindTheseWords is actually pretty long, so I can't just hardcode them and use an OR. I need to reference the table. I don't care about spaces, so a sentence with "peanuts" would also return true.
I've seen a good implementation of this in M, but the performance of my load took a pretty good hit, so I'm hoping to find a DAX option for a new column.
The M Solution, for reference: How to search multiple strings in a string?

fact table
| Column1 |
|------------------------------------------------------|
| The aardvark admitted it was wrong |
| The attractive peanut farmer graded the term paper |
| The awning was too tall to touch |
| This is text string |
| Tester is needed |
sentence table
| Column1 |
|------------|
| attractive |
| peanut |
| aardvark |
| Tester |
Calculated column
Column =
VAR _1 =
ADDCOLUMNS ( 'fact', "newColumn", SUBSTITUTE ( 'fact'[Column1], " ", "|" ) )
VAR _2 =
GENERATE (
_1,
ADDCOLUMNS (
GENERATESERIES ( 1, PATHLENGTH ( [newColumn] ) ),
"Words", PATHITEM ( [newColumn], [Value], TEXT )
)
)
VAR _3 =
ADDCOLUMNS (
_2,
"test", CONTAINS ( VALUES ( sentence[Column1] ), sentence[Column1], [Words] )
)
VAR _4 =
DISTINCT (
SELECTCOLUMNS (
FILTER ( _3, [test] = TRUE ),
"Column1", [Column1] & "",
"test", [test] & ""
)
)
VAR _5 =
DISTINCT (
SELECTCOLUMNS (
FILTER ( _3, [test] = FALSE ),
"Column1", [Column1] & "",
"test", [test] & ""
)
)
VAR _7 =
FILTER ( _5, [Column1] = MAXX ( _4, [Column1] ) )
VAR _8 =
UNION ( _4, _7 )
RETURN
MAXX (
FILTER ( _8, [Column1] = CALCULATE ( MAX ( 'fact'[Column1] ) ) ),
[test]
)

Related

Pyspark Window function: Counting number of categorical variables and calculating percentages

I have the dataframe of the below format. There are different IDs, and product names and types associated for each product.
ID Prod Name Type Total Qty
1 ABC A 200
1 DEF B 350
1 GEH B 120
1 JIK C 100
1 LMO A 40
2 ABC A 10
2 DEF A 20
2 GEH C 30
2 JIK C 40
2 LMO A 50
So I am trying to get the percentage of total A's, B's and C's for that product name and ID in a separate column. As a first step, I was trying to use window function, but it gave me the count of "A" across the whole column.
df.withColumn("count_cat", F.count("Type").over(Window.partitionBy("Type")))
But I need something like this
ID total Products Total Qty % of A % of B % of C
1 5 810 0.29 0.58 0.12
Approach 1: Group By Aggregation
Based on your expected output, aggregates based on a GROUP BY Id would be sufficient.
You may achieve this using the following assuming your initial dataset is stored in a dataframe input_df
Using spark sql
ensure your dataframe is accessible by creating a temporary view
input_df.createOrReplaceTempView("input_df")
Running the sql below on your spark session
output_df = sparkSession.sql("""
SELECT
ID,
COUNT(Prod_Name) as `total products`,
SUM(Total_Qty) as `Total Qty`,
SUM(
CASE WHEN Type='A' THEN Total_Qty END
) / SUM(Total_Qty) as `% of A`,
SUM(
CASE WHEN Type='B' THEN Total_Qty END
) / SUM(Total_Qty) as `% of B`,
SUM(
CASE WHEN Type='C' THEN Total_Qty END
) / SUM(Total_Qty) as `% of C`
FROM
input_df
GROUP BY
ID
""").na.fill(0)
Using the pyspark API
from pyspark.sql import functions as F
output_df = (
input_df.groupBy("ID")
.agg(
F.count("Prod_Name").alias("total products"),
F.sum("Total_Qty").alias("Total Qty"),
(F.sum(
F.when(
F.col("Type")=="A",F.col("Total_Qty")
).otherwise(0)
) / F.sum("Total_Qty")).alias("% of A"),
(F.sum(
F.when(
F.col("Type")=="B",F.col("Total_Qty")
).otherwise(0)
)/ F.sum("Total_Qty")).alias("% of B"),
(F.sum(
F.when(
F.col("Type")=="C",F.col("Total_Qty")
).otherwise(0)
)/ F.sum("Total_Qty")).alias("% of C")
)
)
Approach 2: Using Windows
If it is that you would like to add these as 5 addition columns to your dataset you may use similar aggregations with the following window OVER (PARTITION BY ID) or Window.partitionBy("ID") as shown below
Using spark sql
ensure your dataframe is accessible by creating a temporary view
input_df.createOrReplaceTempView("input_df")
Running the sql below on your spark session
output_df = sparkSession.sql("""
SELECT
*,
COUNT(Prod_Name) OVER (PARTITION BY ID) as `total products`,
SUM(Total_Qty) OVER (PARTITION BY ID) as `Total Qty`,
SUM(
CASE WHEN Type='A' THEN Total_Qty END
) OVER (PARTITION BY ID) / SUM(Total_Qty) OVER (PARTITION BY ID) as `% of A`,
SUM(
CASE WHEN Type='B' THEN Total_Qty END
) OVER (PARTITION BY ID)/ SUM(Total_Qty) OVER (PARTITION BY ID) as `% of B`,
SUM(
CASE WHEN Type='C' THEN Total_Qty END
) OVER (PARTITION BY ID) / SUM(Total_Qty) OVER (PARTITION BY ID) as `% of C`
FROM
input_df
GROUP BY
ID
""").na.fill(0)
Using the pyspark API
from pyspark.sql import functions as F
from pyspark.sql import Window
agg_window = Window.partitionBy("Id")
output_df = (
input_df.withColumn(
"total products",
F.count("Prod_Name").over(agg_window)
)
.withColumn(
"Total Qty",
F.sum("Total_Qty").over(agg_window)
)
.withColumn(
"% of A",
F.sum(
F.when(
F.col("Type")=="A",F.col("Total_Qty")
).otherwise(0)
).over(agg_window) / F.sum("Total_Qty").over(agg_window)
)
.withColumn(
"% of B",
F.sum(
F.when(
F.col("Type")=="B",F.col("Total_Qty")
).otherwise(0)
).over(agg_window) / F.sum("Total_Qty").over(agg_window)
)
.withColumn(
"% of C",
F.sum(
F.when(
F.col("Type")=="C",F.col("Total_Qty")
).otherwise(0)
).over(agg_window) / F.sum("Total_Qty").over(agg_window)
)
)
Let me know if this works for you.
One approach (without repeating A B C etc), is using pivot. The idea is grouping first then pivoting the type:
from pyspark.sql import functions as F
from pyspark.sql import Window as W
(df
.groupBy('ID', 'Type')
.agg(F.sum('Total Qty').alias('qty'))
.withColumn('pct', F.col('qty') / F.sum('qty').over(W.partitionBy('ID')))
.groupBy('ID')
.pivot('Type')
.agg(F.first('pct'))
.show()
)
# Output
# +---+------------------+------------------+-------------------+
# | ID| A| B| C|
# +---+------------------+------------------+-------------------+
# | 1|0.2962962962962963|0.5802469135802469|0.12345679012345678|
# | 2|0.5333333333333333| null| 0.4666666666666667|
# +---+------------------+------------------+-------------------+

Splitting rows that contain a list of postcode prefixes into multiple rows, based on postcode area

I have a table with several columns of data, one of which contains a list of different combined postcode prefixes on the same row.
Here's an example of the table layout:
+------+-----------------------------+
| Col1 | Col2 |
+------+-----------------------------+
| a | AB10; AB11; DD10; DD9 |
| b | S5; SS7; AA1; AA4 |
| c | AB33; AB34; AB36; GG10; GS9 |
+------+-----------------------------+
I'm looking to split the postcode prefixes into multiple rows, based on the area of the postcode, as below:
+------+------------------+
| Col1 | Col2 |
+------+------------------+
| a | AB10; AB11 |
| a | DD10; DD9 |
| b | S5 |
| b | SS7 |
| b | AA1; AA4 |
| c | AB33; AB34; AB36 |
| c | GG10 |
| c | GS9 |
+------+------------------+
I've found a VBA solution that splits, using the semicolon as a delimiter, but not how I need it done.
Sub splitByColB()
Dim r As Range, i As Long, ar
Set r = Worksheets("Sheet1").Range("B4").End(xlUp)
Do While r.Row > 1
ar = Split(r.Value, ";")
If UBound(ar) >= 0 Then r.Value = ar(0)
For i = UBound(ar) To 1 Step -1
r.EntireRow.Copy
r.Offset(1).EntireRow.Insert
r.Offset(1).Value = ar(i)
Next
Set r = r.Offset(-1)
Loop
End Sub
I could import the table in SQLExpress, so an SQL solution would also be welcome.
The SQL solution I put together uses a T-SQL function called DelimitedSplit8K which works like the VB SPLIT function you are using.
-- Sample Data
DECLARE #table TABLE (Col1 CHAR(1) UNIQUE, Col2 CHAR(200));
INSERT #table (Col1,Col2) VALUES ('a','AB10; AB11; DD10; DD9'),
('b','S5; SS7; AA1; AA4'),('c','AB33; AB34; AB36; GG10; GS9');
WITH xx(Col1,i,Pre) AS
(
SELECT t2.Col1, ss.Item+'', f.Pre
FROM #table AS t2
CROSS APPLY dbo.DelimitedSplit8K(t2.Col2,';') AS s
CROSS APPLY (VALUES(RTRIM(LTRIM(s.item)))) AS ss(Item)
CROSS APPLY (VALUES(SUBSTRING(ss.Item,0,PATINDEX('%[0-9]%',ss.Item)))) AS f(Pre)
)
SELECT xx.col1, col2 = STUFF((SELECT '; '+i
FROM xx AS x2
WHERE x2.Col1 = xx.Col1 AND x2.Pre = xx.Pre
FOR XML PATH('')),1,2,'')
FROM xx
GROUP BY col1, xx.Pre;
Returns:
col1 Col2
---- ----------------------
a AB10; AB11
a DD10; DD9
b AA1; AA4
b S5
b SS7
c AB33; AB34; AB36
c GG10
c GS9
I also put together a solution that works with SQL Server 2017 which is cleaner (in case you upgrade or others or using 2017.)
-- Sample Data
DECLARE #table TABLE (Col1 CHAR(1) UNIQUE, Col2 CHAR(200));
INSERT #table (Col1,Col2) VALUES ('a','AB10; AB11; DD10; DD9'),
('b','S5; SS7; AA1; AA4'),('c','AB33; AB34; AB36; GG10; GS9');
SELECT t.Col1, split.item
FROM #table AS t
CROSS APPLY
(
SELECT STRING_AGG(ss.Item,'; ') WITHIN GROUP (ORDER BY ss.Item)
FROM #table AS t2
CROSS APPLY STRING_SPLIT(t2.Col2,';') AS s
CROSS APPLY (VALUES(TRIM(s.[value]))) AS ss(Item)
WHERE t.Col1 = t2.col1
GROUP BY SUBSTRING(ss.Item,0,PATINDEX('%[0-9]%',ss.Item))
) AS split(item);
you could use nested dictionary objects:
Sub splitByColB()
Dim r As Range, ar, val1, val2, prefix As String
Dim obj1 As Object, obj2 As Object
Set obj1 = CreateObject("Scripting.Dictionary")
With Worksheets("Sheet1")
For Each r In .Range("B2:B4")
Set obj2 = CreateObject("Scripting.Dictionary")
With obj2
For Each val2 In Split(Replace(r.Value2, " ", vbNullString), ";")
prefix = GetLetters(CStr(val2))
.Item(prefix) = .Item(prefix) & val2 & " "
Next
End With
Set obj1.Item(r.Offset(, -1).Value2) = obj2
Next
.Range("A2:B4").ClearContents
For Each val1 In obj1.keys
.Cells(.Rows.Count, 1).End(xlUp).Offset(1).Resize(obj1(val1).Count).Value = val1
For Each val2 In obj1(val1).keys
.Cells(.Rows.Count, 2).End(xlUp).Offset(1).Value = obj1(val1)(val2)
Next
Next
End With
End Sub
Function GetLetters(s As String) As String
Dim i As Long
Do While Not IsNumeric(Mid(s, i + 1, 1))
i = i + 1
Loop
GetLetters = Left(s, i)
End Function

Cassandra Partition key duplicates?

I am new to Cassandra so I had a few quick questions, suppose I do this:
CREATE TABLE my_keyspace.my_table (
id bigint,
year int,
datetime timestamp,
field1 int,
field2 int,
PRIMARY KEY ((id, year), datetime))
I imagine Cassandra as something like Map<PartitionKey, SortedMap<ColKey, ColVal>>,
My question is when querying for something from Cassandra using a WHERE, it will be like:
SELECT * FROM my_keyspace.my_table WHERE id = 1 AND year = 4,
This could return 2 or more records, how does this fit in with the data model of Cassandra?
If it really it a Big HashMap how come duplicate records for a partition key are allowed?
Thanks!
There is a batch of entries in the SortedMap<ColKey, ColVal> for each row, using its sorted nature.
To build on your mental model, while there is only 1 partition key for id = 1 AND year = 4 there are multiple cells:
(id, year) | ColKey | ColVal
------------------------------------------
1, 4 | datetime(1):field1 | 1 \ Row1
1, 4 | datetime(1):field2 | 2 /
1, 4 | datetime(5):field1 | 1 \
1, 4 | datetime(5):field2 | 2 / Row2
...

Categorizing Consecutive Months With Zero Values Into Buckets

I've constructed a data model around utilization for my company's fleet within PowerQuery. We have a number of different columns in the data model. Specifically, mileage, VIN, start date, and end date (see below for example table).
Mileage | VIN | Start Date | End Date |
0 | 123 | 6/1/18 | 6/30/18 |
0 | 123 | 7/1/18 | 7/31/18 |
0 | 123 | 8/1/18 | 8/31/18 |
0 | 123 | 9/1/18 | 9/30/18 |
0 | 123 | 10/1/18 | 10/31/18 |
What I'm trying to accomplish is if mileage is equal to 0 for one month it is categorized into a bucket labeled 0-30 days, if mileage is equal to 0 for two consecutive months it is categorized as 31-60 days, and 0 mileage for more than 3 consecutive months would be categorized as >60 days. From the example above, this vehicle would be categorized in the ">60 days" bucket. Is there an easy way to do this within the data model using DAX? Please let me know if you have any follow up questions. Thank you!
Try this as a Calculated Column:
Buckets =
VAR rowDate = 'myTable'[ Start Date ]
VAR previousDate =
CALCULATE (
MAX ( 'myTable'[ Start Date ] ),
FILTER (
ALLEXCEPT ( myTable, myTable[ VIN ] ),
'myTable'[ Start Date ] < rowDate
)
)
VAR prePreviousDate =
CALCULATE (
MAX ( 'myTable'[ Start Date ] ),
FILTER (
ALLEXCEPT ( myTable, myTable[ VIN ] ),
'myTable'[ Start Date ] < previousDate
)
)
VAR PreviousMileage =
CALCULATE (
MAX ( 'myTable'[Mileage ] ),
ALLEXCEPT ( 'myTable', 'myTable'[ VIN ] ),
'myTable'[ Start Date ] = previousDate
)
VAR PrePreviousMileage =
CALCULATE (
MAX ( 'myTable'[Mileage ] ),
ALLEXCEPT ( 'myTable', 'myTable'[ VIN ] ),
'myTable'[ Start Date ] = prePreviousDate
)
RETURN
SWITCH (
TRUE (),
'myTable'[Mileage ] + IF ( ISBLANK ( PreviousMileage ), 1, PreviousMileage )
+ IF ( ISBLANK ( PrePreviousMileage ), 1, PrePreviousMileage )
= 0, "> 60 Days",
'myTable'[Mileage ] + IF ( ISBLANK ( PreviousMileage ), 1, PreviousMileage )
= 0, "31 to 60 Days",
'myTable'[Mileage ] = 0, "0 to 30 Days",
"No Days"
)
The result looks like this. I added some values for testing.

JPQL: SELECT b, count(ts) FROM Branch b JOIN b.tourScheduleList WHERE ts.deleted = 0

I get the desired result here
SELECT b, count(ts) FROM Branch b JOIN b.tourScheduleList ts WHERE ts.deleted = 0 GROUP BY b.id ORDER BY b.name ASC
b1 | 2
b2 | 1
but then I need to get the count of ts.tourAppliedList so I updated the query to
SELECT b, count(ts), count(ta) FROM Branch b JOIN b.tourScheduleList ts JOIN ts.tourAppliedList ta WHERE ts.deleted = 0 GROUP BY b.id ORDER BY b.name ASC
which resulted to
b1 | 3 | 3
b2 | 2 | 2
the result is wrong. I don't know why count(ts) is equal to count(ta)
I tried returning ts then just do a count later but it's returning all its content without considering the ts.deleted = 0
SELECT b, ts FROM Branch b JOIN b.tourScheduleList ts WHERE ts.deleted = 0 GROUP BY b.id ORDER BY b.name ASC
then in the view I just #{item.ts.tourAppliedList.size()} it's not considering the ts.deleted = 0
The problem is your expectation is wrong.
This Join will give you:
b1 | ts1 | ta1
b1 | ts1 | ta2
b1 | ts2 | ta3
b2 | ts3 | ta4
b2 | ts3 | ta5
Or something along this line...
What happens when you group and count those rows?
Simple you have 3 entry for b1 and 2 for b2.
What you need there is count(distinct ts)
Since there are multiple ts for every different ta you would then find a difference
P.s. i dont know if jpql permit a count(distinct ), if thats the case you better do two query and count ts with the join only on ts and then ta with the join on ts and ta

Resources