I am using the Coalesce function to return a value from my preferred ranking of Columns but I also want to include the name of the column that the value was derived from.
`i.e
Table:
Apples Pears Mangos
4 5
**SQL **
; with CTE as
(
Select
Coalesce(Apples,Pears,Mangos) as QTY_Fruit
from Table
) select *, column name from QTY_Fruit
from CTE
Result:
QTY_Fruit Col Name
4 Pears`
I am trying to avoid a case statement if possible because there are about 12 fields that I will need to use in my Coalesce. Would love for an easy way to pull the column name based on value in QTY_Fruit. I'm all ears if the answer lies outside the use of subqueries but I figured this would be a start.
Related
As the title states, I am trying to do a merge of 2 tables. I want a nested joint where the values from the first table are always there and rows matching the second table are added to the first. I believe this is known as the nested join.
Unfortunately, it only allows for 1 key to 1 key matching where as I need it for 1 key in table 1 to 2 keys in table 2
Here is an example
Table1:
Group
..
..
Time
Date
Table2:
Group 1
Group 2
..
..
..
Other Info
What I want is where "Group = Group 1 OR Group = Group 2" and display the matching row from table 2 nested into Table 1
I looked at the following example but I must be confused by the syntax because it doesn't seem to be working for me.
How to join two tables in PowerQuery with one of many columns matching?
So after further investigation of the answer post I linked earlier, I will add an explanation of it here:
Table.AddColumn(Source, "Name_of_Column",
(Q1) => Table.SelectRows(Query2,
each Q1[Col_from_q1] = [Col_from_q2] or Q1[Col_from_q1] = [2_Col_from_q2]
)
)
So this did work for me and it adds an extra column that needs to be expanded to get all the values from the table. What i would add is that I don't know / haven't tested if there are multiple matches and how it treats it, based on nestedjoin, I would assume that it will duplicate rows in the first table.
I have the following table:
DEST_COUNTRY_NAME ORIGIN_COUNTRY_NAME count
United States Romania 15
United States Croatia 1
United States Ireland 344
Egypt United States 15
The table is represented as a Dataset.
scala> dataDS
res187: org.apache.spark.sql.Dataset[FlightData] = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
The following query to sort a Dataset based on count column works. I am getting the count column, sorting it and showing the result:
scala> dataDS.sort($"count".desc).show;
But if I try to use select then it doesn't work. Why?
scala> dataDS.select(dataDS.col("count").desc).show()
I get the error:
java.lang.UnsupportedOperationException: Cannot evaluate expression: input[0, int, true] DESC NULLS LAST
I have several questions around this:
What is the purpose of sort because to me it seems the ordering is done by col("..").desc? Does sort just converts a Column datatype to a Dataset?
Why doesn't using select work? My logic is (a) create a descending order of column dataDS.col("count").desc, (b) select it and (c) show it. The reason I expected this to work is because a similar sql query will work mysql> select count from flight_data_2015 ORDER BY count DESC;
The reason I expected this to work is because a similar sql query will work mysql> select count from flight_data_2015 ORDER BY count DESC;
But it isn't the same.
select(dataDS.col("count").desc) would be like SELECT count DESC FROM dataDS. Notice there is no ORDER BY clause.
This is what .orderBy or .sort in SparkSQL are doing, i.e. dataDS.sort($"count".desc).show; would be SELECT * FROM dataDS ORDER BY count DESC.
Also, note you could literally write dataDS.sql("SELECT ... ") (after registering the temp view) and it would have the same performance as doing it the other way.
Dataset.sort takes a list of Column objects within that Dataset, but it isn't converting them, only returning a new sorted Dataset
I can rank my data with this formula, which groups by Year, Trust and ID, and ranks the Areas.
rankx(
filter(Table,
[Year]=earlier([Year])&&[Trust]=earlier([Trust])&&[ID]=earlier([ID])),
[Area], ,1,Dense)
This works fine - unless you have data where the same Area appears more than once in the same group, whereupon it gives all rows the rank of 1. Is there any way to force unique rank values? So two rows that have the same Area would be given the rank of 1 and 2 (in an arbitrary order)? Thank you for your time.
Assuming you don't have duplicate rows in your table, you can add another column as a tie-breaker in your expression.
Suppose your table has an additional column, [Name], that is distinct between your multiple [Area] rows. Then you could write your formula like this:
= RANKX(
FILTER(Table,
[Year] = EARLIER([Year]) &&
[Trust] = EARLIER([Trust]) &&
[ID] = EARLIER([ID])),
[Area] & [Name], , 1, Dense)
You can append as many columns as you need to get the tie-breaking done.
I have this table which has foreign keys from several other keys:
Basically, this table shows which students registered in which module run by which teacher in what term.
I want to query the following:
How many students have registered for more than one module run by a given tutor?
It will look something like this:
For example, Vasiliy Kuznetsov runs two modules: FunPro and NO. If one student registers for both of them, he is counted as one.
My sql oriented mind is telling me this: Count all the rows in which student_id and tutor_id are the same. For example, in one row student_id is 5 and tutor_id is 10, and the same is true for the third row. Then, I count it as one.
How can I do that with DAX formulas?
RowCount:=
COUNTROWS( ModuleRegistration )
StudentsWithTwoOrMoreRegistrations:=
COUNTROWS(
FILTER(
VALUES( ModuleRegistration[Student_ID] )
,[RowCount] >= 2
)
)
I refer to arguments positionally, thus the first argument to a function is (1), the second (2), and so on.
So, [RowCount] is trivial.
[StudentsWithTwoOrMoreRegistrations] is a bit more involved. DAX, being a functional language, is best understood inside-out.
FILTER() takes a table expression in (1) and evaluates a boolean predicate, (2), for each row in (1). It returns all rows from (1) for which (2) evaluates to true.
Our FILTER()'s (1) is VALUES( ModuleRegistration[Student_ID] ). VALUES() returns the unique rows from a field based on current filter context (it respects slicers and filters in the pivot table). Thus, we will return some subset of the unique list of [Student_ID]s.
Our FILTER()'s (2) is [RowCount] >= 2. For each [Student_ID] in (1), we'll evaluate [RowCount], checking how many times that student appears in ModuleRegistration. [RowCount] is evaluated in the combination of filter context from the pivot table (the [Faculty Name] field in your sample pivot provides filter context) and row context from FILTER()'s (1). Thus it counts how many times the student appears in ModuleRegistration for the [Faculty Name] on the pivot table row.
We check that [RowCount] is >= 2.
You've not indicated if your measure needs to handle grand totals, or how you might want to see that. If you need more help for the grand total to get it to behave the way you like, let me know.
Edit for grand total
There are a few ways you might want to handle grand totals. I'm gong to assume that you want a unique count of students.
StudentsWithTwoOrMoreRegistrations:=
COUNTROWS(
SUMMARIZE(
FILTER(
SUMMARIZE(
ModuleRegistration
,ModuleRegistration[Tutor_ID]
,ModuleRegistration[Student_ID]
)
,[RowCount] >= 2
)
,ModuleRegistration[Student_ID]
)
)
WTF happened to our measure?
Let's examine:
Starting with the innermost SUMMARIZE(). SUMMARIZE() navigates relationships outward from the table in (1) and groups by the columns listed in (2)-(N) (these don't have to be from the table in (1), but must be reachable by navigating relationships).
This is equivalent to the following in SQL:
SELECT
mr.Tutor_ID
,mr.Student_ID
FROM ModuleRegistration mr
We use FILTER() on this table like earlier. [RowCount] is evaluated in the combination of filter context from the pivot table and the row in the table, defined by our SUMMARIZE() above.
Now our row context is instead of just a student, a student-tutor pair. This pair will have a [RowCount] >= 2 when the student has taken more than one module from a tutor.
Our FILTER() returns the pairs which have a [RowCount] >= 2. This output table has two fields, [Tutor_ID] and [Student_ID], but we want to count distinct [Student_ID]s out of this.
Thus, we use the table from FILTER() as our (1) in the outer SUMMARIZE(). We group only by the values of [Student_ID]. We then count the rows of this table.
When only one [Faculty_Name] is in context, e.g. on a pivot table row, then our inner SUMMARIZE() is grouping by a single value of [Tutor_ID] and whatever [Student_ID]s are associated with it. This is identical to our earlier measure.
When we have many [Tutor_ID]s in context, like in the grand total, then we'll see the appropriate behavior of only counting each [Student_ID] once.
I have been melting my brain trying to work out the formula i need for a multiple conditional lookup.
I have two data sets, one is job data and the other is contract data.
The job data contains customer name, location of job and date of job. I need to find out if the job was contracted when it took place, and if it was return a value from column N in the contract data.
The problem comes when i try to use the date ranges, as there are frequently more than one contract per customer.
So for example, in my job data:-
CUSTOMER | LOCATION | JOB DATE
Cust A | Port A | 01/01/2014
Cust A | Port B | 01/02/2014
Customer A had a contract in port B that expired on 21st Feb 2014, so here i would want it to return the value from column N in my contract data as the job was under contract.
Customer A did not have a contract in port A at the time of the job, so i would want it to return 'no contract'.
Contract data has columns containing customer name, port name, and a start and end date value, as well as my lookup category.
I think i need to be using index / match but i can't seem to get them to work with my date ranges. Is there another type of lookup i can use to get this to work?
Please help, I'm losing the plot!
Thanks :)
You can use two approaches here:
In both result and source tables make a helper column that concatenates all three values like this: =A2&B2&C2. So that you get something like 'Cust APort A01/01/2014'. That is, you get a unique value by which you can identify the row. You can add delimiter if needed: =A2&"|"&B2&"|"&C2. Then you can perform VLOOKUP by this value.
You can add a helper column with row number (1, 2, 3 ...) in source table. Then you can use =SUMIFS(<row_number_column>,<source_condition_column_1>,<condition_1>,<source_condition_column_2>,<condition_2>,...) to return the row number of source table that matches all three conditions. You can use this row number to perform INDEX or whatever is needed. But BE CAREFUL: check that there are only unique combinations of all three columns in source table, otherwise this approach may return wrong results. I.e. if matching conditions are met in rows 3 and 7 it will return 10 which is completely wrong.