how to merge 2 tables into single(side by side) table in spotfire - spotfire

I have two tables
Table 1:
name sex age
snr m 22
kkk f 23
djj m 33
kkk f 66
Table 2:
address country
hyd india
Ny US
london Uk
neither table has a common key. how can I get a single table by arranging above two table side by side like below?
Expected output:
name sex age address country
snr m 22 hyd india
kkk f 23 Ny US
djj m 33 london Uk
kkk f 66
Thanks in advance..

I don't know how your join can be very reliable, especially if your table lengths don't match up.
that said, it's definitely possible. before you begin, add both tables to the analysis using whatever method works for you.
Step 1: Create a common key
in order to join tables you'll need some kind of common key. we can create one on the fly using the RowId() function, which the number (id) of the row.
from the Insert menu, choose Transformations...
select Calculate new column and click Add..
give the expression RowId() and name the column something like RowId
repeat these steps for each table in the analysis.
note that you need to do this via column transformation. transformations are calculated when a table is added/refreshed to the analysis, whereas calculated columns are evaluated as needed (basically). any join in Spotfire requires the transformation columns' more "static" nature; you will not be able to join on calculated columns.
Step 2: Join the tables
so here we do the actual join.
from the Insert menu, choose Columns...
make sure your left table ('Table 1' above) is selected
select your right table ('Table 2') by clicking Select ▼ and choosing it from From Current Analysis
click Next >
select our RowId column on both sides and click Match Selected, then click Next >
select whichever columns you want to add
choose Full Outer Join as the join method
finally, click Finish
your result matches your expected output.
if you have gaps in your data (empty rows in either left or right table) your data will almost certainly be misaligned as I believe Spotfire completely will ignore any blank rows. I don't think this it's really recommended to need to join like this without a common key, so if you have trouble with mismatches, you may want to reevaluate your data situation.

Related

Data Filter Expression in ADF

I am trying to filter data in Azure Data Flow.
However, I do not know how to do this.
What I want to do is to extract only the records with the largest value in the "seq_no" column among those with duplicate IDs.
I just don't know what function to use to achieve this.
I await your answer.
Any answer would be appreciated.
Sorry for my bad English, I am Japanese.
Thanks for reading.
You can use aggregate transform and group by id and take the max(seq_no). I repro'd the same. Below are the steps.
Sample data is taken as input.
id
seq_no
mark
1000
1
10
1001
1
10
1001
2
20
1002
1
30
1002
2
20
1002
3
10
img:1 Source Transformation data preview
Then Aggregate transform is taken. In Aggregate settings,
id is given as group by column and aggregates expression is given for seq_no column as max(seq_no).
Aggregate transform output data
img:2 Data preview of Aggregate transform.
In order to get the other column data corresponding to maximum of seq_no column, Join transformation is used.
Left stream: aggregate1
Right stream: source1
Join Type:Inner
Join conditions: source1#id==source2#id
source1#seq_no==source2#seq_no
img:3 Join Transformation settings
img:4 Join transformation data preview
Select transformation is used and removed the extra columns.

Merge two datasets without common column in Azure Data Factory

I have two datasets where I need to do a join/merge in Azure Data Factory, but without having a common identity column. This might be my an oversight from my side, as it should be a very trivial task to do, but I cannot seem to do it via a join or a union.
One dataset only has a couple of rows with a "name" column, let's say rows A, B, C whereas the other have thousands (1-N).
For each row in the large dataset I want A, B, C rows, so it effectively becomes:
1A
1B
1C
2A
2B
2C
...
Any help is appreciated,
Thank you.
You can use Custom (cross) join type in the Join to get the result in this case.
Follow the demonstration below:
Sample Large Dataset(Numbers) with numbers up to 15.
Small Dataset(Letters)
Now, use Join with Large dataset as left and small dataset as right and use custom join with the condition as true().
In the Optimize of Join, select off at the Broadcast to get the above format of the data.
You can see the merge of two datasets below.
If you want the above in a single column with values like 1A,1B,1C..., first use the derived column to concat the above values and then select any column using select.
Derived Column
Now use select to select any column above.
Output

Spark - partitioning/bucketing of n-tables with overlapping but not identical ids

i'm currently trying to optimize some kind of query of 2 rather large tables, which are characterized like this:
Table 1: id column - alphanumerical, about 300mil unique ids, more than 1bil rows overall
Table 2: id column - identical semantics, about 200mil unique ids, more than 1bil rows overall
Lets say on a given day, 17.03. i want to join those two tables on id.
Table 1 is left, table 2 is right, i get like 90% of matches, meaning table 2 has like 90% of those ids present in table 1.
One week later, said table 1 did not change (could but to make explanation easier, consider it didn't), table 2 was updated and now contains more records. I do the join again and now, from the former missing ids some came up, so i got like 95% matches now.
In general, table1.id has some matches with table2.id at a given time which might change on a day-per-day base.
I now want to optimize this join and came up on the bucketing feature. Is this possible?
Example:
1st join: id "ABC123" is present in table1, not in table2. ABC123 gets sorted into a certain bucket, e.g. "1".
2nd join (week later): id "ABC123" now came up in table2; how can it be ensured it comes into the bucket on table 2 which then is co-located with table 1?
Or am i having a general problem of understanding how it works?

Pivot table showing too many rows with DAX measure

Consider the following two tables
Major
Major Department
Department Name
Department Number
Produce
Produce
2
Produce
Taxable Produce
3
Frozen
Frozen
5
Grocery
Grocery
1
Grocery
Taxable Grocery
10
and Sales
UPC
Department
Category
SubCategory
Sales
1125
2
Fruit
Oranges
20
8256
5
Frozen Treats
Fruit Bars
15
9230
1
Snacks
Chips
28
4018
2
Fruit
Bananas
10
925
2
Vegetables
Onions
9
A relationship is created between the Department and Department Number columns.
I create the following pivot table:
What I want to do is add a measure which shows the total for the Major Department on each row.
I have tried MajorDepartmentSales:=CALCULATE(SUM(Sales[Sales]),ALL(Sales[SubCategory]),ALL(Sales[Category]))
which should remove the filters on category and subcategory. I would expect this to work, however it adds every category under the major department, but with the correct values.
Note, that the value of this measure is correct. It shows the totals under that particular major department. The problem is that it shows every category and subcategory under each major department whether they belong there or not. Why is this?
I have found two ways around this. The first modifies the measure to IF(COUNT(Sales[UPC])>0,CALCULATE(SUM(Sales[Sales]),ALL(Sales[SubCategory]),ALL(Sales[Category])),BLANK()) which checks if there are any items under that particular row, and blanks it out otherwise. The second method is to pull the major department to the Sales table with a calculated column MajorDepartmentOnSales:=Related(Major[Major Department]) and then using this column in the pivot table instead of the major department from the Major table.
Both produce what I want. The IF method seems a bit sketchy to me, however.
Question
My question is then why do I get these extra rows in the original approach?
It seems that DAX is correctly recognizing which major department is in play as it gets the value correct, but it is not recognizing that when it comes to filtering it out of the pivot table. I am really new to DAX, and it seems that I am not understanding something either in how the relationship is propagated down or how power pivot interacts with the pivot table.
How do I solve this? Is there a way to rewrite the measure to not cause these extra rows, or do I have to use one of these alternative methods? Ideally, I don't want to change the model. (The actual data in the real report has more tables and more (slightly different) columns than this example, but the example recreates the essential issue.)
Internal engin produce a cross-join for this combination of column/measure:
SELECT {[Measures].[Sum of Sales],[Measures].[MajorDepartmentSales]}
DIMENSION PROPERTIES
PARENT_UNIQUE_NAME,MEMBER_VALUE,HIERARCHY_UNIQUE_NAME ON COLUMNS , NON
EMPTY
Hierarchize(DrilldownMember(DrilldownMember(CrossJoin({[Table1].[Major
Department].[All],[Table1].[Major Department].[Major
Department].AllMembers},
{([Table2].[Category].[All],[Table2].[SubCategory].[All])}),
[Table1].[Major Department].[Major Department].AllMembers,
[Table2].[Category]), [Table2].[Category].[Category].AllMembers,
[Table2].[SubCategory])) DIMENSION PROPERTIES
PARENT_UNIQUE_NAME,MEMBER_VALUE,HIERARCHY_UNIQUE_NAME ON ROWS FROM
[Model] CELL PROPERTIES VALUE, FORMAT_STRING, LANGUAGE, BACK_COLOR,
FORE_COLOR, FONT_FLAGS
If you add a Sales[Department] then you get correct results (what you are expecting), Engine still produce a crossjoin (because of ALL() in measure) but also add a Department (out of crossjoin scoope) and corect relationship are used:
SELECT {[Measures].[Sum of Sales],[Measures].[MajorDepartmentSales]}
DIMENSION PROPERTIES
PARENT_UNIQUE_NAME,MEMBER_VALUE,HIERARCHY_UNIQUE_NAME ON COLUMNS , NON
EMPTY
Hierarchize(DrilldownMember(DrilldownMember(DrilldownMember(CrossJoin({[Table1].[Major
Department].[All],[Table1].[Major Department].[Major
Department].AllMembers},
{([Table2].[Department].[All],[Table2].[Category].[All],[Table2].[SubCategory].[All])}),
[Table1].[Major Department].[Major Department].AllMembers,
[Table2].[Department]), [Table2].[Department].[Department].AllMembers,
[Table2].[Category]), [Table2].[Category].[Category].AllMembers,
[Table2].[SubCategory])) DIMENSION PROPERTIES
PARENT_UNIQUE_NAME,MEMBER_VALUE,HIERARCHY_UNIQUE_NAME ON ROWS FROM
[Model] CELL PROPERTIES VALUE, FORMAT_STRING, LANGUAGE, BACK_COLOR,
FORE_COLOR, FONT_FLAGS
You can use such a workaround.
=IF(ISBLANK(SUM(Table2[Sales])); BLANK(); CALCULATE(SUM(Table2[Sales]); ALL(Table2[SubCategory]; Table2[Category])) )

Creating "Categories" to show on a PivotTable

I have a student database, and I'm trying to show different metrics based on a student's score range in a PivotTable. Specifically (this is a simplified example, so don't worry about the content) I want to show this in my pivot:
StudentGPACat | Avg Post-Grad Salary
3-3.2 | 64,323
3.2-3.4 | 71,225
3.4-3.6 | etc
3.6-3.8 | etc
3.8-4.0 | etc
So I want the rows in my pivot table to show the range the student's average score falls in.
In order to generate that metric, right now, I did 2 things:
(1) Added a new column in my master table in PowerPivot called [avgGrade] that shows the value of the [TableAvgGrade] calculated field from the "Grades" table for each student (i.e., each row in the master table)
=CALCULATE([TableAvgGrade],
FILTER(Grades,Grades[studentID]=Master[studentID]))
(2) Created a new column [StudentGPACat] in PowerPivot and the formula goes:
=If([avgGrade]<3,"3",
If([avgGrade]<3.2,"3-3.2",
If([avgGrade]<3.4,"3.2-3.4",
If([avgGrade]<3.6,"3.4-3.6",
If([avgGrade]<3.8,"3.6-3.8","3.8-4.0")))))
This feels bulky and computationally expensive. Is there an easier way to create these ranges to use as rows in my PivotTable?
EDIT: made some edits to clarify my question
EDIT2: type
What you've done is the appropriate pattern for creating this sort of column. If you're concerned about the gnarly nested IF()s, you can replace with a SWITCH(), which is just syntactic sugar for nested IF()s, but what you've posted is all you need.
In a PivotTable (I don't know with PowerPivot), if you use a numeric value as a Row Label, you can Right click the field, choose Group, define the Starting at value, Ending at value and By step, and you will get an equivalent result quite easily.

Resources