I am trying to filter data in Azure Data Flow.
However, I do not know how to do this.
What I want to do is to extract only the records with the largest value in the "seq_no" column among those with duplicate IDs.
I just don't know what function to use to achieve this.
I await your answer.
Any answer would be appreciated.
Sorry for my bad English, I am Japanese.
Thanks for reading.
You can use aggregate transform and group by id and take the max(seq_no). I repro'd the same. Below are the steps.
Sample data is taken as input.
id
seq_no
mark
1000
1
10
1001
1
10
1001
2
20
1002
1
30
1002
2
20
1002
3
10
img:1 Source Transformation data preview
Then Aggregate transform is taken. In Aggregate settings,
id is given as group by column and aggregates expression is given for seq_no column as max(seq_no).
Aggregate transform output data
img:2 Data preview of Aggregate transform.
In order to get the other column data corresponding to maximum of seq_no column, Join transformation is used.
Left stream: aggregate1
Right stream: source1
Join Type:Inner
Join conditions: source1#id==source2#id
source1#seq_no==source2#seq_no
img:3 Join Transformation settings
img:4 Join transformation data preview
Select transformation is used and removed the extra columns.
Related
How to convert this CSV file input to the given output in ADLS using ADF?
Input data:
order_id,city,country
L10,Sydney,Australia
L11,Annecy,France
L12,Montceau,France
L13,Paris,France
L14,Montceau,Canada
L15,Ste-Hyacinthe,Canada
Output data:
COUNTRY,CITY,TOTAL_Order
Australia,Sydney,1
Australia,Total,1
Canada,Montréal,1
Canada,Ste-Hyacinthe,1
Canada,Total,2
France,Annecy,1
France,Montceau,1
France,Paris,1
France,Total,3
Total,Total,6
I want to find the count of order ids city wise and country wise using Data Flow. This is similar to roll-up aggregation.
Take three aggregate transforms in dataflow to do this. First is to calculate the count of orderid for every country and city combination. Second aggregate transform is to calculate the count of orderid for every country. Third aggregate transform is to calculate the count orderid for the full table. Below are the detailed steps.
Same input data is taken as source.
img:1 source data preview
Create two new additional branches by clicking + symbol near to Source transformation and click new branch.
In each branch add aggregate transformation.
Aggregate transformation1 settings:
group by : country, city
aggregates: total_order=count(order_id)
img:2 aggregate transform1 data preview
Aggregate transorm2 settings:
group by: country
aggregates: total_order=count(order_id)
img:3 aggregate transform 2 data preview.
Aggregate transorm3 settings: No column in group by.
group by:
aggregates: total_order=count(order_id)
img:4 aggregate transform3 data preview.
Next step is to union all these tables. Since all of these are not in the same structure, Add derived columns transformation to aggregate2 and aggregate3 and create columns with empty string.
Join aggregate1,derived1 and derived2 transformations data using Union transformation.
img:5 Data preview after all transformations.
img: 6 Complete dataflow with all transformations.
i'm currently trying to optimize some kind of query of 2 rather large tables, which are characterized like this:
Table 1: id column - alphanumerical, about 300mil unique ids, more than 1bil rows overall
Table 2: id column - identical semantics, about 200mil unique ids, more than 1bil rows overall
Lets say on a given day, 17.03. i want to join those two tables on id.
Table 1 is left, table 2 is right, i get like 90% of matches, meaning table 2 has like 90% of those ids present in table 1.
One week later, said table 1 did not change (could but to make explanation easier, consider it didn't), table 2 was updated and now contains more records. I do the join again and now, from the former missing ids some came up, so i got like 95% matches now.
In general, table1.id has some matches with table2.id at a given time which might change on a day-per-day base.
I now want to optimize this join and came up on the bucketing feature. Is this possible?
Example:
1st join: id "ABC123" is present in table1, not in table2. ABC123 gets sorted into a certain bucket, e.g. "1".
2nd join (week later): id "ABC123" now came up in table2; how can it be ensured it comes into the bucket on table 2 which then is co-located with table 1?
Or am i having a general problem of understanding how it works?
There is Hive table with ~ 500,000 rows.
It has the single column which keeps the JSON string.
JSON stores the measurements from 15 devices organized like this:
company_id=…
device_1:
array of measurements
every single measurements has 2 attributes:
value=
date=
device_2:
…
device_3
…
device_15
...
There are 15 devices in json where every device has the nested array of measurements inside. The size of measurements array is not fixed.
The goal is to get from the measurements only the one with max(date) per device.
The output of SELECT should have the following columns:
company_id
device_1_value
device_1_date
...
device_15_value
device_15_date
I tried to use the LATERAL VIEW to explode the measurements array:
SELECT get_json_object(json_string,'$.company_id),
d1.value, d1.date, ... d15.value, d15.date
FROM T
LATERAL VIEW explode(device_1.measurements) as d1
LATERAL VIEW explode(device_2.measurements) as d2
…
LATERAL VIEW explode(device_15.measurements) as d15
I can use the result of this SQL as an input for another SQL which will extract the records with max(date) per device.
My approach does not scale well: with 15 devices and 2 measurements per device the single row in input table will generate
2^15 = 32,768 rows using my SQL above.
There are 500,000 rows in input table.
You are actually in a great position, to make a cheaper table/join. Bundling (your JSON string) is a optimization trick use to take horribly ugly joins/tables and optimizing them.
The downside is that you should likely be using a hive user defined function or a spark function to pair down the data. SQL is amazing but likely this isn't the right tool for this job. You likely want to use a programming language to help ingest this data into a format that works for SQL.
To avoid the cartesian product generated by multiple lateral views I split the original SQL into 15 independent SQLs (one per device) where the single SQL has just 1 lateral view.
Then I join all 15 SQLs.
I have received an export from a database which contains a huge amount of duplicated records.
There are approx 8000 records with over 100 columns. Issues with data relating to the unique ID being spread across about 5 columns are causing duplications. I expect about 1500 actual unique records.
I have attached a simplified version of what I have and what i'm trying to achieve.
I feel like there could be a solution along the lines of of: merge the rows, if data = the same OK otherwise take non-nulls. Is there something that could be down in power query?
Thanks!
Helen
enter image description here
Apply a simple GROUP (right click on id column and select Group) in Power Query as shown below.
Here is the final output-
I have two tables
Table 1:
name sex age
snr m 22
kkk f 23
djj m 33
kkk f 66
Table 2:
address country
hyd india
Ny US
london Uk
neither table has a common key. how can I get a single table by arranging above two table side by side like below?
Expected output:
name sex age address country
snr m 22 hyd india
kkk f 23 Ny US
djj m 33 london Uk
kkk f 66
Thanks in advance..
I don't know how your join can be very reliable, especially if your table lengths don't match up.
that said, it's definitely possible. before you begin, add both tables to the analysis using whatever method works for you.
Step 1: Create a common key
in order to join tables you'll need some kind of common key. we can create one on the fly using the RowId() function, which the number (id) of the row.
from the Insert menu, choose Transformations...
select Calculate new column and click Add..
give the expression RowId() and name the column something like RowId
repeat these steps for each table in the analysis.
note that you need to do this via column transformation. transformations are calculated when a table is added/refreshed to the analysis, whereas calculated columns are evaluated as needed (basically). any join in Spotfire requires the transformation columns' more "static" nature; you will not be able to join on calculated columns.
Step 2: Join the tables
so here we do the actual join.
from the Insert menu, choose Columns...
make sure your left table ('Table 1' above) is selected
select your right table ('Table 2') by clicking Select ▼ and choosing it from From Current Analysis
click Next >
select our RowId column on both sides and click Match Selected, then click Next >
select whichever columns you want to add
choose Full Outer Join as the join method
finally, click Finish
your result matches your expected output.
if you have gaps in your data (empty rows in either left or right table) your data will almost certainly be misaligned as I believe Spotfire completely will ignore any blank rows. I don't think this it's really recommended to need to join like this without a common key, so if you have trouble with mismatches, you may want to reevaluate your data situation.