How to Perform left Anti and Right Antijoins in Dataflow? - azure

I can see i can do only
Left right inner,Full ,Cross joins in Dataflow.
I can t see a left Anti or Right Anti joins in Dataflow. So how to perform those joins like that of in Sql in Azure data factory

As mentioned in document you can able to see built in joins and perform Full outer, inner, Left outer, Right outer and Cross joins in Data flow as shown in below image.
But as per your requirement you can't see built in anti joins like left anti join, right anti join in data flow like SQL.
For anti joins in Data flow we can perform join conditions as shown in below image.
In join conditions what you have to do is as shown in below image from left column select source1 and from right column select souce2 and in between there is filter operation like [= =,! =,<,>, <=,>=, = = =, < = > ].As per our requirement we can perform any of these operations in left join.
I created data flow as shown below by taking source1 as employee data and source2 as depart and combined these sources using left join.
After choosing left join in join conditions column1 as employee data and column2 as depart and in filter i am using === operator
After performing left join and join condition below is the output got.
Here is the Source1= employee data input:
Sorce 2= depart input:
Alternative method:
As of now it is not possible in joins but you can try it by using Exists.
Source1 in dataflow
Source2 data in dataflow
Next both sources are joined by EXIST activity. In below image you can find exist type & exists conditions. In Exist Activity Exist type is Doesn't exist
After validation you can see required left anti join output in Data preview as shown below

Related

Count of orders - ADF- aggregation

How to convert this CSV file input to the given output in ADLS using ADF?
Input data:
order_id,city,country
L10,Sydney,Australia
L11,Annecy,France
L12,Montceau,France
L13,Paris,France
L14,Montceau,Canada
L15,Ste-Hyacinthe,Canada
Output data:
COUNTRY,CITY,TOTAL_Order
Australia,Sydney,1
Australia,Total,1
Canada,Montréal,1
Canada,Ste-Hyacinthe,1
Canada,Total,2
France,Annecy,1
France,Montceau,1
France,Paris,1
France,Total,3
Total,Total,6
I want to find the count of order ids city wise and country wise using Data Flow. This is similar to roll-up aggregation.
Take three aggregate transforms in dataflow to do this. First is to calculate the count of orderid for every country and city combination. Second aggregate transform is to calculate the count of orderid for every country. Third aggregate transform is to calculate the count orderid for the full table. Below are the detailed steps.
Same input data is taken as source.
img:1 source data preview
Create two new additional branches by clicking + symbol near to Source transformation and click new branch.
In each branch add aggregate transformation.
Aggregate transformation1 settings:
group by : country, city
aggregates: total_order=count(order_id)
img:2 aggregate transform1 data preview
Aggregate transorm2 settings:
group by: country
aggregates: total_order=count(order_id)
img:3 aggregate transform 2 data preview.
Aggregate transorm3 settings: No column in group by.
group by:
aggregates: total_order=count(order_id)
img:4 aggregate transform3 data preview.
Next step is to union all these tables. Since all of these are not in the same structure, Add derived columns transformation to aggregate2 and aggregate3 and create columns with empty string.
Join aggregate1,derived1 and derived2 transformations data using Union transformation.
img:5 Data preview after all transformations.
img: 6 Complete dataflow with all transformations.

What condition should I supply for a custom cross join in an Azure Data Factory dataflow?

In a dataflow, I have two datasets with one column each. Let's say dataset a with column a and dataset b with column b.
I want to cross join them, but when I select the custom cross join option it asks me to specify a condition. I don't understand what I need to supply here, I just want all the records from column a to be cross joined with all the records from column b. What should I put? I tried checking the official Microsoft documentation but there were no examples there.
The cross join in a join transformation of azure data factory dataflow requires a condition on which the join has to be applied. I have done the following to demonstrate how do cross join on the example that you have given.
I have two datasets (one column each). Dataset A has one column a with the following values.
Dataset B has column b with the following values.
I have used join transformation to join both the sources. Now, the dataflow join transformation prompts you to specify a cross join condition. If you don't have any condition and just want to apply cross join on all the data from both datasets, you give the cross join condition value as true() (As you want to do in this case).
This would apply cross join on all the records of column a with all the records of column b.
This is how you can achieve your requirement. If you have any condition, you can pass it to apply cross join based on it instead of using true(). Refer to this official Microsoft documentation to understand more about joins.

Azure Datafactory: How to implement nested sql query in transformation data flow

[![enter image description here][1]][1]
I have two streams customer and customercontact. I am new to azure data factory. I just want to know which activity in data flow transformation will achieve the below sql query result.
(SELECT *
FROM customercontact
WHERE customerid IN
(SELECT customerid
FROM customer)
ORDER BY timestamp DESC
LIMIT 1)
I can utilize Exist transformation for inner query but I am need some help on how I can fetch the first row after sorting customer contact data.So , basically I am looking for a way to add limit/Top/Offset clause in dataflow.
You can achieve transformation for a given query in data flow with different transformation.
For sorting you can use Sort transformation. Here you can select Order Ascending or descending.
For top few records you can use Rank transformation.
For “IN” clause you can use Exists transformation.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/data-flow-rank
Here is my sample data in SQL as Source
I have used Rank transformation.
After rank transformation one more column i.e. RankColumn got added.
Now to select only top 1 record I have used Filter Row Modifier. I used equals(RankColumn,1) expression to select Top 1 record.
Now finally use Sink activity and run pipeline.

what is the difference between join activity and lookup in azure data factory

In the documentation it is mentioned that the lookup activity in azure data flow is similar to join activity with join type equal to left outer join. so i was wondering if both can be used interchangeably or there are some difference between them
It depends on how you want to deal with your data.
Join active is to combine data from two sources or streams in a mapping data flow. But Lookup not only can do this, it could has lookup conditions to filter the input stream data.
In most scenarios, lookup and join active can be used interchangeably.

How to under the query plan of spark

What does the build right mean of below query plan text?
BroadcastHashJoin [i_item_sk#2], [ss_item_sk#25], Inner, BuildLeft
Does that mean the right table is the table get broadcast?
Also, could I confirm that the table contains the column ss_item_sk is the right table from the query plan text?
Thanks.
buildSide is the side that going to be broadcasted. In your case left relation is broadcasted.
Not always both sides can be broadcasted:
inner join - we can broadcast both sides
full outer join - BHJ is not supported
right outer join - we only can broadcast the left side
left outer, left semi, left anti - we only can broadcast the right side
Also, could I confirm that the table contains the column ss_item_sk is
the right table from the query plan text?
Yes

Resources