What condition should I supply for a custom cross join in an Azure Data Factory dataflow? - azure

In a dataflow, I have two datasets with one column each. Let's say dataset a with column a and dataset b with column b.
I want to cross join them, but when I select the custom cross join option it asks me to specify a condition. I don't understand what I need to supply here, I just want all the records from column a to be cross joined with all the records from column b. What should I put? I tried checking the official Microsoft documentation but there were no examples there.

The cross join in a join transformation of azure data factory dataflow requires a condition on which the join has to be applied. I have done the following to demonstrate how do cross join on the example that you have given.
I have two datasets (one column each). Dataset A has one column a with the following values.
Dataset B has column b with the following values.
I have used join transformation to join both the sources. Now, the dataflow join transformation prompts you to specify a cross join condition. If you don't have any condition and just want to apply cross join on all the data from both datasets, you give the cross join condition value as true() (As you want to do in this case).
This would apply cross join on all the records of column a with all the records of column b.
This is how you can achieve your requirement. If you have any condition, you can pass it to apply cross join based on it instead of using true(). Refer to this official Microsoft documentation to understand more about joins.

Related

How to Perform left Anti and Right Antijoins in Dataflow?

I can see i can do only
Left right inner,Full ,Cross joins in Dataflow.
I can t see a left Anti or Right Anti joins in Dataflow. So how to perform those joins like that of in Sql in Azure data factory
As mentioned in document you can able to see built in joins and perform Full outer, inner, Left outer, Right outer and Cross joins in Data flow as shown in below image.
But as per your requirement you can't see built in anti joins like left anti join, right anti join in data flow like SQL.
For anti joins in Data flow we can perform join conditions as shown in below image.
In join conditions what you have to do is as shown in below image from left column select source1 and from right column select souce2 and in between there is filter operation like [= =,! =,<,>, <=,>=, = = =, < = > ].As per our requirement we can perform any of these operations in left join.
I created data flow as shown below by taking source1 as employee data and source2 as depart and combined these sources using left join.
After choosing left join in join conditions column1 as employee data and column2 as depart and in filter i am using === operator
After performing left join and join condition below is the output got.
Here is the Source1= employee data input:
Sorce 2= depart input:
Alternative method:
As of now it is not possible in joins but you can try it by using Exists.
Source1 in dataflow
Source2 data in dataflow
Next both sources are joined by EXIST activity. In below image you can find exist type & exists conditions. In Exist Activity Exist type is Doesn't exist
After validation you can see required left anti join output in Data preview as shown below

Merge two datasets without common column in Azure Data Factory

I have two datasets where I need to do a join/merge in Azure Data Factory, but without having a common identity column. This might be my an oversight from my side, as it should be a very trivial task to do, but I cannot seem to do it via a join or a union.
One dataset only has a couple of rows with a "name" column, let's say rows A, B, C whereas the other have thousands (1-N).
For each row in the large dataset I want A, B, C rows, so it effectively becomes:
1A
1B
1C
2A
2B
2C
...
Any help is appreciated,
Thank you.
You can use Custom (cross) join type in the Join to get the result in this case.
Follow the demonstration below:
Sample Large Dataset(Numbers) with numbers up to 15.
Small Dataset(Letters)
Now, use Join with Large dataset as left and small dataset as right and use custom join with the condition as true().
In the Optimize of Join, select off at the Broadcast to get the above format of the data.
You can see the merge of two datasets below.
If you want the above in a single column with values like 1A,1B,1C..., first use the derived column to concat the above values and then select any column using select.
Derived Column
Now use select to select any column above.
Output

what is the difference between join activity and lookup in azure data factory

In the documentation it is mentioned that the lookup activity in azure data flow is similar to join activity with join type equal to left outer join. so i was wondering if both can be used interchangeably or there are some difference between them
It depends on how you want to deal with your data.
Join active is to combine data from two sources or streams in a mapping data flow. But Lookup not only can do this, it could has lookup conditions to filter the input stream data.
In most scenarios, lookup and join active can be used interchangeably.

How to join Spark datasets A and B and mark records in A which were not joined?

I have two datasets A and B with TypeA and TypeB respectively. Then I join the datasets based on a column (lets call it "key") to get dataset C. After that, I need to discard events in dataset A which were joined with B and retain only those in A which could not be joined. How do I go about it?
What you are looking for is a left-anti join. Check out this post for more details Left Anti join in Spark?

How do we create a generic mapping dataflow in datafactory that will dynamically extract data from different tables with different schema?

I am trying to create a azure datafactory mapping dataflow that is generic for all tables. I am going to pass table name, the primary column for join purpose and other columns to be used in groupBy and aggregate functions as parameters to the DF.
parameters to df
I am unable to refernce this parameter in groupBy
Error: DF-AGG-003 - Groupby should reference atleast one column -
MapDrifted1 aggregate(
) ~> Aggregate1,[486 619]
Has anyone tried this scenario? Please help if you have some knowledge on this or if it can be handled in u-sql script.
We need to first lookup your parameter string name from your incoming source data to locate the metadata and assign it.
Just add a Derived Column previous to your Aggregate and it will work. Call the column 'groupbycol' in your Derived Column and use this formula: byName($group1).
In your Agg, select 'groupbycol' as your groupby column.

Resources