I have an ADF Data Flow that outputs 2 sets of values (Name, Location) as shown below:
Is there a way to output the count of Names in each Location via ADF Data Flow?
You can do it with Aggregate action. I tested it with your data.
Start with Aggregate action's Group by section, add location as group by columns.
Mention aggregated column name in the Columns and count(name) as aggregate expression.
Verify the aggregate's result in Aggregate's Data preview
Related
How to convert this CSV file input to the given output in ADLS using ADF?
Input data:
order_id,city,country
L10,Sydney,Australia
L11,Annecy,France
L12,Montceau,France
L13,Paris,France
L14,Montceau,Canada
L15,Ste-Hyacinthe,Canada
Output data:
COUNTRY,CITY,TOTAL_Order
Australia,Sydney,1
Australia,Total,1
Canada,Montréal,1
Canada,Ste-Hyacinthe,1
Canada,Total,2
France,Annecy,1
France,Montceau,1
France,Paris,1
France,Total,3
Total,Total,6
I want to find the count of order ids city wise and country wise using Data Flow. This is similar to roll-up aggregation.
Take three aggregate transforms in dataflow to do this. First is to calculate the count of orderid for every country and city combination. Second aggregate transform is to calculate the count of orderid for every country. Third aggregate transform is to calculate the count orderid for the full table. Below are the detailed steps.
Same input data is taken as source.
img:1 source data preview
Create two new additional branches by clicking + symbol near to Source transformation and click new branch.
In each branch add aggregate transformation.
Aggregate transformation1 settings:
group by : country, city
aggregates: total_order=count(order_id)
img:2 aggregate transform1 data preview
Aggregate transorm2 settings:
group by: country
aggregates: total_order=count(order_id)
img:3 aggregate transform 2 data preview.
Aggregate transorm3 settings: No column in group by.
group by:
aggregates: total_order=count(order_id)
img:4 aggregate transform3 data preview.
Next step is to union all these tables. Since all of these are not in the same structure, Add derived columns transformation to aggregate2 and aggregate3 and create columns with empty string.
Join aggregate1,derived1 and derived2 transformations data using Union transformation.
img:5 Data preview after all transformations.
img: 6 Complete dataflow with all transformations.
[![enter image description here][1]][1]
I have two streams customer and customercontact. I am new to azure data factory. I just want to know which activity in data flow transformation will achieve the below sql query result.
(SELECT *
FROM customercontact
WHERE customerid IN
(SELECT customerid
FROM customer)
ORDER BY timestamp DESC
LIMIT 1)
I can utilize Exist transformation for inner query but I am need some help on how I can fetch the first row after sorting customer contact data.So , basically I am looking for a way to add limit/Top/Offset clause in dataflow.
You can achieve transformation for a given query in data flow with different transformation.
For sorting you can use Sort transformation. Here you can select Order Ascending or descending.
For top few records you can use Rank transformation.
For “IN” clause you can use Exists transformation.
Refer - https://learn.microsoft.com/en-us/azure/data-factory/data-flow-rank
Here is my sample data in SQL as Source
I have used Rank transformation.
After rank transformation one more column i.e. RankColumn got added.
Now to select only top 1 record I have used Filter Row Modifier. I used equals(RankColumn,1) expression to select Top 1 record.
Now finally use Sink activity and run pipeline.
I'm having issues using the Exists Transformation within a Data Flow with a generic dataset.
I have two sources (one from staging table "sourceStg", one from DWH table "sourceDwh") and want to compare if the UniqueIdentifier-Column in the staging table is existing in the UniqueIdentifier-Column in the DWH table. For that I have a generic data set which I query with a SQL statement containing parameters.
When I open the "Exists settings" I cannot choose any Column from the source in the conditions since the source is generic and has no Projection until I run the data flow. However, I have a parameter which I get from the parent pipeline which provides me the name of the Column containing the UniqueIdentifier (both column names in staging / DWH are the same).
I tried to add following statement "byName($UniqueIdentifier)" in the left and right column field but the engine resolves them both as the sourceStg-Column since the prefix of the source-transformations is missing and it defaults to the first one. What I basically now try to achieve is having some statement as followed defining the correct source-transformation and the column containing the unique identifier with a parameter.
exists(sourceStg#$UniqueIdentifier == sourceDwh#$UniqueIdentifier)
But either the expression cannot be parsed or the result does not retrieve the actual UniqueIdentifier value from the column but writes the statement (e.g. sourceStg#$UniqueIdentifier) as column value.
The only workaround I found so far is having two derived columns which adds a suffix to the UniqueIdentifier-Column in one source and a new parameter $UniqueIdentiferDwh which is populate with the parameter $UniqueIdentifier and the same suffix as used in the derived column.
Any Azure Data Factory experts out there to help?
Thanks in advance!
Hello I am getting rows like below.
ID, NAME,EMAIL,PHONENUMBER
123,ABC, qwe#poi.com|asd#lkj.com, 3636|7363
234,DEF,sjs#djd.com|sndir#fmei.com|cmrjje#fmcj.com,5845|4958|5959
The each person can have multiple emails and phone numbers, separated by |. First email and first phone are linked. Second email and second phone are linked. So they need to be in same records. Can I split this record to multiple rows with one email and one phone per record?
We need to use data flow to achieve that. I created a test, the overall architecture and debug result is as follows:
My source dataset is a text file in Azure data lake gen2.Source1 and Source2 use this same data source.
At DerivedColumn1 activity, we can select the EMAIL column and enter expression split(EMAIL,'|') to split this column to an Array.
At Flatten1 activity, select EMAIL[] as Unroll by and Unroll root.
At SurrogateKey1 activity, enter ROW_NO and start value 1.
The data preview is as follows:
Source2 is the same as Source1, so we jump to DerivedColumn2 activity, we can select the PHONENUMBER column and enter expression split(PHONENUMBER,'|') to split this column to an Array.
At Flatten2 activity, select PHONENUMBER[] as Unroll by and Unroll root.
At SurrogateKey2 activity, enter ROW_NO and start value 1. The data preview is as follows:
At Join1 activity, we can Inner join these two data flows with the key column ROW_NO.
The data preview is as follows:
At Select1 activity, we can select the columns what we need.
The data preview is as follows:
Then we can sink the result to our destination.
That's all.
I have two tables:
First table contains sales pipeline information for accounts (contains pipeline ID, accountID, and pipeline value). Each account IDs have multiple pipeline ID
Second table includes the number of employees per account.
I included these tables to powerpivot, and I created relationship based on account ID.
I would like to create pivot that tells by Number of employees & Pipeline value by account ID and PipeID.
However, when implemented, it repeates all pipeID for each account. Even those pipeIDs which are not related to the account.
http://i.stack.imgur.com/WY1Ga.png
Could someone point me to a right direction to how I tweak the pivot to show only relevant pipeID?
I would appreciate any help you could provide...
thank you!
The numbers repeat in your pivot table because Number of Employees is not related to Pipeline ID. This is Excel's default reaction to missing relationships. To get rid of the repeating numbers and keep this pivot table as is you need to find a way to relate number of employees to a pipeline ID.
If I were modeling this, I would have a separate table that is just distinct Account IDs, to make it it's own dimension. Then have your two tables you mentioned in the question.
If you were simply writing a query against this data, how you would connect pipeline ID to number of employees? To which pipeline should an employee be attached ? If there is a way to do this manually/in a query, adjust your two tables to both include the fields on which you would join. I don't think you would be able to relate Number of Employees to Pipeline ID, so I would remove Pipeline ID from the pivot table. Then the numbers should be correct. You would want to create another pivot to show pipeline by pipeline ID per account.