Dataflow expression builder greatest max integer ADF - azure

I am using data factory's expression builder to build dataflows (Aggregrate Function) to 1. group movies by year, 2.find the max rating of movies 3. Return movie title for max.
I have already grouped by year so I'm trying to return something like
max(toInteger(Rating)) or greatest(toInteger(Rating))
and also get the 'title' of the movie that is max, can this be done in expression builder?

The Aggregate transformation defines aggregations of columns in your data streams. Using the Expression Builder, you can define different types of aggregations such as SUM, MIN, MAX, and COUNT grouped by existing or computed columns.
I tried to repro the issue with sample data and I can observe that getting the movie title isn't possible in Aggregate function in mapping data flow.
In Data preview we can see we are only getting group by column and aggregate column. There is no option to include movie name column here.

Related

Count of orders - ADF- aggregation

How to convert this CSV file input to the given output in ADLS using ADF?
Input data:
order_id,city,country
L10,Sydney,Australia
L11,Annecy,France
L12,Montceau,France
L13,Paris,France
L14,Montceau,Canada
L15,Ste-Hyacinthe,Canada
Output data:
COUNTRY,CITY,TOTAL_Order
Australia,Sydney,1
Australia,Total,1
Canada,Montréal,1
Canada,Ste-Hyacinthe,1
Canada,Total,2
France,Annecy,1
France,Montceau,1
France,Paris,1
France,Total,3
Total,Total,6
I want to find the count of order ids city wise and country wise using Data Flow. This is similar to roll-up aggregation.
Take three aggregate transforms in dataflow to do this. First is to calculate the count of orderid for every country and city combination. Second aggregate transform is to calculate the count of orderid for every country. Third aggregate transform is to calculate the count orderid for the full table. Below are the detailed steps.
Same input data is taken as source.
img:1 source data preview
Create two new additional branches by clicking + symbol near to Source transformation and click new branch.
In each branch add aggregate transformation.
Aggregate transformation1 settings:
group by : country, city
aggregates: total_order=count(order_id)
img:2 aggregate transform1 data preview
Aggregate transorm2 settings:
group by: country
aggregates: total_order=count(order_id)
img:3 aggregate transform 2 data preview.
Aggregate transorm3 settings: No column in group by.
group by:
aggregates: total_order=count(order_id)
img:4 aggregate transform3 data preview.
Next step is to union all these tables. Since all of these are not in the same structure, Add derived columns transformation to aggregate2 and aggregate3 and create columns with empty string.
Join aggregate1,derived1 and derived2 transformations data using Union transformation.
img:5 Data preview after all transformations.
img: 6 Complete dataflow with all transformations.

PySpark Design Pattern for Combining Values Based on Criteria

Hi I am new to PySpark and want to create a function that takes a table of duplicate rows and a dict of {field_names : ["the source" : "the approach for getting the record"]} as an input and creates a new record. The new record will be equal to the first non-null value in the priority list where each "approach" is a function.
For example, the input table looks like this for a specific component:
And given this priority dict:
The output record should look like this:
The new record looks like this because for each field there is a function selected that dictates how the value is selected. (e.g. phone is equal to 0.75 as Amazon's most complete record is null so you coalesce to the next approach in the list which is the value of phone for the most complete record for Google = 0.75)
Essentially, I want to write a pyspark function that groups by components and then applies the appropriate function for each column to get the correct value. While I have a function that "works" the time complexity is terrible as I am naively looping through each component then each column, then each approach in the list to build the record.
Any help is much appreciated!
I think you can solve this using pyspark.sql.functions.when . See this blog post for some complicated usage patterns. You're going to want to group by id, and then use when statements to implement your logic. For example, 'title': {'source': 'Google', 'approach': 'first record'} can be implemented as
(df.groupBy('id').agg(
when(col("source") == lit("Google"), first("title") ).otherwise("null").alias("title" )
)
'Most recent' and 'most complete' are more complicated and may require some self-joins, but you should still be able to use when clauses to get the aggregates you need.

Azure Data Factory Calculate Percent Revenue

I need to calculate billing percentage with respect to the total in Azure Data Factory WorkFlow. I used the expression:
Total_amount/sum(Total_amount)
But it doesn´t work. How could I calculate percentages using aggregate transformation inside a data flow?
Create a data flow with 2 sources. They can both be the same source, in your example. The first stream will have Source->aggregate [sum(total amount)]->Sink (cached). The second stream will have Source->derived column (total amount/lookup from the cached sink above). My example screenshot below does this exact same thing with loan data. This is what my formula in the derived column looks like: loan_amnt / sink1#output().total_amount
You can try using windows transformation in data flow.
Source:
Add a dummy column using derived column transformation and assign a constant value.
Using windows transformation, get the percentage value of the total amount.
Expression in window column:
multiply(divide(toInteger(Total_amount),sum(toInteger(Total_amount))),100)
Preview of windows transformation:

Azure Data Flow - Can we have Dynamic columns or change in projections for Unpiovt functionality

The excel consist of 62 columns and 7 columns are fixed and rest of them have weeks as in year(week1 to week 52)
I have used a data flow task to unpivot the 53 columns into rows with 2 extra columns year and value.
The problem is that I have the 52 week column names keep changing on every week data load and how to I handle this change in column names in data flow. For a single run it gives the exact output
What you'll want to do here is to implement late-binding of your schema, or what ADF refers to as "schema drift". Instead of setting a hardened "early binding" schema in your Source projection, leave the dataset schema and projection empty.
Next, add a Derived Column after your source and call it "Projection". This is where you'll build your projection using rules to account for your evolving schema.
Build out your canonical model with the column names for your entire year using byName('columnname'). That will tell ADF to look for the existence of the column in single quotes from your source data while also providing a schema that you can use to build out your pivot table.
If you need to cast the values, wrap byName() inside of a casting function, i.e. toString(), toDate(), etc.

SSAS, dimension numeric value filtering

I am using the multiple dimensional model in SSAS with a seemingly simple requirement.
I have a Product dimension table with a Price attribute. Using Excel pivot-table, I want to filter this Price attribute, for example "greater than $1000". However the filter in the pivot table is a string only, hence I can not do perform any numerical comparison operations, but rather for equivalent strings, e.g. "$1,000.00".
My problem is similar to this thread, and I wonder if there is a solution/work around that I missed?
Best regards,
CT
As suggested in the thread that you link, you could create a measure for the price, and then filter that. The definition of this calculated measure would be something like
[Product].[Product].Properties("Price", TYPED)
assuming the dimension as well as the attribute are named "Product", and the attribute has the price defined as a property named "Price".
(You define a property in BIDS as a relationship from the Product attribute to the Priice attribute.)

Resources