ADF Mapping Data Flow - Dirty / Label Replacement / Sink cache and derived column - azure

I am trying to build a ADF mapping data flow for generic adding Label - it's purpose is to see a value in a particular column and replace it with a label . I already have my dataset that looks like this (Table B):
enter image description here
The goal is to replace the values with the label ones. Since my label dataset mapping file is in a Cached Sink (Table B),I thought that I could use a Derived Column Activity, along with Cached Lookups to find the clean value, given the current Column Name and current value (dirty) as keys. I did a rule-based mapping expression to get just the columns that needed cleaning:
enter image description here
I tested the derived column transformation using: Each column that matches:libCached#lookup(name).Column_Name
enter image description here
This part allow me to distinct column names that need to be replaced by label and that's working fine.
I need help to make the replacement I tried several formulas it still doesn't work, I don't know if it's achievable or not ??
thanks a lot

To replace the actual values in the derived column, you'll need to use the lookup formula using the key that you've set in the cached sink so that ADF can match on that value. In the screenshot you have, it only shows that you checking for null and are not actually returning the lookup value.

Related

Delete bottom two rows in Azure Data Flow

I would like to delete the bottom two rows of an excel file in ADF, but I don't know how to do it.
The flow I am thinking of is this.
enter image description here
*I intend to filter -> delete the rows to be deleted in yellow.
The file has over 40,000 rows of data and is updated once a month. (The number of rows changes with each update, so the condition must be specified with a function.)
The contents of the file are also shown here.
The bottom two lines contain spaces and asterisks.
enter image description here
Any help would be appreciated.
I'm new to Azure and having trouble.
I need your help.
Add a surrogate key transformation to put a row number on each row. Add a new branch to duplicate the stream and in that new branch, add an aggregate.
Use the aggregate transformation to find the max() value of the surrogate key counter.
Then subtract 2 from that max number and filter for just the rows up to that max-2.
Let me provide a more detailed answer here ... I think I can get it in here without writing a separate blog.
The simplest way to filter out the final 2 rows is a pattern depicted in the screenshot here. Instead of the new branch, I just created 2 sources both pointing to the same data source. The 2nd stream is there just to get a row count and store it in a cached sink. For the aggregation expression I used this: "count(1)" as the row count aggregator.
In the first stream, that is the primary data processing stream, I add a Surrogate Key transformation so that I can have a row number for each row. I called my key column "sk".
Finally, set the Filter transformation to only allow rows with a row number <= the max row count from the cached sink minus 2.
The Filter expression looks like this: sk <= cachedSink#output().rowcount-2

Passing the Dataflow Parameter to Sink Key column in Azure Data factory

I wanted to implement SCD type 2 logic but using dynamic tables and dynamic key fields from Config Table, I have a challenge to pass the Data Flow Parameter as Sink Key Column for my Alter Row activity, it is not taking the parameter values and always gives the error as invalid key column name, I tried picking the Dataflow parameter for the expression builder at sink key column and trying to pass the value from alter row transformation and I have named the field with parameter in the select statement as well , any help or suggestion highly appreciated
Please clink below image
Sample How I wanted to Pass Dynamic Values in Sink Mapping
Trying to Give the Dynamic Value to Key Value
You have "List of columns" selected, so ADF is looking for a column in your target table that is literally called "$TargetPK1Parameter".
Change the selector to "Custom expression" and enter a string array parameter. The parameter can be an array of strings that represent names of key columns in your target table.
It should look something like this:
I encountered a similar problem when trying to pass a composite key, parameterized, as part of the update method to sink. This now allows me to fully parameterise my dataflow and it handles both composite keys and single columns keys.
Here's how the data looks in my config table:
UpsertKeyColumn = DOMNAME,DDLANGUAGE,AS4LOCAL,VALPOS,AS4VERS
A parameter value is set in the dataflow
Upsert_Key_Column = #item().UpsertKeyColumn
Finally, in the Sink settings, Custom Expression is selected for Key columns and the following expression is entered - split($upsert_key_column,',')

Azure Data Factory - Exists transformation in Data Flow with generic dataset

I'm having issues using the Exists Transformation within a Data Flow with a generic dataset.
I have two sources (one from staging table "sourceStg", one from DWH table "sourceDwh") and want to compare if the UniqueIdentifier-Column in the staging table is existing in the UniqueIdentifier-Column in the DWH table. For that I have a generic data set which I query with a SQL statement containing parameters.
When I open the "Exists settings" I cannot choose any Column from the source in the conditions since the source is generic and has no Projection until I run the data flow. However, I have a parameter which I get from the parent pipeline which provides me the name of the Column containing the UniqueIdentifier (both column names in staging / DWH are the same).
I tried to add following statement "byName($UniqueIdentifier)" in the left and right column field but the engine resolves them both as the sourceStg-Column since the prefix of the source-transformations is missing and it defaults to the first one. What I basically now try to achieve is having some statement as followed defining the correct source-transformation and the column containing the unique identifier with a parameter.
exists(sourceStg#$UniqueIdentifier == sourceDwh#$UniqueIdentifier)
But either the expression cannot be parsed or the result does not retrieve the actual UniqueIdentifier value from the column but writes the statement (e.g. sourceStg#$UniqueIdentifier) as column value.
The only workaround I found so far is having two derived columns which adds a suffix to the UniqueIdentifier-Column in one source and a new parameter $UniqueIdentiferDwh which is populate with the parameter $UniqueIdentifier and the same suffix as used in the derived column.
Any Azure Data Factory experts out there to help?
Thanks in advance!

How to pass a Data Flow Parameter in Key Column in Sink Tanformation while updating a data?

I am implementing SCD Type2 through Data Flow. I having created a Parameter in it where I will pass a column name and this Parameter I am using in Sink Transformation in Key Column.
Passing a parameter in Key Column in Data Flow
I have selected the Add Dynamic Content and then Parameter, after that I selected the parameter I have created in Data Flow. Then it shows like "$Key_col".
But when I run the pipeline it gives me an error-
{"message":"at Sink 'sink1'(Line 56/Col 6): Column operands are not allowed in literal expressions. Details:at Sink 'sink1'(Line 56/Col 6): Column operands are not allowed in literal expressions","failureType":"UserError","target":"Update_Existing_Records","errorCode":"DFExecutorUserError"}
Can anyone please tell me how resolve this error or any workaround for this Problem.
Yes, this work. You just need to put single quotes around the parameter value like this:
"'$Key_col'"
I'm using double-quotes for string interpolation in this solution, so paste it in your expression exactly as that.
Key column doesn't support set with parameter. You only can choose the exist column in sink.
The column name that you pick as the key here will be used by ADF as part of the subsequent update, upsert, delete. Therefore, you must pick a column that exists in the Sink mapping. If you wish to not write the value to this key column, then click "Skip writing key columns".
Please reference: Mapping data flow properties.
The parameter Key_col is not exist in the sink, even if it has the same name.
Update:
Data Flow parameter:
If we want to using update, we must add an Alter row active:
Sink, key column choose exist column 'name':
Pipeline runs successful:
Hope this helps.

How does Apache spark structured streaming 2.3.0 let the sink know that a new row is an update of an existing row?

How does spark structured streaming let the sink know that a new row is an update of an existing row when run in an update mode? Does it look at all the values of all columns of the new row and an existing row for an equality match or does it compute some sort of hash?
Reading the documentation, we see some interesting information about update mode (bold formatting added by me):
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.
So, to use update mode there needs to be some kind of aggregation otherwise all data will simply be added to the end of the result table. In turn, to use aggregation the data need to use one or more coulmns as a key. Since a key is needed it is easy to know if a row has been updated or not - simply compare the values with the previous iteration of the table (the key tells you which row to compare with). In aggregations that contains a groupby, the columns being grouped on are the keys.
Simple aggregations that return a single value will not require a key. However, since only a single value is returned it will update if that value is changed. An example here could be taking the sum of a column (without groupby).
The documentation contains a picture that gives a good understanding of this, see the "Model of the Quick Example" from the link above.

Resources