Aggregation in Azure Data Flow is Returning Invalid Value - azure

I have created a data flow in Data Factory.
Step 1. Read the parquet file.
Step 2. Aggregate the file to get the Max(DateField)
Step 3. Use a derived column to write in a Value.
Step 4. Alter row task with Value and the DateField.
Step 5. Sink select the Watermark table to update.
The flow updates the value, but it isn't putting in the max value. The date value is incorrect. Any ideas?
Flow_image

max() aggregate function doesn't work on date/string format type. You must pass any column which contains numerical values. Date is not a valid input on which you can apply max function. There is no maximum date term.
Instead you can filter the timestamp and get the latest or oldest date using ADF.
Refer this answer by #Leon to know how to implement the same.

Related

Azure Data Factory Validation the Header row and checksum of a fixed length txt file and insert

I am currently using the Azure Data Factory to retrieve a fixed-length file from blob storage and trying to import the record into my database.
Fixed-length.txt
0202212161707
1Tom
1Kelvin
1Michael
23
The first row is the header record, which is start with '0' and comes up with the creation time.
The following row are the detail record, started with '1' and comes up with user name.
The last row is the end record, which started with '2' and comes up with the sum of the detail record.
However, I want to validate that the data of the file is correct before I insert those records. I would like to check if the checksum is correct first, and then only insert all those record started with 1.
Currently, I insert all those record line by line into SQL DB and run a stored procedures to perform the tasks. Is it possible to utlize Azure Data Factory to do it?? Thank you.
I reproduced your issue follow below steps.
First take one look up activity to view all the data from file and apply filter on that data.
Then take one set variable activity and get the last row's last element e.g. 23 as 3 with below dynamic expression.
#last(activity('Lookup1').output.value[sub(length(activity('Lookup1').output.value),1)].Prop_0)
Then take one Filter activity to filter rows with 1 prefix with below items value and condition
items : #activity('Lookup1').output.value
condition : #startswith(item().Prop_0,'1')
after filter take ForEach activity to Append those values in an array
Then inside for each activity take Append variable activity it will create an array with filtered values.
Now take If condition with expression which checking value of set variable 1 and length of Append array variable is same or not.
#equals(int(variables('sum')),length(variables('username')))
Then inside true condition, add your copy activity to copy data if condition is true
My Sample Output:
0202212161707
1Tom
1Kelvin
23
for above data control is going to false condition.
0202212161707
1Tom
1Kelvin
1Michael
23
for above data control is going to true condition.

Passing the Dataflow Parameter to Sink Key column in Azure Data factory

I wanted to implement SCD type 2 logic but using dynamic tables and dynamic key fields from Config Table, I have a challenge to pass the Data Flow Parameter as Sink Key Column for my Alter Row activity, it is not taking the parameter values and always gives the error as invalid key column name, I tried picking the Dataflow parameter for the expression builder at sink key column and trying to pass the value from alter row transformation and I have named the field with parameter in the select statement as well , any help or suggestion highly appreciated
Please clink below image
Sample How I wanted to Pass Dynamic Values in Sink Mapping
Trying to Give the Dynamic Value to Key Value
You have "List of columns" selected, so ADF is looking for a column in your target table that is literally called "$TargetPK1Parameter".
Change the selector to "Custom expression" and enter a string array parameter. The parameter can be an array of strings that represent names of key columns in your target table.
It should look something like this:
I encountered a similar problem when trying to pass a composite key, parameterized, as part of the update method to sink. This now allows me to fully parameterise my dataflow and it handles both composite keys and single columns keys.
Here's how the data looks in my config table:
UpsertKeyColumn = DOMNAME,DDLANGUAGE,AS4LOCAL,VALPOS,AS4VERS
A parameter value is set in the dataflow
Upsert_Key_Column = #item().UpsertKeyColumn
Finally, in the Sink settings, Custom Expression is selected for Key columns and the following expression is entered - split($upsert_key_column,',')

How to pass a Data Flow Parameter in Key Column in Sink Tanformation while updating a data?

I am implementing SCD Type2 through Data Flow. I having created a Parameter in it where I will pass a column name and this Parameter I am using in Sink Transformation in Key Column.
Passing a parameter in Key Column in Data Flow
I have selected the Add Dynamic Content and then Parameter, after that I selected the parameter I have created in Data Flow. Then it shows like "$Key_col".
But when I run the pipeline it gives me an error-
{"message":"at Sink 'sink1'(Line 56/Col 6): Column operands are not allowed in literal expressions. Details:at Sink 'sink1'(Line 56/Col 6): Column operands are not allowed in literal expressions","failureType":"UserError","target":"Update_Existing_Records","errorCode":"DFExecutorUserError"}
Can anyone please tell me how resolve this error or any workaround for this Problem.
Yes, this work. You just need to put single quotes around the parameter value like this:
"'$Key_col'"
I'm using double-quotes for string interpolation in this solution, so paste it in your expression exactly as that.
Key column doesn't support set with parameter. You only can choose the exist column in sink.
The column name that you pick as the key here will be used by ADF as part of the subsequent update, upsert, delete. Therefore, you must pick a column that exists in the Sink mapping. If you wish to not write the value to this key column, then click "Skip writing key columns".
Please reference: Mapping data flow properties.
The parameter Key_col is not exist in the sink, even if it has the same name.
Update:
Data Flow parameter:
If we want to using update, we must add an Alter row active:
Sink, key column choose exist column 'name':
Pipeline runs successful:
Hope this helps.

Spotfire: how to get the First and last value in a column based on entity and date?

I have a simple table with two entities and values associated with dates.
I want to extract the FIRST and LAST value based on historical dates. In the underlying data table, the dates are not sorted, hence when using FIRST() and LAST(), Spotfire gives incorrect values. What is the best way to solve this?
I tried
First([Value) OVER (Intersect([Category],[Date]))
Sample of the dataset:
If your using a cross table you can use a nested If statement to return the values when date is Min and Max.

How does Apache spark structured streaming 2.3.0 let the sink know that a new row is an update of an existing row?

How does spark structured streaming let the sink know that a new row is an update of an existing row when run in an update mode? Does it look at all the values of all columns of the new row and an existing row for an equality match or does it compute some sort of hash?
Reading the documentation, we see some interesting information about update mode (bold formatting added by me):
Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode.
So, to use update mode there needs to be some kind of aggregation otherwise all data will simply be added to the end of the result table. In turn, to use aggregation the data need to use one or more coulmns as a key. Since a key is needed it is easy to know if a row has been updated or not - simply compare the values with the previous iteration of the table (the key tells you which row to compare with). In aggregations that contains a groupby, the columns being grouped on are the keys.
Simple aggregations that return a single value will not require a key. However, since only a single value is returned it will update if that value is changed. An example here could be taking the sum of a column (without groupby).
The documentation contains a picture that gives a good understanding of this, see the "Model of the Quick Example" from the link above.

Resources