How do I add a row to a dataset in ADF dataflows - azure

New to ADF and could use some help, I have a 2 column data set I would like to add an additional row to the data set. I have Columns : "CHANNEL" and "CHANNEL_ID" with values as following Source Data set and would like to add fields: '0' and 'blank' to those columns to produce the result belowDesired outcome. Is this kind of transformation possible within my dataflow?
I've tried to pivot the columns and add a derived column for the '0' field and then pivot those columns again, but I was not certain that what I did was right and I believe there has to be a simpler way than that.

In order to add a new row channel_id=0 and channel='blank', I followed below approach.
Two source transformations are taken with the same datasets as in below image.
Then filter transformation is added to one of the source transformations and filter is given to select one of the rows from the dataset.
filter condition:channel_id='52'
derived column transformation is added and settings are give as,
Column: Expression
channel='Blank'
channel_id='0'
Then Union transformation is added and derived column output and source1 output are given as input to the union transfromation.
Result of Union transformation:
Dataflow script
source(output(
channel as string,
channel_id as string
),
allowSchemaDrift: true,
validateSchema: false,
ignoreNoFilesFound: false) ~> source1
source(output(
channel as string,
channel_id as string
),
allowSchemaDrift: true,
validateSchema: false,
ignoreNoFilesFound: false) ~> source2
filter1 derive(channel = 'blank',
channel_id = '0') ~> derivedColumn1
source1, derivedColumn1 union(byName: true)~> union1
source2 filter(channel_id=='52') ~> filter1
union1 sink(allowSchemaDrift: true,
validateSchema: false,
skipDuplicateMapInputs: true,
skipDuplicateMapOutputs: true) ~> sink1
By this way, you can add a row in dataflow.

Related

Spark: good practice to check values in column are all same?

I have a dataset ds that has a column isInError, the dataset is read from a path.
For each dataset that I read, all values in this column should be the same (all true or all false).
Now I want to call some method based on this column (if all values in this column is true, I will add a new column, if all values are false, I will not add).
How can I do this properly ? I can surely do something like this :
dsFiltered = ds.filter(col("isInError").equals("true") then check if dsFiltered is empty, but I don't think it's best practice ?

What is the industry standard Deduping method in Dataflows?

So Deduping is one of the basic and imp Datacleaning technique.
There are a number of ways to do that in dataflow.
Like myself doing deduping with help of aggregate transformation where i put key columns(Consider "Firstname" and "LastName" as cols) which are need to be unique in Group by and a column pattern like name != 'Firstname' && name!='LastName'
$$ _____first($$) in aggregate tab.
The problem with this method is ,if we have a total of 200 cols among 300 cols to be considered as Unique cols, Its a very tedious to do include 200 cols in my column Pattern.
Can anyone suggest a better and optimised Deduping process in Dataflow acc to the above situation?
I tried to repro the deduplication process using dataflow. Below is the approach.
List of columns that needs to be grouped by are given in dataflow parameters.
In this repro, three columns are given. This can be extended as per requirements.
Parameter Name: Par1
Type: String
Default value: 'col1,col2,col3'
Source is taken as in below image.
(Group By columns: col1, col2, col3;
Aggregate column: col4)
Then Aggregate transform is taken and in group by,
sha2(256,byNames(split($Par1,','))) is given in columns and it is named as groupbycolumn
In Aggregates, + Add column pattern near column1 and then delete Column1. Then Enter true() in matching condition. Then click on undefined column expression and enter $$ in column name expression and first($$) in value expression.
Output of aggregation function
Data is grouped by col1,col2 and col3 and first value of col4 is taken for every col1,col2 and col3 combination.
Then using select transformation, groupbycolumn from above output can be removed before copying to sink.
Reference: ** MS document** on Mapping data flow script - Azure Data Factory | Microsoft Learn

Azure Data Factory Selecting one item from object

How do you select just one item for an object in an object via Select in Azure Data Factory
{
"CorrelationId": 123,
"ComponentInfo": {
"ComponentId": "1",
"ComponentName": "testC"
}
}
I have a join1 step in my ADF as such and Inspect and see results in that step:
But when I select just the two I need the Data Preview errors out:
Column source1#ComponentInfo not found. The stream is either not connected or column is unavailable
The Select is set as such:
source1#{source1#ComponentInfo}.ComponentName
What is wrong with my selecting ComponentName since it is an object - the selected method was selected from a drop down. I have tried to flatten the data but it is not an array and modify the schema but not sure if I am researching the right select object method.
I reproduced this with above sample data and used select transformation after join. I got the same error as above.
Here, the select may looking the source1#ComponentInfo as column which is an object in this case.
You can get the desired result using derived column transformation.
After join, use derived column and create two new columns for the required columns and give like below in the data flow expression from the input schema.
ComponentName column:
CorrelationId column:
You can see the result in the Data preview.
Then, you can filter the required columns using select transformation.
Result:

How to Flatten a semicolon Array properly in Azure Data Factory?

Context: I've a data flow that extracts data from SQL DB, when data comes is just one column with a string separated by tab, in order to manipulate the data properly, I've tried to separate every single column with its corresponding data:
Firstly, to 'rebuild' the table properly I used a 'Derived Column' activity replacing tab with semicolons instead (1)
dropLeft(regexReplace(regexReplace(regexReplace(descripcion,[\t],';'),[\n],';'),[\r],';'),1)
So, after that use 'split()' function to get an array and build the columns (2)
split(descripcion, ';')
Problem: When I try to use 'Flatten' activity (as here https://learn.microsoft.com/en-us/azure/data-factory/data-flow-flatten), is just not working and data flow throws me just one column or if I add an additional column in the 'Flatten' activity I just get another column with the same data that the first one:
Expected output:
column2
column1
column3
2000017
ENVASE CORONA CLARA 24/355 ML GRAB
PC13
2004297
ENVASE V FAM GRAB 12/940 ML USADO
PC15
Could you say me what i'm doing wrong, guys? thanks by the way.
You can use the derived column activity itself, try as below.
After the first derived column, what you have is a string array which can just be split again using derived schema modifier.
Where firstc represent the source column equivalent to your column descripcion
Column1: split(firstc, ';')[1]
Column2: split(firstc, ';')[2]
Column3: split(firstc, ';')[3]
Optionally you can select the columns you need to write to SQL sink

How to get the data from previous row in Azure data factory

I am working on transforming data in Azure data factory
I have a source file that contains data like this:
ABC Code-01
DEF
GHI
JKL Code-02
MNO
I need to make the data looks like this to the sink file:
ABC Code-01
DEF Code-01
GHI Code-01
JKL Code-02
MNO Code-02
You can achieve this using Fill down concept available in Azure data factory. The code snippet is available here.
Note: The code snippet assumes that you have already added source transformation in data flow.
Steps:
Add source and link it with the source file (I have generated file with your sample data).
Edit the data flow script available on the right corner to add code.
Add the code snippet after the source as shown.
source1 derive(dummy = 1) ~> DerivedColumn
DerivedColumn keyGenerate(output(sk as long),
startAt: 1L) ~> SurrogateKey
SurrogateKey window(over(dummy),
asc(sk, true),
Rating2 = coalesce(Rating, last(Rating, true()))) ~> Window1
After adding the code in the script, data flow generated 3 transformations
a. Derived column transformation with a new dummy column with constant “1”
b. SurrogateKey transformation to generate Key value for each row starting with value 1.
c. Window transformation to perform window based aggregation. Here the code add predefined clause last() to take previous row not Null vale if current row value is NULL.
For more information on Window transformation refer - https://learn.microsoft.com/en-us/azure/data-factory/data-flow-window
As I am getting the values as single column in source, added additional columns in Derived column to split and store the single source column into 2 columns.
Substitute NULL values if column value is blank. If it is blank, last() clause will not recognize as NULL to substitute previous values.
case(length(dropLeft(Column_1,4)) >1, dropLeft(Column_1,4), toString(null()))
Preview of Derived column: Column_1 is the Source raw data, dummy is the column generated from the code snippet added with constant 1, Column1Left & Column1Right are to store the values after splitting (Column_1) raw data.
Note: Column1Right blank values are replaced with NULLs.
In windows transformation:
a. Over – This partition the source data based on the column provided. As there no other columns to uses as partition column, add the dummy column generated using derived column.
b. Sort – Sorts the source data based on the sort column. Add the Surrogate Key column to sort the incoming source data.
c. Window Column – Here, provide the expression to copy not Null value from previous rows only when the current value is Null
coalesce(Column1Right, last(Column1Right,true()))
d. Data preview of window transformation: Here, Column1Right data Null Values are replaced by previous not Null values based on the expression added in Window Columns.
Second derived column is added to concat Column1Left and Column1Right as single column.
Second Derived column preview:
A select transformation is added to only select required columns to the sink and remove unwanted columns (This is optional).
sink data output after fill down process.

Resources