How to Flatten a semicolon Array properly in Azure Data Factory? - azure

Context: I've a data flow that extracts data from SQL DB, when data comes is just one column with a string separated by tab, in order to manipulate the data properly, I've tried to separate every single column with its corresponding data:
Firstly, to 'rebuild' the table properly I used a 'Derived Column' activity replacing tab with semicolons instead (1)
dropLeft(regexReplace(regexReplace(regexReplace(descripcion,[\t],';'),[\n],';'),[\r],';'),1)
So, after that use 'split()' function to get an array and build the columns (2)
split(descripcion, ';')
Problem: When I try to use 'Flatten' activity (as here https://learn.microsoft.com/en-us/azure/data-factory/data-flow-flatten), is just not working and data flow throws me just one column or if I add an additional column in the 'Flatten' activity I just get another column with the same data that the first one:
Expected output:
column2
column1
column3
2000017
ENVASE CORONA CLARA 24/355 ML GRAB
PC13
2004297
ENVASE V FAM GRAB 12/940 ML USADO
PC15
Could you say me what i'm doing wrong, guys? thanks by the way.

You can use the derived column activity itself, try as below.
After the first derived column, what you have is a string array which can just be split again using derived schema modifier.
Where firstc represent the source column equivalent to your column descripcion
Column1: split(firstc, ';')[1]
Column2: split(firstc, ';')[2]
Column3: split(firstc, ';')[3]
Optionally you can select the columns you need to write to SQL sink

Related

What is the industry standard Deduping method in Dataflows?

So Deduping is one of the basic and imp Datacleaning technique.
There are a number of ways to do that in dataflow.
Like myself doing deduping with help of aggregate transformation where i put key columns(Consider "Firstname" and "LastName" as cols) which are need to be unique in Group by and a column pattern like name != 'Firstname' && name!='LastName'
$$ _____first($$) in aggregate tab.
The problem with this method is ,if we have a total of 200 cols among 300 cols to be considered as Unique cols, Its a very tedious to do include 200 cols in my column Pattern.
Can anyone suggest a better and optimised Deduping process in Dataflow acc to the above situation?
I tried to repro the deduplication process using dataflow. Below is the approach.
List of columns that needs to be grouped by are given in dataflow parameters.
In this repro, three columns are given. This can be extended as per requirements.
Parameter Name: Par1
Type: String
Default value: 'col1,col2,col3'
Source is taken as in below image.
(Group By columns: col1, col2, col3;
Aggregate column: col4)
Then Aggregate transform is taken and in group by,
sha2(256,byNames(split($Par1,','))) is given in columns and it is named as groupbycolumn
In Aggregates, + Add column pattern near column1 and then delete Column1. Then Enter true() in matching condition. Then click on undefined column expression and enter $$ in column name expression and first($$) in value expression.
Output of aggregation function
Data is grouped by col1,col2 and col3 and first value of col4 is taken for every col1,col2 and col3 combination.
Then using select transformation, groupbycolumn from above output can be removed before copying to sink.
Reference: ** MS document** on Mapping data flow script - Azure Data Factory | Microsoft Learn

Azure Data Factory Selecting one item from object

How do you select just one item for an object in an object via Select in Azure Data Factory
{
"CorrelationId": 123,
"ComponentInfo": {
"ComponentId": "1",
"ComponentName": "testC"
}
}
I have a join1 step in my ADF as such and Inspect and see results in that step:
But when I select just the two I need the Data Preview errors out:
Column source1#ComponentInfo not found. The stream is either not connected or column is unavailable
The Select is set as such:
source1#{source1#ComponentInfo}.ComponentName
What is wrong with my selecting ComponentName since it is an object - the selected method was selected from a drop down. I have tried to flatten the data but it is not an array and modify the schema but not sure if I am researching the right select object method.
I reproduced this with above sample data and used select transformation after join. I got the same error as above.
Here, the select may looking the source1#ComponentInfo as column which is an object in this case.
You can get the desired result using derived column transformation.
After join, use derived column and create two new columns for the required columns and give like below in the data flow expression from the input schema.
ComponentName column:
CorrelationId column:
You can see the result in the Data preview.
Then, you can filter the required columns using select transformation.
Result:

In Bigquery, How can I convert Struct of Struct of String to Columns

So, In the table, there are 3 columns as per Image , 3rd one is Record(Struct), conntaing 2 structs old and new. Inside those structs there are columns and values.
I can access each final column by this -change.old.name , But I want to convert them as normal columns and create another taable with that ?
tried unnest but doesn't work as it's not array.
Data structure image
UPDATE :
Finally got it sorted. Should select and convert all columns by selecting all of the nested data and set the alias as how we want or replace dot with an underscore. Then create a table with that.
create table abc
as
select
ID
,Created_on
,Change.old.add as Change_old_add
,Change.old.name as Change_old_name
,Change.old.count_people as Change_old_count_people
,Change.new.add as Change_new_add
,Change.new.name as Change_new_name
,Change.new.count_people as Change_new_count_people
FROM `project.Table`
Finally got it sorted. Should select and convert all columns by selecting all of the nested data and set the alias as how we want or replace dot with an underscore. Then create a table with that.
create table abc
as
select
ID
,Created_on
,Change.old.add as Change_old_add
,Change.old.name as Change_old_name
,Change.old.count_people as Change_old_count_people
,Change.new.add as Change_new_add
,Change.new.name as Change_new_name
,Change.new.count_people as Change_new_count_people
FROM `project.Table`

How to get the data from previous row in Azure data factory

I am working on transforming data in Azure data factory
I have a source file that contains data like this:
ABC Code-01
DEF
GHI
JKL Code-02
MNO
I need to make the data looks like this to the sink file:
ABC Code-01
DEF Code-01
GHI Code-01
JKL Code-02
MNO Code-02
You can achieve this using Fill down concept available in Azure data factory. The code snippet is available here.
Note: The code snippet assumes that you have already added source transformation in data flow.
Steps:
Add source and link it with the source file (I have generated file with your sample data).
Edit the data flow script available on the right corner to add code.
Add the code snippet after the source as shown.
source1 derive(dummy = 1) ~> DerivedColumn
DerivedColumn keyGenerate(output(sk as long),
startAt: 1L) ~> SurrogateKey
SurrogateKey window(over(dummy),
asc(sk, true),
Rating2 = coalesce(Rating, last(Rating, true()))) ~> Window1
After adding the code in the script, data flow generated 3 transformations
a. Derived column transformation with a new dummy column with constant “1”
b. SurrogateKey transformation to generate Key value for each row starting with value 1.
c. Window transformation to perform window based aggregation. Here the code add predefined clause last() to take previous row not Null vale if current row value is NULL.
For more information on Window transformation refer - https://learn.microsoft.com/en-us/azure/data-factory/data-flow-window
As I am getting the values as single column in source, added additional columns in Derived column to split and store the single source column into 2 columns.
Substitute NULL values if column value is blank. If it is blank, last() clause will not recognize as NULL to substitute previous values.
case(length(dropLeft(Column_1,4)) >1, dropLeft(Column_1,4), toString(null()))
Preview of Derived column: Column_1 is the Source raw data, dummy is the column generated from the code snippet added with constant 1, Column1Left & Column1Right are to store the values after splitting (Column_1) raw data.
Note: Column1Right blank values are replaced with NULLs.
In windows transformation:
a. Over – This partition the source data based on the column provided. As there no other columns to uses as partition column, add the dummy column generated using derived column.
b. Sort – Sorts the source data based on the sort column. Add the Surrogate Key column to sort the incoming source data.
c. Window Column – Here, provide the expression to copy not Null value from previous rows only when the current value is Null
coalesce(Column1Right, last(Column1Right,true()))
d. Data preview of window transformation: Here, Column1Right data Null Values are replaced by previous not Null values based on the expression added in Window Columns.
Second derived column is added to concat Column1Left and Column1Right as single column.
Second Derived column preview:
A select transformation is added to only select required columns to the sink and remove unwanted columns (This is optional).
sink data output after fill down process.

Dynamically rename/map columns according to Azure Table data model convention

How would you dynamically rename/map columns according to Azure Table data model convention, property key name should follow C# identifiers. Since we cannot guarantee the columns coming to us conform to the standard, or when we get new columns in, that it is automatically fixed.
Example:
column_1 (something_in_parens), column with spaces, ...
returned...
column_1 something_in_parens, column_with_spaces, ...
The obvious solution might be to run a databricks python step in front of the Copy Data functionality, but maybe Copy Data is able to inflect the right schema?
columns = ["some Not so nice column Names", "Another ONE", "Last_one"]
​
new_columns = [x.lower().replace(" ", "_") for x in columns]
# returns ['some_not_so_nice_column_names', 'another_one', 'last_one']

Resources