What is the industry standard Deduping method in Dataflows? - azure

So Deduping is one of the basic and imp Datacleaning technique.
There are a number of ways to do that in dataflow.
Like myself doing deduping with help of aggregate transformation where i put key columns(Consider "Firstname" and "LastName" as cols) which are need to be unique in Group by and a column pattern like name != 'Firstname' && name!='LastName'
$$ _____first($$) in aggregate tab.
The problem with this method is ,if we have a total of 200 cols among 300 cols to be considered as Unique cols, Its a very tedious to do include 200 cols in my column Pattern.
Can anyone suggest a better and optimised Deduping process in Dataflow acc to the above situation?

I tried to repro the deduplication process using dataflow. Below is the approach.
List of columns that needs to be grouped by are given in dataflow parameters.
In this repro, three columns are given. This can be extended as per requirements.
Parameter Name: Par1
Type: String
Default value: 'col1,col2,col3'
Source is taken as in below image.
(Group By columns: col1, col2, col3;
Aggregate column: col4)
Then Aggregate transform is taken and in group by,
sha2(256,byNames(split($Par1,','))) is given in columns and it is named as groupbycolumn
In Aggregates, + Add column pattern near column1 and then delete Column1. Then Enter true() in matching condition. Then click on undefined column expression and enter $$ in column name expression and first($$) in value expression.
Output of aggregation function
Data is grouped by col1,col2 and col3 and first value of col4 is taken for every col1,col2 and col3 combination.
Then using select transformation, groupbycolumn from above output can be removed before copying to sink.
Reference: ** MS document** on Mapping data flow script - Azure Data Factory | Microsoft Learn

Related

How to Flatten a semicolon Array properly in Azure Data Factory?

Context: I've a data flow that extracts data from SQL DB, when data comes is just one column with a string separated by tab, in order to manipulate the data properly, I've tried to separate every single column with its corresponding data:
Firstly, to 'rebuild' the table properly I used a 'Derived Column' activity replacing tab with semicolons instead (1)
dropLeft(regexReplace(regexReplace(regexReplace(descripcion,[\t],';'),[\n],';'),[\r],';'),1)
So, after that use 'split()' function to get an array and build the columns (2)
split(descripcion, ';')
Problem: When I try to use 'Flatten' activity (as here https://learn.microsoft.com/en-us/azure/data-factory/data-flow-flatten), is just not working and data flow throws me just one column or if I add an additional column in the 'Flatten' activity I just get another column with the same data that the first one:
Expected output:
column2
column1
column3
2000017
ENVASE CORONA CLARA 24/355 ML GRAB
PC13
2004297
ENVASE V FAM GRAB 12/940 ML USADO
PC15
Could you say me what i'm doing wrong, guys? thanks by the way.
You can use the derived column activity itself, try as below.
After the first derived column, what you have is a string array which can just be split again using derived schema modifier.
Where firstc represent the source column equivalent to your column descripcion
Column1: split(firstc, ';')[1]
Column2: split(firstc, ';')[2]
Column3: split(firstc, ';')[3]
Optionally you can select the columns you need to write to SQL sink

Spark: Filter & withColumn using row values?

I need to create a column called sim_count for every row in my spark dataframe, whose value is the count of all other rows from the dataframe that match some conditions based on the current row's values. Is it possible to access row values while using when?
Is something like this possible? I have already implemented this logic using a UDF, but serialization of the dataframe's rdd map is very costly and I am trying to see if there is a faster alternative to find this count value.
Edit
<Row's col_1 val> refer's to the outer scope row I am calculating the count for, NOT the inner scope row inside the df.where. For example, I know this is incorrect syntax, but I'm looking for something like:
df.withColumn('sim_count',
f.when(
f.col("col_1").isNotNull(),
(
df.where(
f.col("price_list").between(f.col("col1"), f.col("col2"))
).count()
)
).otherwise(f.lit(None).cast(LongType()))
)

Selecting a column not in cube in Spark

I have a dataframe which has say 3 columns x,y and z.
I want to get all the three columns in result but I do not want to cube on column z.
Is there a way I can do it?
P.S. - (I have just given example with 3 columns but I have quite a long list of columns so GROUP SET is not an option).
Example -
val df = Seq(("1","x","a"),("1","v","b"),("3","x","c")).toDF("col1","col2","col3")
val list = Seq("col1","col2").map(e=>col(e))
// now I want to select col3 non cubed (basically I do not want get the combinations for it)
// This guy will not select col3 at all since col3 is not part of cube which is I want to achieve
display(df.select($"col1",$"col2",$"col3").cube(list:_*).agg(sum("col1")))
Cube is an extension of GroupBY in which you will get the aggregated result for the various combinations of columns used to group by.
Here is an example of what you can achieve using groupBy,
df.cube($"col1",$"col2").agg(first($"col3").as("col3")).show
Please share your expected result as suggested by Shaido.

Using Large Look up table

Problem Statement :
I have two tables - Data (40 cols) and LookUp(2 cols) . I need to use col10 in data table with lookup table to extract the relevant value.
However I cannot make equi join . I need a join based on like/contains as values in lookup table contain only partial content of value in Data table not complete value. Hence some regex based matching is required.
Data Size :
Data Table : Approx - 2.3 billion entries (1 TB of data)
Look up Table : Approx 1.4 Million entries (50 MB of data)
Approach 1 :
1.Using the Database ( I am using Google Big Query) - A Join based on like take close to 3 hrs , yet it returns no result. I believe Regex based join leads to Cartesian join.
Using Apache Beam/Spark - I tried to construct a Trie for the lookup table which will then be shared/broadcast to worker nodes. However with this approach , I am getting OOM as I am creating too many Strings. I tried increasing memory to 4GB+ per worker node but to no avail.
I am using Trie to extract the longest matching prefix.
I am open to using other technologies like Apache spark , Redis etc.
Do suggest me on how can I go about handling this problem.
This processing needs to performed on a day-to-day basis , hence time and resources both needs to be optimized .
However I cannot make equi join
Below is just to give you an idea to explore for addressing in pure BigQuery your equi join related issue
It is based on an assumption I derived from your comments - and covers use-case when y ou are looking for the longest match from very right to the left - matches in the middle are not qualified
The approach is to revers both url (col10) and shortened_url (col2) fields and then SPLIT() them and UNNEST() with preserving positions
UNNEST(SPLIT(REVERSE(field), '.')) part WITH OFFSET position
With this done, now you can do equi join which potentially can address your issue at some extend.
SO, you JOIN by parts and positions then GROUP BY original url and shortened_url while leaving only those groups HAVING count of matches equal of count of parts in shorteded_url and finally you GROUP BY url and leaving only entry with highest number of matching parts
Hope this can help :o)
This is for BigQuery Standard SQL
#standardSQL
WITH data_table AS (
SELECT 'cn456.abcd.tech.com' url UNION ALL
SELECT 'cn457.abc.tech.com' UNION ALL
SELECT 'cn458.ab.com'
), lookup_table AS (
SELECT 'tech.com' shortened_url, 1 val UNION ALL
SELECT 'abcd.tech.com', 2
), data_table_parts AS (
SELECT url, x, y
FROM data_table, UNNEST(SPLIT(REVERSE(url), '.')) x WITH OFFSET y
), lookup_table_parts AS (
SELECT shortened_url, a, b, val,
ARRAY_LENGTH(SPLIT(REVERSE(shortened_url), '.')) len
FROM lookup_table, UNNEST(SPLIT(REVERSE(shortened_url), '.')) a WITH OFFSET b
)
SELECT url,
ARRAY_AGG(STRUCT(shortened_url, val) ORDER BY weight DESC LIMIT 1)[OFFSET(0)].*
FROM (
SELECT url, shortened_url, COUNT(1) weight, ANY_VALUE(val) val
FROM data_table_parts d
JOIN lookup_table_parts l
ON x = a AND y = b
GROUP BY url, shortened_url
HAVING weight = ANY_VALUE(len)
)
GROUP BY url
with result as
Row url shortened_url val
1 cn457.abc.tech.com tech.com 1
2 cn456.abcd.tech.com abcd.tech.com 2

Table comprehensions: get subset from internal table into another one

As stated in the topic, I want to have a conditioned subset of an internal
table inside another internal table.
Let us first look, what it may look like the old fashioned way.
DATA: lt_hugeresult TYPE tty_mytype,
lt_reducedresult TYPE tty_mytype.
SELECT "whatever" FROM "wherever"
INTO CORRESPONDING FIELDS OF TABLE lt_hugeresult
WHERE "any_wherecondition".
IF sy-subrc = 0.
lt_reducedresult[] = lt_hugeresult[].
DELETE lt_reducedresult WHERE col1 EQ 'a value'
AND col2 NE 'another value'
AND col3 EQ 'third value'.
.
.
.
ENDIF.
We all may know this.
Now I was reading about the table reducing stuff, which is introduced
with abap 7.40, appearently SP8.
Table Comprehensions – Building Tables Functionally
Table-driven:
VALUE tabletype( FOR line IN tab WHERE ( … )
( … line-… … line-… … )
)
For each selected line in the source table(s), construct a line in the result table. Generalization of value constructor from static to dynamic number of lines.
I was experimenting with that, but the results seem not really to fit,
perhaps I am doing it wrong, or I might even need the condition-driven approach.
So, how would it look like, if I want to write the above statement with table comprehension techniques ?
Until now I have this, delivering not that, what I need, and I have seen, that
it seems, as if the "not equal" is not possible...
DATA(reduced) = VALUE tty_mytype( FOR checkline IN lt_hugeresult
WHERE ( col1 = 'a value' )
( col2 = 'another value' )
( col3 = space )
).
Anyone having some hints ?
EDIT: Seems still not to work. Here is, as I do it:
Executable line:
Debugger results:
Wrong Reduced:
And what now ???
You could use the FILTER operator with the EXCEPT WHERE addition to filter out any rows that match the where clause:
lt_reducedresult = FILTER # ( lt_hugeresult EXCEPT WHERE col1 = 'a value'
AND col2 <> 'another value'
AND col3 = 'a third value' ).
Note that lt_hugeresult would have to be a sorted table, and the col1/col2/col3 need to be key components (you can specify a secondary key using the USING KEY addition).
The documentation for FILTER explicitly notes that:
Table filtering can also be performed using a table comprehension or a table reduction with an iteration expression for table iterations with FOR. The operator FILTER provides a shortened format for this special case and is more efficient to execute.
A table filter constructs the result row by row. If the result contains almost all rows in the source table, this method can be slower than copying the source table and deleting the surplus rows from the target table.
So your approach of using DELETE might actually be appropriate depending on the size of the table.
The Table Iterations may be a lot confusing when you use WHERE, because of parenthesis groups.
The "NOT EQUAL" condition is very well supported, as I show below in the solution of your first example. The issue you observe is due to misproper use of parenthesis groups.
You must absolutely define the whole logical expression after WHERE Inside ONE parenthesis group (one, or several elementary conditions separated by logical operators AND, OR, etc.)
After the parenthesis group for WHERE, you define usually only one parenthesis group which corresponds to the line to be added to the target internal table. You may define subsequent parenthesis groups, if for each line in the source internal table, you want to add several lines in the target internal table.
In your example, only the first parenthesis group applies to WHERE (either col1 = 'a value' in your first example, or insplot = _ilnum in your second example).
The subsequent parenthesis groups correspond to the lines to be added, i.e. 2 lines are added for each source line in the first example (one line with col2 = 'another value', and one line with col3 = space), and 3 lines are added for each source line in the second example (one line with inspoper = i_evaluation-inspoper, one line with inspchar = i_evaluation-inspchar, one line corresponding to the line of _single_results).
So, you should write your code as follows.
First example :
DATA(reduced) = VALUE tty_mytype( FOR checkline IN lt_hugeresult
WHERE ( col1 = 'a value'
AND col2 <> 'another value'
AND col3 = 'third value'
)
( checkline )
).
Second example :
DATA(singres) = VALUE tbapi2045d4( FOR checkline IN _single_results
WHERE ( insplot = _ilnum
AND inspoper = i_evaluation-inspoper
AND inspchar = i_evaluation-inspchar
)
( checkline )
).
I compared old-fashioned syntax of your above example with table comprehension technique and got exactly the same result.
Actually, your sample is not functional because it lacks row specification for constructed table reduced.
Try this one, which worked for me.
DATA(reduced) = VALUE tty_mytype( FOR checkline IN lt_hugeresult
WHERE ( col1 = 'a value' AND
col2 = 'another value' AND
col3 = space )
( checkline )
).
In the above sample we have the most basic type of result row specification where is is absolutely similar to source table. More sophisticated examples, where new table rows are evaluated with table iterations, can be found here.

Resources