Azure Data Factory Validation the Header row and checksum of a fixed length txt file and insert - azure

I am currently using the Azure Data Factory to retrieve a fixed-length file from blob storage and trying to import the record into my database.
Fixed-length.txt
0202212161707
1Tom
1Kelvin
1Michael
23
The first row is the header record, which is start with '0' and comes up with the creation time.
The following row are the detail record, started with '1' and comes up with user name.
The last row is the end record, which started with '2' and comes up with the sum of the detail record.
However, I want to validate that the data of the file is correct before I insert those records. I would like to check if the checksum is correct first, and then only insert all those record started with 1.
Currently, I insert all those record line by line into SQL DB and run a stored procedures to perform the tasks. Is it possible to utlize Azure Data Factory to do it?? Thank you.

I reproduced your issue follow below steps.
First take one look up activity to view all the data from file and apply filter on that data.
Then take one set variable activity and get the last row's last element e.g. 23 as 3 with below dynamic expression.
#last(activity('Lookup1').output.value[sub(length(activity('Lookup1').output.value),1)].Prop_0)
Then take one Filter activity to filter rows with 1 prefix with below items value and condition
items : #activity('Lookup1').output.value
condition : #startswith(item().Prop_0,'1')
after filter take ForEach activity to Append those values in an array
Then inside for each activity take Append variable activity it will create an array with filtered values.
Now take If condition with expression which checking value of set variable 1 and length of Append array variable is same or not.
#equals(int(variables('sum')),length(variables('username')))
Then inside true condition, add your copy activity to copy data if condition is true
My Sample Output:
0202212161707
1Tom
1Kelvin
23
for above data control is going to false condition.
0202212161707
1Tom
1Kelvin
1Michael
23
for above data control is going to true condition.

Related

Upsert Option in ADF Copy Activity

With the "upsert option" , should I expect to see "0" as "Rows Written" in a copy activity result summary?
My situation is this: The source and sink table columns are not exactly the same but the Key columns to tell it how to know the write behavior are correct.
I have tested and made sure that it does actually do insert or update based on the data I give to it BUT what I don't understand is if I make ZERO changes and just keep running the pipeline , why does it not show "zero" in the Rows Written summary?
The main reason why rowsWritten is not shown as 0 even when the source and destination have same data is:
Upsert inserts data when a key column value is absent in target table and updates the values of other rows whenever the key column is found in target table.
Hence, it is modifying all records irrespective of the changes in data. As in SQL Merge, there is no way to tell copy activity that if an entire row already exists in target table, then ignore that case.
So, even when key_column matches, it is going to update the values for rest of the columns and hence counted as row written. The following is an example of 2 cases
The rows of source and sink are same:
The rows present:
id,gname
1,Ana
2,Ceb
3,Topias
4,Jerax
6,Miracle
When inserting completely new rows:
The rows present in source are (where sink data is as above):
id,gname
8,Sumail
9,ATF

Delete bottom two rows in Azure Data Flow

I would like to delete the bottom two rows of an excel file in ADF, but I don't know how to do it.
The flow I am thinking of is this.
enter image description here
*I intend to filter -> delete the rows to be deleted in yellow.
The file has over 40,000 rows of data and is updated once a month. (The number of rows changes with each update, so the condition must be specified with a function.)
The contents of the file are also shown here.
The bottom two lines contain spaces and asterisks.
enter image description here
Any help would be appreciated.
I'm new to Azure and having trouble.
I need your help.
Add a surrogate key transformation to put a row number on each row. Add a new branch to duplicate the stream and in that new branch, add an aggregate.
Use the aggregate transformation to find the max() value of the surrogate key counter.
Then subtract 2 from that max number and filter for just the rows up to that max-2.
Let me provide a more detailed answer here ... I think I can get it in here without writing a separate blog.
The simplest way to filter out the final 2 rows is a pattern depicted in the screenshot here. Instead of the new branch, I just created 2 sources both pointing to the same data source. The 2nd stream is there just to get a row count and store it in a cached sink. For the aggregation expression I used this: "count(1)" as the row count aggregator.
In the first stream, that is the primary data processing stream, I add a Surrogate Key transformation so that I can have a row number for each row. I called my key column "sk".
Finally, set the Filter transformation to only allow rows with a row number <= the max row count from the cached sink minus 2.
The Filter expression looks like this: sk <= cachedSink#output().rowcount-2

Aggregation in Azure Data Flow is Returning Invalid Value

I have created a data flow in Data Factory.
Step 1. Read the parquet file.
Step 2. Aggregate the file to get the Max(DateField)
Step 3. Use a derived column to write in a Value.
Step 4. Alter row task with Value and the DateField.
Step 5. Sink select the Watermark table to update.
The flow updates the value, but it isn't putting in the max value. The date value is incorrect. Any ideas?
Flow_image
max() aggregate function doesn't work on date/string format type. You must pass any column which contains numerical values. Date is not a valid input on which you can apply max function. There is no maximum date term.
Instead you can filter the timestamp and get the latest or oldest date using ADF.
Refer this answer by #Leon to know how to implement the same.

Azure Data Factory - Exists transformation in Data Flow with generic dataset

I'm having issues using the Exists Transformation within a Data Flow with a generic dataset.
I have two sources (one from staging table "sourceStg", one from DWH table "sourceDwh") and want to compare if the UniqueIdentifier-Column in the staging table is existing in the UniqueIdentifier-Column in the DWH table. For that I have a generic data set which I query with a SQL statement containing parameters.
When I open the "Exists settings" I cannot choose any Column from the source in the conditions since the source is generic and has no Projection until I run the data flow. However, I have a parameter which I get from the parent pipeline which provides me the name of the Column containing the UniqueIdentifier (both column names in staging / DWH are the same).
I tried to add following statement "byName($UniqueIdentifier)" in the left and right column field but the engine resolves them both as the sourceStg-Column since the prefix of the source-transformations is missing and it defaults to the first one. What I basically now try to achieve is having some statement as followed defining the correct source-transformation and the column containing the unique identifier with a parameter.
exists(sourceStg#$UniqueIdentifier == sourceDwh#$UniqueIdentifier)
But either the expression cannot be parsed or the result does not retrieve the actual UniqueIdentifier value from the column but writes the statement (e.g. sourceStg#$UniqueIdentifier) as column value.
The only workaround I found so far is having two derived columns which adds a suffix to the UniqueIdentifier-Column in one source and a new parameter $UniqueIdentiferDwh which is populate with the parameter $UniqueIdentifier and the same suffix as used in the derived column.
Any Azure Data Factory experts out there to help?
Thanks in advance!

Azure-Data-Factory - If Condition returns false despite being logically true

I'm trying to do a logical test to compare two activity outputs.
The first one is giving back a file name (derived from GetMetaData)and the other one distinct filenames that are already in the database (derived from a lookup Activity).
So the first activity is giving X.csv (a file in a Blob0 while the second one is giving a list Y.csv; Z.csv (the result of the lookup Select distinct from table X)
Based on this outcome I would say that the logical test is true so ADF has to start a particular activity. I'm using the expresion below, but despite the fact there are no errors the outcome is always false. What am I doing wrong? I guess it has something to do with the lookup activity because the query will give a list of values I think.
please help thanks in advance!
#equals(activity('GetBlobName').output,activity('LookupBestandsnaam').output)
Output activity LookupBestandsnaam:
Output activity GetBlobName:
The output of Lookup and Get Metadata are different:
Lookup activity reads and returns the content of a configuration file
or table.
Get Metadata activity to retrieve the metadata of any data in Azure
Data Factory
We can't compare the output directly. You will always get false in the if condition expression.
Please try bellow expression:
#equals(activity('GetBlobName').output.value.name,activity('LookupBestandsnaam').output.value.bestandsnaam)
Update:
Congratulations that you use another way to solved it:
"I have now replaced the if condition with a stored procedure that uses an IF exists script running on the basis of look-up activity in ADF."

Resources