Upsert Option in ADF Copy Activity - azure

With the "upsert option" , should I expect to see "0" as "Rows Written" in a copy activity result summary?
My situation is this: The source and sink table columns are not exactly the same but the Key columns to tell it how to know the write behavior are correct.
I have tested and made sure that it does actually do insert or update based on the data I give to it BUT what I don't understand is if I make ZERO changes and just keep running the pipeline , why does it not show "zero" in the Rows Written summary?

The main reason why rowsWritten is not shown as 0 even when the source and destination have same data is:
Upsert inserts data when a key column value is absent in target table and updates the values of other rows whenever the key column is found in target table.
Hence, it is modifying all records irrespective of the changes in data. As in SQL Merge, there is no way to tell copy activity that if an entire row already exists in target table, then ignore that case.
So, even when key_column matches, it is going to update the values for rest of the columns and hence counted as row written. The following is an example of 2 cases
The rows of source and sink are same:
The rows present:
id,gname
1,Ana
2,Ceb
3,Topias
4,Jerax
6,Miracle
When inserting completely new rows:
The rows present in source are (where sink data is as above):
id,gname
8,Sumail
9,ATF

Related

Updating a table "Y" in file "B" with new added rows from table "X" in file "A"

I am trying to create an "instant cloud flow" on Power Automate.
Context:
Excel file "A" contains a table "X" which gets updated regularly with new rows at the bottom containing new information.
Excel file "B" contains a table "Y" with the same characteristics, number of columns, headers etc. Except for the number of rows since the table "X" is being updated regularly with new rows of data.
Both files are stored on OneDrive cloud and will possibly move into Sharepoint file storage, so they will be in the cloud, not stored locally on any device.
Challenge/need:
I need table "Y", within file "B", to mirror the changes happening on table "X" from file "A". Specifically the new rows of data being added to table "X":
Internet/world > New rows of data at the bottom of Table "X" of file "A" > These same new rows get copied into also the bottom of Table "Y" of file "B". Basically both tables, "X" and "Y" need to stay exactly the same with a maximum interval of 3 minutes.
Solution tried:
I tried a flow which gets triggered every minute. In this flow, I tried creating an array containing the new rows of data added to table "X". Then using the Apply to each control with the values from this new array, I tried the actions Add a row into a table, followed by Update a row for each item inside this array. Keeping in this way table "Y" updated as per table "X". This part works, rows are added and updated on table "Y".
My problem:
The Condition that compares the data from the 2 tables, decides that all rows from table "X" are new data, even though some are already present in table "Y". This is a problem because too many rows are added to table "Y" and the tables become out of sync due to the difference in the number of rows/body length. In my understanding, this happens because an item/object is generated by List rows present in a table called ItemInternalId.
This ItemInternalId generates different id numbers for the same rows already updated previously, and because of this, the condition identifies all rows on table "X" as new data to be updated on table "Y".
Questions:
Could someone confirm that this ItemInternalId is the problem here? I am in doubt because I tried removing this by creating another array using the Select action and then proceeded using just the columns/headers I need, excluding this way ItemInternalId. Problem is that the "header" is excluded (which I need), containing only the value, and also the condition proceeds to identify all rows on "X" as new data again anyway...
Maybe the problem is that I am doing it wrong and there is another simple, or better way to get an array with the new items from table "X"? Here is the condition that I use to try to feed feed a new Array with the new rows from table "X":
Thank you
I found a workaround. I will not accept this as the right answer because it is just a workaround not the definitive solution to the problem.
Basically, The file "A" needs to have a "X" table with just 1 blank row. The Power Automate flow will "add new rows" with the information to this table.
Then on file "B" the table "Y" will need to be created with a certain amount of rows depending on how much data comes in per day, but can be like 100. Then create a Power automate flow that "updates the table" this will add the information from "X" table to "Y" table.
Please be aware that you will need a Key column on both tables so that Power automate knows what rows to update. You can just use basic numerical order for each row on the Key column.

Delete bottom two rows in Azure Data Flow

I would like to delete the bottom two rows of an excel file in ADF, but I don't know how to do it.
The flow I am thinking of is this.
enter image description here
*I intend to filter -> delete the rows to be deleted in yellow.
The file has over 40,000 rows of data and is updated once a month. (The number of rows changes with each update, so the condition must be specified with a function.)
The contents of the file are also shown here.
The bottom two lines contain spaces and asterisks.
enter image description here
Any help would be appreciated.
I'm new to Azure and having trouble.
I need your help.
Add a surrogate key transformation to put a row number on each row. Add a new branch to duplicate the stream and in that new branch, add an aggregate.
Use the aggregate transformation to find the max() value of the surrogate key counter.
Then subtract 2 from that max number and filter for just the rows up to that max-2.
Let me provide a more detailed answer here ... I think I can get it in here without writing a separate blog.
The simplest way to filter out the final 2 rows is a pattern depicted in the screenshot here. Instead of the new branch, I just created 2 sources both pointing to the same data source. The 2nd stream is there just to get a row count and store it in a cached sink. For the aggregation expression I used this: "count(1)" as the row count aggregator.
In the first stream, that is the primary data processing stream, I add a Surrogate Key transformation so that I can have a row number for each row. I called my key column "sk".
Finally, set the Filter transformation to only allow rows with a row number <= the max row count from the cached sink minus 2.
The Filter expression looks like this: sk <= cachedSink#output().rowcount-2

Dataflow Data Factury Azure - Exists

I have been trying to create a dataflow to check “changed data” between two sources, the staging table and the target one. I want to clean all the possible duplicates from the staging table to sink it in the target one.
I have tried different ways, without success. I am wondering if there is a safe and efficient way to do what I want.
Source table is an external table with 77 columns
Target table is a physical one with 77 columns
Datatype of both columns are the same
First try:
I tried through “exists/does not exist” activity.
The first try, I set a “does not exist” activity with all the columns of the table (staging) equal all the columns from the target table. Therefore, I intended to sink all the cases that do not match the setting mentioned.
Example:
STAGING#COLUMNA == TARGET#COLUMNA &&
STAGING #COLUMNB == TARGET #COLUMNB &&
STAGING#COLUMNC == TARGET#COLUMNC
...
Result: It did not work and let all the duplicates get inside the target table. It seems that the comparison with columns not varchar is not that good. I tried to use coalesce and trim, but without success again
Second Try
Therefore, I tried to create a row signature with:
Derived column for both source (staging and target)
sha2(256,COLUMNA,COLUMNB,COLUMNC … )
DOES NOT EXIST ACTIVITY
STAGING#FINGERPRINT == TARGET#FINGERPRINT
Result: Once again it did not work. My test was with more than 10k duplicated rows, and all of them got inside again.
Would anyone have a different approach?
The solutions are here:
Distinct rows: https://www.youtube.com/watch?v=ryYo8UFUgTI
Dedupe: https://www.youtube.com/watch?v=QOi26ETtPTw
Hashing: https://www.youtube.com/watch?v=Id82NZo9hxM

How to avoid having Power Query reorder my data when merging queries/expanding

I have mismatched data lines in a power query so I am attempting to renumber/reorganize the data then merge the information to realign.
Here, I want the data in Column Answer 2 to go into column Answer, cells 6,7,11,12.
I've indexed each of my files and merged the queries. However, when I expand the merged queries, PQ seems to randomize my data.
I'm new to PQ so I don't really write the 'code', just use the user interface.
As you can see from the second image, the data comes out in the wrong order.
I merged two tables, then added index column and moved it to the beginning, then expanded the merged table and deleted index column. The order of rows has left the same, as in the source table.

ADFv2 trouble with column mapping (reposting)

I have a source .csv with 21 columns and a destination table with 25 columns.
Not ALL columns within the source have a home in the destination table and not all columns in the destination table come from the source.
I cannot get my CopyData task to let me pick and choose how I want the mapping to be. The only way I can get it to work so far is to load the source data to a "holding" table that has a 1:1 mapping and then execute a stored procedure to insert data from that table into the final destination.
I've tried altering the schemas on both the source and destination to match but it still errors out because the ACTUAL source has more columns than the destination or vice versa.
This can't possibly be the most efficient way to accomplish this but I'm at a loss as to how to make it work.
Yes I have tried the user interface, yes I have tried the column schemas, no I can't modify the source file and shouldn't need to.
The error code that is returned is some variation on:
"errorCode": "2200",
"message": "ErrorCode=UserErrorInvalidColumnMappingColumnCountMismatch,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=Invalid column mapping provided to copy activity: '{LONG LIST OF COLUMN MAPPING HERE}', Detailed message: Different column count between target structure and column mapping. Target column count:25, Column mapping count:16. Check column mapping in table definition.,Source=Microsoft.DataTransfer.Common,'",
"failureType": "UserError",
"target": "LoadPrimaryOwner"
Tim F. Please view the statements in this Schema mapping in copy activity:
Column mapping supports mapping all or subset of columns in the source
dataset "structure" to all columns in the sink dataset "structure".
The following are error conditions that result in an exception:
1.Source data store query result does not have a column name that is specified in the input dataset "structure" section.
2.Sink data store (if with pre-defined schema) does not have a column name that is specified in the output dataset "structure" section.
3.Either fewer columns or more columns in the "structure" of sink dataset than specified in the mapping.
4.Duplicate mapping.
So,you could know that all the columns in the sink dataset need to be mapped. Since you can't change the destination,maybe you don't have to struggle in an unsupported feature.
Of course ,you could use stored procedure mentioned in your description.That's a perfect workaround and not very troublesome. About the using details, you could refer to my previous cases:
1.Azure Data Factory activity copy: Evaluate column in sink table with #pipeline().TriggerTime
2.Azure Data factory copy activity failed mapping strings (from csv) to Azure SQL table sink uniqueidentifier field
In addition, if you really don't want avoid above solution,you could submit feedback to ADF team about your desired feature.

Resources