Delete bottom two rows in Azure Data Flow

Delete bottom two rows in Azure Data Flow - azure

I would like to delete the bottom two rows of an excel file in ADF, but I don't know how to do it.
The flow I am thinking of is this.
enter image description here
*I intend to filter -> delete the rows to be deleted in yellow.
The file has over 40,000 rows of data and is updated once a month. (The number of rows changes with each update, so the condition must be specified with a function.)
The contents of the file are also shown here.
The bottom two lines contain spaces and asterisks.
enter image description here
Any help would be appreciated.
I'm new to Azure and having trouble.
I need your help.

Add a surrogate key transformation to put a row number on each row. Add a new branch to duplicate the stream and in that new branch, add an aggregate.
Use the aggregate transformation to find the max() value of the surrogate key counter.
Then subtract 2 from that max number and filter for just the rows up to that max-2.

Let me provide a more detailed answer here ... I think I can get it in here without writing a separate blog.
The simplest way to filter out the final 2 rows is a pattern depicted in the screenshot here. Instead of the new branch, I just created 2 sources both pointing to the same data source. The 2nd stream is there just to get a row count and store it in a cached sink. For the aggregation expression I used this: "count(1)" as the row count aggregator.
In the first stream, that is the primary data processing stream, I add a Surrogate Key transformation so that I can have a row number for each row. I called my key column "sk".
Finally, set the Filter transformation to only allow rows with a row number <= the max row count from the cached sink minus 2.
The Filter expression looks like this: sk <= cachedSink#output().rowcount-2

Related

Excel data tables: Multiple outputs with only one input column

I am trying to create a data table with multiple outputs across periods, but for the same scenarios.
Is it possible to create that without inserting an extra column between each output column to deliver input for the data table (i.e. input column = index 50-110).
Is this in any way possible? See picture of what I would usually mark to create the data table (this does only cover one period/output though). But if I were to make the scenario for FY23, then I would need to insert a column between FY22 and FY23 where I copy the index 50-110 again. I would like to not have to do that.

Upsert Option in ADF Copy Activity

With the "upsert option" , should I expect to see "0" as "Rows Written" in a copy activity result summary?
My situation is this: The source and sink table columns are not exactly the same but the Key columns to tell it how to know the write behavior are correct.
I have tested and made sure that it does actually do insert or update based on the data I give to it BUT what I don't understand is if I make ZERO changes and just keep running the pipeline , why does it not show "zero" in the Rows Written summary?

The main reason why rowsWritten is not shown as 0 even when the source and destination have same data is:
Upsert inserts data when a key column value is absent in target table and updates the values of other rows whenever the key column is found in target table.
Hence, it is modifying all records irrespective of the changes in data. As in SQL Merge, there is no way to tell copy activity that if an entire row already exists in target table, then ignore that case.
So, even when key_column matches, it is going to update the values for rest of the columns and hence counted as row written. The following is an example of 2 cases
The rows of source and sink are same:
The rows present:
id,gname
1,Ana
2,Ceb
3,Topias
4,Jerax
6,Miracle
When inserting completely new rows:
The rows present in source are (where sink data is as above):
id,gname
8,Sumail
9,ATF

Delimited File with Varying Number of Rows Azure Data Factory

I have a delimited file separated by hashes that looks somewhat like this,
value#value#value#value#value#value##value
value#value#value#value##value#####value#####value
value#value#value#value###value#value####value##value
As you can see, when separated by hashes, there are more columns in the 2nd and 3rd rows than there is in the first. I want to be able to ingest this into a database using a ADF Data Flow after some transformations. However, whenever I try to do any kind of mapping, I always only see 7 columns (the number of columns in the first row).
Is there any way to get all of the values? As many columns as there are in the row with most number of items? I do not mind the nulls.
Note: I do not have a header row for this.

Azure Data Factory directly will not be able to Import schema -row with the maximum number of column. Hence, it is important to make sure you have same number of columns in your file.
You can use Azure functions to validate your file and update it to get equal number of columns in all rows.
You could give it a try to have a local file with row with the maximum number of column and import the schema from the file, else you have to go for Azure Functions where you have to convert the file and then trigger the pipeline.

Excel2016: Generate ID based on multiple criteria (no VBA)

I am trying to generate Batch ID based on Course, Date, & Time. All the rows which have the same Course+Date+Time combination should have the same Batch ID. All subsequent combinations should have incremental IDs
Batch ID = LEFT(C2,3)&TEXT(<code formula>,"000")
No VBA, only Excel 2016 formula, please.
Sample data snapshot

Bit of a stretch but try in F2:
=IF(COUNTIFS(C$2:C2,C2,D$2:D2,D2,E$2:E2,E2)>1,LOOKUP(2,1/((C$1:C1=C2)*(D$1:D1=D2)*(E$1:E1=E2)),F$1:F1),UPPER(LEFT(C2,3))&TEXT(MAX(IFERROR((LEFT(F$1:F1,3)=LEFT(C2,3))*RIGHT(F$1:F1,3),0))+1,"000"))
Enter through CtrlShiftEnter

It would be easier and more readable to meet your requirement in three steps, rather than a single formula.
Create a unique ID based on the Course, Date and Time.
Formula:
=CONCATENATE(UPPER(LEFT($C3,3)),TEXT($D3,"ddmmyy"),TEXT($E3,"hhmm"))
Breakdown:
LEFT($C3,3) - take the first three characters of the Course
UPPER() = make the first three characters of the Course uppercase
TEXT($D3,"ddmmyy") = take the date, turn it into text and apply a format
TEXT($E3,"hhmm") = take the time, turn it into text and apply a format
Create a lookup table of the unique ID and Batch ID
Copy all the unique Ids that have been created in step 1
Paste them into a new column separate to your data
On the Data menu tab, select Remove Duplicates in the data tools
Add the Batch ID to lookup.
This way the Batch ID can be generate via formula if the Unique Id's are sorted using Sort A to Z.
See the attached image.
Lookup the unique ID to get the Batch ID
=VLOOKUP($F3,$I$3:$J$7,2,FALSE)

Removing irrelevalent data with power query

I have a situation where I have data in such format.
There are thousands of rows with such status. What I would like to have is a new table where rows 2 and 3rd are removed and only the bottom row is left for reporting.
Currently, I have a VBA macro code, in which it first concatenates [sales document and product], checks and tags for repeating value. For the tagged lines, concatenated value times billed price is matched with next (-1 * Concatenate next value * billed price) and both lines are deleted in a loop.
This operation takes a long time sometimes as the size of the file can be big. I would like to move to power query because I have other related files, transformation happening there.
Would be glad if anyone can help me.
BR,
Manoj

I would recommend doing a Group By on the first four columns and using Sum as your aggregation for the billing column. Then simply filter out the 0 rows.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Delete bottom two rows in Azure Data Flow - azure

Related

Excel data tables: Multiple outputs with only one input column

Upsert Option in ADF Copy Activity

Delimited File with Varying Number of Rows Azure Data Factory

Excel2016: Generate ID based on multiple criteria (no VBA)

Removing irrelevalent data with power query

Categories

Resources