Removing irrelevalent data with power query - excel

I have a situation where I have data in such format.
There are thousands of rows with such status. What I would like to have is a new table where rows 2 and 3rd are removed and only the bottom row is left for reporting.
Currently, I have a VBA macro code, in which it first concatenates [sales document and product], checks and tags for repeating value. For the tagged lines, concatenated value times billed price is matched with next (-1 * Concatenate next value * billed price) and both lines are deleted in a loop.
This operation takes a long time sometimes as the size of the file can be big. I would like to move to power query because I have other related files, transformation happening there.
Would be glad if anyone can help me.
BR,
Manoj

I would recommend doing a Group By on the first four columns and using Sum as your aggregation for the billing column. Then simply filter out the 0 rows.

Related

Delete bottom two rows in Azure Data Flow

I would like to delete the bottom two rows of an excel file in ADF, but I don't know how to do it.
The flow I am thinking of is this.
enter image description here
*I intend to filter -> delete the rows to be deleted in yellow.
The file has over 40,000 rows of data and is updated once a month. (The number of rows changes with each update, so the condition must be specified with a function.)
The contents of the file are also shown here.
The bottom two lines contain spaces and asterisks.
enter image description here
Any help would be appreciated.
I'm new to Azure and having trouble.
I need your help.
Add a surrogate key transformation to put a row number on each row. Add a new branch to duplicate the stream and in that new branch, add an aggregate.
Use the aggregate transformation to find the max() value of the surrogate key counter.
Then subtract 2 from that max number and filter for just the rows up to that max-2.
Let me provide a more detailed answer here ... I think I can get it in here without writing a separate blog.
The simplest way to filter out the final 2 rows is a pattern depicted in the screenshot here. Instead of the new branch, I just created 2 sources both pointing to the same data source. The 2nd stream is there just to get a row count and store it in a cached sink. For the aggregation expression I used this: "count(1)" as the row count aggregator.
In the first stream, that is the primary data processing stream, I add a Surrogate Key transformation so that I can have a row number for each row. I called my key column "sk".
Finally, set the Filter transformation to only allow rows with a row number <= the max row count from the cached sink minus 2.
The Filter expression looks like this: sk <= cachedSink#output().rowcount-2

How to merge 2 BIG Tables into 1 adding up existing values with PowerQuery

I have 2 big tables (1 has 690K Rows, 2nd one has 890K rows).
They have the same format and columns:
Username - Points - Bonuses - COLUMN D... COLUMN - K.
Lets say in the first table i have the "Original" usernames and in the 2nd table i have "New" usernames + Some of the "Original" usernames (So people who are still playing + people who are new to the game).
What I'm trying to do is to merge them so i can have in a single table (sum up) their values.
I've already made my tables proper System Tables.
I created their connection in the workbook.
I've tried to merge them but i keep getting less rows than i expect to have, so some records are being left out or not being summed.
I've tried Left Outer, Right Outer, Full Outer with no success.
This is where im standing:
As #Jenn said, i had to append the tables instead of merging them and i also used a filter inside PowerQuery to remove all blanks/zeros before loading it into Excel, i was left with 500K Unique rows, instead of 1.6 Million. Thanks for the comment!
I would append the tables, as indicated above. First load each table separately into PowerQuery, and then append one table into the other one. The column names look a little long and it may make sense to simplify the column names so that the system doesn't read them as different columns due to an inadvertent typo.

How to invert a merge query in power query

I have a single column table of customer account numbers and a main table containing 400,000 records pulling from an access database. I want to remove all records from the table where the customer account number can be found in the single column table.
The merge query capability in power query allows me to return only the records where there is a match on the customer list (in addition to a variety of other variations on this theme) but I would like to know whether there is a way to invert this so that I return all records where the customer number does not appear in this list.
I have achieved this already by using the List.Contains function and adding a custom column to identify the rows to exclude and then filtering them out, but I think this is severely impacting the performance of my workbook. Refreshing the table that initially has 400,000 rows prior to this series of transformations takes a very long time, and all queries that depend on this table then also take a long time to refresh.
Thank you
If you do a Left Anti Join of your table with a single column, this will give you your table filtered to only have the rows which do not match to the single column.

Excel Query looking up multiple values for the same name and presenting averages

Apologies if this has been asked before. I would be surprised if it hasn't but I am just not hitting the correct syntax to search and get the answer.
I have a table of raw data for my staff, it contains data on the name of the employee who completed a job and the start and finish times, among other things. I have no unique ID's other than name, and I cant change that as I'm part of a large organisation and I have to make do with the data I'm given.
what I would like to do it present a table (Table 2) that shows the name of the employee and then takes the start/finish times for all of their jobs on table 1 and presents the average time taken across all of their jobs.
I have used Vlookup in the past but I'm not sure it will cut it here. the raw data table contains approx 6000 jobs each month.
On table 1 i work out the time taken for each job with this formula;
=IF(V6>R6,V6-R6,24-R6+V6) (R= started Time) (V= Completed Time) in 24hr clock.
I have gone this route as some jobs are started before midnight and completed afterwards. Although my raw data also contains dates (started/completed) in separate columns so I am open to an experts feedback on this and if there is a better way to work out the total time form start to completion.
I believe the easiest way to tackle this would be with a Pivot Table. Calculate the time taken for each Name and Job combination in Table 1; create a pivot table with the Name in the Row Labels and the Time in the Values -- change the Time Values to be an average instead of a sum:
Alternatively, you could create a unique list of names, perhaps with Data > Remove Duplicates and then use an =AVERAGEIF formula:
Thanks this give me the thread to pull on, I have unique names as its the persons full name, but ill try pivot tables to hopefully make it a little more future proof for other things to be reports on later.

MutliGet or multiple Get operations when paging

I have a wide column family used as a 'timeline' index, where column names are timestamps. In order to prevent hotspots, I shard the CF by month so that each month has its own row in the CF.
I query the CF for a slice range between two dates and limit the number of columns returned based on the page's records per page, say to 10.
The problem is that if my date range spans several months, I get 10 columns returned from each row, even if there is 10 matching columns in the first row - thus satisfying my paging requirement.
I can see the logic in this, but it strikes me as a real inefficiency if I have to retrieve redundant records from potentially multiple nodes when I only need the first 10 matching columns regardless of how many rows they span.
So my question is, am I better off to do a single Get operation on the first row and then do another Get operation on the second row if my first call doesnt return 10 records and continue until I have the required no. of records (or hit the row limit), or just accept the redundancy and dump the unneeded records?
I would sample your queries and record how many rows you needed to fetch for each one in order to get your 10 results and build a histogram of those numbers. Then, based on the histogram, figure out how many rows you would need to fetch at once in order to complete, say, 90% of your lookups with only a single query to Cassandra. That's a good start, at least.
If you almost always need to fetch more than one row, consider splitting your timeline by larger chunks than a month. Or, if you want to take a more flexible approach, use different bucket sizes based on the traffic for each individual timeline: http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra (see the "Variable Time Bucket Sizes" section).

Resources