How to remove duplicates from individual powerquery columns without removing entire rows

How to remove duplicates from individual powerquery columns without removing entire rows - excel

I have a data table that records cost savings data and I have 1 row per project. This has overall project type data such as annual spend, annual savings, etc. but also has the months the savings fall into. To pivot on this data, I converted it to a table with PowerQuery but some columns repeat such as annual spend for each month where there are savings so I might get 10 rows for savings which is correct, but the annual spend is duplicated 10 times. Can I remove duplicates in just those columns retaining the other data.
I have searched and tried various solutions but haven't found one that works. I am not set on data table format, so am open to anything.
Below is a sample of the data
Sample of PowerQuery
As you will see, Baseline Spend, Negotiated Spend, Savings Amount are all shown for each row and I need to use these in a pivot/slicer.
Any help would be appreciated.
Regards,
Keith

I think one solution might be to "only keep the first1 annual spend per project". More abstractly, "only keep the first value in column(s) X per column(s)Y".
Below is some mock/dummy data. I only want to keep the highlighted values in my annual spend column (as the highlighted values are the first "annual spend" figures per "project").
This is the M code I'm using to achieve this. (To try it, open the Query Editor > Advanced Editor (near top right) > copy-paste code below to there > OK).
let
OnlyKeepFirstValueInColumn = (someTable as table, columnsToNullify as list) as table =>
let
firstRow = Table.FirstN(someTable, 1), // This assumes first row contains a non-blank value.
remainingRows = Table.Skip(someTable, 1),
loopAndNullify = List.Accumulate(columnsToNullify, remainingRows, (tableState, currentHeader) => Table.TransformColumns(tableState, {{currentHeader, each null}})),
combined = firstRow & loopAndNullify
in combined,
FirstValueOfColumnsPerGroup = (someTable as table, groupByColumns as list, columnsToNullify as list) =>
let
group = Table.Group(someTable, groupByColumns, {{"toCombine", each OnlyKeepFirstValueInColumn(_, columnsToNullify), type table}}),
combined = Table.Combine(group[toCombine])
in combined,
aggregatedTable = Table.FromColumns({Text.ToList("aaabbbccccdddeeefg"), List.Repeat({1000}, Text.Length("aaabbbccccdddeeefg"))}, type table [project=text, annual spend=number]),
transformed = FirstValueOfColumnsPerGroup(aggregatedTable, {"project"}, {"annual spend"})
in
transformed
The important bit to understand is this line:
transformed = FirstValueOfColumnsPerGroup(aggregatedTable, {"project"}, {"annual spend"})
in which you should replace:
aggregatedTable with whatever variable/expression contains your table
{"project"} with the name of your "project" column (keep the curly braces {} though as they let you pass in several columns if needed)
{"annual spend"} with the names of whichever column(s) you want to keep only the first value in (keep the curly braces {})
This is what I get (which I think is similar to what you want):
1To keep things simple, we'll say "first" here means the value in the first row. It could have meant "first non-null value" or "first value satisfying some particular condition or logic", but your data suggests the simpler definition will work okay.

Related

Sort data contained in blocks in excel

I have a large amount of reference data in excel, which I am trying to manipulate in a variety of ways. I'm having some problems with the way it is structured and sorting into a more manageable format.
Problem number 1:
I have three columns. Column A contains first a date, and then a designator of high or low. Column B contains times, Column C contains heights.
I would like to sort the data by column B (easy enough) EXCEPT I would like the date headings in Column A preserved. It's almost as though I have 365 tables, each with between 3 and 5 pieces of data - I'm looking to sort the 3 - 5 pieces of data within each date only.
This is what I have currently:
There's no issue with me taking the data and manipulating it some other way first - this is ultimately around me being able to take a batch of data (5x different reference points, each for 365 days) and develop a process to sanitise it and get it displayed in time order, as well as being able to get it into a usable format for problem 2 (I need to adjust some other data points by the sorted data once I have it).
This is what I would like it to look like (I manually went through each of these blocks and sorted them):

It is possible to do it in Excel as follows in cell E2:
=LET(rng, A1:C11, set, FILTER(rng, (INDEX(rng,,1) <>"")),
dates, SCAN("", INDEX(set,,1), LAMBDA(acc, item, IF(ISNUMBER(item), item, acc))),
in, FILTER(HSTACK(dates, set), INDEX(set,,2)<>""), inDates, INDEX(in,,1),
out, REDUCE("", UNIQUE(inDates), LAMBDA(acc, date,
LET(sorted, VSTACK(date, DROP(SORT(FILTER(in, inDates = date),3),,1), {"","",""}),
VSTACK(acc, sorted)
))), IFERROR(DROP(DROP(out,1),-1),"")
)
Here is the output:
You can avoid the clean-up process except for removing the last row as follow:
=LET(rng, A1:C11, set, FILTER(rng, (INDEX(rng,,1) <>"")),
dates, SCAN("", INDEX(set,,1), LAMBDA(acc, item, IF(ISNUMBER(item), item, acc))),
in, FILTER(HSTACK(dates, set), INDEX(set,,2)<>""), inDates, INDEX(in,,1),
out, REDUCE("", UNIQUE(inDates), LAMBDA(acc, date,
LET(sorted, VSTACK(HSTACK(date,"",""), DROP(SORT(FILTER(in, inDates = date),3),,1),
{"","",""}), IF(MAX(LEN(acc))=0, sorted, VSTACK(acc, sorted))
))), DROP(out, -1)
)
Explanation
Basically is to carry out the manual steps but using excel functions. The name set, is the same as the input data (rng) but we removed the empty rows. The name dates, is a column with the same size as rng, repeating all the dates. The condition in the SCAN function to identify a new date is ISNUMBER because dates are stored in Excel as whole numbers. The name in has the data in the format we want for doing the sorting and filter by date removing the date header and adding as the first column the dates.
Now we use DROP/REDUCE/VSTACK pattern (check the answer to the question: how to transform a table in Excel from vertical to horizontal but with different length provided by David Leal) to append each sorted data for a given unique date. We add the date as the first row, then sorted data, and finally an empty row to separate each group of data. Finally, we do a clean-up via IFERROR/DROP to remove the #N/A values and the first and the last empty row.

Using tbl.Lookup to match just part of a column value

This question relates to the Schematiq add-in for Microsoft Excel.
Using =tbl.Lookup(table, columnsToSearch, valuesToFind, resultColumn, [defaultValue]) the values in the valuesToFind column have a consistent 3 characters to the left and then varying characters after (e.g. 908-123456 or 908-321654 - i.e. 908 is always consistent)
How can I tell the function to lookup the value based on the first 3 characters only? The expected answer should be the sum of the results of the above, i.e. 500 + 300 = 800

tbl.Lookup() works by looking for an exact match - this helps ensure it's fast but in this case it means you need an extra step to calculate a column of lookup values, something like this:
A2: =tbl.CalculateColumn(A1, "code", "x => LEFT(x, 3)", "startOfCode")
This will give you a new column that you can use for the columnsToSearch argument, however tbl.Lookup() also looks for just one match - it doesn't know how to combine values together if there is more than one matching row in the table, so I think you also need one more step to group your table by the first 3 chars of the code, like this:
A3: =tbl.Group(A2, "startOfCode", "amount")
Because tbl.Group() adds values together by default, this will give you a table with a row for each distinct value of startOfCode and the subtotal of amount for each of those values. Finally, you can do the lookup exactly as you requested, which for your input table will return 800:
A4: =tbl.Lookup(A3, "startOfCode", "908", "amount")

How can I add a 1 to the most recent, repeated row in Excel?

I have a dataset with 60+ thousand rows in excel and about 20 columns. The "ID column" sometimes repeats itself and I want to add a column that will return 1 only in the row that is the most recent only IF it repeats itself.
Here is the example. I have…
ID DATE ColumnX
AS1 Jan-2013 DATA
AS2 Feb-2013 DATA
AS3 Jan-2013 DATA
AS4 Dec-2013 DATA
AS2 Dec-2013 DATA
I want…
ID DATE ColumnX New Column
AS1 Jan-2013 DATA 1
AS2 Feb-2013 DATA 0
AS3 Jan-2013 DATA 1
AS4 Dec-2013 DATA 1
AS2 Dec-2013 DATA 1
I've been trying with a combination of sort and nested if's, but it depends on my data being always in the same order (so that it looks up the ID in the previous row).
Bonus points: consider my dataset if fairly large for excel, so the most efficient code that won't eat up processor would be appreciated!

An approach you could use is to point MSQuery at your table and use SQL to apply the business rules. On the positive side, this runs very quickly (a couple seconds in my tests against 64k rows). A huge minus is the query engine does not seem to support Excel tables exceeding 64k rows, but there might be ways to work around this. Regardless, I offer the solution in case it gives you some ideas.
To set up first give your data set a named range. I called it MYTABLE. Save. Next select a cell to the right of your table in row 1, and click through Data | From other sources | from Microsoft Query. Choose Excel Files* | OK, browse for your file. The Query Wiz should open, showing MYTABLE available, add all the columns. Click Cancel (really), and click Yes, you want to continue editing.
The MSQuery interface should open, click the SQL button and replace the code with the following. You will need to edit some specifics, such as the file path. (Also, note I used different column names. This was sheer paranoia on my part. The Jet engine is very finicky and I wanted to rule out conflicts with reserved words as I built this.)
SELECT
MYTABLE.ID_X,
MYTABLE.DATE_X,
MYTABLE.COLUMN_X,
IIF(MAXDATES.ID_x IS NULL,0,1) * IIF(DUPTABLE.ID_X IS NULL,0,1) AS NEW_DATA
FROM ((`C:\Users\andy3h\Desktop\SOTEST1.xlsx`.MYTABLE MYTABLE
LEFT OUTER JOIN (
SELECT MYTABLE1.ID_X, MAX(MYTABLE1.DATE_X) AS MAXDATE
FROM `C:\Users\andy3h\Desktop\SOTEST1.xlsx`.MYTABLE MYTABLE1
GROUP BY MYTABLE1.ID_X
) AS MAXDATES
ON MYTABLE.ID_X = MAXDATES.ID_X
AND MYTABLE.DATE_X = MAXDATES.MAXDATE)
LEFT OUTER JOIN (
SELECT MYTABLE2.ID_X
FROM `C:\Users\andy3h\Desktop\SOTEST1.xlsx`.MYTABLE MYTABLE2
GROUP BY MYTABLE2.ID_X
HAVING COUNT(1) > 1
) AS DUPTABLE
ON MYTABLE.ID_X = DUPTABLE.ID_X)
With the code in place MSQuery will complain the query can't be represented graphically. It's OK. The query will execute -- it might take longer than expected to run at this stage. I'm not sure why, but it should run much faster on subsequent refreshes. Once results return, File | Return data to Excel. Accept the defaults on the Import Data dialog.
That's the technique. To refresh the query against new data simply Data | Refresh. If you need to tweak the query you can get back to it though Excel via Data | Connections | Properties | Definition tab.
The code I provided returns your original data plus the NEW_DATA column, which has value 1 if the ID is duplicated and the date is the maximum date for that ID, otherwise 0. This code will not sort out ties if an ID's maximum date is on several rows. All such rows will be tagged 1.
Edit: The code is easily modified to ignore the duplication logic and show most recent row for all IDs. Simply change the last bit of the SELECT clause to read
IIF(MAXDATES.ID_x IS NULL,0,1) AS NEW_DATA
In that case, you could also remove the final LEFT JOIN with alias DUPTABLE.

Sort by ID, then by DATE (ascending). Define entries in new column to be 1 if previous row has the same ID and next row has a different ID or is empty (for last row), 0 otherwise.

It could be done in VBA. I'd be interested to know if this is possible just using formulas, I had to do something similar once before.
Sub Macro1()
Dim rowCount As Long
Sheets("Sheet1").Activate
rowCount = Cells(Rows.Count, 1).End(xlUp).Row
Columns("A:D").Select
Selection.AutoFilter
Range("D2:D" & rowCount).Select
Selection.ClearContents
Columns("A:D").Select
ActiveWorkbook.Worksheets("Sheet1").AutoFilter.Sort.SortFields.Add Key:=Range _
("B1:B" & rowCount), SortOn:=xlSortOnValues
ActiveWorkbook.Worksheets("Sheet1").AutoFilter.Sort.SortFields.Add Key:=Range _
("A1:A" & rowCount), SortOn:=xlSortOnValues
ActiveWorkbook.Worksheets("Sheet1").AutoFilter.Sort.Apply
Dim counter As Integer
For counter = 2 To rowCount
Cells(counter, 4) = 1
If Cells(counter, 1) = Cells(counter + 1, 1) Then Cells(counter, 4) = 0
Next counter
End Sub
So you activate the sheet and get the count of rows.
Then select and autofilter the results, and clear out Column D which has the 0s or 1s. Then filter on the values mbroshi suggested that you say you're already using. Then execute a loop for each record, changing the value to 1, but then back to 0 if the value ahead of it has the same ID.
Depending on your processor I dont think this would take more than a minute or two to run. If you do find something using formulas I would be interested to see it!

Excel CountifS. Criteria multi-column ranges. Non-ordered comparison test

It's my first question here, so please don't kill me if something is wrong. I have found numerous solutions on this site, but not this time. Unfortunately I can't post images yet. It won't be easy, but I will try.
To the point:
My data has the following headers:
Decision_Id Opponent1 Opponent2 Opponent3 Suitor1 Suitor2 Suitor3 Suitor4
Decision_id is a unique integer identifier. The rest are strings.
Each row represents a particular judicial decision. Each Decision can have UP TO 3 opponents (defending party) and UP TO 4 suitors (attacking party).A particular party can be a suitor in one decision and an opponent in another one.
What I want to get :
Cross-table where both rows and columns headers are all distinct parties I encounter in the table. (no problem with that, done.)
Where each cell shows in how many distinct decisions a particular opponent (defined by row header) was attacked by a particular suitor (column header) => All diagonal cells equal ZERO (a party can't attack itself) and table is not symmetric.
I have tried
to apply to the first cell and than expand:
=COUNTIFS("Fixed range of all opponents :$B$2:$D$6","the wanted opponent value : $A2", "Fixed range of all suitors :$E$2:$H$6", "the wanted suitor value : B$1")
I had an error. I figured out that criteria ranges have to be of the same size. OK, created dummy empty columns => no error, BUT, the results are clearly underestimated. I think that there is a match only if opponent and suitor have the same "number". In details: For each row excel tests the opponent1 and suitor1 towards corresponding values, then opponent2 and suitor2, then opponent3 and suitor3... This actually explains why the ranges have to be of the same size.
So, What I would need
Is, for each row, to make excel test all opponents towards the wanted opponent value, test all suitor towards the wanted suitor value. If at least one opponent and one suitors correspond, give it a match and count this decision.(Even though opponent1 and suitor3 had the wanted values)
Remarks
I have already made a VBA code which does the job, but it's too slow (around 5 hours for the whole table) and I expect to do the same for different tables of this kind and/or modify this one. So I am interested in "pure excel", fast solution.
Thank you very much!

The difficult part here is to separate multi-column ranges into separate rows - one way to do that is with OFFSET within COUNTIF, i.e. this formula
=SUMPRODUCT(COUNTIF(OFFSET($B$2:$D$6,ROW($B$2:$D$6)-ROW($B$2),0,1),$A2),COUNTIF(OFFSET($E$2:$H$6,ROW($E$2:$H$6)-ROW($E$2),0,1),B$1))
That assumes that all suitors are different on any one row and all opponents are different on any one row (although formula can be modified if that isn't the case).
You can extend the ranges to any size you want - although the number of rows must be the same for each part
....or here's another more obscure way using MMULT function
=SUMPRODUCT(MMULT(($B$2:$D$6=$A2)+0,{1;1;1}),MMULT(($E$2:$H$6=B$1)+0,{1;1;1;1}))
the {1;1;1} and {1;1;1;1} represent the number of columns in each section so if you have 6 and 8 those need to be changed accordingly

Another possibility is to try this array formula:
=SUM(MMULT(-TRANSPOSE($B$2:$D$6=$A2),-($E$2:$H$6=B$1)))
entered using CTRL+SHIFT+ENTER (or defined as a name and entered normally eg =Total.)

This should do it:
= COUNTIFS($B$2:$B$6,$A2, $E$2:$E$6, B$1)
+ COUNTIFS($C$2:$C$6,$A2, $E$2:$E$6, B$1)
+ COUNTIFS($D$2:$D$6,$A2, $E$2:$E$6, B$1)
+ COUNTIFS($B$2:$B$6,$A2, $F$2:$F$6, B$1)
+ COUNTIFS($C$2:$C$6,$A2, $F$2:$F$6, B$1)
+ COUNTIFS($D$2:$D$6,$A2, $F$2:$F$6, B$1)
+ COUNTIFS($B$2:$B$6,$A2, $G$2:$G$6, B$1)
+ COUNTIFS($C$2:$C$6,$A2, $G$2:$G$6, B$1)
+ COUNTIFS($D$2:$D$6,$A2, $G$2:$G$6, B$1)
+ COUNTIFS($B$2:$B$6,$A2, $H$2:$H$6, B$1)
+ COUNTIFS($C$2:$C$6,$A2, $H$2:$H$6, B$1)
+ COUNTIFS($D$2:$D$6,$A2, $H$2:$H$6, B$1)
These look simpler if you make your data into a table, or define named ranges for the Opponent1, Opponent2, Suitor1 columns etc...

Finding the next result from a MATCH

I am trying to produce a sorted table in excel, which depend on the selected year and category.
My methodology has been to sequentially find largest values in order, within the selected year and category parameters, doing the following:
Column E
{=LARGE(IF(('Master Data'!A$1:A$500 = $B$1) * ('Master Data'!B$1:B$500 = $B$2),'Master Data'!C$1:C$500), $B10)}
This works fine, $B$1$ is where I store the year, $B$2 is where I store the category, $B10 references a hard coded 1-25 in column B.
Column F
{=MATCH(E10,IF(('Master Data'!A$1:A$500 = $B$1) * ('Master Data'!B$1:B$500 = $B$2),'Master Data'!C$1:C$500),FALSE)}
This returns the row number of the result I need, which I then use in conjunction with INDEX to find related data.
The problem with this is that Match only returns the first row number, and if you have two results with the same value this clearly becomes an issue.
Column G
To resolve this I used an example from dailydoseofexcel which looks like this:
=IF(F10<>F11, F11, G10+MATCH(E11,INDIRECT("'Master Data'!C"&(G10+1)&":C500"),0))
This works to a limited extent, for my purposes, as it is unable to take into account the year and category filter I need to apply, so I tried:
{=IF(F10<>F11, F11, G10+MATCH(E11,IF((INDIRECT("'Master Data'!A"&(G10+1)&":A500") = $C$2) * (INDIRECT("'Master Data'!B"&(G10+1)&":B500") = $C$3), INDIRECT("'Master Data'!C"&(G10+1)&":C500")),0))}
But I am just getting #N/A as a result.

I think SUMPRODUCT may be what you are looking for:
Charley Kyd XL Legend: Use SUMPRODUCT to get the Last item in a list

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string