I have a question about Excel! I hope that isn't too unconventional for this site...
So I have an Excel table with several thousand rows. It is kind of setup like a db in that the first three of my four columns have numerical values identifying the sequence or order that the content or fourth row contains.
I am running into some possible duplication issues, and I am remembering back to my college days something about there being a function for the type of test I need to do. I need to verify that there are no two rows that have the same values for column 1-3. There should never be a time where all three columns' values match exactly that of another row.
Is VLookUp the function I need? Any excel experts out there that know of a function I could look into? Thanks so much!
the quick one-off solution I employ for this kind of quest is the following
create a single key in one temporary column - say F "=A2 & B2 & C2 ..." if combined key - I copy this formula all the way down
create a group counter for that single key - say G "=IF(F2=F1,G1+1,1)" - I can safely include the header row here because it will move the formula into the false part
This formula in G numerates all identical keys from 1 to N and starts by 1 for a new key - I copy this formula all the way down
Important: convert G formulae into values (copy / paste special onto itself)
sort descending by G and delete/manipulate all rows where counter <> 1 - or use autofilter
later on I delete F & G columns
this may sound a bit complicated, but especially in large tables VLOOKUP, COUNTIF's etc can be very time consuming.
Hope that helps
You could create another column that concatenates the first 3, then do a countif on that. Let's say the concatenation column is D and your data begins in the second row:
=countif(D:D,D2)
Copy the formula down, then filter on >1.
I think what you need is a countifs function.
assume you add one formula in a ceel in row 4:
=COUNTIFS(A:A,A4,B:B,B4,C:C,C4)
and copy to formula to the whole column
Then the cells with value 1 is a unique set while those larger than 1 have duplicates.
If you only need to check the data once, try the "Remove duplicates" functionality. This can be found in the Data tab -> Data Tools -> Remove Duplicates. Just unselect all but the first three columns in the dialog and Excel will do the rest.
Related
Thanks for any help here. I've been racking my brain (and searching online I promise) for a while on this one.
I'm looking at columns A and B that have unrelated information, but sometimes the information in columns A and B are duplicates. For example cell A2 says "Frank" and B2 say "1". then cell A3 says "Frank" and B3 says "1". The information is a duplicate across two columns. The rest of the names in this example can be anything, Frank, Sally, Robert, etc.. and the rest of the numbers in column B can be anything (Image of example attached). If the information is a duplicate I'd like to output a reduced list using two new columns.
These functions need to be operable in real time. As data is added the equations must update in real time. I also can't concatenate because I need info to stay in two columns. I've see a lot of examples of doing this for 1 column of data using an array (see example 2), but I don't know how to build one that considers two columns. 1 column example: =IFERROR(INDEX($T$2:$T$9, MATCH(0,COUNTIF($X$1:X3, $T$2:$T$9), 0)),"") Any ideas how to build this out so it works for two columns?
I'd like to avoid using VBA if possible, but if it's the only way so be it. I want to avoid using VBA because a lot of people touch the spreadsheet and it's a lot easier for me to fix a function than code. Gotta love humans!
Thanks so much for your help!
Robby
example example2
The simplest way that I can think of is:
a) Select your data, and paste elsewhere either on same sheet or a new one.
b) Select one of the cells in the copied data
c) Go to the Data menu tab, and click Remove Duplicates
d) Click OK
Create a new column C that concatenates columns A and B. Then remove duplicates on the basis of column C.
https://support.office.com/en-us/article/CONCATENATE-function-8F8AE884-2CA8-4F7A-B093-75D702BEA31D
I'd like to use column D as Index or lookup value. Then I want to concatenate the values from column B and C into Column E.
I can use =VLOOKUP(D2,A2:C6,2,FALSE) or =INDEX($B$2:$B$6,MATCH("Person 1",$A$2:$A$6,0)) but I don't know how to use it multiple times in the same column.Is there a way I combine those to search multiple times in the same column?
I'd be open to using vba if that would be a better option, just still not sure about the multiple times per column.
I couldn't figure out magic one in done formula. Richard Tompsett is probably correct with VBA solution. I'd recommend the following series of steps if VBA is out of the question.
(1) Sort by column A. This will group Thing 1 and Thing 2 together into discrete ranges per person.
(2) In cell F2, type =Transpose(B2:C3) and hit F9. Should convert to look like ={"A","D";1,4}, then delete curly brackets and equal sign. This is the range for Person 1 (manually).
(3) In cell E2, enter =SUBSTITUTE(SUBSTITUTE(F2,"""",""),";",","). Should now appear as ' A,D,1,4 '
Repeat for each Person
I'm working on data from a population of people with allergies. Each person has a unique ExceptionID, and each allergen has a unique AllergenID (451 in total).
I have a data table with 2 columns (ExceptionID and AllergenID), where each person's allergies are listed row by row. This means that the ExceptionID column has repeated values for people with multiple allergies, and the AllergenID column has repeated values for the different people who have that allergy.
I am trying to count how many times each pair of allergies is present in this population (e.g. Allergen#107 & Allergen#108, Allergen#107 & Allergen#109,etc). To keep it simple I've created a matrix of 451 rows X 451 columns, representing every pair (twice actually because A/B and B/A are equivalent).
I somehow need to use the row name (allergenID) to lookup the ExceptionID in my data table, and count the cases where that matches the ExceptionIDs from the column name (also AllergenID). I have no problem using Vlookup or Index/Match, but I'm struggling with the correct combination of a lookup and Sumproduct or Countif formula.
Any help is greatly appreciated!
Mike
PS I'm using Excel 2016 if that changes anything.
-=UPDATE=-
So the methods suggested by Dirk and MacroMarc both worked, though I couldn't apply the latter to my full data set (17,000+ rows) because it was taking a long time.
I've since decided to turn this into a VBA macro because we now want to see the counts of triplets instead of pairs.
With the 2 columns you start with, it is as good as impossible... You would need to check every ExceptionID to have 2 different specific AllergenID. Better use a helper-table with ExceptionID as rows and AllergenID as columns (or the opposite... whatever you like). The helper table needs a formula like:
=COUNTIFS($A:$A,$D2,$B:$B,E$1)
Which then can be auto-filled. (The ranges are from my example, you need to change them to your needs).
With this helper-matrix you can easily go for your bigger matrix like this:
=COUNTIFS(E:E,1,INDEX($E:$G,,MATCH($I2,$E$1:$G$1,0)),1)
Again, you can auto-fill with this formula, but you need to change it, so it fits your needs.
Because the columns have the same ID2 (would be your AllergenID), there is no need to lookup them because E:E changes automatically with the auto-fill.
Most important part of the formulas are the $ which should not be messed up, or you can not auto-fill it.
Picture of my self-made example (formulas are from the upper left cell in each table):
If you still have any questions, just ask :)
It can be done straight from your original set-up with array formulas:
Please note that array formulas MUST be entered with Ctrl-Shift-Enter, before copying across and down:
In the example pic, I have NAMED the data ranges $A$2:$A$21 as 'People' and $B$2:$B$21 as 'Allergens' to make it a nicer set-up. You can see in the formula bar how that looks as a formula. However you could use the standard references like this in your first matrix cell:
EDIT: silly me, N function is not needed to turn the booleans into 1's and 0's, since multiplying booleans will do the trick. Below formula works...
SUM(IF(MATCH($A$2:$A$21,$A$2:$A$21,0)=ROW($A$2:$A$21)-1, NOT(ISERROR(MATCH($A$2:$A$21&$E2,$A$2:$A$21&$B$2:$B$21,0)))*NOT(ISERROR(MATCH($A$2:$A$21&F$1, $A$2:$A$21&$B$2:$B$21,0))), 0))
Then copy from F2 across and down. It can be perhaps improved in technique with sumproduct or whatever, but it's just a rough example of the technique....
TOP Table is Input, and bottom table is preview for required output.
For Each ID I need to find earliest datetime. I also need other information from other columns (please see image below).
My current solution is:
In Cell E2 =A2
Cell E3 drag down =IF(E2<>A3,IF(E1=A3,"",A3),"")
In Cell F2 drag down =IF(E2<>"",MIN(IF($A$2:$A$14=E2,$C$2:$C$14)),"") Ctrl+Shift+Enter
One more option without any intermediate calculations:
Select the whole range starting E2 and to the last row where IDs are located - for the sample given it's row 14, so select range E2:E14: =IFERROR(INDEX($A$2:$A$14,SMALL(IF(MATCH($A$2:$A$14,$A$2:$A$14,0)=ROW(INDIRECT("1:"&ROWS($A$2:$A$14))),MATCH($A$2:$A$14,$A$2:$A$14,0),""),ROW(INDIRECT("1:"&ROWS($A$2:$A$14))))),"") and press CTRL+SHIFT+ENTER instead of usual ENTER - this will define a Multicell ARRAY formula and will result in curly {} brackets around it (but do NOT type them manually!).
F2 (ID2): =IF(E2="","",SUMPRODUCT(--(E2=$A$2:$A$14),--(G2=$C$2:$C$14),$B$2:$B$14)) - normal formula.
G2 (Min Date): =IF(E2="","",MIN(IF(E2=$A$2:$A$14,$C$2:$C$14,2^100))) and press CTRL+SHIFT+ENTER instead of usual ENTER - this will define an ARRAY formula and will result in curly {} brackets around it (but do NOT type them manually!).
H2 (InCh): =IF(E2="","",INDEX($D$2:$D$14,SUMPRODUCT(--(E2=$A$2:$A$14),--(F2=$B$2:$B$14),--(G2=$C$2:$C$14),ROW(INDIRECT("1:"&ROWS($D$2:$D$14)))))) - normal formula.
Remarks:
To make the solution more compact and easy to read, define named range for ID column, and then reference other data columns using OFFSET.
ID2 values may not be unique - as they are on the sample for IDs 1...3.
Resulting set for Min Date should be formatted the same way as source Date row.
The key formula of the solution - is multicell monster which returns unique IDs without empty rows - as OP requested)
Sample file: https://www.dropbox.com/s/d2098updfh8djnf/MinDateIDs.xlsx
This is quite a challenge... I think I have found an approach that works. For the sake of clarity, I used a few helper columns. Also, I did not use any named ranges but stuck with the column-row indications. You might want to change that.
It looks like this:
and zooming in to the relevant columns:
Column F contains an array formula to filter out duplicates. An approach is explained here. The formula I used in F2 is
=INDEX($A$2:$A$14, MATCH(MIN(IF(COUNTIF($F$1:F1,$A$2:$A$14)=0, 1, MAX((COUNTIF($A$2:$A$14, "<"&$A$2:$A$14)+1)*2))*(COUNTIF($A$2:$A$14, "<"&$A$2:$A$14)+1)), COUNTIF($A$2:$A$14, "<"&$A$2:$A$14)+1, 0))
Use Ctrl-Shift-Enter to confirm as array formula. Drag this down or copy into column F. Then columns G and H contain the starting and ending indices of the duplicate ID values. This answer helped, please upvote it :-). The two formulas used are:
=MATCH(2,1/FREQUENCY($F2,$A$2:$A$14))
in G2, and
=FREQUENCY($A$2:$A$14,$F2)
in H2. Again, drag them down to get the full column filled. Next, column I is for clarification only -- and for sanity checking. It contains the desired minimum date from each sub-array. Column J substitutes that formula into a MATCH to find the actual index of the desired date.
=MIN(OFFSET($C$2:$C$14,$G2-1,0,1+$H2-$G2,1))
in I2 and
=$G2-1+MATCH(2,1/FREQUENCY(MIN(OFFSET($C$2:$C$14,$G2-1,0,1+$H2-$G2,1)), OFFSET($C$2:$C$14,$G2-1,0,1+$H2-$G2,1))
in J2. Finally, columns L, M and N index into the original set of data via
=INDEX(B$2:B$14,$J2)
in L2, which you can drag horizontally and then vertically.
When you are done, you can hide the helper columns, or fold everything into big formulas. Good luck with that... There might be an easier way to achieve this, but I did not find it.
If you want the value from column D in G then assuming that column C values are unique you could just use a VLOOKUP, i.e. in G2 copied down
=VLOOKUP(F2,C$2:D$14,2,0)
Per your picture, they're all in the same sheet. Just sort by ID, then Date (ascending). As you work your way down the ID column, each time the ID changes, you know you've found the row with the minimum Date for that specific ID. Create an extra column to signify where ID changes occur, and filter for those rows (hide the column if you so desire).
And... voila.
Know this link is old, but there is a much shorter and easier way!
How about using a pivot table using the Minimum as field setting and then do a =GETPIVOTDATA() to get the information back!
Seems a lot simpler as these formulas!
Actually, I just realized I've been overthinking this...Excel keeps the top item and removes all that follow when removing duplicates.
So if you are going to create an extra working table anyway, why not just copy the range/columns you want to keep, then use the basic sort.
Sort first by ID, then by the column you want as the second filter. Be sure the sorts are in the order you want (e.g. newest to oldest, oldest to newest, A to Z, Largest to smallest, etc).
Once the data is sorted, remove duplicates based on ID. You are left with all of your columns of data, filtered by newest/oldest/largest/smallest per individual.
This worked for my table with 30,000+ records, filtered down to 1500 unique individuals with most recent (plus associated amount), and with a second filter, the largest (plus associated date) for each person.
How can I compare records in a table, to make sure these records are not duplicates? Using excel 2007 I don't won’t them to delete after comparison.
Duplicates rows should be colored. I have a table columns are from A to P and I have 500 rows. I want to put condition on A, B, E, F, G, I.
If you don't want to sort your column, you can try with a matrix formula (http://www.stanford.edu/~wfsharpe/mia/mat/mia_mat4.htm).
Practically, you can compare your current row to every row above. Somtething like :
=MIN(LINE(B1)*(IF(A2=A1;1;0))*(IF(B2=B1;1;0)))*(...)
validated with CTRL-SHIFT-ENTER will check if all the conditions are true, else, will return 0.
Please send a file (with anonymous data) if you want a practical example.
Hope that helps
Edit : here is the good solution (provided you want to compare data in the Q column) :
=MIN(LIGNE($Q$5:Q6)*EQUIV(Q6;$Q$5:Q6;0))
If you want to have the first line where the value appear
=MIN(LIGNE($Q$5:Q5)*EQUIV(Q6;$Q$5:Q5;0))
If you'd rather have #N/A if there are no duplicate before that line
Still validate with CTRL-SHIFT-ENTER
Sort by the columns you are interested in then use a formula to compare each row with the one above. You can then use conditional formatting to colour the results.
I may sound stupid here, but usually the simple answers are usually the best.
I did this recently, by literally using the CONCATENATE() function with the TEXT() function to combine all the columns I wanted to compare into a single cell. So in effect I am creating a cell with a unique "key" that holds all the data I want to be unique.
I then sort that column and create another empty column next to it.
Then us this formula to compare the row with the row above it: =IF(A2=A1,0,1)
This simply puts a 0 where it's the same row and a 1 where it's different.
I then filter on the '1's and there are my duplicates!
It'a also usefull as an alternative way of doing a unique COUNT(DISTINCT ...) where I want to count how many unique references of my data exists. SUBTOTAL(3...) is not enough.