Finding Duplicates across a thousand lists - excel

I have over 1,100 lists that each contain no more than 30 items in them. I am trying to see if there are any items within the lists that appear in all lists. I was initially thinking that I would need to compare the list in column A to the list in column B, store the duplicates, then compare the duplicates to the list in Column C, store the new duplicates, compare the new duplicates to the list in Column D, and so on until all the lists have been covered.
My questions are:
1.) Is this the correct way to approach this?
2.) If so, is there a simple VBA code that could be used to do this?

Deduplicate each list using Data > Remove Duplicates
Collate all the lists into one long list
Create a pivot table with the column of items as the Row dimension
Use the same column as the Value displayed in the pivot table, and aggregate using Count.
Sort the pivot table in descending order of that count.
The count shows the number of lists in which each item appears. If any have a count of 1100 then they must occur in every list.

Here's my non VBA solution to this fun problem. The plan is to search each item in any one list and compare to all the other lists in the table.
Start off by inserting a new "A" column to the left of your table. Copy any list and paste to A35.
if your goal is only to find items occuring in all lists, choose the smallest list.
if you would like to analyse, choose the largest list or even multiple lists.
you could include all items by copy/paste TRANSPOSE the entire table to new sheet. then you have less than 30 colums. copy paste each into one column and delete duplicates of this list with data--> remove duplicates.
Now you need to create a formula in cell B35 that searches for the string in A35 in the range B1:B30. You drag the formula all the way right and down.
=COUNTIF(B$1:B$30,$A30)
The results will be the count of each item found in each list. In order to see if any item is in all lists, then all columns within the specific row should count at least 1 item. To the right of the results, see what the minimum value in the row is with:
=MIN(B35:API35)
(assuming your table ends in column API)
If any of your rows have a minimum of 1, then the item is included in all lists.
You could then also sum up the line to see which items occur the most and you could use the "max" instead of "min" to see if any list has duplicates.

Please try to use this
If it will not work I can help you with Macro VB code.
Logic will be as below:
1. Keep 1st column as base to check all the other column
2. Check each 30 cell of the 1st column in a loop with all the other column cell.
3. Stop the loop, if you don't the value in an entire column.

Related

Using Excel transpose data from column to row whilst also finding uniques

I have what seems to be an easy task but at the minute i'm stumped.
I have a list of text values, say A1:A19, with multiple entries some of which are repeated in the list.
I want to take the list in column A and copy to a row, say B2:B8, However i only want to move across each individual value once. Can this be done?
UNIQUE returns the unique values from a range.
TRANSPOSE rotates cells from rows to columns or vice-versa.
=TRANSPOSE(UNIQUE(A2:A19))

Find column first, then value in column

I am having trouble figuring out how to write a function to return a value from a column. Let's say I have a big master list of excluded numbers with columns 1,2,3,4,5,6. In each column is a bunch of values, anywhere from 1-500, and each column can have repeat values or be missing values.
I'll regularly be getting large lists of values and their corresponding columns that I will need to verify are in or not in the master list.
If I get two columns of data, one of values and one of their corresponding columns to cross check in the master list, is there a function or group of functions that will do this?
Sort of like a VLOOKUP, but instead of starting at the left most column, it looks at the column that my list tells it to and then looks for the value my list has. I'm having trouble figuring it out with an INDEX/MATCH because the values can show up on different rows in each column since some columns have omitted numbers.
For the sake of an answer, a Comment from #tigeravatar:
=COUNTIF(INDEX(A:F,0,X),Y) where X is the column number and Y is the number you're looking for.

Excel Instance Parsing

I have a list of data "instances" within one column within an excel sheet.
Each instance can have numerous copies. Here is an example:
abcsingleinstanceblah0001
cdemultipleinstanceexample0001
cdemultipleinstanceexample0002
cdemultipleinstanceexample0003
cdemultipleinstanceexample0004
....
Unfortunately the numbering scheme was not preserved across all of this data. So in some cases copies will have randomized numbers. However, the root instance name is always the same.
QUESTION: What would be a good strategy for creating a function that will parse a list of these instances and, in a new column, list all duplicates past the second copy? In relation to the example above, the new column would list:
cdemultipleinstanceexample0003
cdemultipleinstanceexample0004
I need to have the two duplicates with the lowest integer values preserved out of each set of duplicates, which is why in the example above 3 and 4 would have to go. So in the case of randomized numbers, the two instances with the lowest integer values.
What I have thought of
I was thinking to first organize the column by alphabetical order, which should automatically put duplicates in ascending order. I could then basically strip the number value from all instances, and find where there are more than 2 exact duplicates from the core instance name, which would give me the instances with more than 2 duplicates so that I could perform a function on the original data set... but I don't know if there is a better way of doing this or where to go from here.
I'm looking for formula-based solutions.
Assuming your sorted list is in Column A and that you have a row of headers you could use the following formulas in the neighboring columns.
In B:
=LEFT(A2,LEN(A2)-4)
In C (although not really necessary):
=RIGHT(A2,4)
In D starting with row 3:
=IF(AND(B3=B2,COUNTIF(B1:B3,B3)>2),"Del","Keep")
This formula doesn't work in row 2, but you can hard code the first result.
Then filter the list on Column D for "Del" and delete all the rows.
How's that?
Sort your list in column A. You'll want column headings for later so put those in row 1 (or leave it blank. In B2, type =left(A2,len(A2)-4) and drag the formula down to strip the integers. In C3 type =vlookup(B3,$B$2:$B2,1,0). Populate the formula in C3 right one cell and then down the length of the data. Now in D3 you'll have a list that has errors for any entry that only 2 or fewer instances and will have the name for any that have 2 or more. Sorting this list with a filter on row D for #NA will allow you to delete all the rows with less than two entries.
Remove your filter. Then resort the list in column A in reverse order so the high numbers are first. Replace the contents of C2 and D2 with #N/A. Refilter the list on column D for everything but #N/A and delete all the entries that have an instance listed.

Listing duplicate emails in Excel mailing lists

Im trying to create a list of values in 'sheet 3 column A', that are created by listing all values that are duplicates in two other sheets.
The duplicates are to be found by looking through each value in 'sheet 1 column P' and checking if that value also exists in
'sheet 2 column A'
I've tried reading up on this and there seem to be a number of functions I can use and not sure if I should use.
You need to use the VLOOKUP function, combined with IF. Together they are very very powerful. I really suggest you read up on them.
The following formula in Sheet 3, Column A (starting at row 2) will do what you want:
=IF(ISNA(VLOOKUP(Sheet1!P2,Sheet2!$A$2:$A$99,1,FALSE)),"",Sheet1!P2)
Copy that formula down from A2. I've assumed you have headings in row A. If you have more than 98 rows of emails (values to check), change $A$99 to be something like $A$9999.
So, let me get this straight. You have two workbook tabs. You want to get the intersection of the set (figure out where they are overlapping, duplicate, however you want to say it).
I would do one of two things, depending on how much you like Excel and moving your data around.
Option 1: Create a PivotTable of the data (assumes no duplicates within lists, only between lists)
Copy the data from the second list after the end of the first list (so both lists are now one list)
Insert a Pivot Table (on the ribbon), choosing your single column for the source
PivotTable options will pop up. Put the email address field in RowLabels and Count in the Summarize Values box.
Click on the count column of the pivot and sort largest to smallest.
All your duplicates will have Count > 1
Option 2 - use CountIf
This does not involve moving your data.
Go to sheet 2. In the next column over (from your info, it would be Column Q), put the CountIf function:
=CountIf(Sheet1!A:A,P2)
Then you can sort descending on your new count column to find duplicates.
CountIf performs very well in Excel if your lists are very large.
This can be refined slightly using iferror giving:
=IFERROR(VLOOKUP(Sheet1!P2,Sheet2!$A$2:$A$99,1,False),"",Sheet1!P2)
but is essentially the same thing

Trying add up values but have multiple entries

I am trying to look up the value in one column and pull the number from another column.
Of course, I could use the simple V-lookup or Match.
However, the first column of data has multiple entries that are the same. If I Vlookup it is just going to pull the first number in the second column.
I need to pull each number from the second column and somehow add them together. Despite the fact I have multiple entries.
If there is a way to consolidate the multiple entries in 1st column while also summing up the numbers in the 2nd, that would be great.
I would recommend a Pivot Table. To create one, select a cell in your data range (which needs to have column names in the first row. Choose Insert / Pivot Table from the Ribbon and select the New Worksheet option for the location.
In the Pivot Table list on the new worksheet, drag the name of the first column to the Row Labels box and the name of the second column to the Values box. The name in the Values box should turn to Sum of <2nd column name>.
The Pivot Table will now show a sorted list of the column 1 values and the summed values of column 2. In the example, you'll see that
Does SUMIF do what you are looking for?

Resources