VBA - Remove duplicates which contain less information - excel

First question on Stack, but not my first visit!
Basically I have this huge Excel database (>24 000 rows, merged from different tables) I have been working on for weeks and now that I'm done adding new entries, I have to clean it by removing a lot of duplicates.
The array/table is structured in the following manner :
+---------+-------+--------------------+-------------+--------------------+
| Company | Name | Address | Phone | Email |
+---------+-------+--------------------+-------------+-----------+--------+
| Baij&Co | Steve | 458 Preston avenue | 4156854789 | steve#baij&co.com |
I did search through conventional methods but they don't exactly answer my problem, such as:
Using the "Remove Duplicates" Excel button by selecting all columns to make sure I only keep unique values
Using the filtering method to identify the duplicates and then remove them.
However, my goal is to remove the duplicates for which the given row(s) contains the minimal amount of information, as shown in this example:
+--------- +-------+--------------------+-------------+--------------------+
| Company | Name | Address | Phone | Email |
+--------- +-------+--------------------+-------------+--------------------+
| Baij&Co | Steve | (blank) | 4156854789 | steve#baij&co.com |
| Baij&Co | Steve | (blank) | (blank) | steve#baij&co.com |
| Baij&Co | Steve | 458 Preston avenue | 4156854789 | steve#baij&co.com |
Here, I would like to remove the 1st AND 2nd row as they contain less information (missing address & phone entry) about the same contact.
Does it makes sense..?
I only know the basics of VBA (like creating a userform to add a new contact and fill out the entered information in the right cells) but I struggle with advanced algorithms.
I just know the VBA related function cannot be customized, apart from selecting the columns in which I want to remove the duplicates :
Sheets("Database").Range("ContactsTable").RemoveDuplicates Columns:=1:15, Header:=xlNo
Any ideas?

Thanks fellas!
So I followed #Tim Williams 's suggestion (which is similar to Scott's actually) and did the following:
I realized that email addresses were the unique identifier (or primary key) and I have to delete rows that don't contain any (as it becomes useless to have a contact file without contact information).
I added a column named "Count" and inserted the following formula:
=COUNTIF(N:N; N2)
--> Here, "N:N" is the column containing all email addresses. "N2" being the first cell.
I then sorted the table by descending order on the newly "Count" column to have the most occurrences first.
Then used the "Remove Duplicate" Excel tool and selected the email address column.
As a result, 10 000 rows have been removed (out of 24 000). One thing for sure is this table contains now unique contact files based on the email address. However, I will never know for sure if the most filled row was kept for each contact sadly (unless I spend days comparing both databases, row after row).
Problem solved I guess! Although I would be interested in a VBA-script to do the same (to learn on the algorithm aspect) if anyone knows anything about it :-)
Thanks again!

Related

Excel: How do i auto extract updated info from the main sheet to individual sheets

I am currently doing a sales summary consist of lots of customers, and I am trying to find a way to automatically update the value from the main sheet to individual customers sheet as there are too many customers for me to do that.
The Main Sheet would look like like this with headers
| Serial| | Date | | Customer Name| | Product Info|
| 001 | |Jan4th| | Mike | | Apple |
i am trying to create a formula so that the other individual sheet could only extract rows that was from only the customer (Mike), and other sheets would be of other customers as well.
It would help a lot if the formula can auto add in value that would soon be update as well, as other method i found only able to distribute available values, and when there are new value to be add, i have to repeat the process again which is not effiecent given the number of customers i have to summary for
If not formula then any other method would help too, but VBA is a bit above my capability so if you can provide detail for how i can make use of it, it would be delightful
If anyone can come up with anything i would be grateful, thank you for your attention
i have tried the copy paste link but they do not auto update new value after the paste.

Excel: View clients who don't have a product

I have a table of clients and the products that they have purchased.
I'm looking for a simple way of being able to filter to view all clients who don't have a certain product.
Client | Product
------ | ------
John | A
John | B
John | C
Kate | A
Kate | B
Kate | D
Mary | A
Mary | D
With the above example I would want to look for which clients do not have Product -> C, the return I'm after is Kate and Mary.
I've tried looking at this in a few different ways but I feel I'm over complicating it. I was creating a table to return who has the product then doing a lookup from there against another table of all users to then find out who wasn't in the first list.
I tried using a pivot table to get what I was after but I'm only able to return who has the products rather than who doesn't, also filtering product C from the pivot table does not help as the Client still shows up having other products.
I'm hoping there is an easier way to do this.
Your assistance is appreciated.
Dane
COUNTIFS should to the trick here.
You have one cell where you enter the product to look for. Then you add a column to your table that checks if the client does not have that product.
=COUNTIFS([Client],[#Client],[Product],referenceToTheProductToLookFor)=0
This will count the rows where
the entry in the column "Client" is the same as the value in the column "client" in the current row ([Client] references the whole column, [#Client] only the current row's value of that column)
the entry in the column "Product" is the same as the one entered in your input cell
and checks if the resulting count is 0. If it is 0 the cell value will be TRUE, otherwise it will be FALSE.
If you want to avoid having to make two steps each time you change the product you are looking for (1. enter the product, 2. update the filter on the table) you could use the worksheet's On_Change macro to detect changes to the product code and then automatically update the filter on the table.

Searching in Excel for certain values, if found give text from cell to the left of where we found the value

First let me explain what I want to achieve.
I currently have an Excel like this:
Names | Standards
James | Standard 1
James | Standard 2
James | Standard 3
Francis | Standard 1
Francis | Standard 2
Francis | Standard 3
Leon | Standard 2
Leon | Standard 3
Peter | Standard 2
Michael | Standard 3
And I want to create something like this:
Standard | Name 1 | Name 2 | Name 3 | Name 4
Standard 1 | James | Francis | |
Standard 2 | James | Francis | Leon | Peter
Standard 3 | James | Francis | Leon | Michael
My real Excel has more than 300 standards, so I would like to automate this using Excel Formula. I know this is possible, but I haven't used Excel in a while, so I could use a push in the right direction.
Couple of things I need (I think):
Need to count how many times people in the names column mention a standard. So I want to know that I need 2 names for standard 1 and 4 for standard 3. I think I can do this by using the COUNTIF method.
We need to search for the location of the standards. I think I can do this by using the Match function. This gives us the location of the first match in my original Excel. By sorting my original Excel a-z and combining it with the countif result I know where all the matches are (first match + countif = location of the last match, and everything inbetween is also that standard).
For the first name that mentioned a standard, I will reference the cell left of the first match (because the names are in the cell to the left of the standard I found). For the second name I will reference the cell left of the cell below the first match. I keep doing this till I find as many names as Countif mentioned. So I need an IF statement that makes sure that if 2 people mention standard 1 only gets 2 names and 2 cells with a "".
How will I reference the cells? By another if statement that uses this: Excel Reference To Current Cell , Correct me if I am wrong, but can't I then just say THIS.CELL=cell location I found (probably should use INDIRECT here?).
This is just me brainstorming, but I would love to know if people have any other ideas for my problem or have some feedback for my current plan.
An important thing to mention is that I want to do this using Excel Formula. I do realise that this isn't always the best, but VBA is not an option atm. I am also not worried about performance issues, because I think i'll just copy all the values after I found all the names using formulas.
Thanks in advance!
Depending on how you want to have the layout, I think you should use a pivot table. Drag the 'Standards' and 'Names' fields to the 'rows' data box and then right-click on a standard, click 'Field Settings' - 'Layout and Print' - 'Show item labels in tabular form'. (See example below.)
If you definitely need the data in the format in your question, I would edit the pivot table by dragging the 'names' field to the 'columns' data box. Then drag the 'standards' field from the field list above a second time and duplicate it in the 'values' box (see example below).
In the space underneath the pivot table, use an IF formula to only copy the name if there is a 1. This kind of approach will obviously be quite fragile, so if you can make do with the first approach, I think you will run into fewer problems in the future.

VBA creating a Userform single cell object

I've developed an application that uses excel/ VBA to interface a MySQL database in a WAMP configuration. One of the features of this application is the automatic saving of cells values, including the colors of words, font style, font size etc. The application loads up a calendar and renders the entered cell text upon workbook open.
This is achieved through breaking down the cell value into characters and storing them into a cell_value table that references other objects within my database in the following manor
+----+-------------+------------+------------+----------------------+--------------------+
| id | employee_id | project_id | date | cell_value | font_color |
+----+-------------+------------+------------+----------------------+--------------------+
| 1 | 1 | 1 | 2015-01-07 | the weather is crazy | 1:14:70,15:20:2000 |
+----+-------------+------------+------------+----------------------+--------------------+
this table stores the text entered by person with id=1 for project with id=1 on '2015-01-07'. The cell value is 'the weather is crazy'. the font color is listed in a comma delimited string with each element of this string delimited by a colon.
+--------------------------+------------------------+------------+
| starting_character_index | ending_character_index | font_color |
+--------------------------+------------------------+------------+
| 1 | 14 | 70 |
| 15 | 20 | 2000 |
+--------------------------+------------------------+------------+
I have been using the cell_selection_change event to trigger an automatic write to the database to store the contents of a global called LAST_SELECTED_CELL (as the active cell is set prior to the user making changes to it and hence does not cater for writing to the database AFTER the user makes changes to the contents of the cell).
this method works although is prone to bugs and for large amounts of cells and complex cell font/ color configurations is slow.
I was looking into using a spreadsheet embedded into a userform as potentially a better way of doing a similar process however i've found it to be quite buggy and prone to crashing. Is there a tool out there that allows you to embed a single excel 'cell' into the userform instead of an entire workbook?
If not, is there anything else that could be modified to suit my needs? Or do you have any suggestions as to how to improve my current method of storing/ rendering the data
Kind regards
Jordan
You can't. There is no way to embed Worksheet into the userform. But...
You can imitate it via using ListBox. It provides a way to display data as Excel Sheet displays rows and columns. It gives you a possibility to select single row or multiple rows at a time. You can make visible only one column, the rest of columns can be hidden (use ColumnWidth property).
Cheers
Maciej

Cross reference two data sources for matches in Excel 2010

Firstly, thank you for checking my quesiton. I'm new to doing anything advanced in Excel so I'm a bit lost.
I am trying to match names from two different sources that have the same data structure. There are 3 columns, LastName, FirstName, MiddleName. I added a fourth row to denote which organization the record came from and put both sources into one table and made a pivot out of it which works well enough but I'm having a hard time generating any useful data from it.
There are two main objectives once I have them matched.
I need a percentage of matching.
I need to be able to filter out the ones that matched so I can investigate the ones that didn't.
Here is a small example.
+-------------+-----------+------------+------+
| LastName | FirstName | MiddleName | Org. |
+-------------+-----------+------------+------+
| Jones | Mike | Anthony | Org1 |
| Black | Marry | | Org1 |
| Zeek | Winston | E | Org1 |
| Jones | Mike | A | Org2 |
| Black-Smith | Marry | | Org2 |
| Zeek | Winston | E | Org2 |
+-------------+-----------+------------+------+
As you can see out of the list only Winston E Zeek would really match because all three names are exactly the same. Mike Jones won't match because the listed middle names are wrong and Black and Black-Smith won't match because they are technically different last names. These issues with the data are fine at this stage because those are exactly what I'm trying to identify with a larger data set.
Maybe Excel isn't the best for this issue without using VBA? I'm not familiar with VBA which is why I haven't tried it yet and I unfortunately have limited time.
How can I solve this matching problem?
Any assistance and guidance will be appreciated.
Here's a quick idea:
Sort the data by last name, first name, middle name. That should put same/similar names next to each other.
Add a column that, for each row, has a worksheet function like =IF(A3=A2,1,0). This will indicate if this row matches the one above.
Sum the new column... That will tell you the number of matches. Divide by the total number of rows, to get your percentage.
You can modify the function in step 2, to indicate as tight of a match that you want.
Advantage: No VBA needed. Disadvantage: It requires some manual work and interpretation.

Resources