Changing the value based on the number of occurences in MS Excel - excel-formula

In the following table,
Name |Status
Emily | Rich
Sam | Poor
Emily | Rich
Emma | Poor
Emily | Poor
Emma | Rich
The requirement is, for a particular person, if the count of their status "Rich" is >=2 then change all the status occurrences for that person to "Rich". For others whose status does not satisfy the condition, retain their original values.
For example, in the case of Emily, her status is "Rich" in the first 2 occurrences but the status in the third occurrence is "Poor". I want that to be changed to "Rich" as it satisfies the condition and populate the NEWStatus column with the updated results as shown below.
Name |Status | NEWStatus
Emily | Rich | Rich
Sam | Poor | Poor
Emily | Rich | Rich
Emma | Poor | Poor
Emily | Poor | Rich
Emma | Rich | Rich
I tried using countif() function but still not getting the desired results. Please help.

Formula should be applied on Row 2 of Column-C (C2 Cell) since first row assumed to be the column headers.
Below formula is written by considering Name in Column-A and Status in Column-B
=IF(COUNTIFS(A:A,A2,B:B,"Rich")>=2,"Rich",B2)
Drag the formula down.
One step ahead to avoid 0 result when the Column-B cell value is empty.
=IF(COUNTIFS(A:A,A2,B:B,"Rich")>=2,"Rich",IF(B2="","",B2))

Related

Excel VBA for Data Cleaning

I want to find a way to automate our data cleaning (de-duping) process using excel VBA. Currently, our team gets a patient list from the clinics to de-dup since we find variations of duplicate records.
The reports we get from the clinic comes in an excel spreadsheet. We use specific columns of information from the spreadsheet we receive since we do not need everything from there. I've used multiple functions to remove duplicates since the patient records are entered manually, so there's a lot of different ways you'll see in the list.
It's okay if they have different addresses and insurance info because people move and they switch insurances. The conditions we focus on is whether someone has the same date of birth and first name and last name. We're not too strict on the last name either because last names change. In this case, we make sure that their date of birth is the same, first name is the same, and the address. We need at least three elements the same to consider that these are the same person, or we'll say these are different people and leave the records alone.
The list we work consist of the following columns:
First name, last name, DOB, Address, City, State, Zip, Primary Insurance.
| First Name | Last Name | DOB | Address | City | State | Zip | Primary Insurance |
|------------|-----------|--------------|------------------|----------|------------|--------|--------------------|
| John | Smith | 01/01/1990 | 123 ABC Street | Denver | Colorado | 87880 | Humana Insurance |
| Jon | Smith | 01/01/1990 | 123 ABC Street | Denver | Colorado | 87880 | Humana Insurance |
| Anthony | Bowen | 02/02/1992 | 456 ABC Street | Austin | Texas | 78632 | Aetna |
| Tony | Bowen | 02/02/1992 | 456 ACC Str | Austin | Texas | 78632 | Aetna |
Currently I sort the entire sheet of data from DOB oldest to newest, first name a-z, last name a-z.
From there I apply a formula:
if(and(C2=C3, B2=B3, A2=A3),"Check","") and apply through the entire rows. I filter all the blank cells to remove the formulas embedded and un-filter to jump down to the next cell that is flagged down with "Check" and check the next two rows.
I spot check to make sure the formula is picking all the true duplicates and then filter to see all the "Check" and remove all rows at the same time.
Then I move on to applying another formula:
if(and(C2=C3,left(B2,2)=left(B3,3),left(A2,2)=left(A3,2)),"Check","")
This is to match by same DOB, first two letters of first name and last name. Same process by jumping down to the next row that is flagged down with "Check" to review the two consecutive rows.
Again, spot check to make sure they're true duplicates, then filter to see all "check" and mass delete.
I would like to do this by having macro embedded buttons and the button will help grab the flagged down records to another tab of worksheet and removing those duplicated records from the original working sheet. So this way, it's removed from the original list but the user can still go to the other tab to review those removed duplicates if they want to.
I'd appreciate any suggestions anyone can give.
Thank you!

Fuzzy lookup to find nearest match and location to return ID

I am in the process of matching a number of data sets. These are passenger arrivals from a number of different systems. I need to match these as best as possible. 2% unique in each set, the rest common.
I am not trying to merge, deduplicate, or standardise the data as is normally the case with fuzzy look up. I am trying to find the quality, value and location of the closest match. Other then the common fields the data sets have a whole bunch of unique fileds. Essentially am trying to find a link between these so that I can create reports with the different data sets, each of which has information I need. These have over 100k rows.
I have made the common fields into a sting to simplify the calculations. The fields are arrival date (in excel number format), DOB, Passport and full name. i.e. "44250 | 15-JAN-80 | UK1234567 | JOHN AMITH"
Essentially starting with Table1, I want to add 3 columns; the nearest match in text, the ID associated with this value in the second table or the row number so I can index/match the data and finally the percentage similarity as per example.
I have found functions that find the nearest match, but not the location, or associated ID. Any ideas how the below would work or any other ideas.
MADEUP VALUES
TABLE 1 REF TABLE 1 ID
44054 | 29-Aug-1960 | CL-F2944458 | JOHN THOMSON ID1-010739
44054 | 09-Dec-1989 | LM389990 | EDWARD SMITH ID1-010737
44054 | 09-Dec-1991 | LL556699 | RICHARD FREEMAN ID1-010738
44054 | 06-May-1960 | LK9915782 | JEAN HAMILTON ID1-010740
44054 | 05-Nov-1954 | US 9910505 | BEN JONES ID1-010753
TABLE 2 REF TABLE 2 ID
44054 | 05-Nov-1954 | US 9910505 | BENJAMIN JONES ID2-0001
44059 | 19-Aug-1960 | CL-F2944458 | JOHN THOMSON ID2-0002
44054 | 09-Dec-1991 | LL556666 | RICHARD FREEMAN ID2-0003
44054 | 06-May-1960 | LK9915782 | JEAN HAMILTON ID2-0004
44054 | 09-Nov-1989 | AU-LM389990 | EDWARD SMTH ID2-0005
Levenshtein Distance in VBA
Fuzzy matching Mr Excel
github Fuzzy
I needed to use fuzzy matching in Excel for work, and I also needed to know string similarity, be able to partition sentences, etc.
I created a VBA module for doing just this, I think it may help you: https://github.com/kyledeer-32/vba_fuzzymatching.git
Basically, importing it into your workbook will give you access to several UDFs, e.g., fuzzy match, string similarity, etc.
Note: it won't return the cell index of a matched value, but you could configure the scripts to do this fairly easy, e.g., with just a few changes, you could configure the "=fuzzy_match" function to return the array position of the best match instead of the value itself.
Hope this helps!

Excel LOOKUP matches the wrong cell

I am trying to achieve a fairly simple task in Excel, but I do not get the results that I want. I have a simple schedule in which I assign one of a pool of coaches to a series of matches, by filling out a simple table. Here is scaled-down version:
Match | John | Pete | Chris |
-------|------|------|-------|-------
1 | X | | | John
2 | | X | | Pete
3 | A | | X | Chris
4 | X | | A | Chris (!)
5 | | X | A | Pete
Legend: X: will coach; A: is available.
I used the table to register availability and then changed one A to an X in each row, to select the person that will actually coach the match.
For an overview, I decided to add a column in which the selected coach would appear. I used the following formula: =LOOKUP("X"; B2:D2; B$1:D$1) for row 2 and copied it to the other rows so that the row numbers of each row corresponded with the row in which the formula was placed.
To my surprise, match 4 became assigned to Chris, whereas John has an X and Chris only an A.
When I read Microsoft's documentation on LOOKUP, I noticed a few things:
LOOKUP has a vector form and an array form. Microsoft recommends using HLOOKUP for the array form, but I use the vector form. I do not think that HLOOKUP is useful for me, as it only looks for values in the first row of the specified array, whereas my first row contains the values to be returned.
It reads: "A range that contains only one row or one column.", but also "Important: The values in lookup_vector must be placed in ascending order: ..., -2, -1, 0, 1, 2, ..., A-Z, FALSE, TRUE; otherwise, LOOKUP might not return the correct value. Uppercase and lowercase text are equivalent."
I think that 2. is what causes the issue. I am not sure how to sort a range parameter like A2:A4. Microsoft documents a SORT function, but it's beta. Also, I think that sorting the search row will mess up the match, anyway.
The workaround I found is changing my codes to A: assigned and B: backup, in which A and B are chosen to be alphabetically ascending. If I change the formulas to use lookup value "A", this gives me:
Match | John | Pete | Chris |
-------|------|------|-------|-------
1 | A | | | John
2 | | A | | Pete
3 | B | | A | Chris
4 | A | | B | John
5 | | A | B | Pete
which is the result I want.
Can anyone shed some light on this completely counter-intuitive behavior and/or describe alternative ways to achieve this?
Notes:
I take care of only putting 1 X in each row.
I have seen this in Microsoft Excel for Office 365 MSO (16.0.11328.20362) 64-bit and in Microsoft Excel 2016 MSO (16.0.12130.20232) 32-bits.
LOOKUP is doing exactly what it was designed to, per the documentation. You should use INDEX and MATCH:
=INDEX($B$1:$D$1;MATCH("X";B2:D2;0))
The final 0 argument to MATCH means that you are looking for an exact match so the data doesn't need to be sorted.
LOOKUP does a binary search and therefore it returns A. We have had a long discussion on Chandoo.org forum which you can read here:
https://chandoo.org/forum/threads/how-vlookup-works.18378/
And here's another discussion here: http://www.ashishmathur.com/return-an-exact-value-via-the-lookup-function/
Basically, it keeps looking for an equal or lower value and then keeps slicing through data and therefore it needs data to be sorted.
You can still use LOOKUP by tweaking like below.
=LOOKUP(2;1/(B2:D2="X");B$1:D$1)

Report generation based on multi lookup and dynamic columns

I am a little stuck with a report I am trying to generate in Excel and was hoping someone could help.
Here is a summary of what I am trying to do:
Table 1 has one column called people (it’s basically a list of
employees)
Table 2 has one column called countries (it’s basically a
list of relevant countries)
Table 3 has three columns called person,
country and date.
There is one entry for every person each time they review a country.
So the data will look something like:
PERSON | COUNTRY | DATE
John | uk | 10/01/2013
Paul | uk | 15/01/2013
John | France | 15/01/2013
Bob | Spain | 16/01/2013
The report I need to produce is one which shows who has/hasn't checked each country.
So the columns will be ‘Person’, uk, France, Spain (and any other unique value from the country table).
There will then be one single row for each person with a Yes/No value in the relevant column if that person has reviewed that country i.e. Table 3 contains a value that matches that value for the person and country.
So to be clear the report should be similar to:
PERSON | UK | FRANCE | SPAIN
John | Yes | Yes | No
Paul | Yes | No | No
Bob | No | No | Yes
In summary I can split this into two problems:
How to generate a table that has a column for every unique value in another table (country in the explanation above)
How to do a double lookup i.e. IF EXISTS in TABLE 3 ‘person’=john & ‘country’=uk then return ‘Yes’, otherwise return ‘No’
I’m happy to keep in Excel or make use of SQL reporting i.e move my data to SQL first.
It's kind of a wonky formula but =sumproduct() will do a dual lookup.
=IF(SUMPRODUCT(--($K$2:$K$5=$K13), --($L$2:$L$5=M$10)),"Yes","No")
The Person/Country/Date table is located in the range K1:M5 the results table is located in range K10:N13. I had a workbook open and put it in the corner. (Nobody puts sumproduct in the corner)
The gist is, -- turns a true and false into a 1 or 0. sumproduct will multiply the two results line by line. If both are true, you get 1 x 1 and funnels that back into the if for a yes and no. You'll have to be mindful of th $ in the formula.

Multiple Column vs Multiple Column Lookup

I am after a formula to match a number of columns between two worksheets and return the last reference worksheets final column data. I know this is doable in VBA, but am looking for a formula method.
MainWorksheet:
User | Region | Country | City | Lookup
--------------------------------------------------
User1 | Europe | Italy | Rome | [formula here]
User2 | Americas | Brazil | Rio | [formula here]
ReferenceWorksheet:
Region | Country | City | Data
-----------------------------------
Europe | England | London | some data
Americas | Brazil | Rio | more data
Europe | Italy | Rome | some more data
The formula I am after should match each column in that particular row and add the Data cell value from the ReferenceWorksheet to the MainWorksheet.
eg. If (MainWorksheet.Region = ReferenceWorksheet.Region) &&
(MainWorksheet.Country == ReferenceWorksheet.Country) &&
(MainWorksheet.Region == ReferenceWorksheet.Region) Then
MainWorksheet.Column E = ReferenceWorksheet.Current Row:Data Column
I haven't found a cleancut way to do this using mutliple columns using VLOOKUP, INDEX(MATCH)) etc. Is there a way to filter within a function?
Any help is much appreciated!
I agree with vasek1, adding additional columns will simplify the formulas required but if you want to avoid extra columns there are [relatively] simple methods available.
Method 1 - do the same concatenation as vasek1....but within the formula, e.g. in E2
Main
=INDEX(Ref!D$2:D$100,MATCH(B2&"-"&C2&"-"&D2,Ref!A$2:A$100&"-"&Ref!B$2:B$100&"-"&Ref!C$2:C$100,0))
formula needs to be confirmed with CTRL+SHIFT+ENTER
Method 2 - a non-array version with LOOKUP
=LOOKUP(2,1/(Ref!A$2:A$100=B2)/(Ref!B$2:B$100=C2)/(Ref!C$2:C$100=D2),Ref!D$2:D$100)
Note that the first formula finds the first match, the latter the last. I assume that the reference data will only have a single instance of each region/country/city combination in which case they will both give the same results, but that isn't guaranteed in every situation.
To allow C2 to be "<>" meaning "any country" (as per comment) you can use this revised version of the LOOKUP formula
=LOOKUP(2,1/(Ref!A$2:A$100=B2)/((Ref!B$2:B$100=C2)+(C2="<>"))/(Ref!C$2:C$100=D2),Ref!D$2:D$100)
A similar change can be applied to the INDEX/MATCH version
A solution I use for this type of problem is to create an extra column to serve as the unique identifier for each table. So, in your case,
Main table: formula for key, assuming you start with column 1 = A, is
E2 = B2 & "(underscore)" & C2 & "(underscore)" & D2
User | Region | Country | City | Key | Lookup
--------------------------------------------------
User1 | Europe | Italy | Rome | Europe_Italy_Rome | [formula here]
User2 | Americas | Brazil | Rio | Americas_Brazil_Rio | [formula here]
Reference table: here, insert the extra column to the left so you can do a vlookup on it. Formula for Key in A2 is
A2 = B2 & "(underscore)" & C2 & "(underscore)" & D2
Key | Region | Country | City | Data
---------------------------------------------------------------------
Europe_England_London | Europe | England | London | some data
Americas_Brazil_Rio | Americas | Brazil | Rio | more data
Europe_Italy_Rome | Europe | Italy | Rome | some more data
Then, the lookup formula in the main table becomes very simple:
F2 = VLOOKUP(E2, ReferenceTable!$A$2:$E$4, 5, 0)
You can then hide the key columns from the user, if necessary. The advantage of this approach is that it keeps the formulas simple and is much easier to understand and update than writing VBA or a complicated formula.
Here's a simple example of multi-column MATCH (the kind of approach which often shows up when searching for this type of formula):
In E10:
=IFERROR(INDEX(E3:E5,MATCH(B10&C10&D10,$B$3:$B$5&$C$3:$C$5&$D$3:$D$5,0),1),"No Match")
Be sure to use Ctrl+Shift+Enter when entering the formula.
Posting this to note that it has an issue you should be aware of: the example above matches on:
B | Two | Blue
but it will also match on:
BT | wo | Blue
What you need is typically called a Multiple Lookup.
This was asked a few times, under various forms.
I have compiled here a list of such posts.
(This one is in the list)
There are many possible solutions for that.
The one I found most robust is shown here.
Adapted to this case, the formula in E3 would be
=INDEX(Ref!D:D,SUMPRODUCT(--(Ref!A:A=B3),--(Ref!B:B=C3),--(Ref!C:C=D3),ROW(Ref!D:D)),0)
and copy downward.

Resources