Finding the most common phrases within a range of cells in Excel - excel-formula

I have a list of data that I'm trying to analyse. It hasn't been tagged, so I am trying to set up a way of doing that. Let's say that my list is ice cream shops, the data is something like this:
Jack's Akron Ice Creamery
Gerry and Benn's Soft Serve
Macco's Strathfield Ice Creams and Treats
Auburn Ice Cream
Macco's Paddington Ice Creams and Treats
Patterson Ice Creamery
Cold Food Soft Serve of London
Jacks Cleveland Ice Creamery
Mrs Whipper Ice Creams Frankston
Mrs Whipper Frozen Treats Cranbourne
What I'd like to do is figure out which of these is the most prevalent in this data set (assuming I can regex and scrub away the punctuation differences).
From the above set, I would hope to surface:
Jack's
Macco's
Mrs Whipper
Soft Serve
Ice Creamery
Ice Creams and Treats
(note, this is a made up list, but I have a similarly random, but grouped list)
I have tried using the =MODE() and =INDEX() and =isnumber() functions with =sumifs() and others like these to no avail. I'd assume that the Google Sheets function of =SPLIT() would be of value, but the trouble is that I can't use Sheets due to our company policy.

Your last sentence makes me think you don’t have the newest version of excel, in which case this answer is pointless, but perhaps it will be useful for someone else. I haven’t come across a great workaround for the inability to use COUNTIF in LET so it’s a little messy, but it works.
=LET(x, TEXTJOIN(" ", 1, $A$2:$A$11),
y, FILTERXML("<t><s>"&SUBSTITUTE(x, " ", "</s><s>")&"</s></t>", "//s"),
z, UNIQUE(y),
p, CONCAT(y),
mycount, (LEN(p)-LEN(SUBSTITUTE(p,z,"")))/LEN(z),
mylist, SORT(IF(SEQUENCE(1,2)<2, z, mycount),2,-1),
IF(SEQUENCE(COUNTA(mylist),1)<=5, mylist, ""))
The steps are below:
Use LET to avoid helper columns and make the formula more readable.
Join all the cells into a string with TEXTJOIN and separate with a space.
Use FILTERXML to break up the string into individual cells at every instance of a space, and call this vector “y”.
Get the unique cells in “y” and call the new vector “z”.
Count all the matches of “z” in “y” by getting the difference in length of the whole string and the string minus instances of the word, divided by the length of the word (this is the workaround to COUNTIF)
Append these matches to the unique vector using SEQUENCE (thanks #P.b.!)
Sort by the number of matches, and finally elect to show only the top-5 words with the most matches
This gets it in one go, but from a practicality and speed standpoint, I would probably just use steps 2 and 3, paste the unique values, then use COUNTIF and delete the unneeded data.

Related

Find and Replace only the first instance of a word in a cell

I have almost 2000 rows of content (all product descriptions for a website rebuild, no data) that I've edited and now need to include a Trademark symbol or Registered symbol on various product names and technologies throughout the worksheet. Using Find and Replace would be the fastest way to accomplish this, but the problem is, I only want the symbol to occur on the first instance of the word in the cell and not any following instances.
For example, if I Find and Replace "Nike shoes" with "Nike® shoes". The result would be:
Nike® shoes are built to last for years. Every pair of Nike® shoes is covered by our full lifetime warranty.
But what I really want is the following:
Nike® shoes are built to last for years. Every pair of Nike shoes is covered by our full lifetime warranty.
Is there any way to create a function for finding and replacing the first instance of a word in a cell?
Just a side note, I've never used excel before now so I'm new to this.
If I understood correctly you can try:
=SUBSTITUTE(Cell of the text, "Old Text", "New text", 1).
1 means that you want only one instance of the word to be replaced.
Example:
Let say the text is in cell A1
Nike shoes are built to last for years. Every pair of Nike shoes is covered by our full lifetime warranty.
=SUBSTITUTE(A1, "Nike", "Nike®", 1)
It worked well for me, for exactly the same problem.

Extracting the last sequence of numbers in excel

I have a list of addresses from which I need to extract the last sequence of numbers (zip code). I'm looking for a general expression from which I can extract the zip codes from addresses from all over the world. I would have to tweak the expression in order for it to work for each country, or for a group of countries, I assume.
I'm trying to write a formula in excel that can recognise the last digit in a string, and from that, extract the numbers immediately before that last digit and stoping whenever it reaches a non-integer. Below I have an example of an address and the formula I've come up with (in E26), but I'm looking for something more compact:
National Institute of Pharmaceutical Education and Research (NIPER), Phase X, Sector 67, SAS Nagar, Punjab, 160062, India.
=MID(E26, MAX(IF(ISNUMBER(VALUE(MID(E26,ROW(INDIRECT("1:" & LEN(E26))),1))),ROW(INDIRECT("1:" & LEN(E26))))+1)-6, 6)
The first part of recognizing the last digit is working fine, the problem is to recognize the beggining of the sequence, at least in cases where there's also street numbers within the string (such as in this case). This is why I'm subtacting -6 to the position where the last digit was found, since I know the lenght of the zip code in this particular country. However, it may not be the case for all countries.
Plus there are cases, where there's a space between the sequence such as: 160 062. Also, they won't always have delimeters that I could use to extract the zip codes, hence, the reason why a need an algorithm for this.
I was wondering if there's a nitter way to do this? I would be open for VBA. Thanks for your help!
Best regards,
Antonio

How do you extract the first sentence in Excel

I have around 2000 descriptions that need a short description. Here is an example of a description.
Chloe New, the heir of the Original Chloe is warm, feminine and a great signature scent. Chloe is a flowery scent with peonies, freesia, magnolia, lilies and rose petals along with ripe litchis. A hint of woods at the base makes Chloe New the best everyday fragrance. It's long lasting power keeps you fresh all day long.
The result being this
Chloe New, the heir of the Original Chloe is warm, feminine and a great signature scent.
Sometimes other descriptions will end like this, for example:
Chloe New, the heir of the Original Chloe is warm, feminine and a great signature scent. Chloe is
The current function I am using is the "=left(a1,70)" which takes the first 70 characters starting from the left. However, this function doesn't always extract the first sentence but ends at the beginning of the second sentence.
So my question is:
Is there a function that only extracts the first sentence of a cell?
Add a "space" right after "." to prevent the sentence from being prematurely trimmed in the case of an abbreviation like "e.g.":
=LEFT(A1,FIND(". ",A1))
Where A1 is the cell you are evaluating:
=LEFT(A1,SEARCH(".",A1))
SEARCH() finds the index of "." and you evaluate your LEFT() function up to that point.
EDIT: For this use case, my solution and Gary's solution are nearly identical. For other use cases, SEARCH might be preferable to FIND because it performs case-insensitive search and also supports wildcard characters.
if you want to tag off the comma use this:
=LEFT(A1, (SEARCH(",",A1,1))-1)

Looking up Bigrams in Excel

Suppose I have a list of two-word pairs in a column in Excel. These words are delimited by a space so that a typical pair might look like "extreme happiness". The goal is to search for these 'bigrams' in a larger string located in another column. The issue is that the bigram will only be found if the two words are together and separated by a space. What would be preferable is if Excel could look for both words anywhere in a given larger string. It is crucial that the bigrams occupy one cell each since a score is assigned to each bigram and in fact the function used VLOOKUPs this value based on the bigram cell value. Would it make sense to change the space between any two words to a - or some other character? Is there a way to have Excel look up each value one at a time (perhaps by recognizing this character and passing through the larger string twice, that is, once for each word)?
Example: "The weather last night was extremely cold, but the warm fire gave me some happiness."
Here we would like to find both the word 'extreme' within the word extremely and the word happiness. Currently Excel would not be successful in doing this since it would just look for "extreme happiness" and determine that no such string exists.
If the bigram in the row below "extreme happiness" reads "weather gave" (for some reason) Excel will go check whether that bigram exists in the larger string and return a second score. This is done so that at the end every score can be added together.
This is pretty easy with a couple of formulas. See screenshot below:
The logic is simple. Assuming your bigram is in B1, we can input the following in C1. This will replace the spaces with *, which is Excel's wildcard character.
=SUBSTITUTE(B2," ","*")
Then we concatenate it to give us a wildcarded beginning and end.
=CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*")
We then use a simple COUNTIF against the statement (here in A1) to return to us a count of occurence.
=COUNTIF(A2,CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*"))
A simple IF check enclosing the above, with condition >0, can be used to give us either Yes or No.
=IF(COUNTIF(A2,CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*"))>0,"Yes","No")
Let us know if this helps.

Weighted search algorithm to find like contacts

I need to write an algorithm that returns the closest match for a contact based on the name and address entered by the user. Both of these are troubling, since there are so many ways to enter a company name and address, for instance:
Company A, 123 Any Street Suite 200, Anytown, AK 99012
Comp. A, 123 Any St., Suite 200, Anytown, AK 99012
CA, 123 Any Street Ste 200, Anytown, AK 99012
I have looked at doing a Levenshtein distance on the Name, but that doesn't seem a great tool, since they could abbreviate the name. I am looking for something that matches on the most information possible.
My initial attempt was to limit the results first by the first 5 digits of the postal code and then try to filter down to one based on other information, but there must be a more standard approach to getting this done. I am working in .NET but will look at any code you can provide to get an idea on how to accomplish this.
I don't exactly now how this is accomplished, but all major delivery companies (FedEx, USPS, UPS) seem to have a way of matching an address you input against their database and transforming it to a normalized form. As I've seen this happen on multiple websites (Amazon comes to mind), I am assuming that there is an API to this functionality, but I don't know where to look for it and whether it is suitable for your purposes.
Just a thought though.
EDIT: I found the USPS API
I have solved this problem with a combination of address normalization, Metaphone, and Levenshtein distance. You will need to separate the name from the address since they have different characteristics. Here are the steps you need to do:
1) Narrow down you list of matches by using the (first six characters of the) zip code. Basically you will need to calculate the Levenshtein distance of the two strings and select the ones that have a distance of 1 or 2 at the most. You can potentially precompute a table of zip codes and their "Levenshtein neighbors" if you really need to speed up the search.
http://en.wikipedia.org/wiki/Levenshtein_distance
2) Convert all the address abbreviations to a standard format using the list of official prefix and suffix abbreviations from the USPS. This will help make sure your results for the next step are more uniform:
https://www.usps.com/send/official-abbreviations.htm
3) Convert the address to a short code using the Methaphone algorithm. This will get rid of most common spelling mistakes. Just make sure that your implementation can eliminate all non word characters, pass numbers intact and handle multiple words (make sure each word is separated by a single space):
http://en.wikipedia.org/wiki/Metaphone
4) Once you have the Methaphone result of the compare the address strings using the Levenshtein distance. Calculate a percentage of change score by dividing the result by the number of characters in the longer string.
5) Repeat steps 3 and 4 but now use the names instead of the addresses.
6) Compute the score for each entry using this formula: (Weight for address * Address score) + (Weight for name * Name score). Pick your weights based on what is more important. I would start with .9 for the address (since the address is more specific) and .1 for the name but the weights may depend on your application. Pick the entry with the lowest score. If the score is too high (say over .15 you may declare that there are no matches).
I think filtering based on zip code first would be the easiest, as finding it is fairly unambiguous. From there you can probably extract the city and street. I'm not sure how you would go about finding the name, but it seems matching it against the address if you already have a database of (name, address) pairs is feasible.
Dun & Bradstreet do this. They charge money because it's really hard. There's no "standard" solution. It's mostly a painful choice between a service like D&B or roll your own.
As a start, I'd probably do a word-indexed search. That would mean two stages:
Offline stage: Generate an index of all the addresses by their keywords. For example, "Company", "A" and "123" would all become an keywords for the address you provided above. You could do some stemming, which would mean for words like "street" you'd also add a word "st" into its index.
Online stage: The user gives you a search query. Break down the search query into all its keywords, and find all possible matches of each keyword in the database. Tally the number of matched keywords on each address. Then sort the results by the number of matched keywords. This should be able to be done quite quickly if there aren't too many matches, as its just a few sorted list merges and increments, followed finally by a sort.
Given that you know the domain of your problem, you could specialise the algorithm to use knowledge about the domain - for example the zip code filtering mentioned before.
Also just to enable me to provide you with a better answer, are you using an SQL database at all? I ask because the way I would do it is I'd store the keyword index in the SQL database, and then the SQL query to search by keyword becomes quite easy, since the database does all the work.
Maybe instead of using Levenshtein for the name only, it could be useful when used with the entire string representation of a contact. For instance, the distance of your first example to the second is 7 and to the third 9. Considering the strings have lengths 54, 50 and 45, this seems to be a relatively useful and quite simple similarity measure.
This is what I would do. I am not aware of algorithms, so I just use what makes sense.
I am assuming that the person would provide name, street address, city name, state name, and zipcode.
If the zipcode is provided in 9 numbers, or has a hyphen, I would strip it down to 5 numbers. I would search the database for all of the addresses that has that zipcode.[query 1]
Then I would compare the state letter with the one from the database. If it's not a match, then I would tell that to the user. Same goes for the city name.
From what I understand, a street name is not in numbers, only the house on a street had numbers in it. Further more, the house number is usually at the beginning unless it is house or suite number.
So I would do regex to search for the numbers and the next space or comma next to it. Then find position of the first word that does not has a period(.) or ends in comma. I have part of the street name, so I could do a comparison against the rows fetched earlier, or I would change the query to have the street name LIKE %streetName%.
I am guessing the database has a beginning number and ending number of the house on a block. I would check against that street row to see if the provided street number is on that street.
By now you would know the correct data to show, and could look up in a different table as to which name is associated with that house number. I am not sure why you want to compare it. Only use for name comparing would be if you want to find people whose address was not provided. You can look here for comparing string ways Similar String algorithm
If you can reliably figure out general structure of each address (perhaps by the suggestions in the other answers), your best bet would be to run the data through a USPS-certified (meaning: the results are reliable, accurate, and conform to federal standards) address verification service.
#RyanDelucchi, it is a fun problem, but only once you've solved it. So, #SteveBering, I would recommend submitting your list of contacts to a list processing service which will flag duplicates based on the address -- according to USPS guidelines.
Since I work in the address verification field, I would suggest SmartyStreets (which I work for) since it will deliver the most value to your specific need -- however, there are a few CASS-Certified vendors who will do basically similar things.

Resources