Looking up Bigrams in Excel - excel

Suppose I have a list of two-word pairs in a column in Excel. These words are delimited by a space so that a typical pair might look like "extreme happiness". The goal is to search for these 'bigrams' in a larger string located in another column. The issue is that the bigram will only be found if the two words are together and separated by a space. What would be preferable is if Excel could look for both words anywhere in a given larger string. It is crucial that the bigrams occupy one cell each since a score is assigned to each bigram and in fact the function used VLOOKUPs this value based on the bigram cell value. Would it make sense to change the space between any two words to a - or some other character? Is there a way to have Excel look up each value one at a time (perhaps by recognizing this character and passing through the larger string twice, that is, once for each word)?
Example: "The weather last night was extremely cold, but the warm fire gave me some happiness."
Here we would like to find both the word 'extreme' within the word extremely and the word happiness. Currently Excel would not be successful in doing this since it would just look for "extreme happiness" and determine that no such string exists.
If the bigram in the row below "extreme happiness" reads "weather gave" (for some reason) Excel will go check whether that bigram exists in the larger string and return a second score. This is done so that at the end every score can be added together.

This is pretty easy with a couple of formulas. See screenshot below:
The logic is simple. Assuming your bigram is in B1, we can input the following in C1. This will replace the spaces with *, which is Excel's wildcard character.
=SUBSTITUTE(B2," ","*")
Then we concatenate it to give us a wildcarded beginning and end.
=CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*")
We then use a simple COUNTIF against the statement (here in A1) to return to us a count of occurence.
=COUNTIF(A2,CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*"))
A simple IF check enclosing the above, with condition >0, can be used to give us either Yes or No.
=IF(COUNTIF(A2,CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*"))>0,"Yes","No")
Let us know if this helps.

Related

How to make an excel (365) function that recognizes different words in the same cell and changes them individually

What im working with
I have a list of product names, but unfortunately they are written in uppercase I now want to make only the first letter uppercase and the rest lowercase but I also want all words with 3 or less symbols to stay uppercase
im trying if functions but nothing is really working
i use the german excel version but i would be happy if someone has any idea on how to do it im trying different functions for hours but nothing is working
=IF(LENGTH(C6)<=3,UPPER(C6),UPPER(LEFT(C6,1))&LOWER(RIGHT(C6,LENGTH(C6)-1)))
but its a #NAME error excel does not recognize the first and the last bracket
This is hard! Let me explain:
I do believe there are German words in the mix that are below 4 characters in length that you should exclude. My German isn't great but there would probably be a huge deal of words below 4 characters;
There seems to be substrings that are 3+ characters in length but should probably stay uppercase, e.g. '550E/ER';
There seem to be quite a bunch of characters that could be used as delimiters to split the input into 'words'. It's hard to catch any of them without a full list;
Possible other reasons;
With the above in mind I think it's safe to say that we can try to accomplish something that you want as best as we can. Therefor I'd suggest
To split on multiple characters;
Exclude certain words from being uppercase when length < 3;
Include certain words to be uppercase when length > 3 and digits are present;
Assume 1st character could be made uppercase in any input;
For example:
Formula in B1:
=MAP(A1:A5,LAMBDA(v,LET(x,TEXTSPLIT(v,{"-","/"," ","."},,1),y,TEXTSPLIT(v,x,,1),z,TEXTJOIN(y,,MAP(x,LAMBDA(w,IF(SUM(--(w={"zu","ein","für","aus"})),LOWER(w),IF((LEN(w)<4)+SUM(IFERROR(FIND(SEQUENCE(10,,0),w),)),UPPER(w),LOWER(w)))))),UPPER(LEFT(z))&MID(z,2,LEN(v)))))
You can see how difficult it is to capture each and every possibility;
The minute you exclude a few words, another will pop-up (the 'x' between numbers for example. Which should stay upper/lower-case depending on the context it is found in);
The second you include words containing digits, you notice that some should be excluded ('00SICHERUNGS....');
If the 1st character would be a digit, the whole above solution would not change 1st alpha-char in upper;
Maybe some characters shouldn't be used as delimiters based on context? Think about hypenated words;
Possible other reasons.
Point is, this is not just hard, it's extremely hard if not impossible to do on the type of data you are currently working with! Even if one is proficient with writing a regular expression (chuck in all (non-available to Excel) tokens, quantifiers and methods if you like), I'd doubt all edge-case could be covered.
Because you are dealing with any number of words in a cell you'll need to get crafty with this one. Thankfully there is TEXTSPLIT() and TEXTJOIN() that can make short work of splitting the text into words, where we can then test the length, change the capitalization, and then join them back together all in one formula:
=TEXTJOIN(" ", TRUE, IF(LEN(TEXTSPLIT(C6," "))<=3,UPPER(TEXTSPLIT(C6," ")),PROPER(TEXTSPLIT(C6," "))))
Also used PROPER() formula as well, which only capitalizes the first character of a word.

If second two characters equal this then that

I have a string of letters and numbers, where if the second two characters of the string equal a certain value, then a location value should be shown in the corresponding column.
I have used the MID function to essentially extract the characters of the string that I want to use MID(A2,2,2) but now I can't figure out how to compare what is returned to a list of options that those two characters could be without typing in each option in an extremely long formula.
Here are possible strings that are situated in a column:
3PH356969
MSFFACEBUS
MBH0007398
MBH0007402
I am extracting the second two characters of these, to compare to a list similar to this
PH
SF
BH
PG
HR
These values then correspond to location (below), which would optimally be returned:
Philadelphia
Bay Area
Birmingham
Western PA
Hartford
I can write =IF(MID(A2,2,2)="PH","Philadelphia",else...) but then the else-ifs will go on for 76 more 2-character strings to compare against. I'm hoping there is a more optimal way for this.
Expected results should be the location corresponding to the string, or just "error" displayed.
Basically we need to use a lookup/reference table, but instead of a much more common VLOOKUP function we can use a much faster INDEX + MATCH combo.
Formula in B1:
=INDEX($E$1:$E$6,MATCH(MID(A1,2,2),$D$1:$D$6;0))
I would use a VLOOKUP, personally. Although it would require a separate lookup table, just feed your MID result as the VLOOKUP key. Then you could easily add/remove locations, and there will be an #N/A error if the key's not there.
If you don't want a separate lookup table, you may try it this way:
=IFERROR(INDEX({"Philadelphia","Bay Area","Birmingham","Western PA","Hartford"},MATCH(MID(A2,2,2),{"PH","SF","BH","PG","HR"},0)),"Not found")

Extracting text from complex string in excel

The attached image (link: https://i.stack.imgur.com/w0pEw.png) shows a range of cells (B1:B7) from a table I imported from the web. I need a formula that allows me to extract the names from each cell. In this case, my objective is to generate the following list of names, where each name is in its own cell: Erik Karlsson, P.K. Subban, John Tavares, Matthew Tkachuk, Steven Stamkos, Dustin Brown, Shea Weber.
I have been reading about left, right, and mid functions, but I'm confused by the irregular spacing and special characters (i.e. the box with question mark beside some names).
Can anyone help me extract the names? Thanks
Assuming that your cells follow the same format, you can use a variety of text functions to get the name.
This function requires the following format:
Some initial text, followed by
2 new lines in Excel (represented by CHAR(10)
The name, which consists of a first name, a space, then a last name
A second space on the same line as the name, followed by some additional text.
With this format, you can use the following formula (assuming your data is in an Excel table, with the column of initial data named Text):
=MID([#Text],SEARCH(CHAR(10),[#Text],SEARCH(CHAR(10),[#Text])+1)+1,SEARCH(" ",MID([#Text],SEARCH(CHAR(10),[#Text],SEARCH(CHAR(10),[#Text])+1)+1,LEN([#Text])),SEARCH(" ",MID([#Text],SEARCH(CHAR(10),[#Text],SEARCH(CHAR(10),[#Text])+1)+1,LEN([#Text])))+1)-1)
To come up with this formula, we take the following steps:
First, we figure out where the name starts. We know this occurs after the 2 new lines, so we use:
=SEARCH(CHAR(10),[#Text],SEARCH(CHAR(10),[#Text])+1)+1
The inner (occurring second) SEARCH finds the first new line, and the outer (occurring first) finds the 2nd new line.
Now that we have that value, we can use it to determine the rest of the string (after the 2 new lines). Let's say that the previous formula was stored in a table column called Start of Name. The 2nd formula will then be:
=MID([#Text],[#[Start of Name]],LEN([#Text]))
Note that we're using the length of the entire text, which by definition is more than we need. However, that's not an issue, since Excel returns the smaller amount between the last argument to MID and the actual length of the text.
Once we have the text from the start of the name on, we need to calculate the position of the 2nd space (where the name ends). To do that, we need to calculate the position of the first space. This is similar to how we calculated the start of the name earlier (which starts after 2 new lines). The function we need is:
=SEARCH(" ",[#[Rest of String]],SEARCH(" ",[#[Rest of String]])+1)-1
So now, we know where the name starts (after 2 new lines), and where it ends (after the 2nd space). Assuming we have these numbers stored in columns named Start of Name and To Second Space respectively, we can use the following formula to get the name:
=MID([#Text],[#[Start of Name]],[#[To Second Space]])
This is equivalent to the first formula: The difference is that the first formula doesn't use any "helper columns".
Of course, if any cell doesn't match this format, then you'll be out of luck. Using Excel formulas to parse text can be finicky and inflexible. For example, if someone has a middle name, or someone has a initials with spaces (e.g. P.K. Subban was P. K. Subban), or there was a Jr. or something, your job would be a lot harder.
Another alternative is to use regular expressions to get the data you want. I would recommend this thorough answer as a primer. Although you still have the same issues with name formats.
Finally, there's the obligatory Falsehoods Programmers Believe About Names as a warning against assuming any kind of standardized name format.

List of items find almost duplicates

Within excel I have a list of artists, songs, edition.
This list contains over 15000 records.
The problem is the list does contain some "duplicate" records. I say "duplicate" as they aren't a complete match. Some might have a few typo's and I'd like to fix this up and remove those records.
So for example some records:
ABBA - Mamma Mia - Party
ABBA - Mama Mia! - Official
Each dash indicates a separate column (so 3 columns A, B, C are filled in)
How would I mark them as duplicates within Excel?
I've found out about the tool Fuzzy Lookup. Yet I'm working on a mac and since it's not available on mac I'm stuck.
Any regex magic or vba script what can help me out?
It'd also be alright to see how much similar the row is (say 80% similar).
One of the common methods for fuzzy text matching is the Levenshtein (distance) algorithm. Several nice implementations of this exist here:
https://stackoverflow.com/a/4243652/1278553
From there, you can use the function directly in your spreadsheet to find similarities between instances:
You didn't ask, but a database would be really nice here. The reason is you can do a cartesian join (one of the very few valid uses for this) and compare every single record against every other record. For example:
select
s1.group, s2.group, s1.song, s2.song,
levenshtein (s1.group, s2.group) as group_match,
levenshtein (s1.song, s2.song) as song_match
from
songs s1
cross join songs s2
order by
group_match, song_match
Yes, this would be a very costly query, depending on the number of records (in your example 225,000,000 rows), but it would bubble to the top the most likely duplicates / matches. Not only that, but you can incorporate "reasonable" joins to eliminate obvious mismatches, for example limit it to cases where the group matches, nearly matches, begins with the same letter, etc, or pre-filtering out groups where the Levenschtein is greater than x.
You could use an array formula, to indicate the duplicates, and you could modify the below to show the row numbers, this checks the rows beneath the entry for any possible 80% dupes, where 80% is taken as left to right, not total comparison. My data is a1:a15000
=IF(NOT(ISERROR(FIND(MID($A1,1,INT(LEN($A1)*0.8)),$A2:$A$15000))),1,0)
This way will also look back up the list, to indicate the ones found
=SUM(IF(ISERROR(FIND(MID($A2,1,INT(LEN($A1)*0.8)),$A3:$A$15000,1)),0,1))+SUM(IF(ISERROR(FIND(MID($A2,1,INT(LEN($A2)*0.8)),$A$1:$A1,1)),0,1))
The first entry i.e. row 1 is the first part of the formula, and the last row will need the last part after the +
try this worksheet fucntions in your loop:
=COUNTIF(Range,"*yourtexttofind*")

Excel, Numberplate Clarification

I am working on an excel document for fuel cards at the minute and my current issue is to write in a formula for validating number plates based on UK standard plates (two letters followed by two numbers then three letters i.e. BK08JWZ). At this point in time we are not considering personal plates in this just to keep things simple.
Ideally I need excel to look at the text in the box and confirm it to an agreed layout but I am struggling to find the right formula. The plates are in column 'I' and I have already added in another column after titled 'approved plates' in column 'J'but this can be deleted if it's not needed.
Results wise, I can do this one of two ways, to either get the excel document to highlight and number plates that do not match the DVLA standard , or have a column next to the number plate column that registers a boolean response to the recognition i.e. If it is valid (true) or if not (false).
Either way the plate needs to be able to be seen as it was currently, so if there is something wrong with it, it needs to be visible, not throw up an error message.
Any help would be very welcome.
All the information on UK standard number plates are on this site:
https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/359317/INF104_160914.pdf
I would do it like this:
1) create a lookup sheet with data from the booklet. One column for allowed "memory tag" identiffiers (first two letters), one column for the allowed "age identiffiers" (first two numbers), and one column for allowed random letters (last three letters, full alphabet except I and Q)
2) strip spaces from the number plate for comparison
3) Use MID(numberplate,1,2), MID(numberplate,3,2) and MID(numberplate,5,3) to compare to each lookup list repectively (using INDEX()>0).
4) when all 3 parts are found in lookup lists the number plate is valid.
Try researching Regular Expressions or RegEx. This is a powerful programming tool to determine whether strings match specific patterns. You can use RegEx expressions to extract the pattern, replace the pattern or test for the pattern. Very efficient but not for the faint-hearted although there is plenty of help on-line. Try this article for starters.
The following RegEx may be what you need..
(?^[A-Z]{2}[0-9]{2}[A-Z]{3}$)|(?^[A-Z][0-9]{1,3}[A-Z]{3}$)|(?^[A-Z]{3}[0-9]{1,3}[A-Z]$)|(?^[0-9]{1,4}[A-Z]{1,2}$)|(?^[0-9]{1,3}[A-Z]{1,3}$)|(?^[A-Z]{1,2}[0-9]{1,4}$)|(?^[A-Z]{1,3}[0-9]{1,3}$)
This was copied from this article which gives a very full explanation using DVLA rules.
EDIT:
To use RegEx within Excel. In the IDE, Tools menu, select References and add the Microsoft VBScript Regular Expressions 5.5 reference.
With acknowlegement to user3616725s helpful observation.

Resources