Choosing correct word for the given string - nlp

Suppose the given word is" connnggggggrrrraaatsss" and we need to convert it to congrats .
Or for other example is "looooooovvvvvveeeeee" should be changed to "love" .
Here the given words can be repeated for any number of times but it should be changed to correct form. We need to write a java based program.

You cannot really check for every word because there are certain words which have more than 1 alphabets in their spelling. So one way you could go is -
check for each alphabet in the word and restrict its number of consecutive appearances to two
now check the new spelling on the spell checker, you might want to try HUNspell as it is widely used by many word processing softwares.

Related

How to make an excel (365) function that recognizes different words in the same cell and changes them individually

What im working with
I have a list of product names, but unfortunately they are written in uppercase I now want to make only the first letter uppercase and the rest lowercase but I also want all words with 3 or less symbols to stay uppercase
im trying if functions but nothing is really working
i use the german excel version but i would be happy if someone has any idea on how to do it im trying different functions for hours but nothing is working
=IF(LENGTH(C6)<=3,UPPER(C6),UPPER(LEFT(C6,1))&LOWER(RIGHT(C6,LENGTH(C6)-1)))
but its a #NAME error excel does not recognize the first and the last bracket
This is hard! Let me explain:
I do believe there are German words in the mix that are below 4 characters in length that you should exclude. My German isn't great but there would probably be a huge deal of words below 4 characters;
There seems to be substrings that are 3+ characters in length but should probably stay uppercase, e.g. '550E/ER';
There seem to be quite a bunch of characters that could be used as delimiters to split the input into 'words'. It's hard to catch any of them without a full list;
Possible other reasons;
With the above in mind I think it's safe to say that we can try to accomplish something that you want as best as we can. Therefor I'd suggest
To split on multiple characters;
Exclude certain words from being uppercase when length < 3;
Include certain words to be uppercase when length > 3 and digits are present;
Assume 1st character could be made uppercase in any input;
For example:
Formula in B1:
=MAP(A1:A5,LAMBDA(v,LET(x,TEXTSPLIT(v,{"-","/"," ","."},,1),y,TEXTSPLIT(v,x,,1),z,TEXTJOIN(y,,MAP(x,LAMBDA(w,IF(SUM(--(w={"zu","ein","für","aus"})),LOWER(w),IF((LEN(w)<4)+SUM(IFERROR(FIND(SEQUENCE(10,,0),w),)),UPPER(w),LOWER(w)))))),UPPER(LEFT(z))&MID(z,2,LEN(v)))))
You can see how difficult it is to capture each and every possibility;
The minute you exclude a few words, another will pop-up (the 'x' between numbers for example. Which should stay upper/lower-case depending on the context it is found in);
The second you include words containing digits, you notice that some should be excluded ('00SICHERUNGS....');
If the 1st character would be a digit, the whole above solution would not change 1st alpha-char in upper;
Maybe some characters shouldn't be used as delimiters based on context? Think about hypenated words;
Possible other reasons.
Point is, this is not just hard, it's extremely hard if not impossible to do on the type of data you are currently working with! Even if one is proficient with writing a regular expression (chuck in all (non-available to Excel) tokens, quantifiers and methods if you like), I'd doubt all edge-case could be covered.
Because you are dealing with any number of words in a cell you'll need to get crafty with this one. Thankfully there is TEXTSPLIT() and TEXTJOIN() that can make short work of splitting the text into words, where we can then test the length, change the capitalization, and then join them back together all in one formula:
=TEXTJOIN(" ", TRUE, IF(LEN(TEXTSPLIT(C6," "))<=3,UPPER(TEXTSPLIT(C6," ")),PROPER(TEXTSPLIT(C6," "))))
Also used PROPER() formula as well, which only capitalizes the first character of a word.

Best way to find if there is a one-typo word from list of given words

How would you efficiently solve this problem
Suppose we were given a list of words [“apple”, “banana”, “mango”]
If we are given a word in the list that is one typo away,
“Dpple”
“Adple”
“Appld”
We output true
If there is more than one typo, we output false.
For optimizations, I’ve tried storing the list in a hashtable containing the number of letters of each word and looking for the same number of letters upon the given input to reduce the size in which we look for our input. Is there a faster optimization we can make to this problem?
One possible optimisation would be to generate all one-typo words for the given list and put them in a map (or some better string lookup structure). Then lookup the given words - if found output true, else false. The total number of one-typo words is: 25*L, where L is the total number of letters in the input list (assuming case does not matter).

Excel, Numberplate Clarification

I am working on an excel document for fuel cards at the minute and my current issue is to write in a formula for validating number plates based on UK standard plates (two letters followed by two numbers then three letters i.e. BK08JWZ). At this point in time we are not considering personal plates in this just to keep things simple.
Ideally I need excel to look at the text in the box and confirm it to an agreed layout but I am struggling to find the right formula. The plates are in column 'I' and I have already added in another column after titled 'approved plates' in column 'J'but this can be deleted if it's not needed.
Results wise, I can do this one of two ways, to either get the excel document to highlight and number plates that do not match the DVLA standard , or have a column next to the number plate column that registers a boolean response to the recognition i.e. If it is valid (true) or if not (false).
Either way the plate needs to be able to be seen as it was currently, so if there is something wrong with it, it needs to be visible, not throw up an error message.
Any help would be very welcome.
All the information on UK standard number plates are on this site:
https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/359317/INF104_160914.pdf
I would do it like this:
1) create a lookup sheet with data from the booklet. One column for allowed "memory tag" identiffiers (first two letters), one column for the allowed "age identiffiers" (first two numbers), and one column for allowed random letters (last three letters, full alphabet except I and Q)
2) strip spaces from the number plate for comparison
3) Use MID(numberplate,1,2), MID(numberplate,3,2) and MID(numberplate,5,3) to compare to each lookup list repectively (using INDEX()>0).
4) when all 3 parts are found in lookup lists the number plate is valid.
Try researching Regular Expressions or RegEx. This is a powerful programming tool to determine whether strings match specific patterns. You can use RegEx expressions to extract the pattern, replace the pattern or test for the pattern. Very efficient but not for the faint-hearted although there is plenty of help on-line. Try this article for starters.
The following RegEx may be what you need..
(?^[A-Z]{2}[0-9]{2}[A-Z]{3}$)|(?^[A-Z][0-9]{1,3}[A-Z]{3}$)|(?^[A-Z]{3}[0-9]{1,3}[A-Z]$)|(?^[0-9]{1,4}[A-Z]{1,2}$)|(?^[0-9]{1,3}[A-Z]{1,3}$)|(?^[A-Z]{1,2}[0-9]{1,4}$)|(?^[A-Z]{1,3}[0-9]{1,3}$)
This was copied from this article which gives a very full explanation using DVLA rules.
EDIT:
To use RegEx within Excel. In the IDE, Tools menu, select References and add the Microsoft VBScript Regular Expressions 5.5 reference.
With acknowlegement to user3616725s helpful observation.

Looking up Bigrams in Excel

Suppose I have a list of two-word pairs in a column in Excel. These words are delimited by a space so that a typical pair might look like "extreme happiness". The goal is to search for these 'bigrams' in a larger string located in another column. The issue is that the bigram will only be found if the two words are together and separated by a space. What would be preferable is if Excel could look for both words anywhere in a given larger string. It is crucial that the bigrams occupy one cell each since a score is assigned to each bigram and in fact the function used VLOOKUPs this value based on the bigram cell value. Would it make sense to change the space between any two words to a - or some other character? Is there a way to have Excel look up each value one at a time (perhaps by recognizing this character and passing through the larger string twice, that is, once for each word)?
Example: "The weather last night was extremely cold, but the warm fire gave me some happiness."
Here we would like to find both the word 'extreme' within the word extremely and the word happiness. Currently Excel would not be successful in doing this since it would just look for "extreme happiness" and determine that no such string exists.
If the bigram in the row below "extreme happiness" reads "weather gave" (for some reason) Excel will go check whether that bigram exists in the larger string and return a second score. This is done so that at the end every score can be added together.
This is pretty easy with a couple of formulas. See screenshot below:
The logic is simple. Assuming your bigram is in B1, we can input the following in C1. This will replace the spaces with *, which is Excel's wildcard character.
=SUBSTITUTE(B2," ","*")
Then we concatenate it to give us a wildcarded beginning and end.
=CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*")
We then use a simple COUNTIF against the statement (here in A1) to return to us a count of occurence.
=COUNTIF(A2,CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*"))
A simple IF check enclosing the above, with condition >0, can be used to give us either Yes or No.
=IF(COUNTIF(A2,CONCATENATE("*",SUBSTITUTE(B2," ","*"),"*"))>0,"Yes","No")
Let us know if this helps.

String manipulation: randomly swapping out the first letter with one of the other letters of the word

I need to randomly swap out the first letter of a word with one of the other letters. What i am having trouble is with specifying that i only need to randomly generate ONE character. I cant use any conditionals, so can anyone please recommend a method to use?

Resources