Check if string contains consecutive repeated substring - string

I got an interview problem which asks to determine whether or not a given string contains substring repeated right after it. For example:
ATAYTAYUV contains TAY after TAY
AABCD contains A after A
ABCAB contains two AB, but they are not consecutive, so the answer is negative
My idea was to look at the first letter, find its second occurrence then check letter by letter if the letters after the first occurrence match the letters after the second occurrence. If they all do, the answer is positive. If not, once I get a mismatch, I can repeat the process but starting with the last letter I checked, since I would not be able to get a repeated sequence up to that point.
I am not sure if the approach is correct or if it is the mos efficient.

Assume that you are looking for a repeating pattern of length 3. If you write the string shifted right by three positions in front of itself (and trimmed), you can detect runs of 3 identical characters.
ATAYTAYUV
ATAYTA
Repeat this for all lengths up to N/2.

Related

Maximum number of consecutive 1's in a string

A string of length N (can be upto 10^5) is given which consists of only 0 and 1. We have to remove two substrings of length exactly K from the original string to maximize the number of consecutive 1's.
For example suppose the string is 1100110001and K=1.
So we can remove two substrings of length 1. The best possible option here is to remove the 0's at 3rd place and 4th place and get the output as 4 (as the new string will be 11110001)
If I try brute force it'll timeout for sure. I don't know if sliding window will work or not. Can anyone give me any hint on how to proceed? I am not demanding the full answer obviously, just some hints will work for me. Thanks in advance :)
This has a pretty straightforward dynamic programming solution.
For each index i, calculate:
The length of the sequence of 1s that immediately precedes it, if nothing has been removed;
The longest sequence of 1s that could immediately precede it, if exactly one substring is removed before it; and
The longest sequence of 1s that could immediately precede it, if exactly two substrings are removed before it.
For each index, these three values are easily calculated in constant time from the values for earlier indexes, so you can do this in a single pass in O(N) time.
For example, let BEST(i,r) be the best length immediately preceding position i after removing r substrings. If i >= K, then you can remove a substring ending at i and have BEST(i,r) = BEST(i-K,r-1) for r > 0. If string[i-1] = '1' then you could extend the sequence from the previous position and have BEST(i,r) = BEST(i-1,r)+1. Choose the best possibility for each i,r.
The largest value you find in step (3) is the answer.

Delete all ocurrences of substring in minimal steps

I want to find the minimum number of deletions I need to make in order for a substring to no longer appear in a given string. Both the string and substring are composed of only lower case letters.
For example, for string "recorerecore" and substring "recore" I would need 2 deletions.
For string "recorecore" and substring "recore" I would need only 1.
For string "recorecorecorecore" and substring "recore" I would need 2, either the first and third or the second and fourth.
For string "rerecorecore" I would need to take out 1, the second occurrence, as taking the first out would lead to having recore again.
I only can think of the brute force solution which involves actually deleting in every combination possible and finding the minimum, but this takes too long.
Does anyone know a way to do this faster?
recursively Boyer–Moore the string with the substring and delete as you find them

Finding all words in a paragraph whose first three letters are the same?

How can we solve this problem in a best way? Is there any algorithm for solving this?
"In a paragraph we have to find and print all the words which have starting 3 letters same. Example: we input some paragraph and as a output we get letters like-
a) 1. you 2. your 3. yours 4. yourself
b) 1. early 2. earlier 3. earliest
Like this we get all the words of paragraph which have starting 3 letters common"
A reasonable solution that's not too hard to code up is to maintain a map of some sort where the keys are the first three letters of each word and the values are the sets of words that start with those three letters. You can scan across the words in the paragraph and, for each one you encounter, trim off the first three words, look up the map entry corresponding to those letters, and add in that word to the list. You can then iterate over the map at the end, find all sets containing at least two words, then print out each cluster you find.
Overall, the runtime of this approach is O(L), where L is the total length of all the words in the paragraph. To see this, notice that for each word, we do a map lookup on a constant-sized prefix of that word, then copy all the characters of the word into the map. Overall, this visits each character at most a constant number of times.
Trie with the first three characters and then the word index as the leaf should do the trick.

determining soundex conversion

when converting the name 'Lukasieicz' to soundex (LETTER,DIGIT,DIGIT,DIGIT,DIGIT), I come up with L2222.
However, I am being told by my lecture slides that the actual answer is supposed to be L2220.
Please explain why my answer is incorrect, or if the lecture answer was just a typo or something.
my steps:
Lukasieicz
remove and keep L
ukasieicz
Remove contiguous duplicate characters
ukasieicz
remove A,E,H,I,O,U,W,Y
KSCZ
convert up to first four remaining letters to soundex (as described in lecture directions)
2222
append beginning letter
L2222
If this is American Soundex as defined by the National Archives you're both wrong. American Soundex contains one letter and three numbers, you can't have L2222 nor L2220. It's L222.
But let's say they added another number for some reason.
The basic substitution gives L2222. But you're supposed to collapse adjacent letters with the same numbers (step 3 below) and then pad with zeros if necessary (step 4).
If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
If you have too few letters in your word that you can't assign [four] numbers, append with zeros until there are [four] numbers. If you have more than [4] letters, just retain the first [4] numbers.
Lukasieicz # the original word
L_2_2___22 # replace with numbers, leave the gaps in
L_2_2___2 # apply step 3 and squeeze adjacent numbers
L2220 # apply step 4 and pad to four numbers
We can check how conventional (ie. three number) soundex implementations behave with the shorter Lukacz which becomes L_2_22. Following rules 3 and 4, it should be L220.
The National Archives recommends an online Soundex calculator which produces L220. So does PostgreSQL and Text::Soundex in both its original flavor and NARA implementations.
$ perl -wle 'use Text::Soundex; print soundex("Lukacz"); print soundex_nara("Lukacz")'
L220
L220
MySQL, predictably, is doing its own thing and returns L200.
This function implements the original Soundex algorithm, not the more popular enhanced version (also described by D. Knuth). The difference is that original version discards vowels first and duplicates second, whereas the enhanced version discards duplicates first and vowels second.
In conclusion, you forgot the squeeze step.

compare a string to a cell array of srings in matlab and find the most similar

I have a list of images stored in a directory. They are all named. My GUI reads all the images and saves their names in a cell array. Now I have added a editable box that the user can type in a name and the program will show that image. The problem is I want the program to take into account typos and misspellings by the user and find the most similar file name to the user typed word. Can you please help me?
Many Thanks,
Hamid
You should read this WP article: Approximate string matching and look at "Calculation of distance between strings" on FEx.
I think you should use the longest common subsequence algorithm to approximately compare strings.
Here is a matlab implementation:
http://www.mathworks.com/matlabcentral/fileexchange/24559-longest-common-subsequence
After, just do something like that:
[~,ind]=min(cellarray( #(x) LCS(lower(userInput),lower(x)), allFileNames));
chosenFile=allFileName{ind};
(the function LCS is the longest common subsequence algorithm, and the functionlower converts to lower case)
Not exactly what you are looking for, but you can compare the first few characters of the strings ignoring case to find a close match. See the command strncmpi:
strncmpi Compare first N characters of strings ignoring case.
TF = strncmpi(S,C,N) performs a case-insensitive comparison between the
first N characters of string S and the first N characters in each element
of cell array C. Input S is a character vector (or 1-by-1 cell array), and
input C is a cell array of strings. The function returns TF, a logical
array that is the same size as C and contains logical 1 (true) for those
elements of C that are a match, except for letter case, and logical 0
(false) for those elements that are not. The order of the two input
arguments is not important.

Resources