Finding all words in a paragraph whose first three letters are the same? - string

How can we solve this problem in a best way? Is there any algorithm for solving this?
"In a paragraph we have to find and print all the words which have starting 3 letters same. Example: we input some paragraph and as a output we get letters like-
a) 1. you 2. your 3. yours 4. yourself
b) 1. early 2. earlier 3. earliest
Like this we get all the words of paragraph which have starting 3 letters common"

A reasonable solution that's not too hard to code up is to maintain a map of some sort where the keys are the first three letters of each word and the values are the sets of words that start with those three letters. You can scan across the words in the paragraph and, for each one you encounter, trim off the first three words, look up the map entry corresponding to those letters, and add in that word to the list. You can then iterate over the map at the end, find all sets containing at least two words, then print out each cluster you find.
Overall, the runtime of this approach is O(L), where L is the total length of all the words in the paragraph. To see this, notice that for each word, we do a map lookup on a constant-sized prefix of that word, then copy all the characters of the word into the map. Overall, this visits each character at most a constant number of times.

Trie with the first three characters and then the word index as the leaf should do the trick.

Related

How to calculate the smallest repetition of a word?

We have the word "s". We want to build the word "p" by arbitrarily rearranging the order of the letters in the word "s".
The repeatability of a word is the length of the longest prefix that also occurs elsewhere in the word.
For example, the repeatability of the word "barbara" is 3, because a prefix of length 3 (bar) occurs further in the word, and no longer any longer
not.
"ababa"
has a recurrence of 3, because aba also occurs from the 3rd to the last letter.
We want to find the word with the least repetition.
Of all the utterances that have the least repetition, we want to
find the earliest lexicographically
The word length s is less than 10^5.
Examples:
input: barbara output: aababrr,
input: banana output: baaann,
input: ababa output: aabab,
I pondered several cases and decided that if a letter occurs only once, it is necessary to find the largest one and give it to the beginning and arrange the rest of the letters lexicographically, but I am unable to think of anything else. How to solve it?

Check if string contains consecutive repeated substring

I got an interview problem which asks to determine whether or not a given string contains substring repeated right after it. For example:
ATAYTAYUV contains TAY after TAY
AABCD contains A after A
ABCAB contains two AB, but they are not consecutive, so the answer is negative
My idea was to look at the first letter, find its second occurrence then check letter by letter if the letters after the first occurrence match the letters after the second occurrence. If they all do, the answer is positive. If not, once I get a mismatch, I can repeat the process but starting with the last letter I checked, since I would not be able to get a repeated sequence up to that point.
I am not sure if the approach is correct or if it is the mos efficient.
Assume that you are looking for a repeating pattern of length 3. If you write the string shifted right by three positions in front of itself (and trimmed), you can detect runs of 3 identical characters.
ATAYTAYUV
ATAYTA
Repeat this for all lengths up to N/2.

Finding a word from a list of strings

Say I have a list of strings
["rdfa", "gijn", "ipqd"]
and have a variable containing the string "and", how would I be able to check if "and" is in the list? To make this more clear think of the list as a word search:
rdfa
gijn
ipqd
I see the vertical word "and", but how would I be able to check if the word "and" is in the list? Finding horizontal words was much easier, but finding a vertical word is confusing me. I was thinking possibly that I would need to find if the first letter of "and" is in any element in the list, then I would need to find if the second letter is in the same column as the first, and also in the row above or below the first letter, and the same for subsequent letters (as I'd like this to work for any length word). However I'm not sure how this would be implemented. I hope the question is clear as it's quite difficult to explain without a showing a word search.
A pure python way to transpose your matrix of letters would be
def transpose(l):
return map(''.join, zip(*l))
l = ["rdfa", "gijn", "ipqd"]
if 'and' in transpose(l):
...
A list element is unaware of the rest of the elements, so you can't compare it vertically. If the word list you have is not huge, the most efficient way is to construct the transpose of your list like:
np.array(your_word_list).T.tolist()
and then look horizontally. If your words are not the same length just pad them with spaces.
If you are trying to solve a word puzzle, then check this question or this module

determining soundex conversion

when converting the name 'Lukasieicz' to soundex (LETTER,DIGIT,DIGIT,DIGIT,DIGIT), I come up with L2222.
However, I am being told by my lecture slides that the actual answer is supposed to be L2220.
Please explain why my answer is incorrect, or if the lecture answer was just a typo or something.
my steps:
Lukasieicz
remove and keep L
ukasieicz
Remove contiguous duplicate characters
ukasieicz
remove A,E,H,I,O,U,W,Y
KSCZ
convert up to first four remaining letters to soundex (as described in lecture directions)
2222
append beginning letter
L2222
If this is American Soundex as defined by the National Archives you're both wrong. American Soundex contains one letter and three numbers, you can't have L2222 nor L2220. It's L222.
But let's say they added another number for some reason.
The basic substitution gives L2222. But you're supposed to collapse adjacent letters with the same numbers (step 3 below) and then pad with zeros if necessary (step 4).
If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
If you have too few letters in your word that you can't assign [four] numbers, append with zeros until there are [four] numbers. If you have more than [4] letters, just retain the first [4] numbers.
Lukasieicz # the original word
L_2_2___22 # replace with numbers, leave the gaps in
L_2_2___2 # apply step 3 and squeeze adjacent numbers
L2220 # apply step 4 and pad to four numbers
We can check how conventional (ie. three number) soundex implementations behave with the shorter Lukacz which becomes L_2_22. Following rules 3 and 4, it should be L220.
The National Archives recommends an online Soundex calculator which produces L220. So does PostgreSQL and Text::Soundex in both its original flavor and NARA implementations.
$ perl -wle 'use Text::Soundex; print soundex("Lukacz"); print soundex_nara("Lukacz")'
L220
L220
MySQL, predictably, is doing its own thing and returns L200.
This function implements the original Soundex algorithm, not the more popular enhanced version (also described by D. Knuth). The difference is that original version discards vowels first and duplicates second, whereas the enhanced version discards duplicates first and vowels second.
In conclusion, you forgot the squeeze step.

How should I remove all the repeated words and letters of a String?

I am trying to remove every character repeated over 2 times from an extremely long string. So, for example, the word Terrrrrrific becomes Terrific.
Now my question is, how do I filter out repeats that include more than a single character the same way, i.e. if I have Words words words words words I want to filter it down to words words, however, it might be something less sensible, such as abcdabcdabcdabcdabcd which should become abcdabcd.
I do suspect that I should use a suffix tree, but I'm not sure how to go at the algorithm exactly.
I don't know, Is this efficient algorithm for you but you can do this:
Choose length for finding repeats
Then for every start point from 0 to length-1 go through string
Maintain stack (you use disjoint substrings and push on stack if top two from stack is different from them)

Resources