How to calculate the smallest repetition of a word? - combinatorics

We have the word "s". We want to build the word "p" by arbitrarily rearranging the order of the letters in the word "s".
The repeatability of a word is the length of the longest prefix that also occurs elsewhere in the word.
For example, the repeatability of the word "barbara" is 3, because a prefix of length 3 (bar) occurs further in the word, and no longer any longer
not.
"ababa"
has a recurrence of 3, because aba also occurs from the 3rd to the last letter.
We want to find the word with the least repetition.
Of all the utterances that have the least repetition, we want to
find the earliest lexicographically
The word length s is less than 10^5.
Examples:
input: barbara output: aababrr,
input: banana output: baaann,
input: ababa output: aabab,
I pondered several cases and decided that if a letter occurs only once, it is necessary to find the largest one and give it to the beginning and arrange the rest of the letters lexicographically, but I am unable to think of anything else. How to solve it?

Related

Check if string contains consecutive repeated substring

I got an interview problem which asks to determine whether or not a given string contains substring repeated right after it. For example:
ATAYTAYUV contains TAY after TAY
AABCD contains A after A
ABCAB contains two AB, but they are not consecutive, so the answer is negative
My idea was to look at the first letter, find its second occurrence then check letter by letter if the letters after the first occurrence match the letters after the second occurrence. If they all do, the answer is positive. If not, once I get a mismatch, I can repeat the process but starting with the last letter I checked, since I would not be able to get a repeated sequence up to that point.
I am not sure if the approach is correct or if it is the mos efficient.
Assume that you are looking for a repeating pattern of length 3. If you write the string shifted right by three positions in front of itself (and trimmed), you can detect runs of 3 identical characters.
ATAYTAYUV
ATAYTA
Repeat this for all lengths up to N/2.

Maximum number of consecutive 1's in a string

A string of length N (can be upto 10^5) is given which consists of only 0 and 1. We have to remove two substrings of length exactly K from the original string to maximize the number of consecutive 1's.
For example suppose the string is 1100110001and K=1.
So we can remove two substrings of length 1. The best possible option here is to remove the 0's at 3rd place and 4th place and get the output as 4 (as the new string will be 11110001)
If I try brute force it'll timeout for sure. I don't know if sliding window will work or not. Can anyone give me any hint on how to proceed? I am not demanding the full answer obviously, just some hints will work for me. Thanks in advance :)
This has a pretty straightforward dynamic programming solution.
For each index i, calculate:
The length of the sequence of 1s that immediately precedes it, if nothing has been removed;
The longest sequence of 1s that could immediately precede it, if exactly one substring is removed before it; and
The longest sequence of 1s that could immediately precede it, if exactly two substrings are removed before it.
For each index, these three values are easily calculated in constant time from the values for earlier indexes, so you can do this in a single pass in O(N) time.
For example, let BEST(i,r) be the best length immediately preceding position i after removing r substrings. If i >= K, then you can remove a substring ending at i and have BEST(i,r) = BEST(i-K,r-1) for r > 0. If string[i-1] = '1' then you could extend the sequence from the previous position and have BEST(i,r) = BEST(i-1,r)+1. Choose the best possibility for each i,r.
The largest value you find in step (3) is the answer.

Finding all words in a paragraph whose first three letters are the same?

How can we solve this problem in a best way? Is there any algorithm for solving this?
"In a paragraph we have to find and print all the words which have starting 3 letters same. Example: we input some paragraph and as a output we get letters like-
a) 1. you 2. your 3. yours 4. yourself
b) 1. early 2. earlier 3. earliest
Like this we get all the words of paragraph which have starting 3 letters common"
A reasonable solution that's not too hard to code up is to maintain a map of some sort where the keys are the first three letters of each word and the values are the sets of words that start with those three letters. You can scan across the words in the paragraph and, for each one you encounter, trim off the first three words, look up the map entry corresponding to those letters, and add in that word to the list. You can then iterate over the map at the end, find all sets containing at least two words, then print out each cluster you find.
Overall, the runtime of this approach is O(L), where L is the total length of all the words in the paragraph. To see this, notice that for each word, we do a map lookup on a constant-sized prefix of that word, then copy all the characters of the word into the map. Overall, this visits each character at most a constant number of times.
Trie with the first three characters and then the word index as the leaf should do the trick.

determining soundex conversion

when converting the name 'Lukasieicz' to soundex (LETTER,DIGIT,DIGIT,DIGIT,DIGIT), I come up with L2222.
However, I am being told by my lecture slides that the actual answer is supposed to be L2220.
Please explain why my answer is incorrect, or if the lecture answer was just a typo or something.
my steps:
Lukasieicz
remove and keep L
ukasieicz
Remove contiguous duplicate characters
ukasieicz
remove A,E,H,I,O,U,W,Y
KSCZ
convert up to first four remaining letters to soundex (as described in lecture directions)
2222
append beginning letter
L2222
If this is American Soundex as defined by the National Archives you're both wrong. American Soundex contains one letter and three numbers, you can't have L2222 nor L2220. It's L222.
But let's say they added another number for some reason.
The basic substitution gives L2222. But you're supposed to collapse adjacent letters with the same numbers (step 3 below) and then pad with zeros if necessary (step 4).
If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by 'h' or 'w' are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.
If you have too few letters in your word that you can't assign [four] numbers, append with zeros until there are [four] numbers. If you have more than [4] letters, just retain the first [4] numbers.
Lukasieicz # the original word
L_2_2___22 # replace with numbers, leave the gaps in
L_2_2___2 # apply step 3 and squeeze adjacent numbers
L2220 # apply step 4 and pad to four numbers
We can check how conventional (ie. three number) soundex implementations behave with the shorter Lukacz which becomes L_2_22. Following rules 3 and 4, it should be L220.
The National Archives recommends an online Soundex calculator which produces L220. So does PostgreSQL and Text::Soundex in both its original flavor and NARA implementations.
$ perl -wle 'use Text::Soundex; print soundex("Lukacz"); print soundex_nara("Lukacz")'
L220
L220
MySQL, predictably, is doing its own thing and returns L200.
This function implements the original Soundex algorithm, not the more popular enhanced version (also described by D. Knuth). The difference is that original version discards vowels first and duplicates second, whereas the enhanced version discards duplicates first and vowels second.
In conclusion, you forgot the squeeze step.

Permutations of a string of non-unique characters

While there are a lot of solutions for how to find all the (unique) permutations of a string of unique characters, I haven't found solutions that work when the characters are non-unique. I have listed out my idea below and would appreciate feedback, but also feel free to provide your own ideas.
My idea:
To illustrate my algorithm, I'm using the example of the string ABBC, which I want to find all permutations of. Since there are two B's I will be labelling them B1 and B2.
Create a new string by removing all duplicate characters from the original string (e.g. turn AB1B2C into AB1C).
Find all possible permutations of the new string (e.g. AB1C, ACB1, B1AC, etc.). There are many algorithms to do this, since the string's characters are all unique.
Choose one duplicate character. For each permutation, insert the chosen duplicate characters at every "position" of the permutation, except when the character just before the duplicate character has the same value as the duplicate character (e.g. For the permutation AB1C, since the duplicate character is B2, insert it to get B2AB1C, AB2B1C, AB1CB2. Exception: Don't do AB1B2C, since that's just a duplicate of AB2B1C).
Continue to do step 3 but now choose a different duplicate character. (Do this until all duplicate characters have been chosen exactly once.)
Prior research: The answer by Prakhar on this SO question claims to work for duplicates: Generate list of all possible permutations of a string. It might, but I suspect there's a bug in the code.
How about this: suppose that the string with duplicates is of length N. Now consider the sequence 0,1,...N-1. Find all its permutations using one of the known algorithms. For each permutation in this list, generate a corresponding string by using the number in the permutation as an index into the original string. For example, if the string is ABBC, then the sequences will be 0,1,2,3; 0,1,3,2; etc. The sequence 3,0,1,2, as an example, is one of the permutations, and it yields the string CABB

Resources