How to write a method that takes a string and returns the longest valid substring - string

I have been practicing interview questions with a friend, and he threw me this one he made up:
Given a method that tells you if a string is valid, write a method that takes a string, and returns the longest valid substring (without reordering the characters).
My first brute force solution would be to find all of the subsets of the input string, and then plug them through (longest to shortest) the given method till a valid string is found and return that.
But that obviously isn't good enough.
So I was trying to think of it this way:
Check the input string
Check all of the subsets of the inputString, with length == inputString length - 1
So and and so forth until all of the subsets with length 1 are checked, and then return false
The problem in my head, then, is that in order for this to be optimal, we want to utilize the fact that we only care for the longest valid string. If I were to check each subset recursively, then I would be doing a depth-first traversal of the subsets, when I'm really looking for a breadth-first, so I can find the longest quicker.
Once I realized that, I got stuck. I couldn't even come up with pseudo code to tackle this problem.
Is a "breadth-first" search of the subsets of a string even possible?
The closest solution I could find was on the math stackexchange, somebody posted a promising looking answer-- https://math.stackexchange.com/questions/89419/algorithm-wanted-enumerate-all-subsets-of-a-set-in-order-of-increasing-sums
but it unfortunately is pretty hard for me to comprehend.
Would the best solution just be a depth-first recursive iteration through all of the subsets and return the longest valid string from there?

string in
for int sub_len in len(in) , 1 //length of the substring must be smaller than/equal to
//the length of the input and atleast 1
for int sub_offset in 0 , len(in) - sub_len
//the offset of the string must be in [0 , n]
//where n is the number of characters that are not in the
//substring
string sub = substring(in , sub_offset , sub_len)
if isValid(sub)
return sub
This generates all possible substrings for a given input (in) and returns the first/longest valid substring.

Related

Unique Substrings in wrap around strings

I have been given an infinite wrap around of the string str="abcdefghijklmnopqrstuvwxyz" so it looks like
"..zabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcd...." and another string p.
I need to find out how many unique non-empty substrings of p are present in the infinite wraparound string str?
For example: "zab"
There are 6 substrings "z", "a", "b", "za", "ab", "zab" of string "zab" in str.
I tried finding all suffixes of p in a particular concatenation of the string str say for example: "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
and as soon as i get a suffix which is a part of the above i add all its substrings to my result, as:
for (int i=0;i<length;i++) {
String suffix = p.substring(i,length);
if(isPresent(suffix)) {
sum += (suffix.length()*(suffix.length()+1))/2;
break;
} else {
sum++;
}
}
And my isPresent function is:
private boolean isPresent(String s) {
if(s.length()==1) {
return true;
}
String main = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcde
fghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz";
return main.contains(s);
}
If the length of p is greater than my assumed concatenated string assumed in isPresent function, my algorithm fails!!
So how should i find the substrings irrespective of the the wrap around string str? Is there a better approach for this problem?
Some ideas/suggestions (not a full algo)
you don't need to consider an infinite repetition of the wrap around string but only len(p)/len(repeating-fragment) + 1 (integral division) repetitions. Let's denote this string with S **
if a substring sp of p is a substring of S, than any substrings of sp will be substrings of S
So the problem seems to reduce to:
find sp (substring of both p and S) with the maximal length. This is called longest common substring and admits a dynamic programming solution with the complexity of O(n*m) (lengths of the two strings). The cited has a pseudo-code algo.
repeat the above recursively with the 'remnants' of p after eliminating the longest common substring.
Now, you have a sequence of "longest common substrings". How many do you need to retain? I feel that the "longest common substring" may be used to trim down the need of brute-forcing every substring of any and all the above, but I'd need more time than I have available now.
I hope the sketch above helps.
** I might be wrong on the number of repetitions which need to be considered. If I am, then in any case there will be a maximal number of repetitions to be considered and there will be an S of minimal length that is sufficient for the purpose.

Find the minimal lexographical string formed by merging two strings

Suppose we are given two strings s1 and s2(both lowercase). We have two find the minimal lexographic string that can be formed by merging two strings.
At the beginning , it looks prettty simple as merge of the mergesort algorithm. But let us see what can go wrong.
s1: zyy
s2: zy
Now if we perform merge on these two we must decide which z to pick as they are equal, clearly if we pick z of s2 first then the string formed will be:
zyzyy
If we pick z of s1 first, the string formed will be:
zyyzy which is correct.
As we can see the merge of mergesort can lead to wrong answer.
Here's another example:
s1:zyy
s2:zyb
Now the correct answer will be zybzyy which will be got only if pick z of s2 first.
There are plenty of other cases in which the simple merge will fail. My question is Is there any standard algorithm out there used to perform merge for such output.
You could use dynamic programming. In f[x][y] store the minimal lexicographical string such that you've taken x charecters from the first string s1 and y characters from the second s2. You can calculate f in bottom-top manner using the update:
f[x][y] = min(f[x-1][y] + s1[x], f[x][y-1] + s2[y]) \\ the '+' here represents
\\ the concatenation of a
\\ string and a character
You start with f[0][0] = "" (empty string).
For efficiency you can store the strings in f as references. That is, you can store in f the objects
class StringRef {
StringRef prev;
char c;
}
To extract what string you have at certain f[x][y] you just follow the references. To udapate you point back to either f[x-1][y] or f[x][y-1] depending on what your update step says.
It seems that the solution can be almost the same as you described (the "mergesort"-like approach), except that with special handling of equality. So long as the first characters of both strings are equal, you look ahead at the second character, 3rd, etc. If the end is reached for some string, consider the first character of the other string as the next character in the string for which the end is reached, etc. for the 2nd character, etc. If the ends for both strings are reached, then it doesn't matter from which string to take the first character. Note that this algorithm is O(N) because after a look-ahead on equal prefixes you know the whole look-ahead sequence (i.e. string prefix) to include, not just one first character.
EDIT: you look ahead so long as the current i-th characters from both strings are equal and alphabetically not larger than the first character in the current prefix.

String matching without using builtin functions

I want to search for a query (a string) in a subject (another string).
The query may appear in whole or in parts, but will not be rearranged. For instance, if the query is 'da', and the subject is 'dura', it is still a match.
I am not allowed to use string functions like strfind or find.
The constraints make this actually quite straightforward with a single loop. Imagine you have two indices initially pointing at the first character of both strings, now compare them - if they don't match, increment the subject index and try again. If they do, increment both. If you've reached the end of the query at that point, you've found it. The actual implementation should be simple enough, and I don't want to do all the work for you ;)
If this is homework, I suggest you look at the explanation which precedes the code and then try for yourself, before looking at the actual code.
The code below looks for all occurrences of chars of the query string within the subject string (variables m; and related ii, jj). It then tests all possible orders of those occurrences (variable test). An order is "acceptable" if it contains all desired chars (cond1) in increasing positions (cond2). The result (variable result) is affirmative if there is at least one acceptable order.
subject = 'this is a test string';
query = 'ten';
m = bsxfun(#eq, subject.', query);
%'// m: test if each char of query equals each char of subject
[ii jj] = find(m);
jj = jj.'; %'// ii: which char of query is found within subject...
ii = ii.'; %'// jj: ... and at which position
test = nchoosek(1:numel(jj),numel(query)).'; %'// test all possible orders
cond1 = all(jj(test) == repmat((1:numel(query)).',1,size(test,2)));
%'// cond1: for each order, are all chars of query found in subject?
cond2 = all(diff(ii(test))>0);
%// cond2: for each order, are the found chars in increasing positions?
result = any(cond1 & cond2); %// final result: 1 or 0
The code could be improved by using a better approach as regards to test, i.e. not testing all possible orders given by nchoosek.
Matlab allows you to view the source of built-in functions, so you could always try reading the code to see how the Matlab developers did it (although it will probably be very complex). (thanks Luis for the correction)
Finding a string in another string is a basic computer science problem. You can read up on it in any number of resources, such as Wikipedia.
Your requirement of non-rearranging partial matches recalls the bioinformatics problem of mapping splice variants to a genomic sequence.
You may solve your problem by using a sequence alignment algorithm such as Smith-Waterman, modified to work with all English characters and not just DNA bases.
Is this question actually from bioinformatics? If so, you should tag it as such.

Given a word, convert it into a palindrome with minimum addition of letters to it [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
Here is a pretty interesting interview question:
Given a word, append the fewest number of letters to it to convert it into a palindrome.
For example, if "hello" is the string given, the result should be "hellolleh." If "coco" is given, the result should be "cococ."
One approach I can think of is to append the reverse of the string to the end of the original string, then try to eliminate the extra characters from the end. However, I can't figure out how to do this efficiently. Does anyone have any ideas?
Okay! Here's my second attempt.
The idea is that we want to find how many of the characters at the end of the string can be reused when appending the extra characters to complete the palindrome. In order to do this, we will use a modification of the KMP string matching algorithm. Using KMP, we search the original string for its reverse. Once we get to the very end of the string, we will have as much a match as possible between the reverse of the string and the original string that occurs at the end of the string. For example:
HELLO
O
1010
010
3202
202
1001
1001
At this point, KMP normally would say "no match" unless the original string was a palindrome. However, since we currently know how much of the reverse of the string was matched, we can instead just figure out how many characters are still missing and then tack them on to the end of the string. In the first case, we're missing LLEH. In the second case, we're missing 1. In the third, we're missing 3. In the final case, we're not missing anything, since the initial string is a palindrome.
The runtime of this algorithm is the runtime of a standard KMP search plus the time required to reverse the string: O(n) + O(n) = O(n).
So now to argue correctness. This is going to require some effort. Consider the optimal answer:
| original string | | extra characters |
Let's suppose that we are reading this backward from the end, which means that we'll read at least the reverse of the original string. Part of this reversed string extends backwards into the body of the original string itself. In fact, to minimize the number of characters added, this has to be the largest possible number of characters that ends back into the string itself. We can see this here:
| original string | | extra characters |
| overlap |
Now, what happens in our KMP step? Well, when looking for the reverse of the string inside itself, KMP will keep as long of a match as possible at all times as it works across the string. This means that when the KMP hits the end of the string, the matched portion it maintains will be the longest possible match, since KMP only moves the starting point of the candidate match forward on a failure. Consequently, we have this longest possible overlap, so we'll get the shortest possible number of characters required at the end.
I'm not 100% sure that this works, but it seems like this works in every case I can throw at it. The correctness proof seems reasonable, but it's a bit hand-wavy because the formal KMP-based proof would probably be a bit tricky.
Hope this helps!
To answer I would take this naive approach:
when we need 0 characters? when string it's a palindrome
when we need 1 character? when except the first character string is a palindrome
when we need 2 characters? when except the 2 start characters the string is a palindrome
etc etc...
So an algorithm could be
for index from 1 to length
if string.right(index) is palindrome
return string + reverse(string.left(index))
end
next
edit
I'm not much a Python guy, but a simple minded implementation of the the above pseudo code could be
>>> def rev(s): return s[::-1]
...
>>> def pal(s): return s==rev(s)
...
>>> def mpal(s):
... for i in range(0,len(s)):
... if pal(s[i:]): return s+rev(s[:i])
...
>>> mpal("cdefedcba")
'cdefedcbabcdefedc'
>>> pal(mpal("cdefedcba"))
True
Simple linear time solution.
Let's call our string S.
Let f(X, P) be the length of the longest common prefix of X and P. Compute f(S[0], rev(S)), f(S[1], rev(S)), ... where S[k] is the suffix of S starting at position k. Obviously, you want to choose the minimum k such that k + f(S[k], rev(S)) = len(S). That means that you just have to append k characters at the end. If k is 0, the sting is already a palindrom. If k = len(S), then you need to append the entire reverse.
We need compute f(S[i], P) for all S[i] quickly. This is the tricky part. Create a suffix tree of S. Traverse the tree and update every node with the length of the longest common prefix with P. The values at the leaves correspond to f(S[i], P).
First make a function to test string for palindrome-ness, keeping in mind that "a" and "aa" are palindromes. They are palindromes, right???
If the input is a palindrome, return it (0 chars needed to be added)
Loop from x[length] down to x[1] checking if the subset of the string x[i]..x[length] is a palindrome, to find the longest palindrome.
Take the substring from the input string before the longest palindrome, reversing it and adding it to the end should make the shortest palindrome via appending.
coco => c+oco => c+oco+c
mmmeep => mmmee+p => mmmee+p+eemmm

How to find all cyclic shifted strings in a given input?

This is a coding exercise. Suppose I have to decide if one string is created by a cyclic shift of another. For example: cab is a cyclic shift of abc but cba is not.
Given two strings s1 and s2 we can do that as follows:
if (s1.length != s2.length)
return false
for(int i = 0; i < s1.length(); i++)
if ((s1.substring(i) + s1.substring(0, i)).equals(s2))
return true
return false
Now what if I have an array of strings and want to find all strings that are cyclic shift of one another? For example: ["abc", "xyz", "yzx", "cab", "xxx"] -> ["abc", "cab"], ["xyz", "yzx"], ["xxx"]
It looks like I have to check all pairs of the strings. Is there a "better" (more efficient) way to do that?
As a start, you can know if a string s1 is a rotation of a string s2 with a single call to contains(), like this:
public boolean isRotation(String s1, String s2){
String s2twice = s2+s2;
return s2twice.contains(s1);
}
Namely, if s1 is "rotation" and s2 is "otationr", the concat gives you "otationrotationr", which contains s1 indeed.
Now, even if we assume this is linear, or close to it (which is not impossible using Rabin-Karp, for instance), you are still left with O(n^2) pair comparisons, which may be too much.
What you could do is build an hashtable where the sorted word is the key, and the posting list contains all the words from your list that, if sorted, give the key (ie. key("bca") and key("cab") both should return "abc"):
private Map<String, List<String>> index;
/* ... */
public void buildIndex(String[] words){
for(String word : words){
String sortedWord = sortWord(word);
if(!index.containsKey(sortedWord)){
index.put(sortedWord, new ArrayList<String>());
}
index.get(sortedWord).add(word);
}
}
CAVEAT: The hashtable will contain, for each key, all the words that have exactly the same letters occurring the same amount of times (not just the rotations, ie. "abba" and "baba" will have the same key but isRotation("abba", "baba") will return false).
But once you have built this index, you can significantly reduce the number of pairs you need to consider: if you want all the rotations for "bca" you just need to sort("bca"), look it up in the hashtable, and check (using the isRotation method above, if you want) if the words in the posting list are the result of a rotation or not.
If strings are short compared to the number of strings in the list, you can do significantly better by rotating all strings to some normal form (lexicographic minimum, for example). Then sort lexicographically and find runs of the same string. That's O(n log n), I think... neglecting string lengths. Something to try, maybe.
Concerning the way to find the pairs in the table, there could be many better way, but what I came up as a first thought is to sort the table and apply the check per adjacent pair.
This is much better and simpler that checking every string with every other string in the table
Consider building an automaton for each string against which you wish to test.
Each automaton should have one entry point for each possible character in the string, and transitions for each character, plus an extra transition from the end to the start.
You could improve performance even further if you amalgated the automata.
I think a combination of the answers by Patrick87 and savinos would make a fair amount of sense. Specifically, in a Java-esque pseudo-code:
List<String> inputs = ["abc", "xyz", "yzx", "cab", "xxx"];
Map<String,List<String>> uniques = new Map<String,List<String>>();
for(String value : inputs) {
String normalized = normalize(value);
if(!uniques.contains(normalized)) {
unqiues.put(normalized, new List<String>());
}
uniques.get(normalized).add(value);
}
// you now have a Map of normalized strings to every string in the input
// that is "equal to" that normalized version
Normalizing the string, as stated by Patrick87 might be best done by picking the rotation of the string that results in the lowest lexographic ordering.
It's worth noting, however, that the "best" algorithm probably relies heavily on the inputs... the number of strings, the length of those string, how many duplicates there are, etc.
You can rotate all the strings to a normalized form using Booth's algorithm (https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation) in O(s) time, where s is the length of the string.
You can then use the normalized form as a key in a HashMap (where the value is the set of rotations seen in the input). You can populate this HashMap in a single pass over the data. i.e., for each string
calculate the normalized form
check if the HashMap contains the normalized form as a key - if not insert the empty Set at this key
add the string to the Set in the HashMap
You then just need to output the values of the HashMap. This makes the total runtime of the algorithm O(n * s) - where n is the number of words and s is the average word length. The total space usage is also O(n * s).

Resources