Algorithm for finding string permutations where each position varies - string

Let's say I have a three character string "ABC". I want to generate all permutations of that string where a single letter can be replaced with his lower-case equivalent. For example, "aBC", "abC", "abc", "AbC", "Abc", etc. In other words, given a regexp like [Aa][Bb][Cc] generate every string that can be matched by it.

The problem can be trivially reduced to generating all binary sequences of length n. This has been previously addressed, for example in Fastest way to generate all binary strings of size n into a boolean array? and all permutations of a binary sequence x bits long.

Related

Unique Substrings in wrap around strings

I have been given an infinite wrap around of the string str="abcdefghijklmnopqrstuvwxyz" so it looks like
"..zabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcd...." and another string p.
I need to find out how many unique non-empty substrings of p are present in the infinite wraparound string str?
For example: "zab"
There are 6 substrings "z", "a", "b", "za", "ab", "zab" of string "zab" in str.
I tried finding all suffixes of p in a particular concatenation of the string str say for example: "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz"
and as soon as i get a suffix which is a part of the above i add all its substrings to my result, as:
for (int i=0;i<length;i++) {
String suffix = p.substring(i,length);
if(isPresent(suffix)) {
sum += (suffix.length()*(suffix.length()+1))/2;
break;
} else {
sum++;
}
}
And my isPresent function is:
private boolean isPresent(String s) {
if(s.length()==1) {
return true;
}
String main = "abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcde
fghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz";
return main.contains(s);
}
If the length of p is greater than my assumed concatenated string assumed in isPresent function, my algorithm fails!!
So how should i find the substrings irrespective of the the wrap around string str? Is there a better approach for this problem?
Some ideas/suggestions (not a full algo)
you don't need to consider an infinite repetition of the wrap around string but only len(p)/len(repeating-fragment) + 1 (integral division) repetitions. Let's denote this string with S **
if a substring sp of p is a substring of S, than any substrings of sp will be substrings of S
So the problem seems to reduce to:
find sp (substring of both p and S) with the maximal length. This is called longest common substring and admits a dynamic programming solution with the complexity of O(n*m) (lengths of the two strings). The cited has a pseudo-code algo.
repeat the above recursively with the 'remnants' of p after eliminating the longest common substring.
Now, you have a sequence of "longest common substrings". How many do you need to retain? I feel that the "longest common substring" may be used to trim down the need of brute-forcing every substring of any and all the above, but I'd need more time than I have available now.
I hope the sketch above helps.
** I might be wrong on the number of repetitions which need to be considered. If I am, then in any case there will be a maximal number of repetitions to be considered and there will be an S of minimal length that is sufficient for the purpose.

Find list of substrings for a given string that are in the dictionary

Input given is basically a dictionary (array of strings) and an InputString.
We want to find out all the possible substrings of that string that are in the dictionary.
Input:
Dictionary: ["hell", "hello", "heaven", "ample", "his", "some", "other", "words"]
String: "hello world, this is an example"
Output: ["hell", "hello", "his", "ample"] //all the substrings that are in dictionary.
Solution that I can think of is building a trie like structure out of dictionary, and then running following loop
for(i= 0 to inputString.length)
substring = inputString.substring(i,length)
lookupInTrie(substring)
lookupInTrie(string)
this function returns list of complete words from trie that match the prefix of string.
i.e, if you pass in string "hello world" to this function and dictionary has word "hell" and "hello" then our lookup will return ["hell","hello"];
So if we do not count dictionary->trie conversion. Finding all substrings of a given string that are in dictionary can be done in O(n^2) time.
I want to know if we can optimize this further and reduce the complexity down from n^2.
What you're describing looks like a perfect spot to use the Aho-Corasick string-matching algorithm, which is essentially an optimized version of the algorithm you're describing above. It works by building a trie from the pattern strings and then running the original string through it, but does so in a way that doesn't require a huge amount of backtracking. The total time complexity is O(m + n + z), where m is the length of the string to search, n is the total length of the pattern strings, and z is the number of matches.
You could also use a suffix tree here. Building a suffix tree for the sentence and then searching for each pattern in it takes time O(m + n + z), where m, n, and z are defined as above, though coding this up from scratch would be pretty tough.

Partial String matching in two large strings

I am looking for an efficient algorithm to find out all partial matches in 2 large strings. For example,
string 1: "Thisismyfirststring"
string 2: "searchismyfirtestring"
This should return "his", "hisismyfir", "string", etc.
Is this possible?
Regards..
Construct a boolean matrix M where M(i,j) tells you if the i'th character of one string matches the j'th character of the other. A matching substring will now be a diagonal line of true in M, so now walk the matrix and look for those.

How to detect palindrome cycle length in a string?

Suppose a string is like this "abaabaabaabaaba", the palindrome cycle here is 3, because you can find the string aba at every 3rd position and you can augment the palindrome by concatenating any number of "aba"s to the string.
I think it's possible to detect this efficiently using Manacher's Algorithm but how?
You can find it easily by searching the string S in S+S. The first index you find is the cycle number you want (may be the entire string). In python it would be something like:
In [1]: s = "abaabaabaabaaba"
In [2]: print (s+s).index(s, 1)
3
The 1 is there to ignore the index 0, that would be a trivial match.

algorithms for fast string approximate matching

Given a source string s and n equal length strings, I need to find a quick algorithm to return those strings that have at most k characters that are different from the source string s at each corresponding position.
What is a fast algorithm to do so?
PS: I have to claim that this is a academic question. I want to find the most efficient algorithm if possible.
Also I missed one very important piece of information. The n equal length strings form a dictionary, against which many source strings s will be queried upon. There seems to be some sort of preprocessing step to make it more efficient.
My gut instinct is just to iterate over each String n, maintaining a counter of how many characters are different than s, but I'm not claiming it is the most efficient solution. However it would be O(n) so unless this is a known performance problem, or an academic question, I'd go with that.
Sedgewick in his book "Algorithms" writes that Ternary Search Tree allows "to locate all words within a given Hamming distance of a query word". Article in Dr. Dobb's
Given that the strings are fixed length, you can compute the Hamming distance between two strings to determine the similarity; this is O(n) on the length of the string. So, worst case is that your algorithm is O(nm) for comparing your string against m words.
As an alternative, a fast solution that's also a memory hog is to preprocess your dictionary into a map; keys are a tuple (p, c) where p is the position in the string and c is the character in the string at that position, values are the strings that have characters at that position (so "the" will be in the map at {(0, 't'), "the"}, {(1, 'h'), "the"}, {(2, 'e'), "the"}). To query the map, iterate through query string's characters and construct a result map with the retrieved strings; keys are strings, values are the number of times the strings have been retrieved from the primary map (so with the query string "the", the key "thx" will have a value of 2, and the key "tee" will have a value of 1). Finally, iterate through the result map and discard strings whose values are less than K.
You can save memory by discarding keys that can't possibly equal K when the result map has been completed. For example, if K is 5 and N is 8, then when you've reached the 4th-8th characters of the query string you can discard any retrieved strings that aren't already in the result map since they can't possibly have 5 matching characters. Or, when you've finished with the 6th character of the query string, you can iterate through the result map and remove all keys whose values are less than 3.
If need be you can offload the primary precomputed map to a NoSql key-value database or something along those lines in order to save on main memory (and also so that you don't have to precompute the dictionary every time the program restarts).
Rather than storing a tuple (p, c) as the key in the primary map, you can instead concatenate the position and character into a string (so (5, 't') becomes "5t", and (12, 'x') becomes "12x").
Without knowing where in each input string the match characters will be, for a particular string, you might need to check every character no matter what order you check them in. Therefore it makes sense to just iterate over each string character-by-character and keep a sum of the total number of mismatches. If i is the number of mismatches so far, return false when i == k and true when there are fewer than k-i unchecked characters remaining in the string.
Note that depending on how long the strings are and how many mismatches you'll allow, it might be faster to iterate over the whole string rather than performing these checks, or perhaps to perform them only after every couple characters. Play around with it to see how you get the fastest performance.
My method if we're thinking out loud :P I can't see a way to do this without going through each n string, but I'm happy to be corrected. On that it would begin with a pre-process to save a second set of your n strings so that the characters are in ascending order.
The first part of the comparison would then be to check each n string a character at a time say n' to each character in s say s'.
If s' is less than n' then not equal and move to the next s'. If n' is less than s' then go to next n'. Otherwise record a matching character. Repeat this until k miss matches are found or the alternate matches are found and mark n accordingly.
For further consideration, an added pre-processing could be done on each adjacent string in n to see the total number of characters that differ. This could then be used when comparing strings n to s and if sufficient difference exist between these and the adjacent n there may not be a need to compare it?

Resources