Algorithm to form a given pattern using some strings

Algorithm to form a given pattern using some strings - string

Given are 6 strings of any length. The words are to be arranged in the pattern shown below. They can be arranged either vertically or horizontally.
--------
| |
| |
| |
---------------
| |
| |
| |
--------
The pattern need not to be symmetric and there need to be two empty areas as shown.
For example:
Given strings
PQF
DCC
ACTF
CKTYCA
PGYVQP
DWTP
The pattern can be
DCC...
W.K...
T.T...
PGYVQP
..C..Q
..ACTF
where dot represent empty areas.
The other example is
RVE
LAPAHFUIK
BIRRE
KZGLPFQR
LLHU
UUZZSQHILWB
Pattern is
LLHU....
A..U....
P..Z....
A..Z....
H..S....
F..Q....
U..H....
I..I....
KZGLPFQR
...W...V
...BIRRE
If multiple patterns are possible then pattern with lexicographically smallest first line, then second line and so on is to be formed. What algorithm can be used to solve this?

Find strings which suits to this constraint:
strlen(a) + strlen(b) - 1 = strlen(c)
strlen(d) + strlen(e) - 1 = strlen(f)
After that try every possible situation if they are valid. For example;
aaa.....
d.f.....
d.f.....
d.f.....
cccccccc
..f....e
..f....e
..bbbbbb
There will be 2*2*2 = 8 different situation.

There are a number of heuristics that you can apply, but before that, let's go over some properties of the puzzle.
+aa+
c f
+ee+eee+
f d
+bbb+
Let us call the length of the string with the same character as appeared in the diagram above. We have:
a + b - 1 = e
c + d - 1 = f
I will refer to the 2 strings for the cross in the middle as middle strings.
We also infer that the length of the string cannot be less than 2. Therefore, we can infer:
e > a, e > b
f > c, f > d
From this, we know that the 2 shortest strings cannot be middle strings, due to the inequality above.
The 3 largest strings cannot be equal also, since after choosing any of 3 string as middle string, we are left with 2 largest strings that are equal, and it is impossible according to the inequality above.
The puzzle is only tricky when the lengths are regular. When the lengths are irregular, you can do direct mapping from length to position.
If we have the 2 largest strings being equal, due to the inequality above, they are the 2 middle strings. The worst case for this one is a "regular" puzzle, where the length a, b, c, d are equal.
If the 2 largest strings are unequal, the largest string's position can be determined immediately (since its length is unique in the puzzle) - as one of the middle string. In worst case, there can be 3 candidates for the other middle string - just brute force and check all of them.
Algorithm:
Try to map unique length string to the position.
Brute force the 2 strings in the middle (taken into consideration what I mentioned above), and brute force to fill in the rest.
Even with stupid brute force, there are only 6! = 720 cases, if the string can only go from left to right, up to down (no reverse). There will be 46080 cases (* 2^6) if the string is allowed to be in any direction.

Related

most efficient way to sort strings with only 2 distinct characters?

If I have strings that I know have no more than 2 distinct characters,
example set:
aab
abbbbabb
bbbaa
aaaaaaa
aaaa
abab
a
aa
aaaaa
aaabba
aabbbab
What's the most efficient way to put them into alphabetical order?
the resulting sorted set:
a
aa
aaaa
aaaaa
aaaaaaa
aaabba
aab
aabbbab
abab
abbbbabb
bbbaa
edit:
I know I could just use a normal sorting algorithm (quick sort, merge sort), but the question is: Does the fact that there are not more than 2 distinct characters make something else more efficient?
If the maximum length of the string matters, I would like to know the answer for 2 different scenarios:
maximum length of the string is the same as the number of strings (n strings being sorted, n maximum length of the string)
maximum length of the string is log n, with n as the number of strings being sorted
I can also assume that all of the strings are distinct.

The String compareTo or compareToIgnoresCase method will return a negative integer, 0, or a polsitive integer depending on the alphabetical ordering of the two Strings being compared. Try that.

General sorting algorithm based on comparisons only asymptotically can't achieve results better than O(nlogn). In your case there is an additional information (2 distinct chars) which has a potential of improving this result. A simple approach that will yield a O(n) result:
Check the first character (let's mark it x).
Scan the string till the end
whenever x is encountered increase a counter.
when encountered the non-x character (let's mark it y) for the first time store it in a dedicated variable
Compare x and y.
if x < y fill the string from the beginning with x's according to the counter and the rest with y
if x > y fill the string from the beginning with y's string length-num of x's slots and the rest with x's.

What is the fastest way to sort n strings of length n each?

I have n strings, each of length n. I wish to sort them in ascending order.
The best algorithm I can think of is n^2 log n, which is quick sort. (Comparing two strings takes O(n) time). The challenge is to do it in O(n^2) time. How can I do it?
Also, radix sort methods are not permitted as you do not know the number of letters in the alphabet before hand.

Assume any letter is a to z.
Since no requirement for in-place sorting, create an array of linked list with length 26:
List[] sorted= new List[26]; // here each element is a list, where you can append
For a letter in that string, its sorted position is the difference of ascii: x-'a'.
For example, position for 'c' is 2, which will be put to position as
sorted[2].add('c')
That way, sort one string only take n.
So sort all strings takes n^2.
For example, if you have "zdcbacdca".
z goes to sorted['z'-'a'].add('z'),
d goes to sorted['d'-'a'].add('d'),
....
After sort, one possible result looks like
0 1 2 3 ... 25 <br/>
a b c d ... z <br/>
a b c <br/>
c
Note: the assumption of letter collection decides the length of sorted array.

For small numbers of strings a regular comparison sort will probably be faster than a radix sort here, since radix sort takes time proportional to the number of bits required to store each character. For a 2-byte Unicode encoding, and making some (admittedly dubious) assumptions about equal constant factors, radix sort will only be faster if log2(n) > 16, i.e. when sorting more than about 65,000 strings.
One thing I haven't seen mentioned yet is the fact that a comparison sort of strings can be enhanced by exploiting known common prefixes.
Suppose our strings are S[0], S[1], ..., S[n-1]. Let's consider augmenting mergesort with a Longest Common Prefix (LCP) table. First, instead of moving entire strings around in memory, we will just manipulate lists of indices into a fixed table of strings.
Whenever we merge two sorted lists of string indices X[0], ..., X[k-1] and Y[0], ..., Y[k-1] to produce Z[0], ..., Z[2k-1], we will also be given 2 LCP tables (LCPX[0], ..., LCPX[k-1] for X and LCPY[0], ..., LCPY[k-1] for Y), and we need to produce LCPZ[0], ..., LCPZ[2k-1] too. LCPX[i] gives the length of the longest prefix of X[i] that is also a prefix of X[i-1], and similarly for LCPY and LCPZ.
The first comparison, between S[X[0]] and S[Y[0]], cannot use LCP information and we need a full O(n) character comparisons to determine the outcome. But after that, things speed up.
During this first comparison, between S[X[0]] and S[Y[0]], we can also compute the length of their LCP -- call that L. Set Z[0] to whichever of S[X[0]] and S[Y[0]] compared smaller, and set LCPZ[0] = 0. We will maintain in L the length of the LCP of the most recent comparison. We will also record in M the length of the LCP that the last "comparison loser" shares with the next string from its block: that is, if the most recent comparison, between two strings S[X[i]] and S[Y[j]], determined that S[X[i]] was smaller, then M = LCPX[i+1], otherwise M = LCPY[j+1].
The basic idea is: After the first string comparison in any merge step, every remaining string comparison between S[X[i]] and S[Y[j]] can start at the minimum of L and M, instead of at 0. That's because we know that S[X[i]] and S[Y[j]] must agree on at least this many characters at the start, so we don't need to bother comparing them. As larger and larger blocks of sorted strings are formed, adjacent strings in a block will tend to begin with longer common prefixes, and so these LCP values will become larger, eliminating more and more pointless character comparisons.
After each comparison between S[X[i]] and S[Y[j]], the string index of the "loser" is appended to Z as usual. Calculating the corresponding LCPZ value is easy: if the last 2 losers both came from X, take LCPX[i]; if they both came from Y, take LCPY[j]; and if they came from different blocks, take the previous value of L.
In fact, we can do even better. Suppose the last comparison found that S[X[i]] < S[Y[j]], so that X[i] was the string index most recently appended to Z. If M ( = LCPX[i+1]) > L, then we already know that S[X[i+1]] < S[Y[j]] without even doing any comparisons! That's because to get to our current state, we know that S[X[i]] and S[Y[j]] must have first differed at character position L, and it must have been that the character x in this position in S[X[i]] was less than the character y in this position in S[Y[j]], since we concluded that S[X[i]] < S[Y[j]] -- so if S[X[i+1]] shares at least the first L+1 characters with S[X[i]], it must also contain x at position L, and so it must also compare less than S[Y[j]]. (And of course the situation is symmetrical: if the last comparison found that S[Y[j]] < S[X[i]], just swap the names around.)
I don't know whether this will improve the complexity from O(n^2 log n) to something better, but it ought to help.

You can build a Trie, which will cost O(s*n),
Details:
https://stackoverflow.com/a/13109908

Solving it for all cases should not be possible in better that O(N^2 Log N).
However if there are constraints that can relax the string comparison, it can be optimised.
-If the strings have high repetition rate and are from a finite ordered set. You can use ideas from count sort and use a map to store their count. later, sorting just the map keys should suffice. O(NMLogM) where M is the number of unique strings. You can even directly use TreeMap for this purpose.
-If the strings are not random but the suffixes of some super string this can well be done
O(N Log^2N). http://discuss.codechef.com/questions/21385/a-tutorial-on-suffix-arrays

Deterministic automata to find number of subsequence in string of another string

Deterministic automata to find number of subsequences in string ?
How can I construct a DFA to find number of occurence string as a subsequence in another string?
eg. In "ssstttrrriiinnngggg" we have 3 subsequences which form string "string" ?
also both string to be found and to be searched only contain characters from specific character Set .
I have some idea about storing characters in stack poping them accordingly till we match , if dont match push again .
Please tell DFA solution ?

OVERLAPPING MATCHES
If you wish to count the number of overlapping sequences then you simply construct a DFA that matches the string, e.g.
1 -(if see s)-> 2 -(if see t)-> 3 -(if see r)-> 4 -(if see i)-> 5 -(if see n)-> 6 -(if see g)-> 7
and then compute the number of ways of being in each state after seeing each character using dynamic programming. See the answers to this question for more details.
DP[a][b] = number of ways of being in state b after seeing the first a characters
= DP[a-1][b] + DP[a-1][b-1] if character at position a is the one needed to take state b-1 to b
= DP[a-1][b] otherwise
Start with DP[0][b]=0 for b>1 and DP[0][1]=1.
Then the total number of overlapping strings is DP[len(string)][7]
NON-OVERLAPPING MATCHES
If you are counting the number of non-overlapping sequences, then if we assume that the characters in the pattern to be matched are distinct, we can use a slight modification:
DP[a][b] = number of strings being in state b after seeing the first a characters
= DP[a-1][b] + 1 if character at position a is the one needed to take state b-1 to b and DP[a-1][b-1]>0
= DP[a-1][b] - 1 if character at position a is the one needed to take state b to b+1 and DP[a-1][b]>0
= DP[a-1][b] otherwise
Start with DP[0][b]=0 for b>1 and DP[0][1]=infinity.
Then the total number of non-overlapping strings is DP[len(string)][7]
This approach will not necessarily give the correct answer if the pattern to be matched contains repeated characters (e.g. 'strings').

Understanding the Knuth Morris Pratt(KMP) Failure Function

I've been reading the Wikipedia article about the Knuth-Morris-Pratt algorithm and I'm confused about how the values are found in the jump/partial match table.
i | 0 1 2 3 4 5 6
W[i] | A B C D A B D
T[i] | -1 0 0 0 0 1 2
If someone can more clearly explain the shortcut rule because the sentence
"let us say that we discovered a proper suffix which is a proper prefix and ending at W[2] with length 2 (the maximum possible)"
is confusing. If the proper suffix ends at W[2] wouldn't it be size of 3?
Also I'm wondering why T[4] isn't 1 when there is a prefix and suffix of size 1: The A.
Thanks for any help that can be offered.

Notice that the failure function T[i] does not use i as an index, but rather as a length. Therefore, T[2] represents the length of the longest proper border (a string that is both a prefix and suffix) of the string formed from the first two characters of W, rather than the longest proper border formed by the string ending at character 2. This is why the maximum possible value of T[2] is 2 rather than 3 - the substring formed from the first two characters of W can't have length any greater than 2.
Using this interpretation, it's also easier to see why T[4] is 0 rather than 1. The substring of W formed from the first four characters of W is ABCD, which has no proper prefix that is also a proper suffix.
Hope this helps!

"let us say that we discovered a proper suffix which is a proper prefix and ending at W[2] with length 2 (the maximum possible)"
Okay, the length can be maximum 2, it's correct, here is why...
One fact: "proper" prefix can't be the whole string , same goes for "proper" suffix(like proper subset)
Lets, W[0]=A W[1]=A W[2]=A , i.e the pattern is "AAA", so, the (max length)proper prefix can be "AA" (left to right) and, the (max length) proper suffix can be "AA" (right to left)
//yes, the prefix and suffix have overlaps (the middle "A")
So, the value would be 2 rather than 3, it would have been 3 only if the prefix was not proper.

Given string s, find the shortest string t, such that, t^m=s

Given string s, find the shortest string t, such that, t^m=s.
Examples:
s="aabbb" => t="aabbb"
s="abab" => t = "ab"
How fast can it be done?
Of course naively, for every m divides |s|, I can try if substring(s,0,|s|/m)^m = s.
One can figure out the solution in O(d(|s|)n) time, where d(x) is the number of divisors of s. Can it be done more efficiently?

This is the problem of computing the period of a string. Knuth, Morris and Pratt's sequential string matching algorithm is a good place to get started. This is in a paper entitled "Fast Pattern Matching in Strings" from 1977.
If you want to get fancy with it, then check out the paper "Finding All Periods and Initial Palindromes of a String in Parallel" by Breslauer and Galil in 1991. From their abstract:
An optimal O(log log n) time CRCW-PRAM algorithm for computing all
periods of a string is presented. Previous parallel algorithms compute
the period only if it is shorter than half of the length of the
string. This algorithm can be used to find all initial palindromes of
a string in the same time and processor bounds. Both algorithms are
the fastest possible over a general alphabet. We derive a lower bound
for finding palindromes by a modification of a previously known lower
bound for finding the period of a string [3]. When p processors are
available the bounds become \Theta(d n p e + log log d1+p=ne 2p).

I really like this thing called the z-algorithm: http://www.utdallas.edu/~besp/demo/John2010/z-algorithm.htm
For every position it calculates the longest substring starting from there, that is also a prefix of the whole string. (in linear time of course).
a a b c a a b x a a a z
1 0 0 3 1 0 0 2 2 1 0
Given this "z-table" it is easy to find all strings that can be exponentiated to the whole thing. Just check for all positions if pos+z[pos] = n.
In our case:
a b a b
0 2 0
Here pos = 2 gives you 2+z[2] = 4 = n hence the shortest string you can use is the one of length 2.
This will even allow you to find cases where only a prefix of the exponentiated string matches, like:
a b c a
0 0 1
Here (abc)^2 can be cut down to your original string. But of course, if you don't want matches like this, just go over the divisors only.

Yes you can do it in O(|s|) time.
You can search for a "target" string of length n in a "source" string of length m in O(n+m) time. Build a solution based on that.
Let both source and target be s. An additional constraint is that 1 and any positions in the source that do not divide |s| are not valid starting positions for the match. Of course the search per se will always fail. But if there's a partial match and you have reached the end of the sourse string, then you have a solution to the original problem.

a modification to Boyer-Moore could possibly handle this in O(n) where n is length of s
http://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string