Interesting strings algorithm

Interesting strings algorithm - string

Given two finite sequences of string, A and B, of length n each,
for example:
A1: "kk", A2: "ka", A3: "kkk", A4: "a"
B1: "ka", B2: "kakk", B3: "ak", B4: "k"
Give a finite sequences of indexes so that their concentration for A
and B gives the same string. Repetitions allowed.
In this example I can't find the solution but for example if the list (1,2,2,4) is a solution then A1 + A2 + A2 + A4 = B1 + B2 + B2 + B4. In this example there are only two characters but it's already very difficult. Actually it's not even trivial to find the shortest solution with one character!
I tried to think of things.. for example the total sum of the length of the strings must be equal and the for the first and last string we need corresponding characters. But nothing else. I suppose for some set of strings it's simply impossible. Anyone can think of a good algorithm?
EDIT: Apparently, this is the Post Correspondence Problem
There is no algorithm that can decide whether a such an instance has a solution or not. If there were, the halting problem could be solved. Dirty trick...

Very tough question, but I'll give it a shot. This is more of a stream of consciousness than an answer, apologies in advance.
If I understand this correctly, you're given 2 equal sized sequences of strings, A and B, indexed from 1..n, say. You then have to find a sequence of indices such that the concatenation of strings A(1)..A(m) equals the concatenation of strings B(1)..B(m) where m is the length of the sequence of indices.
The first thing I would observe is that there could be an infinite number of solutions. For example, given:
A { "x", "xx" }
B { "xx", "x" }
Possible solutions are:
{ 1, 2 }
{ 2, 1 }
{ 1, 2, 1, 2 }
{ 1, 2, 2, 1 }
{ 2, 1, 1, 2 }
{ 2, 1, 2, 1 }
{ 1, 2, 1, 2, 1, 2}
...
So how would you know when to stop? As soon as you had one solution? As soon as one of the solutions is a superset of another solution?
One place you could start would be by taking all the strings of minimum common length from both sets (in my example above, you would take the "x" from both, and searching for 2 equal strings that share a common index. You can then repeat this for strings of the next size up. For example, if the first set has 3 strings of length 1, 2 and 3 respectively, and the second set has strings of length 1, 3 and 3 respectively, you would take the strings of length 3. You would do this until you have no more strings. If you find any, then you have a solution to the problem.
It then gets harder when you have to start combining several strings as in my example above. The naive, brute force approach would be to start permuting all strings from both sets that, when concatenated, result in strings of the same length, then compare them. So in the below example:
A { "ga", "bag", "ac", "a" }
B { "ba", "g", "ag", "gac" }
You would start with sequences of length 2:
A { "ga", "ac" }, B { "ba", "ag" } (indices 1, 3)
A { "bag", "a" }, B { "g", "gac" } (indices 2, 4)
Comparing these gives "gaac" vs "baag" and "baga" vs "ggac", neither of which are equal, so there are no solutions there. Next, we would go for sequences of length 3:
A { "ga", "bag", "a" }, B { "ba", "g", "gac" } (indices 1, 2, 4)
A { "bag", "ac", "a" }, B { "g", "ag", "gac" } (indices 2, 3, 4)
Again, no solutions, so then we end up with sequences of size 4, of which we have no solutions.
Now it gets even trickier, as we have to start thinking about perhaps repeating some indices, and now my brain is melting.
I'm thinking looking for common subsequences in the strings might be helpful, and then using the remaining parts in the strings that were not matched. But I don't quite know how.

A very simple way is to just use something like a breadth-first search. This also has the advantage that the first solution found will have minimal size.

It is not clear what the 'solution' you are looking for is, the longest solution? the shortest? all solutions?
Since you allow repetition there will an infinite number of solutions for some inputs so I will work on:
Find all sequences under a fixed length.
Written as a pseudo code but in a manner very similar to f# sequence expressions
// assumed true/false functions
let Eq aList bList =
// eg Eq "ab"::"c" "a" :: "bc" -> true
// Eq {} {} is _false_
let EitherStartsWith aList bList =
// eg "ab"::"c" "a" :: "b" -> true
// eg "a" "ab" -> true
// {} {} is _true_
let rec FindMatches A B aList bList level
= seq {
if level > 0
if Eq aList bList
yield aList
else if EitherStartsWith aList bList
Seq.zip3 A B seq {1..}
|> Seq.iter (func (a,b,i) ->
yield! FindMatches A B aList::(a,i) bList::(b,i) level - 1) }
let solution (A:seq<string>) (B:seq<string>) length =
FindMatches A B {} {} length
Some trivial constraints to reduce the problem:
The first selection pair must have a common start section.
the final selection pair must have a common end section.
Based on this we can quickly eliminate many inputs with no solution
let solution (A:seq<string>) (B:seq<string>) length =
let starts = {}
let ends = {}
Seq.zip3 A B seq {1..}
|> Seq.iter(fun (a,b,i) ->
if (a.StartsWith(b) or b.StartsWith(a))
start = starts :: (a,b,i)
if (a.EndsWith(b) or b.EndsWith(a))
ends = ends :: (a,b,i))
if List.is_empty starts || List.is_empty ends
Seq.empty // no solution
else
Seq.map (fun (a,b,i) ->
FindMatches A B {} :: (a,i) {} :: (b,i) length - 1)
starts
|> Seq.concat

Here's a suggestion for a brute force search. First generate number sequences bounded to the length of your list:
[0,0,..]
[1,0,..]
[2,0,..]
[3,0,..]
[0,1,..]
...
The number sequence length determines how many strings are going to be in any solution found.
Then generate A and B strings by using the numbers as indexes into your string lists:
public class FitSequence
{
private readonly string[] a;
private readonly string[] b;
public FitSequence(string[] a, string[] b)
{
this.a = a;
this.b = b;
}
private static string BuildString(string[] source, int[] indexes)
{
var s = new StringBuilder();
for (int i = 0; i < indexes.Length; ++i)
{
s.Append(source[indexes[i]]);
}
return s.ToString();
}
public IEnumerable<int[]> GetSequences(int length)
{
foreach (var numberSequence in new NumberSequence(length).GetNumbers(a.Length - 1))
{
string a1 = BuildString(a, numberSequence);
string b1 = BuildString(b, numberSequence);
if (a1 == b1)
yield return numberSequence;
}
}
}
This algorithm assumes equal lengths for A and B.
I tested your example with
static void Main(string[] args)
{
var a = new[] {"kk", "ka", "kkk", "a"};
var b = new[] {"ka", "kakk", "ak", "k"};
for (int i = 0; i < 100; ++i)
foreach (var sequence in new FitSequence(a, b).GetSequences(i))
{
foreach (int x in sequence)
Console.Write("{0} ", x);
Console.WriteLine();
}
}
but could not find any solutions, though it seemed to work for simple tests.

Related

Generate A Random String With A Set of Banned Substrings

I want to generate a random string of a fixed length L. However, there is a set of "banned" substrings all of length b that cannot appear in the string. Is there a way to algorithmically generate this parent string?
Here is a small example:
I want a string that is 10 characters long -> XXXXXXXXXX
The banned substrings are {'AA', 'CC', 'AD'}
The string ABCDEFGHIJ is a valid string, but AABCDEFGHI is not.
For a small example it is relatively easy to randomly generate and then check the string, but as the set of banned substrings gets larger (or the length of the banned substrings gets smaller), the probability of randomly generating a valid string rapidly decreases.

This will be a fairly efficient approach, but it requires a lot of theory.
First you can take your list of strings, like in this case AA, BB, AD. This can trivially be turned into a regular expression that matches any of them, namely /AA|BB|AD/. Which you can then turn into an Nondeterministic Finite Automaton (NFA) and a Deterministic Finite Automaton (DFA) for matching the regular expression. See these lecture notes for an example of how to do that. In this case the DFA will have the following states:
Matched nothing (yet)
Matched A
Matched B
End of match
And the transition rules will be:
If A go to state 2, if B go to state 3, else go to state 1.
If A or D go to state 4, else go to state 1.
If B go to state 4, else go to state 1.
Match complete, we're done. (Stay in state 4 forever.)
Now normally a DFA is used to find a match. We're going to use it as a way to find ways to avoid a match. So how will we do that?
The trick here is dynamic programming.
What we will do is create a table of:
by position in the string
by state of the match
how many ways there are to get here
how many ways we got here from the previous (position, state) pairs
In other words we go forward and create a table which starts like this:
[
{1: {'count': 1}},
{1: {'count': 24, 'prev': {1: 24}},
2: {'count': 1, 'prev': {1: 1}},
3: {'count': 1, 'prev': {1: 1}},
},
{1: {'count': 625, 'prev': {1: 576, 2: 24, 3: 25}},
2: {'count': 25, 'prev': {1: 24, 3: 1}},
3: {'count': 25, 'prev': {1: 24, 2: 1}},
4: {'count': 3, 'prev': {2: 2, 3: 1}},
},
...
]
By the time this is done, we know exactly how many ways we can wind up at the end of the string with a match (state 4), partial match (states 2 or 3) or not currently matching (state 1).
Now we generate the random string backwards. First we randomly pick the final state with odds based on the count. Then from that final state's prev entry we can pick the state we were on before that final one. Then we randomly pick a letter that would have done that. We are picking that letter/state combination completely randomly from all solutions.
Now we know what state we were in at the second to last letter. Pick the previous state, then pick a letter in the same way.
Continue back until finally we know the state we're in after we've picked the first letter, and then pick that letter.
It is a lot of logic to write, but it is all deterministic in time, and will let you pick random strings all day long once you've done the initial analysis.

There are two ways.
Random String is created while repeating until the ban string does not appear.
#include <bits/stdc++.h>
using namespace std;
const int N = 1110000;
char ban[] = "AA";
char rnd[N];
int main() {
srand(time(0));
int n = 100;
do {
for (int i = 0; i < n; i++) rnd[i] = 'A'+rand()%26;
rnd[n] = 0;
} while (strstr(rnd, ban));
cout << rnd << endl;
}
I think this is the easiest way to implement.
However, this method has complexity up to O((26/25)^n*(n+b)), if the length of the string to be created is very long and the length of the ban string is very small.
For example if ban="A", n=10000, then there will be time limit exceed!
You can proceed with the creation of one character and one character while checking whether there is a ban string.
If you want to use this way, you must know about KMP algorithm.
In this way, we can not use the system default search function strstr.
#include <bits/stdc++.h>
using namespace std;
const int N = 1110000;
char ban[] = "A";
char rnd[N];
int b = strlen(ban);
int n = 10;
int *pre_kmp() {
int *pie=new int [b];
pie[0] = 0;
int k=0;
for(int i=1;i<b;i++)
{
while(k>0 && ban[k] != ban[i] )
{
k=pie[k-1];
}
if( ban[k] == ban[i] )
{
k=k+1;
}
pie[i]=k;
}
return pie;
}
int main() {
srand(time(0));
int *pie = pre_kmp();
int matched_pos = 0;
for (int cur = 0; cur < n; cur++) {
do {
rnd[cur] = 'A'+rand()%26;
while (matched_pos > 0 && ban[matched_pos] != rnd[cur])
matched_pos = pie[matched_pos-1];
if (ban[matched_pos] == rnd[cur])
matched_pos = matched_pos+1;
} while (matched_pos == b);
}
cout << rnd << endl;
}
This algorithm's time complexity will be O(26 * n * b).
Of course you could also search manually without using the KMP algorithm. However, in this case, the time complexity becomes O(n*b), so the time complexity becomes very large when both the ban string length and the generator string length are long.
Hope this helps you.

Simple approach:
Create a map where the keys are the banned substrings excluding the final letter, and the values are the lists of allowed letters. Alternatively you can have the values be the lists of banned letters with slight modification.
Generate random letters, using the set of allowed letters for the (b-1)-length substring at the end of the letters already generated, or the full alphabet if that substring doesn't match a key.
This is O(n*b) unless there's a way to update the substring of the last b-1 characters in O(1).
Ruby Code
def randstr(strlen, banned_strings)
b = banned_strings[0].length # All banned strings are length b
alphabet = ('a'..'z').to_a
prefix_to_legal_letters = Hash.new {|h, k| h[k] = ('a'..'z').to_a}
banned_strings.each do |s|
prefix_to_legal_letters[s[0..-2]].delete(s[-1])
end
str = ''
while str.length < strlen
letters_to_use = alphabet
if str.length >= b-1
str_end = str[(-b+1)..-1]
if prefix_to_legal_letters.has_key?(str_end)
letters_to_use = prefix_to_legal_letters[str_end]
end
end
str += letters_to_use.sample()
end
return str
end
A silly example to show this works. The only legal letter after 'a' is 'z'. Every 'a' in the output is followed by a 'z'.
randstr(1000, ['aa', 'ab', 'ac', 'ad', 'ae', 'af', 'ag', 'ah', 'ai', 'aj', 'ak', 'al', 'am', 'an', 'ao',
'ap', 'aq', 'ar', 'as', 'at', 'au', 'av', 'aw', 'ax', 'ay'])
=> "gkowhxkhrknrxkbjxbjwiqohvvazwrjxjdekrujdyprjnmbjuklqsjdwzidhpgzzmnfbyjuptbpyezfeeydgdkpznvjfwziazrzohwvnitnfupdqxivtvkbazpvqzdzzsneslrazmhbjojyqowhvjhsrdgpbejicitprxzmkhgsuvvlyfizmhohorazemyhtbazvhvazdmnjmjzoggwmjjnrqxcmrdhxozbsjjdqgfjorazmtwtvvujpgivdxijowgxnkuxovncnivazmtykliqiielsfixuflfsgqbpevazozfsvfynhxyjpxtuolqooowazpyoukssknxdntzjjbqazxjttdblepsjzqgxmxvtrmjbgvuyfvspdrrohmtwhtdxfcvidswxtzbznszsqorpxdywbytsitxeziudmvlnluwmcqtfydxlocltozovhusbblfutbqjfjeslverzctxazyprazxzmazxwbdfkwxdwdqxnqhbcliwuitsnnpscbsjitoftblgjycpnxqsikpjqysmqiazdazwwjmeazxcbejthnlsskhazxazlrceyjtbmcpazscazvsjkqhiqfbjygjhyqazsbjymsovojfxynygzwmlhkmpvswpweqkkvmbrxhazpmiqrazcgprlbywmqpyvtphydniazovrkolzbslsosjvdqkgrjmcorqtgeazfwskjuhndszliiirtncmzrzhocyazyrhhpbcsmneuiktyswvgqwkzswkjnyuazggnreeccyidvrbxuskrlchjxnrrpljilogxmicjvmoeequbpkursrqsisqtfkruswnyftdgbjhwvcrlcnfecyfdnslmxztlbfxjhgeslqedrflthlhnlwopmsdjgochxwxhfhvqcixvxdjixcazggmexidtlhymkiyyfuhxufvxyfazmmwsbrlooqwfphgfhvthspvmyiazdazggpeuhnpjmzsazfxmsukpd"
Note that this code and approach needs modification if the OPs statement that all banned substrings are length b is modified -- either directly (by giving banned substrings of varying lengths), or implicitly (by having a (b-1)-length substring for which every letter is banned, in which case the (b-1)-length substring is effectively banned.
The modification here is the obvious one of checking all the possible key lengths in the map; it's still O(n*b) assuming b is the longest banned substring.

Find strings which have a particular string as suffix in optimal time?

We have an array ‘A’ of strings and an array ‘B’ of strings. For each string ‘s’ in B, we have to find the number of strings in ‘A’ which have the suffix as ‘s’?
1≤size of A ≤10^5
1≤size of B ≤10^5
1≤|Ai|≤10^2
1≤|Bi|≤10^2
The naive approach is simply traversing through 'B' and for each string in B iterate over A to find a number but it has a time complexity of N^2. We need a solution with better time complexity.

Construct a prefix tree based on A. In each vertex of the tree also keep information on how many strings 'pass' through it.
Then, for each s in B, find a vertex in the prefix tree that corresponds to s and just read how many strings from A passed through it (the information that is already there).
Add words from A to prefix tree reversed, so you can operate on suffixes, and not prefixes.
Time complexity is O(size(A) + size(B))
Pseudo code:
struct node
{
node* children[ALPHABET_SIZE]
int num_words;
}
func solve(string[] a, string[] b)
{
node* treeRoot = new node()
for (string s in a)
{
string x = s.reverse()
node* currNode = treeRoot
for (char ch in x)
{
currNode.num_words++
currNode.children[ch] = new node()
currNode = currNode.children[ch]
}
currNode.num_words++
}
int answer[len(b)]
for (int i=0;i<len(b);++i)
{
string x = b[i].reverse()
node* currNode = treeRoot
bool brk = false
for (char ch in x)
{
if (currNode.children[ch] == null)
{
brk = true
break
}
currNode = currNode.children[ch]
}
if (brk)
answer[i] = 0
else
answer[i] = currNode.num_words
}
return answer
}
EDIT:
By size(A) I mean total number of chars in all strings in A

You can do this in O(N) time, where N is the size of the input (number of strings * average length), and without any complicated data structures.
First reverse all the strings to change this from a suffix problem into a prefix problem. The task is now to find the number of strings in A with each string in B as a prefix.
For each string s in B, remember these two strings:
s_start is just s. This is the highest string such that all strings with s as a prefix are lexographically >= s_start
s_end is the smallest string such that all strings with s as a prefix are < s_end. You can get this string by incrementing the last character of s.
For example, if s is "abc", then s_start = "abc" and s_end = "abd". Iff we have "abc" <= x < "abd", then "abc" is a prefix of x.
Now sort A, sort the list of B starts, and sort the list of B ends. If you use a radix sort, this takes O(N) time.
Then walk through the 3 lists in order as if you were merging them into one sorted list. If you find the same string in multiple lists, process the B strings before A strings.
Keep track of the number of As you consume, and:
Whenever you see an s_start, set start[s] = As_consumed_so_far
Whenever you see an s_end, set end[s] = As_consumed_so_far
When you're done, for each s in B, end[s]-start[s] is the number of strings in A with s as a prefix.

Given a string sequence of words, check if it matches a pattern

I encountered this problem in an interview and I was stuck on the best way to go about it. The question is as follows:
Given a string sequence of words and a string sequence pattern, return true if the sequence of words matches the pattern otherwise false.
Definition of match: A word that is substituted for a variable must always follow that substitution. For example, if "f" is substituted as "monkey" then any time we see another "f" then it must match "monkey" and any time we see "monkey" again it must match "f".
Examples
input: "ant dog cat dog", "a d c d"
output: true
This is true because every variable maps to exactly one word and vice verse.
a -> ant
d -> dog
c -> cat
d -> dog
input: "ant dog cat dog", "a d c e"
output: false
This is false because if we substitute "d" as "dog" then you can not also have "e" be substituted as "dog".
a -> ant
d, e -> dog (Both d and e can't both map to dog so false)
c -> cat
input: "monkey dog eel eel", "e f c c"
output: true
This is true because every variable maps to exactly one word and vice verse.
e -> monkey
f -> dog
c -> eel
Initially, I thought of doing something as follows...
function matchPattern(pattern, stringToMatch) {
var patternBits = pattern.split(" ");
var stringBits = stringToMatch.split(" ");
var dict = {};
if (patternBits.length < 0
|| patternBits.length !== stringBits.length) {
return false;
}
for (var i = 0; i < patternBits.length; i++) {
if (dict.hasOwnProperty(patternBits[i])) {
if (dict[patternBits[i]] !== stringBits[i]) {
return false;
}
} else {
dict[patternBits[i]] = stringBits[i];
}
}
return true;
}
var ifMatches = matchPattern("a e c d", "ant dog cat dog");
console.log("Pattern: " + (ifMatches ? "matches!" : "does not match!"));
However, I realized that this won't work and fails example #2 as it erroneously returns true. One way to deal with this issue is to use a bi-directional dictionary or two dictionaries i.e store both {"a": "ant"} and
{"ant": "a"} and check both scenarios in the if check. However, that seemed like wasted space. Is there a better way to tackle this problem without using regular expressions?

I think a simple choice that is quadratic in the length of the list of words is to verify that every pairing of list indices has the same equality characteristics in the two lists. I'll assume that you get the "words" and "pattern" as lists already and don't need to parse out spaces and whatever -- that ought to be a separate function's responsibility anyway.
function matchesPatternReference(words, pattern) {
if(words.length !== pattern.length) return false;
for(var i = 0; i < words.length; i++)
for(var j = i+1; j < words.length; j++)
if((words[i] === words[j]) !== (pattern[i] === pattern[j]))
return false;
return true;
}
A slightly better approach would be to normalize both lists, then compare the normalized lists for equality. To normalize a list, replace each list element by the number of unique list elements that appear before its first occurrence in the list. This will be linear in the length of the longer list, assuming you believe hash lookups and list appends take constant time. I don't know enough Javascript to know if these are warranted; certainly at worst the idea behind this algorithm can be implemented with suitable data structures in n*log(n) time even without believing that hash lookups are constant time (a somewhat questionable assumption no matter the language).
function normalize(words) {
var next_id = 0;
var ids = {};
var result = [];
for(var i = 0; i < words.length; i++) {
if(!ids.hasOwnProperty(words[i])) {
ids[words[i]] = next_id;
next_id += 1;
}
result.push(ids[words[i]]);
}
return result;
}
function matchesPatternFast(words, pattern) {
return normalize(words) === normalize(pattern);
}
Note: As pointed out in the comments, one should check deep equality of the normalized arrays manually, since === on arrays does an identity comparison in Javascript and does not compare elementwise. See also How to check if two arrays are equal with Javascript?.
Addendum: Below I argue that matchesPatternFast and matchesPatternReference compute the same function -- but use the faulty assumption that === on arrays compares elements pointwise rather than being a pointer comparison.
We can define the following function:
function matchesPatternSlow(words, pattern) {
return matchesPatternReference(normalize(words), normalize(pattern));
}
I observe that normalize(x).length === x.length and normalize(x)[i] === normalize(x)[j] if and only if x[i] === x[j]; therefore matchesPatternSlow computes the same function as matchesPatternReference.
I will now argue that matchesPatternSlow(x,y) === matchesPatternFast(x,y). Certainly if normalize(x) === normalize(y) then we will have this property. matchesPatternFast will manifestly return true. On the other hand, matchesPatternSlow operates by making a number of queries on its two inputs and verifying that these queries always return the same results for both lists: outside the loop, the query is function(x) { return x.length }, and inside the loop, the query is function(x, i, j) { return x[i] === x[j]; }. Since equal objects will respond identically to any query, it follows that all queries on the two normalized lists will align, matchesPatternSlow will also return true.
What if normalize(x) !== normalize(y)? Then matchesPatternFast will manifestly return false. But if they are not equal, then either their lengths do not match -- in which case matchesPatternSlow will also return false from the first check in matchesPatternReference as we hoped -- or else the elements at some index are not equal. Suppose the smallest mismatching index is i. It is a property of normalize that the element at index i will either be equal to an element at index j<i or else it will be one larger than the maximal element from indices 0 through i-1. So we now have four cases to consider:
We have j1<i and j2<i for which normalize(x)[j1] === normalize(x)[i] and normalize(y)[j2] === normalize(y)[i]. But since normalize(x)[i] !== normalize(y)[i] we then know that normalize(x)[j1] !== normalize(y)[i]. So when matchesPatternReference chooses the indices j1 and i, we will find that normalize(x)[j1] === normalize(x)[i] is true and normalize(y)[j1] === normalize(y)[i] is false and immediately return false as we are trying to show.
We have j<i for which normalize(x)[j] === normalize(x)[i] and normalize(y)[i] is not equal to any previous element of normalize(y). Then matchesPatternReference will return false when it chooses the indices j and i, since normalize(x) matches on these indices but normalize(y) doesn't.
We have j<i for which normalize(y)[j] === normalize(y)[i] and normalize(x)[i] is not equal to any previous element of normalize(x). Basically the same as in the previous case.
We have that normalize(x)[i] is one larger than the largest earlier element in normalize(x) and normalize(y)[i] is one larger than the largest earlier element in normalize(y). But since normalize(x) and normalize(y) agree on all previous elements, this means normalize(x)[i] === normalize(y)[i], a contradiction to our assumption that the normalized lists differ at this index.
So in all cases, matchesPatternFast and matchesPatternSlow agree -- hence matchesPatternFast and matchesPatternReference compute the same function.

For this special case, I assume the pattern refers to matching first character. If so, you can simply zip and compare.
# python2.7
words = "ant dog cat dog"
letters = "a d c d"
letters2 = "a d c e"
def match(ws, ls):
ws = ws.split()
ls = ls.split()
return all(w[0] == l for w, l in zip(ws + [[0]], ls + [0]))
print match(words, letters)
print match(words, letters2)
The funny [[0]] and [0] in the end is to ensure that the pattern and the words have the same length.

Evenly distribute repetitive strings

I need to distribute a set of repetitive strings as evenly as possible.
Is there any way to do this better then simple shuffling using unsort? It can't do what I need.
For example if the input is
aaa
aaa
aaa
bbb
bbb
The output I need
aaa
bbb
aaa
bbb
aaa
The number of repetitive strings doesn't have any limit as well as the number of the reps of any string.
The input can be changed to list string number_of_reps
aaa 3
bbb 2
... .
zzz 5
Is there an existing tool, Perl module or algorithm to do this?

Abstract: Given your description of how you determine an “even distribution”, I have written an algorithm that calculates a “weight” for each possible permutation. It is then possible to brute-force the optimal permutation.
Weighing an arrangement of items
By "evenly distribute" I mean that intervals between each two occurrences of a string and the interval between the start point and the first occurrence of the string and the interval between the last occurrence and the end point must be as much close to equal as possible where 'interval' is the number of other strings.
It is trivial to count the distances between occurrences of strings. I decided to count in a way that the example combination
A B A C B A A
would give the count
A: 1 2 3 1 1
B: 2 3 3
C: 4 4
I.e. Two adjacent strings have distance one, and a string at the start or the end has distance one to the edge of the string. These properties make the distances easier to calculate, but are just a constant that will be removed later.
This is the code for counting distances:
sub distances {
my %distances;
my %last_seen;
for my $i (0 .. $#_) {
my $s = $_[$i];
push #{ $distances{$s} }, $i - ($last_seen{$s} // -1);
$last_seen{$s} = $i;
}
push #{ $distances{$_} }, #_ - $last_seen{$_} for keys %last_seen;
return values %distances;
}
Next, we calculate the standard variance for each set of distances. The variance of one distance d describes how far they are off from the average a. As it is squared, large anomalies are heavily penalized:
variance(d, a) = (a - d)²
We get the standard variance of a data set by summing the variance of each item, and then calculating the square root:
svar(items) = sqrt ∑_i variance(items[i], average(items))
Expressed as Perl code:
use List::Util qw/sum min/;
sub svar (#) {
my $med = sum(#_) / #_;
sqrt sum map { ($med - $_) ** 2 } #_;
}
We can now calculate how even the occurrences of one string in our permutation are, by calculating the standard variance of the distances. The smaller this value is, the more even the distribution is.
Now we have to combine these weights to a total weight of our combination. We have to consider the following properties:
Strings with more occurrences should have greater weight that strings with fewer occurrences.
Uneven distributions should have greater weight than even distributions, to strongly penalize unevenness.
The following can be swapped out by a different procedure, but I decided to weigh each standard variance by raising it to the power of occurrences, then adding all weighted svariances:
sub weigh_distance {
return sum map {
my #distances = #$_; # the distances of one string
svar(#distances) ** $#distances;
} distances(#_);
}
This turns out to prefer good distributions.
We can now calculate the weight of a given permutation by passing it to weigh_distance. Therefore, we can decide if two permutations are equally well distributed, or if one is to be prefered:
Selecting optimal permutations
Given a selection of permuations, we can select those permutations that are optimal:
sub select_best {
my %sorted;
for my $strs (#_) {
my $weight = weigh_distance(#$strs);
push #{ $sorted{$weight} }, $strs;
}
my $min_weight = min keys %sorted;
#{ $sorted{$min_weight} }
}
This will return at least one of the given possibilities. If the exact one is unimportant, an arbitrary element of the returend array can be selected.
Bug: This relies on stringification of floats, and is therefore open to all kinds of off-by-epsilon errors.
Creating all possible permutations
For a given multiset of strings, we want to find the optimal permutation. We can think of the available strings as a hash mapping the strings to the remaining avaliable occurrences. With a bit of recursion, we can build all permutations like
use Carp;
# called like make_perms(A => 4, B => 1, C => 1)
sub make_perms {
my %words = #_;
my #keys =
sort # sorting is important for cache access
grep { $words{$_} > 0 }
grep { length or carp "Can't use empty strings as identifiers" }
keys %words;
my ($perms, $ok) = _fetch_perm_cache(\#keys, \%words);
return #$perms if $ok;
# build perms manually, if it has to be.
# pushing into #$perms directly updates the cached values
for my $key (#keys) {
my #childs = make_perms(%words, $key => $words{$key} - 1);
push #$perms, (#childs ? map [$key, #$_], #childs : [$key]);
}
return #$perms;
}
The _fetch_perm_cache returns an ref to a cached array of permutations, and a boolean flag to test for success. I used the following implementation with deeply nested hashes, that stores the permutations on leaf nodes. To mark the leaf nodes, I have used the empty string—hence the above test.
sub _fetch_perm_cache {
my ($keys, $idxhash) = #_;
state %perm_cache;
my $pointer = \%perm_cache;
my $ok = 1;
$pointer = $pointer->{$_}[$idxhash->{$_}] //= do { $ok = 0; +{} } for #$keys;
$pointer = $pointer->{''} //= do { $ok = 0; +[] }; # access empty string key
return $pointer, $ok;
}
That not all strings are valid input keys is no issue: every collection can be enumerated, so make_perms could be given integers as keys, which are translated back to whatever data they represent by the caller. Note that the caching makes this non-threadsafe (if %perm_cache were shared).
Connecting the pieces
This is now a simple matter of
say "#$_" for select_best(make_perms(A => 4, B => 1, C => 1))
This would yield
A A C A B A
A A B A C A
A C A B A A
A B A C A A
which are all optimal solutions by the used definition. Interestingly, the solution
A B A A C A
is not included. This could be a bad edge case of the weighing procedure, which strongly favours putting occurrences of rare strings towards the center. See Futher work.
Completing the test cases
Preferable versions are first: AABAA ABAAA, ABABACA ABACBAA(two 'A' in a row), ABAC ABCA
We can run these test cases by
use Test::More tests => 3;
my #test_cases = (
[0 => [qw/A A B A A/], [qw/A B A A A/]],
[1 => [qw/A B A C B A A/], [qw/A B A B A C A/]],
[0 => [qw/A B A C/], [qw/A B C A/]],
);
for my $test (#test_cases) {
my ($correct_index, #cases) = #$test;
my $best = select_best(#cases);
ok $best ~~ $cases[$correct_index], "[#{$cases[$correct_index]}]";
}
Out of interest, we can calculate the optimal distributions for these letters:
my #counts = (
{ A => 4, B => 1 },
{ A => 4, B => 2, C => 1},
{ A => 2, B => 1, C => 1},
);
for my $count (#counts) {
say "Selecting best for...";
say " $_: $count->{$_}" for keys %$count;
say "#$_" for select_best(make_perms(%$count));
}
This brings us
Selecting best for...
A: 4
B: 1
A A B A A
Selecting best for...
A: 4
C: 1
B: 2
A B A C A B A
Selecting best for...
A: 2
C: 1
B: 1
A C A B
A B A C
C A B A
B A C A
Further work
Because the weighing attributes the same importance to the distance to the edges as to the distance between letters, symmetrical setups are preferred. This condition could be eased by reducing the value of the distance to the edges.
The permutation generation algorithm has to be improved. Memoization could lead to a speedup. Done! The permutation generation is now 50× faster for synthetic benchmarks, and can access cached input in O(n), where n is the number of different input strings.
It would be great to find a heuristic to guide the permutation generation, instead of evaluating all posibilities. A possible heuristic would consider whether there are enough different strings available that no string has to neighbour itself (i.e. distance 1). This information could be used to narrow the width of the search tree.
Transforming the recursive perm generation to an iterative solution would allow to interweave searching with weight calculation, which would make it easier to skip or defer unfavourable solutions.
The standard variances are raised to the power of the occurrences. This is probably not ideal, as a large deviation for a large number of occurrences weighs lighter than a small deviation for few occurrences, e.g.
weight(svar, occurrences) → weighted_variance
weight(0.9, 10) → 0.35
weight(0.5, 1) → 0.5
This should in fact be reversed.
Edit
Below is a faster procedure that approximates a good distribution. In some cases, it will yield the correct solution, but this is not generally the case. The output is bad for inputs with many different strings where most have very few occurrences, but is generally acceptable where only few strings have few occurrences. It is significantly faster than the brute-force solution.
It works by inserting strings at regular intervals, then spreading out avoidable repetitions.
sub approximate {
my %def = #_;
my ($init, #keys) = sort { $def{$b} <=> $def{$a} or $a cmp $b } keys %def;
my #out = ($init) x $def{$init};
while(my $key = shift #keys) {
my $visited = 0;
for my $parts_left (reverse 2 .. $def{$key} + 1) {
my $interrupt = $visited + int((#out - $visited) / $parts_left);
splice #out, $interrupt, 0, $key;
$visited = $interrupt + 1;
}
}
# check if strings should be swapped
for my $i ( 0 .. $#out - 2) {
#out[$i, $i + 1] = #out[$i + 1, $i]
if $out[$i] ne $out[$i + 1]
and $out[$i + 1] eq $out[$i + 2]
and (!$i or $out[$i + 1 ] ne $out[$i - 1]);
}
return #out;
}
Edit 2
I generalized the algorithm for any objects, not just strings. I did this by translating the input to an abstract representation like “two of the first thing, one of the second”. The big advantage here is that I only need integers and arrays to represent the permutations. Also, the cache is smaller, because A => 4, C => 2, C => 4, B => 2 and $regex => 2, $fh => 4 represent the same abstract multisets. The speed penalty incurred by the neccessity to transform data between the external, internal, and cache representations is roughly balanced by the reduced number of recursions.
The large bottleneck is in the select_best sub, which I have largely rewritten in Inline::C (still eats ~80% of execution time).
These issues go a bit beyond the scope of the original question, so I won't paste the code in here, but I guess I'll make the project available via github once I've ironed out the wrinkles.

How to Convert Array into int in groovy?

Lets say I have a array defined in Groovy like this
def int[] a = [1,9]
Now I want to convert this array into a int variable say a1 such that a1 has the value as 19(which are the array values in the a) any way to do this?

I'd go for:
[1, 2, 3, 4].inject(0) { a, h -> a * 10 + h }

1) you don't need the def:
int[] a = [0,9]
2) What do you mean by 09? Isn't that 9? How are you seeing this encoding working?
If you mean you just want to concatenate the numbers together, so;
[ 1, 2, 3, 4 ] == 1234
Then you could do something like:
int b = a.collect { "$it" }.join( '' ) as int
which converts each element into a string, joins them all together, and then parses the resultant String into an int

def sb = new StringBuilder()
[0,9].each{
sb.append(it)
}
assert sb.toString() == "09"

Based on your comments on other answers, this should get you going:
def a = [ 0, 9, 2 ]
int a1 = a.join('') as int
assert a1 == 92
As you can see from the other answers, there's many ways to accomplish what you want. Just use the one that best fit your coding style.

You already have plenty of options, but just to add to the confusion, here's another one:
int[] a = [1,9]
Integer number = a.toList().join().toInteger()
// test it
assert number == 19

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Interesting strings algorithm - string

A very simple way is to just use something like a breadth-first search. This also has the advantage that the first solution found will have minimal size.

Related

Generate A Random String With A Set of Banned Substrings

Find strings which have a particular string as suffix in optimal time?

Given a string sequence of words, check if it matches a pattern

Evenly distribute repetitive strings

How to Convert Array into int in groovy?

Categories

Resources