Evenly distribute repetitive strings - string

I need to distribute a set of repetitive strings as evenly as possible.
Is there any way to do this better then simple shuffling using unsort? It can't do what I need.
For example if the input is
aaa
aaa
aaa
bbb
bbb
The output I need
aaa
bbb
aaa
bbb
aaa
The number of repetitive strings doesn't have any limit as well as the number of the reps of any string.
The input can be changed to list string number_of_reps
aaa 3
bbb 2
... .
zzz 5
Is there an existing tool, Perl module or algorithm to do this?

Abstract: Given your description of how you determine an “even distribution”, I have written an algorithm that calculates a “weight” for each possible permutation. It is then possible to brute-force the optimal permutation.
Weighing an arrangement of items
By "evenly distribute" I mean that intervals between each two occurrences of a string and the interval between the start point and the first occurrence of the string and the interval between the last occurrence and the end point must be as much close to equal as possible where 'interval' is the number of other strings.
It is trivial to count the distances between occurrences of strings. I decided to count in a way that the example combination
A B A C B A A
would give the count
A: 1 2 3 1 1
B: 2 3 3
C: 4 4
I.e. Two adjacent strings have distance one, and a string at the start or the end has distance one to the edge of the string. These properties make the distances easier to calculate, but are just a constant that will be removed later.
This is the code for counting distances:
sub distances {
my %distances;
my %last_seen;
for my $i (0 .. $#_) {
my $s = $_[$i];
push #{ $distances{$s} }, $i - ($last_seen{$s} // -1);
$last_seen{$s} = $i;
}
push #{ $distances{$_} }, #_ - $last_seen{$_} for keys %last_seen;
return values %distances;
}
Next, we calculate the standard variance for each set of distances. The variance of one distance d describes how far they are off from the average a. As it is squared, large anomalies are heavily penalized:
variance(d, a) = (a - d)²
We get the standard variance of a data set by summing the variance of each item, and then calculating the square root:
svar(items) = sqrt ∑_i variance(items[i], average(items))
Expressed as Perl code:
use List::Util qw/sum min/;
sub svar (#) {
my $med = sum(#_) / #_;
sqrt sum map { ($med - $_) ** 2 } #_;
}
We can now calculate how even the occurrences of one string in our permutation are, by calculating the standard variance of the distances. The smaller this value is, the more even the distribution is.
Now we have to combine these weights to a total weight of our combination. We have to consider the following properties:
Strings with more occurrences should have greater weight that strings with fewer occurrences.
Uneven distributions should have greater weight than even distributions, to strongly penalize unevenness.
The following can be swapped out by a different procedure, but I decided to weigh each standard variance by raising it to the power of occurrences, then adding all weighted svariances:
sub weigh_distance {
return sum map {
my #distances = #$_; # the distances of one string
svar(#distances) ** $#distances;
} distances(#_);
}
This turns out to prefer good distributions.
We can now calculate the weight of a given permutation by passing it to weigh_distance. Therefore, we can decide if two permutations are equally well distributed, or if one is to be prefered:
Selecting optimal permutations
Given a selection of permuations, we can select those permutations that are optimal:
sub select_best {
my %sorted;
for my $strs (#_) {
my $weight = weigh_distance(#$strs);
push #{ $sorted{$weight} }, $strs;
}
my $min_weight = min keys %sorted;
#{ $sorted{$min_weight} }
}
This will return at least one of the given possibilities. If the exact one is unimportant, an arbitrary element of the returend array can be selected.
Bug: This relies on stringification of floats, and is therefore open to all kinds of off-by-epsilon errors.
Creating all possible permutations
For a given multiset of strings, we want to find the optimal permutation. We can think of the available strings as a hash mapping the strings to the remaining avaliable occurrences. With a bit of recursion, we can build all permutations like
use Carp;
# called like make_perms(A => 4, B => 1, C => 1)
sub make_perms {
my %words = #_;
my #keys =
sort # sorting is important for cache access
grep { $words{$_} > 0 }
grep { length or carp "Can't use empty strings as identifiers" }
keys %words;
my ($perms, $ok) = _fetch_perm_cache(\#keys, \%words);
return #$perms if $ok;
# build perms manually, if it has to be.
# pushing into #$perms directly updates the cached values
for my $key (#keys) {
my #childs = make_perms(%words, $key => $words{$key} - 1);
push #$perms, (#childs ? map [$key, #$_], #childs : [$key]);
}
return #$perms;
}
The _fetch_perm_cache returns an ref to a cached array of permutations, and a boolean flag to test for success. I used the following implementation with deeply nested hashes, that stores the permutations on leaf nodes. To mark the leaf nodes, I have used the empty string—hence the above test.
sub _fetch_perm_cache {
my ($keys, $idxhash) = #_;
state %perm_cache;
my $pointer = \%perm_cache;
my $ok = 1;
$pointer = $pointer->{$_}[$idxhash->{$_}] //= do { $ok = 0; +{} } for #$keys;
$pointer = $pointer->{''} //= do { $ok = 0; +[] }; # access empty string key
return $pointer, $ok;
}
That not all strings are valid input keys is no issue: every collection can be enumerated, so make_perms could be given integers as keys, which are translated back to whatever data they represent by the caller. Note that the caching makes this non-threadsafe (if %perm_cache were shared).
Connecting the pieces
This is now a simple matter of
say "#$_" for select_best(make_perms(A => 4, B => 1, C => 1))
This would yield
A A C A B A
A A B A C A
A C A B A A
A B A C A A
which are all optimal solutions by the used definition. Interestingly, the solution
A B A A C A
is not included. This could be a bad edge case of the weighing procedure, which strongly favours putting occurrences of rare strings towards the center. See Futher work.
Completing the test cases
Preferable versions are first: AABAA ABAAA, ABABACA ABACBAA(two 'A' in a row), ABAC ABCA
We can run these test cases by
use Test::More tests => 3;
my #test_cases = (
[0 => [qw/A A B A A/], [qw/A B A A A/]],
[1 => [qw/A B A C B A A/], [qw/A B A B A C A/]],
[0 => [qw/A B A C/], [qw/A B C A/]],
);
for my $test (#test_cases) {
my ($correct_index, #cases) = #$test;
my $best = select_best(#cases);
ok $best ~~ $cases[$correct_index], "[#{$cases[$correct_index]}]";
}
Out of interest, we can calculate the optimal distributions for these letters:
my #counts = (
{ A => 4, B => 1 },
{ A => 4, B => 2, C => 1},
{ A => 2, B => 1, C => 1},
);
for my $count (#counts) {
say "Selecting best for...";
say " $_: $count->{$_}" for keys %$count;
say "#$_" for select_best(make_perms(%$count));
}
This brings us
Selecting best for...
A: 4
B: 1
A A B A A
Selecting best for...
A: 4
C: 1
B: 2
A B A C A B A
Selecting best for...
A: 2
C: 1
B: 1
A C A B
A B A C
C A B A
B A C A
Further work
Because the weighing attributes the same importance to the distance to the edges as to the distance between letters, symmetrical setups are preferred. This condition could be eased by reducing the value of the distance to the edges.
The permutation generation algorithm has to be improved. Memoization could lead to a speedup. Done! The permutation generation is now 50× faster for synthetic benchmarks, and can access cached input in O(n), where n is the number of different input strings.
It would be great to find a heuristic to guide the permutation generation, instead of evaluating all posibilities. A possible heuristic would consider whether there are enough different strings available that no string has to neighbour itself (i.e. distance 1). This information could be used to narrow the width of the search tree.
Transforming the recursive perm generation to an iterative solution would allow to interweave searching with weight calculation, which would make it easier to skip or defer unfavourable solutions.
The standard variances are raised to the power of the occurrences. This is probably not ideal, as a large deviation for a large number of occurrences weighs lighter than a small deviation for few occurrences, e.g.
weight(svar, occurrences) → weighted_variance
weight(0.9, 10) → 0.35
weight(0.5, 1) → 0.5
This should in fact be reversed.
Edit
Below is a faster procedure that approximates a good distribution. In some cases, it will yield the correct solution, but this is not generally the case. The output is bad for inputs with many different strings where most have very few occurrences, but is generally acceptable where only few strings have few occurrences. It is significantly faster than the brute-force solution.
It works by inserting strings at regular intervals, then spreading out avoidable repetitions.
sub approximate {
my %def = #_;
my ($init, #keys) = sort { $def{$b} <=> $def{$a} or $a cmp $b } keys %def;
my #out = ($init) x $def{$init};
while(my $key = shift #keys) {
my $visited = 0;
for my $parts_left (reverse 2 .. $def{$key} + 1) {
my $interrupt = $visited + int((#out - $visited) / $parts_left);
splice #out, $interrupt, 0, $key;
$visited = $interrupt + 1;
}
}
# check if strings should be swapped
for my $i ( 0 .. $#out - 2) {
#out[$i, $i + 1] = #out[$i + 1, $i]
if $out[$i] ne $out[$i + 1]
and $out[$i + 1] eq $out[$i + 2]
and (!$i or $out[$i + 1 ] ne $out[$i - 1]);
}
return #out;
}
Edit 2
I generalized the algorithm for any objects, not just strings. I did this by translating the input to an abstract representation like “two of the first thing, one of the second”. The big advantage here is that I only need integers and arrays to represent the permutations. Also, the cache is smaller, because A => 4, C => 2, C => 4, B => 2 and $regex => 2, $fh => 4 represent the same abstract multisets. The speed penalty incurred by the neccessity to transform data between the external, internal, and cache representations is roughly balanced by the reduced number of recursions.
The large bottleneck is in the select_best sub, which I have largely rewritten in Inline::C (still eats ~80% of execution time).
These issues go a bit beyond the scope of the original question, so I won't paste the code in here, but I guess I'll make the project available via github once I've ironed out the wrinkles.

Related

Looking for a specific combination algorithm to solve a problem

Let’s say I have a purchase total and I have a csv file full of purchases where some of them make up that total and some don’t. Is there a way to search the csv to find the combination or combinations of purchases that make up that total ? Let’s say the purchase total is 155$ and my csv file has the purchases [5.00$,40.00$,7.25$,$100.00,$10.00]. Is there an algorithm that will tell me the combinations of the purchases that make of the total ?
Edit: I am still having trouble with the solution you provided. When I feed this spreadsheet with pandas into the code snippet you provided it only shows one solution equal to 110.04$ when there are three. It is like it is stopping early without finding the final solutions.This is the output that I have from the terminal - [57.25, 15.87, 13.67, 23.25]. The output should be [10.24,37.49,58.21,4.1] and [64.8,45.24] and [57.25,15.87,13.67,23.25]
from collections import namedtuple
import pandas
df = pandas.read_csv('purchases.csv',parse_dates=["Date"])
from collections import namedtuple
values = df["Purchase"].to_list()
S = 110.04
Candidate = namedtuple('Candidate', ['sum', 'lastIndex', 'path'])
tuples = [Candidate(0, -1, [])]
while len(tuples):
next = []
for (sum, i, path) in tuples:
# you may range from i + 1 if you don't want repetitions of the same purchase
for j in range(i+1, len(values)):
v = values[j]
# you may check for strict equality if no purchase is free (0$)
if v + sum <= S:
next.append(Candidate(sum = v + sum, lastIndex = j, path = path + [v]))
if v + sum == S :
print(path + [v])
tuples = next
A dp solution:
Let S be your goal sum
Build all 1-combinations. Keep those which sums less or equal than S. Whenever one equals S, output it
Build all 2-combinations reusing the previous ones.
Repeat
from collections import namedtuple
values = [57.25,15.87,13.67,23.25,64.8,45.24,10.24,37.49,58.21,4.1]
S = 110.04
Candidate = namedtuple('Candidate', ['sum', 'lastIndex', 'path'])
tuples = [Candidate(0, -1, [])]
while len(tuples):
next = []
for (sum, i, path) in tuples:
# you may range from i + 1 if you don't want repetitions of the same purchase
for j in range(i + 1, len(values)):
v = values[j]
# you may check for strict equality if no purchase is free (0$)
if v + sum <= S:
next.append(Candidate(sum = v + sum, lastIndex = j, path = path + [v]))
if abs(v + sum - S) <= 1e-2 :
print(path + [v])
tuples = next
More detail about the tuple structure:
What we want to do is to augment a tuple with a new value.
Assume we start with some tuple with only one value, say the tuple associated to 40.
its sum is trivially 40
the last index added is 1 (it is the number 40 itself)
the used values is [40], since it is the sole value.
Now to generate the next tuples, we will iterate from the last index (1), to the end of the array.
So candidates are 7.25, 100.00, 10.00
The new tuple associated to 7.25 is:
sum: 40 + 7.25
last index: 2 (7.25 has index 2 in array)
used values: values of tuple union 7.25, so [40, 7.25]
The purpose of using the last index, is to avoid considering [7.25, 40] and [40, 7.25]. Indeed they would be the same combination
So to generate tuples from an old one, only consider values occurring 'after' the old one from the array
At every step, we thus have tuples of the same size, each of them aggregates the values taken, the sum it amounts to, and the next values to consider to augment it to a bigger size
edit: to handle floats, you may replace (v+sum)<=S by abs(v+sum - S)<=1e-2 to say a solution is reach when you are very close (here distance arbitrarily set to 0.01) to solution
edit2: same code here as in https://repl.it/repls/DrearyWindingHypertalk (which does give
[64.8, 45.24]
[57.25, 15.87, 13.67, 23.25]
[10.24, 37.49, 58.21, 4.1]

How to implement a sequence of numbers, each consecutive pair adds to 4?

I have a sequence (seq) of numbers.
I want the addition of each consecutive pair of numbers to equal 4.
Below is my attempt at implementing this. But, it is wrong. The Alloy Analyzer showed me it's wrong, by generating this instance:
2, 2, -2, 4
The first pair adds to 4. (2 + 2 = 4)
The second pair does not. (2 + -2 = 0)
What is the correct way to implement this? Note: I need to use sequences (seq), so please don't change the signature or its field. I am hoping that you can show me the correct way to express the fact. Or, tell me that it's impossible to implement given the use of seq.
one sig Test {
numbers: seq Int
}
fact {
all disj n, n': Test.numbers.elems {
(plus[Test.numbers.idxOf[n], 1] = Test.numbers.idxOf[n']) =>
plus[n, n'] = 4
}
}
run {#Test.numbers.indsOf[2] > 1}
To explain why your fact is incorrect, consider the following counterexample: the Test.numbers sequence is 2, 2, 2, 4.
In that counterexample:
Test.numbers.elems evaluates to 2, 4
Test.numbers.idxOf[2] is 0 (the first index of element 2)
Test.numbers.idxOf[4] is 3
there are no two disjoint n and n' in Test.numbers.elems (i.e., {2, 4}) such that plus[Test.numbers.idxOf[n], 1] = Test.numbers.idxOf[n'] so the fact trivially holds.
The following fact should express your desired property correctly:
fact {
all i: Test.numbers.inds - (#Test.numbers).prev |
plus[Test.numbers[i], Test.numbers[i.next]] = 4
}
mySeq.inds evaluates to indexes of the sequence mySeq
i.next evaluates to i + 1
i.prev evaluates to i - 1

Algorithm for generating all string combinations

Say I have a list of strings, like so:
strings = ["abc", "def", "ghij"]
Note that the length of a string in the list can vary.
The way you generate a new string is to take one letter from each element of the list, in order. Examples: "adg" and "bfi", but not "dch" because the letters are not in the same order in which they appear in the list. So in this case where I know that there are only three elements in the list, I could fairly easily generate all possible combinations with a nested for loop structure, something like this:
for i in strings[0].length:
for ii in strings[1].length:
for iii in strings[2].length:
print(i+ii+iii)
The issue arises for me when I don't know how long the list of strings is going to be beforehand. If the list is n elements long, then my solution requires n for loops to succeed.
Can any one point me towards a relatively simple solution? I was thinking of a DFS based solution where I turn each letter into a node and creating a connection between all letters in adjacent strings, but this seems like too much effort.
In python, you would use itertools.product
eg.:
>>> for comb in itertools.product("abc", "def", "ghij"):
>>> print(''.join(comb))
adg
adh
adi
adj
aeg
aeh
...
Or, using an unpack:
>>> words = ["abc", "def", "ghij"]
>>> print('\n'.join(''.join(comb) for comb in itertools.product(*words)))
(same output)
The algorithm used by product is quite simple, as can be seen in its source code (Look particularly at function product_next). It basically enumerates all possible numbers in a mixed base system (where the multiplier for each digit position is the length of the corresponding word). A simple implementation which only works with strings and which does not implement the repeat keyword argument might be:
def product(words):
if words and all(len(w) for w in words):
indices = [0] * len(words)
while True:
# Change ''.join to tuple for a more accurate implementation
yield ''.join(w[indices[i]] for i, w in enumerate(words))
for i in range(len(indices), 0, -1):
if indices[i - 1] == len(words[i - 1]) - 1:
indices[i - 1] = 0
else:
indices[i - 1] += 1
break
else:
break
From your solution it seems that you need to have as many for loops as there are strings. For each character you generate in the final string, you need a for loop go through the list of possible characters. To do that you can make recursive solution. Every time you go one level deep in the recursion, you just run one for loop. You have as many level of recursion as there are strings.
Here is an example in python:
strings = ["abc", "def", "ghij"]
def rec(generated, k):
if k==len(strings):
print(generated)
return
for c in strings[k]:
rec(generated + c, k+1)
rec("", 0)
Here's how I would do it in Javascript (I assume that every string contains no duplicate characters):
function getPermutations(arr)
{
return getPermutationsHelper(arr, 0, "");
}
function getPermutationsHelper(arr, idx, prefix)
{
var foundInCurrent = [];
for(var i = 0; i < arr[idx].length; i++)
{
var str = prefix + arr[idx].charAt(i);
if(idx < arr.length - 1)
{
foundInCurrent = foundInCurrent.concat(getPermutationsHelper(arr, idx + 1, str));
}
else
{
foundInCurrent.push(str);
}
}
return foundInCurrent;
}
Basically, I'm using a recursive approach. My base case is when I have no more words left in my array, in which case I simply add prefix + c to my array for every c (character) in my last word.
Otherwise, I try each letter in the current word, and pass the prefix I've constructed on to the next word recursively.
For your example array, I got:
adg adh adi adj aeg aeh aei aej afg afh afi afj bdg bdh bdi
bdj beg beh bei bej bfg bfh bfi bfj cdg cdh cdi cdj ceg ceh
cei cej cfg cfh cfi cfj

Scala String Similarity

I have a Scala code that computes similarity between a set of strings and give all the unique strings.
val filtered = z.reverse.foldLeft((List.empty[String],z.reverse)) {
case ((acc, zt), zz) =>
if (zt.tail.exists(tt => similarity(tt, zz) < threshold)) acc
else zz :: acc, zt.tail
}._1
I'll try to explain what is going on here :
This uses a fold over the reversed input data, starting from the empty String (to accumulate results) and the (reverse of the) remaining input data (to compare against - I labeled it zt for "z-tail").
The fold then cycles through the data, checking each entry against the tail of the remaining data (so it doesn't get compared to itself or any earlier entry)
If there is a match, just the existing accumulator (labelled acc) will be allowed through, otherwise, add the current entry (zz) to the accumulator. This updated accumulator is paired with the tail of the "remaining" Strings (zt.tail), to ensure a reducing set to compare against.
Finally, we end up with a pair of lists: the required remaining Strings, and an empty list (no Strings left to compare against), so we take the first of these as our result.
The problem is like in first iteration, if 1st, 4th and 8th strings are similar, I am getting only the 1st string. Instead of it, I should get a set of (1st,4th,8th), then if 2nd,5th,14th and 21st strings are similar, I should get a set of (2nd,5th,14th,21st).
If I understand you correctly - you want the result to be of type List[List[String]] and not the List[String] you are getting now - where each item is a list of similar Strings (right?).
If so - I can't see a trivial change to your implementation that would achieve this, as the similar values are lost (when you enter the if(true) branch and just return the acc - you skip an item and you'll never "see" it again).
Two possible solutions I can think of:
Based on your idea, but using a 3-Tuple of the form (acc, zt, scanned) as the foldLeft result type, where the added scanned is the list of already-scanned items. This way we can refer back to them when we find an element that doesn't have preceeding similar elements:
val filtered = z.reverse.foldLeft((List.empty[List[String]],z.reverse,List.empty[String])) {
case ((acc, zt, scanned), zz) =>
val hasSimilarPreceeding = zt.tail.exists { tt => similarity(tt, zz) < threshold }
val similarFollowing = scanned.collect { case tt if similarity(tt, zz) < threshold => tt }
(if (hasSimilarPreceeding) acc else (zz :: similarFollowing) :: acc, zt.tail, zz :: scanned)
}._1
A probably-slower but much simpler solution would be to just groupBy the group of similar strings:
val alternative = z.groupBy(s => z.collect {
case other if similarity(s, other) < threshold => other
}.toSet ).values.toList
All of this assumes that the function:
f(a: String, b: String): Boolean = similarity(a, b) < threshold
Is commutative and transitive, i.e.:
f(a, b) && f(a. c) means that f(b, c)
f(a, b) if and only if f(b, a)
To test both implementations I used:
// strings are similar if they start with the same character
def similarity(s1: String, s2: String) = if (s1.head == s2.head) 0 else 100
val threshold = 1
val z = List("aa", "ab", "c", "a", "e", "fa", "fb")
And both options produce the same results:
List(List(aa, ab, a), List(c), List(e), List(fa, fb))

Interesting strings algorithm

Given two finite sequences of string, A and B, of length n each,
for example:
A1: "kk", A2: "ka", A3: "kkk", A4: "a"
B1: "ka", B2: "kakk", B3: "ak", B4: "k"
Give a finite sequences of indexes so that their concentration for A
and B gives the same string. Repetitions allowed.
In this example I can't find the solution but for example if the list (1,2,2,4) is a solution then A1 + A2 + A2 + A4 = B1 + B2 + B2 + B4. In this example there are only two characters but it's already very difficult. Actually it's not even trivial to find the shortest solution with one character!
I tried to think of things.. for example the total sum of the length of the strings must be equal and the for the first and last string we need corresponding characters. But nothing else. I suppose for some set of strings it's simply impossible. Anyone can think of a good algorithm?
EDIT: Apparently, this is the Post Correspondence Problem
There is no algorithm that can decide whether a such an instance has a solution or not. If there were, the halting problem could be solved. Dirty trick...
Very tough question, but I'll give it a shot. This is more of a stream of consciousness than an answer, apologies in advance.
If I understand this correctly, you're given 2 equal sized sequences of strings, A and B, indexed from 1..n, say. You then have to find a sequence of indices such that the concatenation of strings A(1)..A(m) equals the concatenation of strings B(1)..B(m) where m is the length of the sequence of indices.
The first thing I would observe is that there could be an infinite number of solutions. For example, given:
A { "x", "xx" }
B { "xx", "x" }
Possible solutions are:
{ 1, 2 }
{ 2, 1 }
{ 1, 2, 1, 2 }
{ 1, 2, 2, 1 }
{ 2, 1, 1, 2 }
{ 2, 1, 2, 1 }
{ 1, 2, 1, 2, 1, 2}
...
So how would you know when to stop? As soon as you had one solution? As soon as one of the solutions is a superset of another solution?
One place you could start would be by taking all the strings of minimum common length from both sets (in my example above, you would take the "x" from both, and searching for 2 equal strings that share a common index. You can then repeat this for strings of the next size up. For example, if the first set has 3 strings of length 1, 2 and 3 respectively, and the second set has strings of length 1, 3 and 3 respectively, you would take the strings of length 3. You would do this until you have no more strings. If you find any, then you have a solution to the problem.
It then gets harder when you have to start combining several strings as in my example above. The naive, brute force approach would be to start permuting all strings from both sets that, when concatenated, result in strings of the same length, then compare them. So in the below example:
A { "ga", "bag", "ac", "a" }
B { "ba", "g", "ag", "gac" }
You would start with sequences of length 2:
A { "ga", "ac" }, B { "ba", "ag" } (indices 1, 3)
A { "bag", "a" }, B { "g", "gac" } (indices 2, 4)
Comparing these gives "gaac" vs "baag" and "baga" vs "ggac", neither of which are equal, so there are no solutions there. Next, we would go for sequences of length 3:
A { "ga", "bag", "a" }, B { "ba", "g", "gac" } (indices 1, 2, 4)
A { "bag", "ac", "a" }, B { "g", "ag", "gac" } (indices 2, 3, 4)
Again, no solutions, so then we end up with sequences of size 4, of which we have no solutions.
Now it gets even trickier, as we have to start thinking about perhaps repeating some indices, and now my brain is melting.
I'm thinking looking for common subsequences in the strings might be helpful, and then using the remaining parts in the strings that were not matched. But I don't quite know how.
A very simple way is to just use something like a breadth-first search. This also has the advantage that the first solution found will have minimal size.
It is not clear what the 'solution' you are looking for is, the longest solution? the shortest? all solutions?
Since you allow repetition there will an infinite number of solutions for some inputs so I will work on:
Find all sequences under a fixed length.
Written as a pseudo code but in a manner very similar to f# sequence expressions
// assumed true/false functions
let Eq aList bList =
// eg Eq "ab"::"c" "a" :: "bc" -> true
// Eq {} {} is _false_
let EitherStartsWith aList bList =
// eg "ab"::"c" "a" :: "b" -> true
// eg "a" "ab" -> true
// {} {} is _true_
let rec FindMatches A B aList bList level
= seq {
if level > 0
if Eq aList bList
yield aList
else if EitherStartsWith aList bList
Seq.zip3 A B seq {1..}
|> Seq.iter (func (a,b,i) ->
yield! FindMatches A B aList::(a,i) bList::(b,i) level - 1) }
let solution (A:seq<string>) (B:seq<string>) length =
FindMatches A B {} {} length
Some trivial constraints to reduce the problem:
The first selection pair must have a common start section.
the final selection pair must have a common end section.
Based on this we can quickly eliminate many inputs with no solution
let solution (A:seq<string>) (B:seq<string>) length =
let starts = {}
let ends = {}
Seq.zip3 A B seq {1..}
|> Seq.iter(fun (a,b,i) ->
if (a.StartsWith(b) or b.StartsWith(a))
start = starts :: (a,b,i)
if (a.EndsWith(b) or b.EndsWith(a))
ends = ends :: (a,b,i))
if List.is_empty starts || List.is_empty ends
Seq.empty // no solution
else
Seq.map (fun (a,b,i) ->
FindMatches A B {} :: (a,i) {} :: (b,i) length - 1)
starts
|> Seq.concat
Here's a suggestion for a brute force search. First generate number sequences bounded to the length of your list:
[0,0,..]
[1,0,..]
[2,0,..]
[3,0,..]
[0,1,..]
...
The number sequence length determines how many strings are going to be in any solution found.
Then generate A and B strings by using the numbers as indexes into your string lists:
public class FitSequence
{
private readonly string[] a;
private readonly string[] b;
public FitSequence(string[] a, string[] b)
{
this.a = a;
this.b = b;
}
private static string BuildString(string[] source, int[] indexes)
{
var s = new StringBuilder();
for (int i = 0; i < indexes.Length; ++i)
{
s.Append(source[indexes[i]]);
}
return s.ToString();
}
public IEnumerable<int[]> GetSequences(int length)
{
foreach (var numberSequence in new NumberSequence(length).GetNumbers(a.Length - 1))
{
string a1 = BuildString(a, numberSequence);
string b1 = BuildString(b, numberSequence);
if (a1 == b1)
yield return numberSequence;
}
}
}
This algorithm assumes equal lengths for A and B.
I tested your example with
static void Main(string[] args)
{
var a = new[] {"kk", "ka", "kkk", "a"};
var b = new[] {"ka", "kakk", "ak", "k"};
for (int i = 0; i < 100; ++i)
foreach (var sequence in new FitSequence(a, b).GetSequences(i))
{
foreach (int x in sequence)
Console.Write("{0} ", x);
Console.WriteLine();
}
}
but could not find any solutions, though it seemed to work for simple tests.

Resources