Shortest string containing the most occurrence of strings in a set - string

Define the degree of a string M be the number of times it appears in another string S. For example M = "aba" and S="ababa", the degree of M is 2. Given a set of strings and an integer N, find the string of the minimum length so that the sum of degrees of all strings in the set is at least N.
For example a set {"ab", "bd", "abd" "babd", "abc"}, N = 4, the answer will be "babd". It contains "ab", "abd", "babd" and "bd" one time.
N <= 100, M <= 100, length of every string in the set <= 100. Strings in the set only consist of uppercase and lowercase letters.
How to solve this problem? This looks similar to the shortest superstring problems which has a dynamic programming solution that has exponential complexity. However, the constraint in this problem is much larger and the same idea also won't work here. Is there some string data structure that can be applied here?

I have a polynomial time algorithm, which I'm too lazy to code. But I'll describe it for you.
First, make each string in the set plus the empty string be the nodes of a graph. The empty string is connected to each other string, and vice versa. If the end of one string overlaps with the start of another, they also connect. If two can overlap by different amounts, they get multiple edges. (So it is not exactly a graph...)
Each edge gets a cost and a value. The cost is how many characters you have to extend the string you are building by to move from the old end to the new end. (In other words the length of the second string minus the length of the overlap.) to having this one. The value is how many new strings you completed that cross the barrier between the former and the latter string.
Your example was {"ab", "bd", "abd" "babd", "abc"}. Here are the (cost, value) pairs for each transition.
from -> to : (value, cost)
"" -> "ab": ( 1, 2)
"" -> "bd": ( 1, 2)
"" -> "abd": ( 3, 3) # we added "ab", "bd" and "abd"
"" -> "babd": ( 4, 4) # we get "ab", "bd", "abd" and "babd"
"" -> "abc": ( 2, 3) # we get "ab" and "abc"
"ab" -> "": ( 0, 0)
"ab" -> "bd": ( 2, 1) # we added "abd" and "bd" for 1 character
"ab" -> "abd": ( 2, 1) # ditto
"ab" -> "abc": ( 1, 1) # we only added "abc"
"bd" -> "": ( 0, 0) # only empty, nothing else starts "bd"
"abd" -> "": ( 0, 0)
"babd" -> "": ( 0, 0)
"babd" -> "abd": ( 0, 0) # overlapped, but added nothing.
"abc" -> "": ( 0, 0)
OK, all of that is setup. Why did we want this graph?
Well note that if we start at "" with a cost of 0 and a value of 0, then take a path through the graph, that constructs a string. It correctly states the cost, and provides a lower bound on the value. The value can be higher. For example if your set were {"ab", "bc", "cd", "abcd"} then the path "" -> "ab" -> "bc" -> "cd" would lead to the string "abcd" with a cost of 4 and a predicted value of 3. But that value estimate missed the fact that we matched "abcd".
However for any given string made up only of substrings from the set, there is a path through the graph that has the correct cost and the correct value. (At each choice you want to pick the earliest starting matching string that you have not yet counted, and of those pick the longest of them. Then you never miss any matches.)
So we've turned our problem from constructing strings to constructing paths through a graph. What we want to do is build up the following data structure:
for each (value, node) combination:
(best cost, previous node, previous value)
Filling in that data structure is a dynamic programming problem. Once filled in we can just trace back through it to find what path in the graph got us to that value with that cost. Given that path, we can figure out the string that did it.
How fast is it? If our set has K strings then we only need to fill in K * N values, each of which we can give a maximum of K candidates for new values. Which makes the path finding a O(K^2 * N) problem.

So here is my approach. At first iteration we construct a pool out of the initial strings.
After that:
We select out of the pool a string having minimal length and sum of degrees=N. If we found such a string we just return it.
We filter out of the pool all strings with degree less than maximal. We work only with the best possible string combinations.
We construct all variants out of the current pool and the initial strings. Here we need to take into consideration that strings can overlap. Say a string "aba" and "ab"(from initial strings) could produce: ababa, abab, abaab (we do not include "aba" because we already had it in our pool and we need to move further).
We filter out duplicates and this is our next pool.
Repeat everything from the point 1.
The FindTarget() method accepts the target sum as a parameter. FindTarget(4) will solve the sample task.
public class Solution
{
/// <summary>
/// The initial strings.
/// </summary>
string[] stringsSet;
Tuple<string, string>[][] splits;
public Solution(string[] strings)
{
stringsSet = strings;
splits = stringsSet.Select(s => ProduceItemSplits(s)).ToArray();
}
/// <summary>
/// Find the optimal string.
/// </summary>
/// <param name="N">Target degree.</param>
/// <returns></returns>
public string FindTarget(int N)
{
var pool = stringsSet;
while (true)
{
var poolWithDegree = pool.Select(s => new { str = s, degree = GetN(s) })
.ToArray();
var maxDegree = poolWithDegree.Max(m => m.degree);
var optimalString = poolWithDegree
.Where(w => w.degree >= N)
.OrderBy(od => od.str.Length)
.FirstOrDefault();
if (optimalString != null) return optimalString.str; // We found it
var nextPool = poolWithDegree.Where(w => w.degree == maxDegree)
.SelectMany(sm => ExpandString(sm.str))
.Distinct()
.ToArray();
pool = nextPool;
}
}
/// <summary>
/// Get degree.
/// </summary>
/// <param name="candidate"></param>
/// <returns></returns>
public int GetN(string candidate)
{
var N = stringsSet.Select(s =>
{
var c = Regex.Matches(candidate, s).Count();
return c;
}).Sum();
return N;
}
public Tuple<string, string>[] ProduceItemSplits(string item)
{
var substings = Enumerable.Range(0, item.Length + 1)
.Select((i) => new Tuple<string, string>(item.Substring(0, i), item.Substring(i, item.Length - i))).ToArray();
return substings;
}
private IEnumerable<string> ExpandStringWithOneItem(string str, int index)
{
var item = stringsSet[index];
var itemSplits = splits[index];
var startAttachments = itemSplits.Where(w => str.StartsWith(w.Item2) && w.Item1.Length > 0)
.Select(s => s.Item1 + str);
var endAttachments = itemSplits.Where(w => str.EndsWith(w.Item1) && w.Item2.Length > 0)
.Select(s => str + s.Item2);
return startAttachments.Union(endAttachments);
}
public IEnumerable<string> ExpandString(string str)
{
var r = Enumerable.Range(0, splits.Length - 1)
.Select(s => ExpandStringWithOneItem(str, s))
.SelectMany(s => s);
return r;
}
}
static void Main(string[] args)
{
var solution = new Solution(new string[] { "ab", "bd", "abd", "babd", "abc" });
var s = solution.FindTarget(150);
Console.WriteLine(s);
}

Related

Valid binary tree from array of string

I am having an array of strings.Each character in a string can be r or l only.
I have to check if it is valid or not as
1. {rlr,l,r,lr, rl}
*
/ \
l r
\ /
r l
\
r
A valid tree as all nodes are present.
2. {ll, r, rl, rr}
*
/ \
- r
/ /\
l l r
Invalid tree as there is no l node.
From a give input I have to determine if it is creating a valid tree or not.
I have come up with two solutions.
1.Using trie to store input and marking each node as valid or not while insertion.
2.Sort the input array according to the length.
So for the first case it will be { l, r, lr, rl, rlr}
And I will create a set of strings to put all input.
If a string is having length more then 1(for rlr :: r, rl) I will consider all its prefix from index 0 and check in set.if any of the prefix in not present in set then I will return false.
I am wondering if there is a more optimal solution or any modification in the above methods.
A recursive approach with test cases,
public static void main(String[] args) {
System.out.println(new Main().isValid(new String[]{"LRL", "LRR", "LL", "LR"}));
System.out.println(new Main().isValid(new String[]{"LRL", "LRR", "LL", "LR", "L"}));
System.out.println(new Main().isValid(new String[]{"LR", "L"}));
System.out.println(new Main().isValid(new String[]{"L", "R", "LL", "LR"}));
}
public boolean isValid(String[] strs) {
Set<String> set = new HashSet<>();
int maxLength = 0;
for (String str : strs) {
set.add(str);
maxLength = Math.max(str.length(), maxLength);
}
helper(set, "L", 1, maxLength);
helper(set, "R", 1, maxLength);
return set.isEmpty();
}
private void helper(Set<String> set, String current, int len, int maxLength) {
if (!set.contains(current) || current.length() > maxLength) {
return;
}
if (set.contains(current))
set.remove(current);
helper(set, current + "L", len + 1, maxLength);
helper(set, current + "R", len + 1, maxLength);
}
Another possible solution is actually building the tree (or trie) and maintain a set of nodes that are incomplete yet.
If you finish iterating over the list and you still have incomplete nodes then the tree isn't valid.
If the set is empty then the tree is valid.
For example, in the second tree you gave, for node ll you will create also node l but you will add it to the incomplete set. If one of the later nodes is l then you will erase it from the set. If not, you will end the iteration with a non empty set that contains you missing nodes.

Generate string permutations recursively; each character appears n times

I'm trying to write an algorithm that will generate all strings of length nm, with exactly n of each number 1, 2, ... m,
For instance all strings of length 6, with exactly two 1's, two 2's and two 3's e.g. 112233, 121233,
I managed to do this with just 1's and 2's using a recursive method, but can't seem to get something that works when I introduce 3's.
When m = 2, the algorithm I have is:
generateAllStrings(int len, int K, String str)
{
if(len == 0)
{
output(str);
}
if(K > 0)
{
generateAllStrings(len - 1, K - 1, str + '2');
}
if(len > K)
{
generateAllStrings(len - 1, K, str + '1');
}
}
I've tried inserting similar conditions for the third number but the algorithm doesn't give a correct output. After that I wouldn't even know how to generalise for 4 numbers and above.
Is recursion the right thing to do? Any help would be appreciated.
One option would be to list off all distinct permutations of the string 111...1222...2...nnn....n. There are nice algorithms for enumerating all distinct permutations of a string in time proportional to the length of the string, and they'd probably be a good way to go about solving this problem.
To use a simple recursive algorithm, give each recursion the permutation so far (variable perm), and the number of occurances of each digit that is still available (array count).
Run the code snippet to generate all unique permutations for n=2 and m=4 (set: 11223344).
function permutations(n, m) {
var perm = "", count = []; // start with empty permutation
for (var i = 0; i < m; i++) count[i] = n; // set available number for each digit = n
permute(perm, count); // start recursion with "" and [n,n,n...]
function permute(perm, count) {
var done = true;
for (var i = 0; i < count.length; i++) { // iterate over all digits
if (count[i] > 0) { // more instances of digit i available
var c = count.slice(); // create hard copy of count array
--c[i]; // decrement count of digit i
permute(perm + (i + 1), c); // add digit to permutation and recurse
done = false; // digits left over: not the last step
}
}
if (done) document.write(perm + "<BR>"); // no digits left: complete permutation
}
}
permutations(2, 4);
You can easily do this using DFS (or BFS alternatively). We can define an graph such that each node contains one string and a node is connected to any node that holds a string with a pair of int swaped in comparison to the original string. This graph is connected, thus we can easily generate a set of all nodes; which will contain all strings that are searched:
set generated_strings
list nodes
nodes.add(generateInitialString(N , M))
generated_strings.add(generateInitialString(N , M))
while(!nodes.empty())
string tmp = nodes.remove(0)
for (int i in [0 , N * M))
for (int j in distinct([0 , N * M) , i))
string new = swap(tmp , i , j)
if (!generated_strings.contains(new))
nodes.add(new)
generated_strings.add(new)
//generated_strings now contains all strings that can possibly be generated.

Determine number of char movement to get word

Suppose you are given a word
"sunflower"
You can perform only one operation type on it, pick a character and move it to the front.
So for instance if you picked 'f', the word would be "fsunlower".
You can have a series of these operations.
fsunlower (moved f to front)
wfsunloer (moved w to front)
fwsunloer (moved f to front again)
The problem is to get the minimum number of operations required, given the derived word and the original word. So if input strings are "fwsunloer", "sunflower", the output would be 3.
This problem is equivalent to : given String A and B, find the longest suffix of string A that is a sub-sequence of String B. Because, if we know which n - characters need to be moved, we will only need n steps. So what we need to find is the maximum number of character that don't need to be moved, which is equivalent to the longest suffix in A.
So for the given example, the longest suffix is sunlor
Java code:
public static void main(String[] args) {
System.out.println(minOp("ewfsunlor", "sunflower"));
}
public static int minOp(String A, String B) {
int n = A.length() - 1;//Start from the end of String A;
int pos = B.length();
int result = 0;
while (n >= 0) {
int nxt = -1;
for (int i = pos - 1; i >= 0; i--) {
if (B.charAt(i) == A.charAt(n)) {
nxt = i;
break;
}
}
if (nxt == -1) {
break;
}
result++;
pos = nxt;
n--;
}
return B.length() - result;
}
Result:
3
Time complexity O(n) with n is length of String A.
Note: this algorithm is based on an assumption that A and B contains same set of character. Otherwise, you need to check for that before using the function

Finding minimum moves required for making 2 strings equal

This is a question from one of the online coding challenge (which has completed).
I just need some logic for this as to how to approach.
Problem Statement:
We have two strings A and B with the same super set of characters. We need to change these strings to obtain two equal strings. In each move we can perform one of the following operations:
1. swap two consecutive characters of a string
2. swap the first and the last characters of a string
A move can be performed on either string.
What is the minimum number of moves that we need in order to obtain two equal strings?
Input Format and Constraints:
The first and the second line of the input contains two strings A and B. It is guaranteed that the superset their characters are equal.
1 <= length(A) = length(B) <= 2000
All the input characters are between 'a' and 'z'
Output Format:
Print the minimum number of moves to the only line of the output
Sample input:
aab
baa
Sample output:
1
Explanation:
Swap the first and last character of the string aab to convert it to baa. The two strings are now equal.
EDIT : Here is my first try, but I'm getting wrong output. Can someone guide me what is wrong in my approach.
int minStringMoves(char* a, char* b) {
int length, pos, i, j, moves=0;
char *ptr;
length = strlen(a);
for(i=0;i<length;i++) {
// Find the first occurrence of b[i] in a
ptr = strchr(a,b[i]);
pos = ptr - a;
// If its the last element, swap with the first
if(i==0 && pos == length-1) {
swap(&a[0], &a[length-1]);
moves++;
}
// Else swap from current index till pos
else {
for(j=pos;j>i;j--) {
swap(&a[j],&a[j-1]);
moves++;
}
}
// If equal, break
if(strcmp(a,b) == 0)
break;
}
return moves;
}
Take a look at this example:
aaaaaaaaab
abaaaaaaaa
Your solution: 8
aaaaaaaaab -> aaaaaaaaba -> aaaaaaabaa -> aaaaaabaaa -> aaaaabaaaa ->
aaaabaaaaa -> aaabaaaaaa -> aabaaaaaaa -> abaaaaaaaa
Proper solution: 2
aaaaaaaaab -> baaaaaaaaa -> abaaaaaaaa
You should check if swapping in the other direction would give you better result.
But sometimes you will also ruin the previous part of the string. eg:
caaaaaaaab
cbaaaaaaaa
caaaaaaaab -> baaaaaaaac -> abaaaaaaac
You need another swap here to put back the 'c' to the first place.
The proper algorithm is probably even more complex, but you can see now what's wrong in your solution.
The A* algorithm might work for this problem.
The initial node will be the original string.
The goal node will be the target string.
Each child of a node will be all possible transformations of that string.
The current cost g(x) is simply the number of transformations thus far.
The heuristic h(x) is half the number of characters in the wrong position.
Since h(x) is admissible (because a single transformation can't put more than 2 characters in their correct positions), the path to the target string will give the least number of transformations possible.
However, an elementary implementation will likely be too slow. Calculating all possible transformations of a string would be rather expensive.
Note that there's a lot of similarity between a node's siblings (its parent's children) and its children. So you may be able to just calculate all transformations of the original string and, from there, simply copy and recalculate data involving changed characters.
You can use dynamic programming. Go over all swap possibilities while storing all the intermediate results along with the minimal number of steps that took you to get there. Actually, you are going to calculate the minimum number of steps for every possible target string that can be obtained by applying given rules for a number times. Once you calculate it all, you can print the minimum number of steps, which is needed to take you to the target string. Here's the sample code in JavaScript, and its usage for "aab" and "baa" examples:
function swap(str, i, j) {
var s = str.split("");
s[i] = str[j];
s[j] = str[i];
return s.join("");
}
function calcMinimumSteps(current, stepsCount)
{
if (typeof(memory[current]) !== "undefined") {
if (memory[current] > stepsCount) {
memory[current] = stepsCount;
} else if (memory[current] < stepsCount) {
stepsCount = memory[current];
}
} else {
memory[current] = stepsCount;
calcMinimumSteps(swap(current, 0, current.length-1), stepsCount+1);
for (var i = 0; i < current.length - 1; ++i) {
calcMinimumSteps(swap(current, i, i + 1), stepsCount+1);
}
}
}
var memory = {};
calcMinimumSteps("aab", 0);
alert("Minimum steps count: " + memory["baa"]);
Here is the ruby logic for this problem, copy this code in to rb file and execute.
str1 = "education" #Sample first string
str2 = "cnatdeiou" #Sample second string
moves_count = 0
no_swap = 0
count = str1.length - 1
def ends_swap(str1,str2)
str2 = swap_strings(str2,str2.length-1,0)
return str2
end
def swap_strings(str2,cp,np)
current_string = str2[cp]
new_string = str2[np]
str2[cp] = new_string
str2[np] = current_string
return str2
end
def consecutive_swap(str,current_position, target_position)
counter=0
diff = current_position > target_position ? -1 : 1
while current_position!=target_position
new_position = current_position + diff
str = swap_strings(str,current_position,new_position)
# p "-------"
# p "CP: #{current_position} NP: #{new_position} TP: #{target_position} String: #{str}"
current_position+=diff
counter+=1
end
return counter,str
end
while(str1 != str2 && count!=0)
counter = 1
if str1[-1]==str2[0]
# p "cross match"
str2 = ends_swap(str1,str2)
else
# p "No match for #{str2}-- Count: #{count}, TC: #{str1[count]}, CP: #{str2.index(str1[count])}"
str = str2[0..count]
cp = str.rindex(str1[count])
tp = count
counter, str2 = consecutive_swap(str2,cp,tp)
count-=1
end
moves_count+=counter
# p "Step: #{moves_count}"
# p str2
end
p "Total moves: #{moves_count}"
Please feel free to suggest any improvements in this code.
Try this code. Hope this will help you.
public class TwoStringIdentical {
static int lcs(String str1, String str2, int m, int n) {
int L[][] = new int[m + 1][n + 1];
int i, j;
for (i = 0; i <= m; i++) {
for (j = 0; j <= n; j++) {
if (i == 0 || j == 0)
L[i][j] = 0;
else if (str1.charAt(i - 1) == str2.charAt(j - 1))
L[i][j] = L[i - 1][j - 1] + 1;
else
L[i][j] = Math.max(L[i - 1][j], L[i][j - 1]);
}
}
return L[m][n];
}
static void printMinTransformation(String str1, String str2) {
int m = str1.length();
int n = str2.length();
int len = lcs(str1, str2, m, n);
System.out.println((m - len)+(n - len));
}
public static void main(String[] args) {
Scanner scan = new Scanner(System.in);
String str1 = scan.nextLine();
String str2 = scan.nextLine();
printMinTransformation("asdfg", "sdfg");
}
}

Is there a circular hash function?

Thinking about this question on testing string rotation, I wondered: Is there was such thing as a circular/cyclic hash function? E.g.
h(abcdef) = h(bcdefa) = h(cdefab) etc
Uses for this include scalable algorithms which can check n strings against each other to see where some are rotations of others.
I suppose the essence of the hash is to extract information which is order-specific but not position-specific. Maybe something that finds a deterministic 'first position', rotates to it and hashes the result?
It all seems plausible, but slightly beyond my grasp at the moment; it must be out there already...
I'd go along with your deterministic "first position" - find the "least" character; if it appears twice, use the next character as the tie breaker (etc). You can then rotate to a "canonical" position, and hash that in a normal way. If the tie breakers run for the entire course of the string, then you've got a string which is a rotation of itself (if you see what I mean) and it doesn't matter which you pick to be "first".
So:
"abcdef" => hash("abcdef")
"defabc" => hash("abcdef")
"abaac" => hash("aacab") (tie-break between aa, ac and ab)
"cabcab" => hash("abcabc") (it doesn't matter which "a" comes first!)
Update: As Jon pointed out, the first approach doesn't handle strings with repetition very well. Problems arise as duplicate pairs of letters are encountered and the resulting XOR is 0. Here is a modification that I believe fixes the the original algorithm. It uses Euclid-Fermat sequences to generate pairwise coprime integers for each additional occurrence of a character in the string. The result is that the XOR for duplicate pairs is non-zero.
I've also cleaned up the algorithm slightly. Note that the array containing the EF sequences only supports characters in the range 0x00 to 0xFF. This was just a cheap way to demonstrate the algorithm. Also, the algorithm still has runtime O(n) where n is the length of the string.
static int Hash(string s)
{
int H = 0;
if (s.Length > 0)
{
//any arbitrary coprime numbers
int a = s.Length, b = s.Length + 1;
//an array of Euclid-Fermat sequences to generate additional coprimes for each duplicate character occurrence
int[] c = new int[0xFF];
for (int i = 1; i < c.Length; i++)
{
c[i] = i + 1;
}
Func<char, int> NextCoprime = (x) => c[x] = (c[x] - x) * c[x] + x;
Func<char, char, int> NextPair = (x, y) => a * NextCoprime(x) * x.GetHashCode() + b * y.GetHashCode();
//for i=0 we need to wrap around to the last character
H = NextPair(s[s.Length - 1], s[0]);
//for i=1...n we use the previous character
for (int i = 1; i < s.Length; i++)
{
H ^= NextPair(s[i - 1], s[i]);
}
}
return H;
}
static void Main(string[] args)
{
Console.WriteLine("{0:X8}", Hash("abcdef"));
Console.WriteLine("{0:X8}", Hash("bcdefa"));
Console.WriteLine("{0:X8}", Hash("cdefab"));
Console.WriteLine("{0:X8}", Hash("cdfeab"));
Console.WriteLine("{0:X8}", Hash("a0a0"));
Console.WriteLine("{0:X8}", Hash("1010"));
Console.WriteLine("{0:X8}", Hash("0abc0def0ghi"));
Console.WriteLine("{0:X8}", Hash("0def0abc0ghi"));
}
The output is now:
7F7D7F7F
7F7D7F7F
7F7D7F7F
7F417F4F
C796C7F0
E090E0F0
A909BB71
A959BB71
First Version (which isn't complete): Use XOR which is commutative (order doesn't matter) and another little trick involving coprimes to combine ordered hashes of pairs of letters in the string. Here is an example in C#:
static int Hash(char[] s)
{
//any arbitrary coprime numbers
const int a = 7, b = 13;
int H = 0;
if (s.Length > 0)
{
//for i=0 we need to wrap around to the last character
H ^= (a * s[s.Length - 1].GetHashCode()) + (b * s[0].GetHashCode());
//for i=1...n we use the previous character
for (int i = 1; i < s.Length; i++)
{
H ^= (a * s[i - 1].GetHashCode()) + (b * s[i].GetHashCode());
}
}
return H;
}
static void Main(string[] args)
{
Console.WriteLine(Hash("abcdef".ToCharArray()));
Console.WriteLine(Hash("bcdefa".ToCharArray()));
Console.WriteLine(Hash("cdefab".ToCharArray()));
Console.WriteLine(Hash("cdfeab".ToCharArray()));
}
The output is:
4587590
4587590
4587590
7077996
You could find a deterministic first position by always starting at the position with the "lowest" (in terms of alphabetical ordering) substring. So in your case, you'd always start at "a". If there were multiple "a"s, you'd have to take two characters into account etc.
I am sure that you could find a function that can generate the same hash regardless of character position in the input, however, how will you ensure that h(abc) != h(efg) for every conceivable input? (Collisions will occur for all hash algorithms, so I mean, how do you minimize this risk.)
You'd need some additional checks even after generating the hash to ensure that the strings contain the same characters.
Here's an implementation using Linq
public string ToCanonicalOrder(string input)
{
char first = input.OrderBy(x => x).First();
string doubledForRotation = input + input;
string canonicalOrder
= (-1)
.GenerateFrom(x => doubledForRotation.IndexOf(first, x + 1))
.Skip(1) // the -1
.TakeWhile(x => x < input.Length)
.Select(x => doubledForRotation.Substring(x, input.Length))
.OrderBy(x => x)
.First();
return canonicalOrder;
}
assuming generic generator extension method:
public static class TExtensions
{
public static IEnumerable<T> GenerateFrom<T>(this T initial, Func<T, T> next)
{
var current = initial;
while (true)
{
yield return current;
current = next(current);
}
}
}
sample usage:
var sequences = new[]
{
"abcdef", "bcdefa", "cdefab",
"defabc", "efabcd", "fabcde",
"abaac", "cabcab"
};
foreach (string sequence in sequences)
{
Console.WriteLine(ToCanonicalOrder(sequence));
}
output:
abcdef
abcdef
abcdef
abcdef
abcdef
abcdef
aacab
abcabc
then call .GetHashCode() on the result if necessary.
sample usage if ToCanonicalOrder() is converted to an extension method:
sequence.ToCanonicalOrder().GetHashCode();
One possibility is to combine the hash functions of all circular shifts of your input into one meta-hash which does not depend on the order of the inputs.
More formally, consider
for(int i=0; i<string.length; i++) {
result^=string.rotatedBy(i).hashCode();
}
Where you could replace the ^= with any other commutative operation.
More examply, consider the input
"abcd"
to get the hash we take
hash("abcd") ^ hash("dabc") ^ hash("cdab") ^ hash("bcda").
As we can see, taking the hash of any of these permutations will only change the order that you are evaluating the XOR, which won't change its value.
I did something like this for a project in college. There were 2 approaches I used to try to optimize a Travelling-Salesman problem. I think if the elements are NOT guaranteed to be unique, the second solution would take a bit more checking, but the first one should work.
If you can represent the string as a matrix of associations so abcdef would look like
a b c d e f
a x
b x
c x
d x
e x
f x
But so would any combination of those associations. It would be trivial to compare those matrices.
Another quicker trick would be to rotate the string so that the "first" letter is first. Then if you have the same starting point, the same strings will be identical.
Here is some Ruby code:
def normalize_string(string)
myarray = string.split(//) # split into an array
index = myarray.index(myarray.min) # find the index of the minimum element
index.times do
myarray.push(myarray.shift) # move stuff from the front to the back
end
return myarray.join
end
p normalize_string('abcdef').eql?normalize_string('defabc') # should return true
Maybe use a rolling hash for each offset (RabinKarp like) and return the minimum hash value? There could be collisions though.

Resources