combinatorics for programmers? - combinatorics

I started writing a C# Silverlight program to try and find brute force solutions to the travelling sales man problems. But got stuck on trying to figure out all the possible routes.
For my program I am generating random dots and trying to find the shortest line that can join them all, without visiting any twice.
so if I have three dots A,B, & C I would want to find all the different combinations of A,B, & C, where each is only used once and the set is not the same as another set already found when reversed.
eg:
ABC
ACB
BAC
But how can I compute all the combinations for any number of dots?
I was writing this program for fun and I am now more interested in finding a good resource for learning about how to solve combinatorial problems in programming. Everything I have found for learning combinatorics tells me how to find to number of possible combinations and is useless for actually enumerating all the possible combinations.

If you're getting intertested in this sort of thing, i recommend you try out some of the problems on project euler, e.g. http://projecteuler.net/problem=15
In pythons itertools module it has some examples with example code.
You could convert the sample code to the programming language of your choice.
http://docs.python.org/library/itertools.html
sample functions:
product('ABCD', repeat=2) AA AB AC AD BA BB BC BD CA CB CC CD DA DB DC DD
permutations('ABCD', 2) AB AC AD BA BC BD CA CB CD DA DB DC
combinations('ABCD', 2) AB AC AD BC BD CD
combinations_with_replacement('ABCD', 2) AA AB AC AD BB BC BD CC CD DD
sample code:
def combinations(iterable, r):
# combinations('ABCD', 2) --> AB AC AD BC BD CD
# combinations(range(4), 3) --> 012 013 023 123
pool = tuple(iterable)
n = len(pool)
if r > n:
return
indices = range(r)
yield tuple(pool[i] for i in indices)
while True:
for i in reversed(range(r)):
if indices[i] != i + n - r:
break
else:
return
indices[i] += 1
for j in range(i+1, r):
indices[j] = indices[j-1] + 1
yield tuple(pool[i] for i in indices)
Note that in your above problem, if you are allowing one to go from point x1,y1 to point x2,y2 in straight line distance, then it isn't the same problem. (as you can sort the points and put them into a spatial datastructure). I Think in the traveling salesman problem, you're supposed to have "windy/hilly roads" so that even if two points are close together in terms of x and y, they may have a large weighted edge connecting them.

Here's my C# class to find permutations or combinations:
public static class IEnumerableExtensions
{
public static IEnumerable<IEnumerable<T>> Arrange<T>(this IEnumerable<T> elements,
int places, bool allowRepeats = true, bool orderMatters = true)
{
return orderMatters ?
Permutate(elements, places, allowRepeats) :
Combine(elements, places, allowRepeats);
}
public static IEnumerable<IEnumerable<T>> Permutate<T>(this IEnumerable<T> elements, int places, bool allowRepeats = false)
{
foreach (var cur in elements)
{
if (places == 1) yield return cur.Yield();
else
{
var sub = allowRepeats ? elements : elements.Where(v => !v.Equals(cur));
foreach (var res in sub.Permutate(places - 1, allowRepeats))
{
yield return res.Prepend(cur);
}
}
}
}
public static IEnumerable<IEnumerable<T>> Combine<T>(this IEnumerable<T> elements, int places, bool allowRepeats = false)
{
int i = 0;
foreach (var cur in elements)
{
if (places == 1) yield return cur.Yield();
else
{
var sub = allowRepeats ? elements.Skip(i++) : elements.Skip(i++ + 1);
foreach (var res in sub.Combine(places - 1, allowRepeats))
{
yield return res.Prepend(cur);
}
}
}
}
public static IEnumerable<T> Yield<T>(this T item)
{
yield return item;
}
static IEnumerable<T> Prepend<T>(this IEnumerable<T> rest, T first)
{
yield return first;
foreach (var item in rest)
yield return item;
}
}
Usage:
var places = new char[] { 'A', 'B', 'C' };
var routes = places.Permutate(3).ToArray();
//to remove reverse routes:
var noRev = (from r1 in routes
from r2 in routes
where r1.SequenceEqual(r2.Reverse())
select (r1.First() < r2.First() ? r1 : r2)).Distinct();

Here is a solution in Python. The first function is a recursive function that generates all permutations P(n,n) of the same length n as the input list. The second function runs the first one and filters out any permutation whose reverse already exists.
def all_perms(elements):
"""
Recursive function to generate all permutations
:param elements: a list
"""
if len(elements) <=1:
yield elements
else:
for perm in all_perms(elements[1:]):
for i in range(len(elements)):
yield perm[:i] + elements[0:1] + perm[i:]
def filtered_perms(elements):
"""
Filters out any permutation whose reverse already exists
:param elements: a list
"""
result = []
for perm in all_perms(elements):
if list(reversed(perm)) not in result:
result.append(perm)
print(result)
filtered_perms(["A", "B", "C"])
#[['A', 'B', 'C'], ['B', 'A', 'C'], ['B', 'C', 'A']]

Related

Dynamic Programming - String, Substrings and Minimum Cost

Dear Computer Science experts,
I have a question regarding Dynamic Programming (DP). The problem is I am given a sentence of characters and a cost_list that contains a list of substrings of sentence with their costs, the goal is to find lowest cost. It is assumed that cost_list contains all the substrings in sentence.
For example, suppose I have the below parameters,
sentence = "xxxyyzz"
cost_list = [["x", 1], ["xx", 3], ["y", 3], ["yy", 1], ["z", 2]]
So sentence could be [xx][x][yy][z][z], so the total cost is 3 + 1 + 1 + 2 + 2 = 9
But I could also select the substrings in sentence in a different way and we have [x][x][x][yy][z][z], which gives us 1 + 1 + 1 + 1 + 2 + 2 = 8 and it is the lowest cost.
The question is to construct a Dynamic Programming algorithm find_lowest_cost(sentence, cost_list).
Below is my recursive function for this problem I created, I have tested and it is correct,
def find_lowest_cost(sentence, cost_list):
if len(sentence) == 0:
return 0
else:
result = []
possible_substrings = []
possible_costs = []
for c in cost_list:
current_substring = c[0]
current_cost = c[1]
if current_substring == sentence[0:len(current_substring)]:
possible_substrings.append(current_substring)
possible_costs.append(current_cost)
for i in range(0, len(possible_substrings)):
result.append(possible_costs[i] + find_lowest_cost(sentence[len(possible_substrings[i]):], cost_list))
return min(result)
sentence = "xxxyyzz"
cost_list = [["x", 1], ["xx", 3], ["y", 3], ["yy", 1], ["z", 2]]
print(find_lowest_cost(sentence, cost_list))
I am stuck on how to converting the Recursion to Dynamic Programming (DP).
Question 1: For DP table, my columns are the characters of sentence. How what should my rows be? My thinking is it can't be a rows of "x", "xx", "y", "yy" and "z" because how would we compare "yy" with, say only "y" in sentence?
Question 2: Suppose rows and columns are figured out, at the current cell, what should the current cell be built upon? My notion is the cell is built-upon the lowest value of previous cells, such as cell[row][col-1], cell[row-1][col] and cell[row-1][col-1]?
Thanks!
Once you are able to get the recursive solution then try to look for how many variable are getting changed. Analysing the recursive approach:
We need to find a solution like, what is the minimum cost when string is having length 1, then 2 so on... There would be repetitive calculation for substring from 0 to k th index so we need to store all calculated result into single dp so that we can give the answer of any k th index which has already calculated.
Below is my Java solution.
import java.util.HashMap;
public class MyClass {
private static Integer[] dp;
public static void main(String args[]) {
// cost_list = [["x", 1], ["xx", 3], ["y", 3], ["yy", 1], ["z", 2]]
HashMap<String, Integer> costmp = new HashMap();
costmp.put("x", 1);
costmp.put("xx", 3);
costmp.put("y", 3);
costmp.put("yy", 1);
costmp.put("z", 2);
String sentence = "xxxyyzz";
// String sentence = "xxyyzzxxxxyyyyxxxyxxyyyxxyyyzzzyyyxxxyyyyzzzyyyyxxxyyyzzzyyxxxxxxxxxxxxxxyyxyxyzzzzxxyyxx";
// String sentence = "xxxyyzzxxxxyyyyxxxyxxyyyxxyyyzzzyyyxxxyyyyzzzy";
dp = new Integer[sentence.length()+1];
int res = find_lowest_cost(sentence, costmp, 0);
System.out.println("find_lowest_cost = " + res);
}
private static int find_lowest_cost(String sentence, HashMap<String, Integer> costmp, int st)
{
if(st == sentence.length())
return 0;
int mincost = Integer.MAX_VALUE;
if(dp[st] != null)
return dp[st];
String str = new String();
for(int i = st;i < sentence.length(); i++)
{
str+=sentence.charAt(i);
if(!costmp.containsKey(str))
break;
int cost = costmp.get(str);
mincost = Math.min(mincost, cost+find_lowest_cost(sentence, costmp, i+1));
}
dp[st] = mincost;
return mincost;
}
}

Find the even number using given number

I have to find the greatest even number possible using the digits of given number
Input : 7876541
Desired output : 8776514
Can anyone help me with the logic?
How about this?
convert it into string
sort the numbers in reverse order
join them and convert it as number
def n = 7876541
def newN = (n.toString().split('').findAll{it}.sort().reverse().join()) as Integer
println newN
You can quickly try it on-line demo
EDIT: Based on the OP comments, updating the answer.
Here is what you can do -
- find the permutations of the number
- find the even number
- filter it by maximum number.
There is already found a thread for finding the permutations, so re-using it with little changes. Credits to JavaHopper.
Of course, it can be simplified by groovified.
class Permutations {
static def list = []
public static void printPermutation(char[] a, int startIndex, int endIndex) {
if (startIndex == endIndex)
list << ((new String(a)) as Integer)
else {
for (int x = startIndex; x < endIndex; x++) {
swap(a, startIndex, x)
printPermutation(a, startIndex + 1, endIndex)
swap(a, startIndex, x)
}
}
}
private static void swap(char[] a, int i, int x) {
char t = a[i]
a[i] = a[x]
a[x] = t
}
}
def n = 7876541
def cArray = n.toString().toCharArray()
Permutations.printPermutation(cArray, 0, cArray.size())
println Permutations.list.findAll { it.mod(2) == 0}?.max()
Quickly try online demo
There is no need to create permutations.
Try this solution:
convert the source number into a string.
split the string into an array,
sort the numbers, for the time being, in ascending order,
find the index of the first even digit,
remove this number from the array (storing it in a variable),
reverse the array and add the removed number,
join the digits from the array and convert them into integer.
So the whole script looks like below:
def inp = 7876541
def chars1 = inp.toString().split('')
// findAll{it} drops an empty starting element from the split result
def chars2 = chars1.findAll{it}.sort()
// Find index of the 1st even digit
def n = chars2.findIndexOf{it.toInteger() % 2 == 0}
def dig = chars2[n] // Store this digit
chars2.remove(n) // Remove from the array
def chars3 = chars2.reverse() // Descending order
chars3.add(dig) // Add the temporarily deleted number
def out = (chars3.join()) as Integer // result
println out

Incomprehensible technical interview

This was a question asked in a recent programming interview.
Given a string "str" and pair of "N" swapping indices, generate a lexicographically largest string. Swapping indices can be reused any number times.
Eg:
String = "abdc"
Indices:
(1,4)
(3,4)
Answer:
cdba, cbad, dbac,dbca
You should print only "dbca" which is lexicographically largest.
This might sound naive, but I completely fail to follow the question. Can someone please help me understand what the question means?
I think it's saying that, given the string mystring = "abdc", you are instructed to switch characters at the specified index pairs such that you produce the lexicographically "largest" string (i.e. such that if you lex-sorted all possible strings, it would end up at the last index). So you have two valid operations: (1) switch mystring[1] with mystring[4] ("abdc" --> "cbda"), and (2) switch mystring[3] with mystring[4] ("abdc" --> "abcd"). Also, you can multiply chain operations: either operation (1) followed by (2) ("abdc" --> "cbda" --> "cbad"), or vice versa ("abdc" --> "abcd" --> "dbca"), and so on and so forth ("abdc" --> "cbda" --> "cbad" --> "dbac").
Then you (reverse) lex-sort these and pop off the top index:
>>> allPermutations = ['abcd', 'cbad', 'abdc', 'cbda', 'dbca', 'dbac']
>>> lexSorted = sorted(allPermutations, reverse=True) # ['dbca', 'dbac', 'cbda', 'cbad', 'abdc', 'abcd']
>>> lexSorted.pop(0)
'dbca'
Based on the clarification by #ncemami I came up with this solution.
public static String swap(String str, Pair<Integer, Integer> p1, Pair<Integer, Integer> p2){
TreeSet<String> set = new TreeSet<>();
String s1 = swap(str, p1.getKey(), p1.getValue());
set.add(s1);
String s2 = swap(s1, p2.getKey(), p2.getValue());
set.add(s2);
String s3 = swap(str, p2.getKey(), p2.getValue());
set.add(s3);
String s4 = swap(s3, p1.getKey(), p1.getValue());
set.add(s4);
return set.last();
}
private static String swap(String str, int a, int b){
StringBuilder sb = new StringBuilder(str);
char temp1 = str.charAt(a);
char temp2 = str.charAt(b);
sb.setCharAt(a, temp2);
sb.setCharAt(b, temp1);
return sb.toString();
}
Here my Java solution:
String swapLexOrder(String str, int[][] pairs) {
Map<Integer, Set<Integer>> neighbours = new HashMap<>();
for (int[] pair : pairs) {
// It contains all the positions that are reachable from the index present in the pairs
Set<Integer> reachablePositionsL = neighbours.get(pair[0]);
Set<Integer> temp = neighbours.get(pair[1]); // We use it just to merge the two sets if present
if (reachablePositionsL == null) {
reachablePositionsL = (temp == null ? new TreeSet<>() : temp);
} else if (temp != null) {
// Changing the reference so every addition to "reachablePositionsL" will reflect on both positions
for (Integer index: temp) {
neighbours.put(index, reachablePositionsL);
}
reachablePositionsL.addAll(temp);
}
reachablePositionsL.add(pair[0]);
reachablePositionsL.add(pair[1]);
neighbours.put(pair[0], reachablePositionsL);
neighbours.put(pair[1], reachablePositionsL);
}
StringBuilder result = new StringBuilder(str);
for (Set<Integer> set : neighbours.values()) {
Iterator<Character> orderedCharacters = set.stream()
.map(i -> str.charAt(i - 1))
.sorted(Comparator.reverseOrder())
.iterator();
set.forEach(i -> result.setCharAt(i - 1, orderedCharacters.next()));
}
return result.toString();
}
Here an article that explain my the problem.
String = "abcd"
co_ord = [(1,4),(3,4)]
def find_combinations(co_ord, String):
l1 = []
for tup_le in co_ord:
l1.extend(tup_le)
l1 = [x-1 for x in l1]
l1 = list(set(l1))
l2 = set(range(len(String)))-set(l1)
return l1,int(''.join(str(i) for i in l2))
def perm1(lst):
if len(lst) == 0:
return []
elif len(lst) == 1:
return [lst]
else:
l = []
for i in range(len(lst)):
x = lst[i]
xs = lst[:i] + lst[i+1:]
for p in perm1(xs):
l.append([x]+p)
return l
lx, ly = find_combinations(co_ord, String)
final = perm1(lx)
print(final)
temp = []
final_list=[]
for i in final:
for j in i:
temp.append(String[j])
final_list.append(''.join(temp))
temp=[]
final_list = [ i[:ly] + String[ly] + i[ly:] for i in final_list]
print(sorted(final_list,reverse=True)[0])

Reduce a string using grammar-like rules

I'm trying to find a suitable DP algorithm for simplifying a string. For example I have a string a b a b and a list of rules
a b -> b
a b -> c
b a -> a
c c -> b
The purpose is to get all single chars that can be received from the given string using these rules. For this example it will be b, c. The length of the given string can be up to 200 symbols. Could you please prompt an effective algorithm?
Rules always are 2 -> 1. I've got an idea of creating a tree, root is given string and each child is a string after one transform, but I'm not sure if it's the best way.
If you read those rules from right to left, they look exactly like the rules of a context free grammar, and have basically the same meaning. You could apply a bottom-up parsing algorithm like the Earley algorithm to your data, along with a suitable starting rule; something like
start <- start a
| start b
| start c
and then just examine the parse forest for the shortest chain of starts. The worst case remains O(n^3) of course, but Earley is fairly effective, these days.
You can also produce parse forests when parsing with derivatives. You might be able to efficiently check them for short chains of starts.
For a DP problem, you always need to understand how you can construct the answer for a big problem in terms of smaller sub-problems. Assume you have your function simplify which is called with an input of length n. There are n-1 ways to split the input in a first and a last part. For each of these splits, you should recursively call your simplify function on both the first part and the last part. The final answer for the input of length n is the set of all possible combinations of answers for the first and for the last part, which are allowed by the rules.
In Python, this can be implemented like so:
rules = {'ab': set('bc'), 'ba': set('a'), 'cc': set('b')}
all_chars = set(c for cc in rules.values() for c in cc)
# memoize
def simplify(s):
if len(s) == 1: # base case to end recursion
return set(s)
possible_chars = set()
# iterate over all the possible splits of s
for i in range(1, len(s)):
head = s[:i]
tail = s[i:]
# check all possible combinations of answers of sub-problems
for c1 in simplify(head):
for c2 in simplify(tail):
possible_chars.update(rules.get(c1+c2, set()))
# speed hack
if possible_chars == all_chars: # won't get any bigger
return all_chars
return possible_chars
Quick check:
In [53]: simplify('abab')
Out[53]: {'b', 'c'}
To make this fast enough for large strings (to avoiding exponential behavior), you should use a memoize decorator. This is a critical step in solving DP problems, otherwise you are just doing a brute-force calculation. A further tiny speedup can be obtained by returning from the function as soon as possible_chars == set('abc'), since at that point, you are already sure that you can generate all possible outcomes.
Analysis of running time: for an input of length n, there are 2 substrings of length n-1, 3 substrings of length n-2, ... n substrings of length 1, for a total of O(n^2) subproblems. Due to the memoization, the function is called at most once for every subproblem. Maximum running time for a single sub-problem is O(n) due to the for i in range(len(s)), so the overall running time is at most O(n^3).
Let N - length of given string and R - number of rules.
Expanding a tree in a top down manner yields computational complexity O(NR^N) in the worst case (input string of type aaa... and rules aa -> a).
Proof:
Root of the tree has (N-1)R children, which have (N-1)R^2 children, ..., which have (N-1)R^N children (leafs). So, the total complexity is O((N-1)R + (N-1)R^2 + ... (N-1)R^N) = O(N(1 + R^2 + ... + R^N)) = (using binomial theorem) = O(N(R+1)^N) = O(NR^N).
Recursive Java implementation of this naive approach:
public static void main(String[] args) {
Map<String, Character[]> rules = new HashMap<String, Character[]>() {{
put("ab", new Character[]{'b', 'c'});
put("ba", new Character[]{'a'});
put("cc", new Character[]{'b'});
}};
System.out.println(simplify("abab", rules));
}
public static Set<String> simplify(String in, Map<String, Character[]> rules) {
Set<String> result = new HashSet<String>();
simplify(in, rules, result);
return result;
}
private static void simplify(String in, Map<String, Character[]> rules, Set<String> result) {
if (in.length() == 1) {
result.add(in);
}
for (int i = 0; i < in.length() - 1; i++) {
String two = in.substring(i, i + 2);
Character[] rep = rules.get(two);
if (rep != null) {
for (Character c : rep) {
simplify(in.substring(0, i) + c + in.substring(i + 2, in.length()), rules, result);
}
}
}
}
Bas Swinckels's O(RN^3) Java implementation (with HashMap as a memoization cache):
public static Set<String> simplify2(final String in, Map<String, Character[]> rules) {
Map<String, Set<String>> cache = new HashMap<String, Set<String>>();
return simplify2(in, rules, cache);
}
private static Set<String> simplify2(final String in, Map<String, Character[]> rules, Map<String, Set<String>> cache) {
final Set<String> cached = cache.get(in);
if (cached != null) {
return cached;
}
Set<String> ret = new HashSet<String>();
if (in.length() == 1) {
ret.add(in);
return ret;
}
for (int i = 1; i < in.length(); i++) {
String head = in.substring(0, i);
String tail = in.substring(i, in.length());
for (String c1 : simplify2(head, rules)) {
for (String c2 : simplify2(tail, rules, cache)) {
Character[] rep = rules.get(c1 + c2);
if (rep != null) {
for (Character c : rep) {
ret.add(c.toString());
}
}
}
}
}
cache.put(in, ret);
return ret;
}
Output in both approaches:
[b, c]

Is there a circular hash function?

Thinking about this question on testing string rotation, I wondered: Is there was such thing as a circular/cyclic hash function? E.g.
h(abcdef) = h(bcdefa) = h(cdefab) etc
Uses for this include scalable algorithms which can check n strings against each other to see where some are rotations of others.
I suppose the essence of the hash is to extract information which is order-specific but not position-specific. Maybe something that finds a deterministic 'first position', rotates to it and hashes the result?
It all seems plausible, but slightly beyond my grasp at the moment; it must be out there already...
I'd go along with your deterministic "first position" - find the "least" character; if it appears twice, use the next character as the tie breaker (etc). You can then rotate to a "canonical" position, and hash that in a normal way. If the tie breakers run for the entire course of the string, then you've got a string which is a rotation of itself (if you see what I mean) and it doesn't matter which you pick to be "first".
So:
"abcdef" => hash("abcdef")
"defabc" => hash("abcdef")
"abaac" => hash("aacab") (tie-break between aa, ac and ab)
"cabcab" => hash("abcabc") (it doesn't matter which "a" comes first!)
Update: As Jon pointed out, the first approach doesn't handle strings with repetition very well. Problems arise as duplicate pairs of letters are encountered and the resulting XOR is 0. Here is a modification that I believe fixes the the original algorithm. It uses Euclid-Fermat sequences to generate pairwise coprime integers for each additional occurrence of a character in the string. The result is that the XOR for duplicate pairs is non-zero.
I've also cleaned up the algorithm slightly. Note that the array containing the EF sequences only supports characters in the range 0x00 to 0xFF. This was just a cheap way to demonstrate the algorithm. Also, the algorithm still has runtime O(n) where n is the length of the string.
static int Hash(string s)
{
int H = 0;
if (s.Length > 0)
{
//any arbitrary coprime numbers
int a = s.Length, b = s.Length + 1;
//an array of Euclid-Fermat sequences to generate additional coprimes for each duplicate character occurrence
int[] c = new int[0xFF];
for (int i = 1; i < c.Length; i++)
{
c[i] = i + 1;
}
Func<char, int> NextCoprime = (x) => c[x] = (c[x] - x) * c[x] + x;
Func<char, char, int> NextPair = (x, y) => a * NextCoprime(x) * x.GetHashCode() + b * y.GetHashCode();
//for i=0 we need to wrap around to the last character
H = NextPair(s[s.Length - 1], s[0]);
//for i=1...n we use the previous character
for (int i = 1; i < s.Length; i++)
{
H ^= NextPair(s[i - 1], s[i]);
}
}
return H;
}
static void Main(string[] args)
{
Console.WriteLine("{0:X8}", Hash("abcdef"));
Console.WriteLine("{0:X8}", Hash("bcdefa"));
Console.WriteLine("{0:X8}", Hash("cdefab"));
Console.WriteLine("{0:X8}", Hash("cdfeab"));
Console.WriteLine("{0:X8}", Hash("a0a0"));
Console.WriteLine("{0:X8}", Hash("1010"));
Console.WriteLine("{0:X8}", Hash("0abc0def0ghi"));
Console.WriteLine("{0:X8}", Hash("0def0abc0ghi"));
}
The output is now:
7F7D7F7F
7F7D7F7F
7F7D7F7F
7F417F4F
C796C7F0
E090E0F0
A909BB71
A959BB71
First Version (which isn't complete): Use XOR which is commutative (order doesn't matter) and another little trick involving coprimes to combine ordered hashes of pairs of letters in the string. Here is an example in C#:
static int Hash(char[] s)
{
//any arbitrary coprime numbers
const int a = 7, b = 13;
int H = 0;
if (s.Length > 0)
{
//for i=0 we need to wrap around to the last character
H ^= (a * s[s.Length - 1].GetHashCode()) + (b * s[0].GetHashCode());
//for i=1...n we use the previous character
for (int i = 1; i < s.Length; i++)
{
H ^= (a * s[i - 1].GetHashCode()) + (b * s[i].GetHashCode());
}
}
return H;
}
static void Main(string[] args)
{
Console.WriteLine(Hash("abcdef".ToCharArray()));
Console.WriteLine(Hash("bcdefa".ToCharArray()));
Console.WriteLine(Hash("cdefab".ToCharArray()));
Console.WriteLine(Hash("cdfeab".ToCharArray()));
}
The output is:
4587590
4587590
4587590
7077996
You could find a deterministic first position by always starting at the position with the "lowest" (in terms of alphabetical ordering) substring. So in your case, you'd always start at "a". If there were multiple "a"s, you'd have to take two characters into account etc.
I am sure that you could find a function that can generate the same hash regardless of character position in the input, however, how will you ensure that h(abc) != h(efg) for every conceivable input? (Collisions will occur for all hash algorithms, so I mean, how do you minimize this risk.)
You'd need some additional checks even after generating the hash to ensure that the strings contain the same characters.
Here's an implementation using Linq
public string ToCanonicalOrder(string input)
{
char first = input.OrderBy(x => x).First();
string doubledForRotation = input + input;
string canonicalOrder
= (-1)
.GenerateFrom(x => doubledForRotation.IndexOf(first, x + 1))
.Skip(1) // the -1
.TakeWhile(x => x < input.Length)
.Select(x => doubledForRotation.Substring(x, input.Length))
.OrderBy(x => x)
.First();
return canonicalOrder;
}
assuming generic generator extension method:
public static class TExtensions
{
public static IEnumerable<T> GenerateFrom<T>(this T initial, Func<T, T> next)
{
var current = initial;
while (true)
{
yield return current;
current = next(current);
}
}
}
sample usage:
var sequences = new[]
{
"abcdef", "bcdefa", "cdefab",
"defabc", "efabcd", "fabcde",
"abaac", "cabcab"
};
foreach (string sequence in sequences)
{
Console.WriteLine(ToCanonicalOrder(sequence));
}
output:
abcdef
abcdef
abcdef
abcdef
abcdef
abcdef
aacab
abcabc
then call .GetHashCode() on the result if necessary.
sample usage if ToCanonicalOrder() is converted to an extension method:
sequence.ToCanonicalOrder().GetHashCode();
One possibility is to combine the hash functions of all circular shifts of your input into one meta-hash which does not depend on the order of the inputs.
More formally, consider
for(int i=0; i<string.length; i++) {
result^=string.rotatedBy(i).hashCode();
}
Where you could replace the ^= with any other commutative operation.
More examply, consider the input
"abcd"
to get the hash we take
hash("abcd") ^ hash("dabc") ^ hash("cdab") ^ hash("bcda").
As we can see, taking the hash of any of these permutations will only change the order that you are evaluating the XOR, which won't change its value.
I did something like this for a project in college. There were 2 approaches I used to try to optimize a Travelling-Salesman problem. I think if the elements are NOT guaranteed to be unique, the second solution would take a bit more checking, but the first one should work.
If you can represent the string as a matrix of associations so abcdef would look like
a b c d e f
a x
b x
c x
d x
e x
f x
But so would any combination of those associations. It would be trivial to compare those matrices.
Another quicker trick would be to rotate the string so that the "first" letter is first. Then if you have the same starting point, the same strings will be identical.
Here is some Ruby code:
def normalize_string(string)
myarray = string.split(//) # split into an array
index = myarray.index(myarray.min) # find the index of the minimum element
index.times do
myarray.push(myarray.shift) # move stuff from the front to the back
end
return myarray.join
end
p normalize_string('abcdef').eql?normalize_string('defabc') # should return true
Maybe use a rolling hash for each offset (RabinKarp like) and return the minimum hash value? There could be collisions though.

Resources