Is there a circular hash function? - string

Thinking about this question on testing string rotation, I wondered: Is there was such thing as a circular/cyclic hash function? E.g.
h(abcdef) = h(bcdefa) = h(cdefab) etc
Uses for this include scalable algorithms which can check n strings against each other to see where some are rotations of others.
I suppose the essence of the hash is to extract information which is order-specific but not position-specific. Maybe something that finds a deterministic 'first position', rotates to it and hashes the result?
It all seems plausible, but slightly beyond my grasp at the moment; it must be out there already...

I'd go along with your deterministic "first position" - find the "least" character; if it appears twice, use the next character as the tie breaker (etc). You can then rotate to a "canonical" position, and hash that in a normal way. If the tie breakers run for the entire course of the string, then you've got a string which is a rotation of itself (if you see what I mean) and it doesn't matter which you pick to be "first".
So:
"abcdef" => hash("abcdef")
"defabc" => hash("abcdef")
"abaac" => hash("aacab") (tie-break between aa, ac and ab)
"cabcab" => hash("abcabc") (it doesn't matter which "a" comes first!)

Update: As Jon pointed out, the first approach doesn't handle strings with repetition very well. Problems arise as duplicate pairs of letters are encountered and the resulting XOR is 0. Here is a modification that I believe fixes the the original algorithm. It uses Euclid-Fermat sequences to generate pairwise coprime integers for each additional occurrence of a character in the string. The result is that the XOR for duplicate pairs is non-zero.
I've also cleaned up the algorithm slightly. Note that the array containing the EF sequences only supports characters in the range 0x00 to 0xFF. This was just a cheap way to demonstrate the algorithm. Also, the algorithm still has runtime O(n) where n is the length of the string.
static int Hash(string s)
{
int H = 0;
if (s.Length > 0)
{
//any arbitrary coprime numbers
int a = s.Length, b = s.Length + 1;
//an array of Euclid-Fermat sequences to generate additional coprimes for each duplicate character occurrence
int[] c = new int[0xFF];
for (int i = 1; i < c.Length; i++)
{
c[i] = i + 1;
}
Func<char, int> NextCoprime = (x) => c[x] = (c[x] - x) * c[x] + x;
Func<char, char, int> NextPair = (x, y) => a * NextCoprime(x) * x.GetHashCode() + b * y.GetHashCode();
//for i=0 we need to wrap around to the last character
H = NextPair(s[s.Length - 1], s[0]);
//for i=1...n we use the previous character
for (int i = 1; i < s.Length; i++)
{
H ^= NextPair(s[i - 1], s[i]);
}
}
return H;
}
static void Main(string[] args)
{
Console.WriteLine("{0:X8}", Hash("abcdef"));
Console.WriteLine("{0:X8}", Hash("bcdefa"));
Console.WriteLine("{0:X8}", Hash("cdefab"));
Console.WriteLine("{0:X8}", Hash("cdfeab"));
Console.WriteLine("{0:X8}", Hash("a0a0"));
Console.WriteLine("{0:X8}", Hash("1010"));
Console.WriteLine("{0:X8}", Hash("0abc0def0ghi"));
Console.WriteLine("{0:X8}", Hash("0def0abc0ghi"));
}
The output is now:
7F7D7F7F
7F7D7F7F
7F7D7F7F
7F417F4F
C796C7F0
E090E0F0
A909BB71
A959BB71
First Version (which isn't complete): Use XOR which is commutative (order doesn't matter) and another little trick involving coprimes to combine ordered hashes of pairs of letters in the string. Here is an example in C#:
static int Hash(char[] s)
{
//any arbitrary coprime numbers
const int a = 7, b = 13;
int H = 0;
if (s.Length > 0)
{
//for i=0 we need to wrap around to the last character
H ^= (a * s[s.Length - 1].GetHashCode()) + (b * s[0].GetHashCode());
//for i=1...n we use the previous character
for (int i = 1; i < s.Length; i++)
{
H ^= (a * s[i - 1].GetHashCode()) + (b * s[i].GetHashCode());
}
}
return H;
}
static void Main(string[] args)
{
Console.WriteLine(Hash("abcdef".ToCharArray()));
Console.WriteLine(Hash("bcdefa".ToCharArray()));
Console.WriteLine(Hash("cdefab".ToCharArray()));
Console.WriteLine(Hash("cdfeab".ToCharArray()));
}
The output is:
4587590
4587590
4587590
7077996

You could find a deterministic first position by always starting at the position with the "lowest" (in terms of alphabetical ordering) substring. So in your case, you'd always start at "a". If there were multiple "a"s, you'd have to take two characters into account etc.

I am sure that you could find a function that can generate the same hash regardless of character position in the input, however, how will you ensure that h(abc) != h(efg) for every conceivable input? (Collisions will occur for all hash algorithms, so I mean, how do you minimize this risk.)
You'd need some additional checks even after generating the hash to ensure that the strings contain the same characters.

Here's an implementation using Linq
public string ToCanonicalOrder(string input)
{
char first = input.OrderBy(x => x).First();
string doubledForRotation = input + input;
string canonicalOrder
= (-1)
.GenerateFrom(x => doubledForRotation.IndexOf(first, x + 1))
.Skip(1) // the -1
.TakeWhile(x => x < input.Length)
.Select(x => doubledForRotation.Substring(x, input.Length))
.OrderBy(x => x)
.First();
return canonicalOrder;
}
assuming generic generator extension method:
public static class TExtensions
{
public static IEnumerable<T> GenerateFrom<T>(this T initial, Func<T, T> next)
{
var current = initial;
while (true)
{
yield return current;
current = next(current);
}
}
}
sample usage:
var sequences = new[]
{
"abcdef", "bcdefa", "cdefab",
"defabc", "efabcd", "fabcde",
"abaac", "cabcab"
};
foreach (string sequence in sequences)
{
Console.WriteLine(ToCanonicalOrder(sequence));
}
output:
abcdef
abcdef
abcdef
abcdef
abcdef
abcdef
aacab
abcabc
then call .GetHashCode() on the result if necessary.
sample usage if ToCanonicalOrder() is converted to an extension method:
sequence.ToCanonicalOrder().GetHashCode();

One possibility is to combine the hash functions of all circular shifts of your input into one meta-hash which does not depend on the order of the inputs.
More formally, consider
for(int i=0; i<string.length; i++) {
result^=string.rotatedBy(i).hashCode();
}
Where you could replace the ^= with any other commutative operation.
More examply, consider the input
"abcd"
to get the hash we take
hash("abcd") ^ hash("dabc") ^ hash("cdab") ^ hash("bcda").
As we can see, taking the hash of any of these permutations will only change the order that you are evaluating the XOR, which won't change its value.

I did something like this for a project in college. There were 2 approaches I used to try to optimize a Travelling-Salesman problem. I think if the elements are NOT guaranteed to be unique, the second solution would take a bit more checking, but the first one should work.
If you can represent the string as a matrix of associations so abcdef would look like
a b c d e f
a x
b x
c x
d x
e x
f x
But so would any combination of those associations. It would be trivial to compare those matrices.
Another quicker trick would be to rotate the string so that the "first" letter is first. Then if you have the same starting point, the same strings will be identical.
Here is some Ruby code:
def normalize_string(string)
myarray = string.split(//) # split into an array
index = myarray.index(myarray.min) # find the index of the minimum element
index.times do
myarray.push(myarray.shift) # move stuff from the front to the back
end
return myarray.join
end
p normalize_string('abcdef').eql?normalize_string('defabc') # should return true

Maybe use a rolling hash for each offset (RabinKarp like) and return the minimum hash value? There could be collisions though.

Related

Finding the binary composition of a binary number

Very new to C#, so this could be a silly question.
I am working with alot of UInt64's. These are expressed as hex right? If we look at its binary representation, can we return such an array that if we apply the 'or' operation to, we will arrive back at the original UInt64?
For example, let's say
x = 1011
Then, I am looking for an efficient way to arrive at,
f(x) = {1000, 0010, 0001}
Where these numbers are in hex, rather than binary. Sorry, I am new to hex too.
I have a method already, but it feels inefficient. I first convert to a binary string, and loop over that string to find each '1'. I then add the corresponding binary number to an array.
Any thoughts?
Here is a better example. I have a hexadecimal number x, in the form of,
UInt64 x = 0x00000000000000FF
Where the binary representation of x is
0000000000000000000000000000000000000000000000000000000011111111
I wish to find an array consisting of hexadecimal numbers (UInt64??) such that the or operation applied to all members of that array would result in x again. For example,
f(x) = {0x0000000000000080, // 00000....10000000
0x0000000000000040, // 00000....01000000
0x0000000000000020, // 00000....00100000
0x0000000000000010, // 00000....00010000
0x0000000000000008, // 00000....00001000
0x0000000000000004, // 00000....00000100
0x0000000000000002, // 00000....00000010
0x0000000000000001 // 00000....00000001
}
I think the question comes down to finding an efficient way to find the index of the '1's in the binary expansion...
public static UInt64[] findOccupiedSquares(UInt64 pieces){
UInt64[] toReturn = new UInt64[BitOperations.PopCount(pieces)];
if (BitOperations.PopCount(pieces) == 1){
toReturn[0] = pieces;
}
else{
int i = 0;
int index = 0;
while (pieces != 0){
i += 1;
pieces = pieces >> 1;
if (BitOperations.TrailingZeroCount(pieces) == 0){ // One
int rank = (int)(i / 8);
int file = i - (rank * 8);
toReturn[index] = LUTable.MaskRank[rank] & LUTable.MaskFile[file];
index += 1;
}
}
}
return toReturn;
}
Your question still confuses me as you seem to be mixing the concepts of numbers and number representations. i.e. There is an integer and then there is a hexadecimal representation of that integer.
You can very simply break any integer into its base-2 components.
ulong input = 16094009876; // example input
ulong x = 1;
var bits = new List<ulong>();
do
{
if ((input & x) == x)
{
bits.Add(x);
}
x <<= 1;
} while (x != 0);
bits is now a list of integers which each represent one of the binary 1 bits within the input. This can be verified by adding (or ORing - same thing) all the values. So this expression is true:
bits.Aggregate((a, b) => a | b) == input
If you want hexadecimal representations of those integers in the list, you can simply use ToString():
var hexBits = bits.Select(b => b.ToString("X16"));
If you want the binary representations of the integers, you can use Convert:
var binaryBits = bits.Select(b => Convert.ToString((long)b, 2).PadLeft(64, '0'));

Generate string permutations recursively; each character appears n times

I'm trying to write an algorithm that will generate all strings of length nm, with exactly n of each number 1, 2, ... m,
For instance all strings of length 6, with exactly two 1's, two 2's and two 3's e.g. 112233, 121233,
I managed to do this with just 1's and 2's using a recursive method, but can't seem to get something that works when I introduce 3's.
When m = 2, the algorithm I have is:
generateAllStrings(int len, int K, String str)
{
if(len == 0)
{
output(str);
}
if(K > 0)
{
generateAllStrings(len - 1, K - 1, str + '2');
}
if(len > K)
{
generateAllStrings(len - 1, K, str + '1');
}
}
I've tried inserting similar conditions for the third number but the algorithm doesn't give a correct output. After that I wouldn't even know how to generalise for 4 numbers and above.
Is recursion the right thing to do? Any help would be appreciated.
One option would be to list off all distinct permutations of the string 111...1222...2...nnn....n. There are nice algorithms for enumerating all distinct permutations of a string in time proportional to the length of the string, and they'd probably be a good way to go about solving this problem.
To use a simple recursive algorithm, give each recursion the permutation so far (variable perm), and the number of occurances of each digit that is still available (array count).
Run the code snippet to generate all unique permutations for n=2 and m=4 (set: 11223344).
function permutations(n, m) {
var perm = "", count = []; // start with empty permutation
for (var i = 0; i < m; i++) count[i] = n; // set available number for each digit = n
permute(perm, count); // start recursion with "" and [n,n,n...]
function permute(perm, count) {
var done = true;
for (var i = 0; i < count.length; i++) { // iterate over all digits
if (count[i] > 0) { // more instances of digit i available
var c = count.slice(); // create hard copy of count array
--c[i]; // decrement count of digit i
permute(perm + (i + 1), c); // add digit to permutation and recurse
done = false; // digits left over: not the last step
}
}
if (done) document.write(perm + "<BR>"); // no digits left: complete permutation
}
}
permutations(2, 4);
You can easily do this using DFS (or BFS alternatively). We can define an graph such that each node contains one string and a node is connected to any node that holds a string with a pair of int swaped in comparison to the original string. This graph is connected, thus we can easily generate a set of all nodes; which will contain all strings that are searched:
set generated_strings
list nodes
nodes.add(generateInitialString(N , M))
generated_strings.add(generateInitialString(N , M))
while(!nodes.empty())
string tmp = nodes.remove(0)
for (int i in [0 , N * M))
for (int j in distinct([0 , N * M) , i))
string new = swap(tmp , i , j)
if (!generated_strings.contains(new))
nodes.add(new)
generated_strings.add(new)
//generated_strings now contains all strings that can possibly be generated.

Dynamic character generator; Generate all possible strings from a character set

I want to make a dynamic string generator that will generate all possible unique strings from a character set with a dynamic length.
I can make this very easily using for loops but then its static and not dynamic length.
// Prints all possible strings with the length of 3
for a in allowedCharacters {
for b in allowedCharacters {
for c in allowedCharacters {
println(a+b+c)
}
}
}
But when I want to make this dynamic of length so I can just call generate(length: 5) I get confused.
I found this Stackoverflow question But the accepted answer generates strings 1-maxLength length and I want maxLength on ever string.
As noted above, use recursion. Here is how it can be done with C#:
static IEnumerable<string> Generate(int length, char[] allowed_chars)
{
if (length == 1)
{
foreach (char c in allowed_chars)
yield return c.ToString();
}
else
{
var sub_strings = Generate(length - 1, allowed_chars);
foreach (char c in allowed_chars)
{
foreach (string sub in sub_strings)
{
yield return c + sub;
}
}
}
}
private static void Main(string[] args)
{
string chars = "abc";
List<string> result = Generate(3, chars.ToCharArray()).ToList();
}
Please note that the run time of this algorithm and the amount of data it returns is exponential as the length increases which means that if you have large lengths, you should expect the code to take a long time and to return a huge amount of data.
Translation of #YacoubMassad's C# code to Swift:
func generate(length: Int, allowedChars: [String]) -> [String] {
if length == 1 {
return allowedChars
}
else {
let subStrings = generate(length - 1, allowedChars: allowedChars)
var arr = [String]()
for c in allowedChars {
for sub in subStrings {
arr.append(c + sub)
}
}
return arr
}
}
println(generate(3, allowedChars: ["a", "b", "c"]))
Prints:
aaa, aab, aac, aba, abb, abc, aca, acb, acc, baa, bab, bac, bba, bbb, bbc, bca, bcb, bcc, caa, cab, cac, cba, cbb, cbc, cca, ccb, ccc
While you can (obviously enough) use recursion to solve this problem, it quite an inefficient way to do the job.
What you're really doing is just counting. In your example, with "a", "b" and "c" as the allowed characters, you're counting in base 3, and since you're allowing three character strings, they're three digit numbers.
An N-digit number in base M can represent NM different possible values, going from 0 through NM-1. So, for your case, that's limit=pow(3, 3)-1;. To generate all those values, you just count from 0 through the limit, and convert each number to base M, using the specified characters as the "digits". For example, in C++ the code can look like this:
#include <string>
#include <iostream>
int main() {
std::string letters = "abc";
std::size_t base = letters.length();
std::size_t digits = 3;
int limit = pow(base, digits);
for (int i = 0; i < limit; i++) {
int in = i;
for (int j = 0; j < digits; j++) {
std::cout << letters[in%base];
in /= base;
}
std::cout << "\t";
}
}
One minor note: as I've written it here, this produces the output in basically a little-endian format. That is, the "digit" that varies the fastest is on the left, and the one that changes the slowest is on the right.

Reduce a string using grammar-like rules

I'm trying to find a suitable DP algorithm for simplifying a string. For example I have a string a b a b and a list of rules
a b -> b
a b -> c
b a -> a
c c -> b
The purpose is to get all single chars that can be received from the given string using these rules. For this example it will be b, c. The length of the given string can be up to 200 symbols. Could you please prompt an effective algorithm?
Rules always are 2 -> 1. I've got an idea of creating a tree, root is given string and each child is a string after one transform, but I'm not sure if it's the best way.
If you read those rules from right to left, they look exactly like the rules of a context free grammar, and have basically the same meaning. You could apply a bottom-up parsing algorithm like the Earley algorithm to your data, along with a suitable starting rule; something like
start <- start a
| start b
| start c
and then just examine the parse forest for the shortest chain of starts. The worst case remains O(n^3) of course, but Earley is fairly effective, these days.
You can also produce parse forests when parsing with derivatives. You might be able to efficiently check them for short chains of starts.
For a DP problem, you always need to understand how you can construct the answer for a big problem in terms of smaller sub-problems. Assume you have your function simplify which is called with an input of length n. There are n-1 ways to split the input in a first and a last part. For each of these splits, you should recursively call your simplify function on both the first part and the last part. The final answer for the input of length n is the set of all possible combinations of answers for the first and for the last part, which are allowed by the rules.
In Python, this can be implemented like so:
rules = {'ab': set('bc'), 'ba': set('a'), 'cc': set('b')}
all_chars = set(c for cc in rules.values() for c in cc)
# memoize
def simplify(s):
if len(s) == 1: # base case to end recursion
return set(s)
possible_chars = set()
# iterate over all the possible splits of s
for i in range(1, len(s)):
head = s[:i]
tail = s[i:]
# check all possible combinations of answers of sub-problems
for c1 in simplify(head):
for c2 in simplify(tail):
possible_chars.update(rules.get(c1+c2, set()))
# speed hack
if possible_chars == all_chars: # won't get any bigger
return all_chars
return possible_chars
Quick check:
In [53]: simplify('abab')
Out[53]: {'b', 'c'}
To make this fast enough for large strings (to avoiding exponential behavior), you should use a memoize decorator. This is a critical step in solving DP problems, otherwise you are just doing a brute-force calculation. A further tiny speedup can be obtained by returning from the function as soon as possible_chars == set('abc'), since at that point, you are already sure that you can generate all possible outcomes.
Analysis of running time: for an input of length n, there are 2 substrings of length n-1, 3 substrings of length n-2, ... n substrings of length 1, for a total of O(n^2) subproblems. Due to the memoization, the function is called at most once for every subproblem. Maximum running time for a single sub-problem is O(n) due to the for i in range(len(s)), so the overall running time is at most O(n^3).
Let N - length of given string and R - number of rules.
Expanding a tree in a top down manner yields computational complexity O(NR^N) in the worst case (input string of type aaa... and rules aa -> a).
Proof:
Root of the tree has (N-1)R children, which have (N-1)R^2 children, ..., which have (N-1)R^N children (leafs). So, the total complexity is O((N-1)R + (N-1)R^2 + ... (N-1)R^N) = O(N(1 + R^2 + ... + R^N)) = (using binomial theorem) = O(N(R+1)^N) = O(NR^N).
Recursive Java implementation of this naive approach:
public static void main(String[] args) {
Map<String, Character[]> rules = new HashMap<String, Character[]>() {{
put("ab", new Character[]{'b', 'c'});
put("ba", new Character[]{'a'});
put("cc", new Character[]{'b'});
}};
System.out.println(simplify("abab", rules));
}
public static Set<String> simplify(String in, Map<String, Character[]> rules) {
Set<String> result = new HashSet<String>();
simplify(in, rules, result);
return result;
}
private static void simplify(String in, Map<String, Character[]> rules, Set<String> result) {
if (in.length() == 1) {
result.add(in);
}
for (int i = 0; i < in.length() - 1; i++) {
String two = in.substring(i, i + 2);
Character[] rep = rules.get(two);
if (rep != null) {
for (Character c : rep) {
simplify(in.substring(0, i) + c + in.substring(i + 2, in.length()), rules, result);
}
}
}
}
Bas Swinckels's O(RN^3) Java implementation (with HashMap as a memoization cache):
public static Set<String> simplify2(final String in, Map<String, Character[]> rules) {
Map<String, Set<String>> cache = new HashMap<String, Set<String>>();
return simplify2(in, rules, cache);
}
private static Set<String> simplify2(final String in, Map<String, Character[]> rules, Map<String, Set<String>> cache) {
final Set<String> cached = cache.get(in);
if (cached != null) {
return cached;
}
Set<String> ret = new HashSet<String>();
if (in.length() == 1) {
ret.add(in);
return ret;
}
for (int i = 1; i < in.length(); i++) {
String head = in.substring(0, i);
String tail = in.substring(i, in.length());
for (String c1 : simplify2(head, rules)) {
for (String c2 : simplify2(tail, rules, cache)) {
Character[] rep = rules.get(c1 + c2);
if (rep != null) {
for (Character c : rep) {
ret.add(c.toString());
}
}
}
}
}
cache.put(in, ret);
return ret;
}
Output in both approaches:
[b, c]

Determine number of char movement to get word

Suppose you are given a word
"sunflower"
You can perform only one operation type on it, pick a character and move it to the front.
So for instance if you picked 'f', the word would be "fsunlower".
You can have a series of these operations.
fsunlower (moved f to front)
wfsunloer (moved w to front)
fwsunloer (moved f to front again)
The problem is to get the minimum number of operations required, given the derived word and the original word. So if input strings are "fwsunloer", "sunflower", the output would be 3.
This problem is equivalent to : given String A and B, find the longest suffix of string A that is a sub-sequence of String B. Because, if we know which n - characters need to be moved, we will only need n steps. So what we need to find is the maximum number of character that don't need to be moved, which is equivalent to the longest suffix in A.
So for the given example, the longest suffix is sunlor
Java code:
public static void main(String[] args) {
System.out.println(minOp("ewfsunlor", "sunflower"));
}
public static int minOp(String A, String B) {
int n = A.length() - 1;//Start from the end of String A;
int pos = B.length();
int result = 0;
while (n >= 0) {
int nxt = -1;
for (int i = pos - 1; i >= 0; i--) {
if (B.charAt(i) == A.charAt(n)) {
nxt = i;
break;
}
}
if (nxt == -1) {
break;
}
result++;
pos = nxt;
n--;
}
return B.length() - result;
}
Result:
3
Time complexity O(n) with n is length of String A.
Note: this algorithm is based on an assumption that A and B contains same set of character. Otherwise, you need to check for that before using the function

Resources