Efficient string sorting algorithm

Efficient string sorting algorithm - string

Sorting strings by comparisons (e.g. standard QuickSort + strcmp-like function) may be a bit slow, especially for long strings sharing a common prefix (the comparison function takes O(s) time, where s is the length of string), thus a standard solution has the complexity of O(s * nlog n). Are there any known faster algorithms?

If you know that the string consist only of certain characters (which is almost always the case), you can use a variant of BucketSort or RadixSort.

You could build a trie, which should be O(s*n), I believe.

Please search for "Sedgewick Multikey quick sort" (Sedgewick wrote famous algorithms textbooks in C and Java). His algorithm is relatively easy to implement and quite fast. It avoids the problem you are talking above. There is the burst sort algorithm which claims to be faster, but I don't know of any implementation.
There is an article Fast String Sort in C# and F# that describes the algorithm and has a reference to Sedgewick's code as well as to C# code. (disclosure: it's an article and code that I wrote based on Sedgewick's paper).

Summary
I found the string_sorting
repo by Tommi Rantala comprehensive, it includes many known efficient (string) sorting algorithms, e.g. MSD radix sort, burstsort and multi-key-quicksort. In addition, most of them are also cache efficient.
My Experience
It appears to me three-way radix/string quicksort is one of the fastest string sorting algorithms. Also, MSD radix sort is a good one. They are introduced in Sedgewick's excellent Algorithms book.
Here are some results to sort leipzig1M.txt taken from here:
$ wc leipzig1M.txt
# lines words characters
1'000'000 21'191'455 129'644'797 leipzig1M.txt
Method
Time
Hoare
7.8792s
Quick3Way
7.5074s
Fast3Way
5.78015s
RadixSort
4.86149s
Quick3String
4.3685s
Heapsort
32.8318s
MergeSort
16.94s
std::sort/introsort
6.10666s
MSD+Q3S
3.74214s
The charming thing about three-way radix/string quicksort is it is really simple to implement, effectively only about ten source lines of code.
template<typename RandomIt>
void insertion_sort(RandomIt first, RandomIt last, size_t d)
{
const int len = last - first;
for (int i = 1; i < len; ++i) {
// insert a[i] into the sorted sequence a[0..i-1]
for (int j = i; j > 0 && std::strcmp(&(*(first+j))[d], &(*(first+j-1))[d]) < 0; --j)
iter_swap(first + j, first + j - 1);
}
}
template<typename RandomIt>
void quick3string(RandomIt first, RandomIt last, size_t d)
{
if (last - first < 2) return;
#if 0 // seems not to help much
if (last - first <= 8) { // change the threshold as you like
insertion_sort(first, last, d);
return;
}
#endif
typedef typename std::iterator_traits<RandomIt>::value_type String;
typedef typename string_traits<String>::value_type CharT;
typedef std::make_unsigned_t<CharT> UCharT;
RandomIt lt = first, i = first + 1, gt = last - 1;
/* make lo = median of {lo, mid, hi} */
RandomIt mid = lt + ((gt - lt) >> 1);
if ((*mid)[d] < (*lt)[d]) iter_swap(lt, mid);
if ((*mid)[d] < (*gt)[d]) iter_swap(gt, mid);
// now mid is the largest of the three, then make lo the median
if ((*lt)[d] < (*gt)[d]) iter_swap(lt, gt);
UCharT pivot = (*first)[d];
while (i <= gt) {
int diff = (UCharT) (*i)[d] - pivot;
if (diff < 0) iter_swap(lt++, i++);
else if (diff > 0) iter_swap(i, gt--);
else ++i;
}
// Now a[lo..lt-1] < pivot = a[lt..gt] < a[gt+1..hi].
quick3string(first, lt, d); // sort a[lo..lt-1]
if (pivot != '\0')
quick3string(lt, gt+1, d+1); // sort a[lt..gt] on following character
quick3string(gt+1, last, d); // sort a[gt+1..hi]
}
/*
* Three-way string quicksort.
* Similar to MSD radix sort, we first sort the array on the leading character
* (using quicksort), then apply this method recursively on the subarrays. On
* first sorting, a pivot v is chosen, then partition it in 3 parts, strings
* whose first character are less than v, equal to v, and greater than v. Just
* like the partitioning in classic quicksort but with comparing only the 1st
* character instead of the whole string. After partitioning, only the middle
* (equal-to-v) part can sort on the following character (index of d+1). The
* other two recursively sort on the same depth (index of d) because these two
* haven't been sorted on the dth character (just partitioned them: <v or >v).
*
* Time complexity: O(N~N*lgN), space complexity: O(lgN).
* Explaination: N * string length (for partitioning, find equal-to-v part) +
* O(N*lgN) (to do the quicksort thing)
* character comparisons (instead of string comparisons in normal quicksort).
*/
template<typename RandomIt>
void str_qsort(RandomIt first, RandomIt last)
{
quick3string(first, last, 0);
}
NOTE: But if you like me searching Google for "fastest string sorting algorithm", chances are it's burstsort, a cache-aware MSD radix sort variant (paper). I also found this paper by Bentley and Sedgewick helpful, which used a Multikey Quicksort.

Related

Longest common prefix - comparing time complexity of two algorithms

If you comparing these two solutions the time complexity of the first solution is O(array-len*sortest-string-len) that you may shorten it to O(n*m) or even O(n^2). And the second one seems O(n * log n) as it has a sort method and then comparing the first and the last item so it would be O(n) and don't have any effect on the O.
But, what happens to the comparing the strings item in the list. Sorting a list of integer values is O(n * log n) but don't we need to compare the characters in the strings to be able to sort them? So, am I wrong if I say the time complexity of the second solution is O(n * log n * longest-string-len)?
Also, as it does not check the prefixes while it is sorting it would do the sorting (the majority of the times) anyway so its best case is far worse than the other option? Also, for the worst-case scenario if you consider the point I mentioned it would still be worse than the first solution?
public string longestCommonPrefix(List<string> input) {
if(input.Count == 0) return "";
if(input.Count == 1) return input[0];
var sb = new System.Text.StringBuilder();
for(var charIndex = 0; charIndex < input[0].Length; charIndex++)
{
for(var itemIndex = 1; itemIndex < input.Count; itemIndex++)
{
if(input[itemIndex].Length > charIndex)
return sb.ToString();
if(input[0][charIndex] != input[itemIndex][charIndex])
return sb.ToString();
}
sb.Append(input[0][charIndex]);
}
return sb.ToString();
}
static string longestCommonPrefix(String[] a)
{
int size = a.Length;
/* if size is 0, return empty string */
if (size == 0)
return "";
if (size == 1)
return a[0];
/* sort the array of strings */
Array.Sort(a);
/* find the minimum length from first
and last string */
int end = Math.Min(a[0].Length,
a[size-1].Length);
/* find the common prefix between the
first and last string */
int i = 0;
while (i < end && a[0][i] == a[size-1][i] )
i++;
string pre = a[0].Substring(0, i);
return pre;
}

First of all, unless I am missing something obvious, the first method runs in O(N * shortest-string-length); shortest, not longest.
Second, you may not reduce O(n*m) to O(n^2): the number of strings and their length are unrelated.
Finally, you are absolutely right. Sorting indeed takes O(n*log(n)*m), so in no case it would improve the performance.
As a side note, it may be beneficial to find the shortest string beforehand. This would make a input[itemIndex].Length > charIndex unnecessary.

find the number of ways you can form a string on size N, given an unlimited number of 0s and 1s

The below question was asked in the atlassian company online test ,I don't have test cases , this is the below question I took from this link
find the number of ways you can form a string on size N, given an unlimited number of 0s and 1s. But
you cannot have D number of consecutive 0s and T number of consecutive 1s. N, D, T were given as inputs,
Please help me on this problem,any approach how to proceed with it
My approach for the above question is simply I applied recursion and tried for all possiblity and then I memoized it using hash map
But it seems to me there must be some combinatoric approach that can do this question in less time and space? for debugging purposes I am also printing the strings generated during recursion, if there is flaw in my approach please do tell me
#include <bits/stdc++.h>
using namespace std;
unordered_map<string,int>dp;
int recurse(int d,int t,int n,int oldd,int oldt,string s)
{
if(d<=0)
return 0;
if(t<=0)
return 0;
cout<<s<<"\n";
if(n==0&&d>0&&t>0)
return 1;
string h=to_string(d)+" "+to_string(t)+" "+to_string(n);
if(dp.find(h)!=dp.end())
return dp[h];
int ans=0;
ans+=recurse(d-1,oldt,n-1,oldd,oldt,s+'0')+recurse(oldd,t-1,n-1,oldd,oldt,s+'1');
return dp[h]=ans;
}
int main()
{
int n,d,t;
cin>>n>>d>>t;
dp.clear();
cout<<recurse(d,t,n,d,t,"")<<"\n";
return 0;
}

You are right, instead of generating strings, it is worth to consider combinatoric approach using dynamic programming (a kind of).
"Good" sequence of length K might end with 1..D-1 zeros or 1..T-1 of ones.
To make a good sequence of length K+1, you can add zero to all sequences except for D-1, and get 2..D-1 zeros for the first kind of precursors and 1 zero for the second kind
Similarly you can add one to all sequences of the first kind, and to all sequences of the second kind except for T-1, and get 1 one for the first kind of precursors and 2..T-1 ones for the second kind
Make two tables
Zeros[N][D] and Ones[N][T]
Fill the first row with zero counts, except for Zeros[1][1] = 1, Ones[1][1] = 1
Fill row by row using the rules above.
Zeros[K][1] = Sum(Ones[K-1][C=1..T-1])
for C in 2..D-1:
Zeros[K][C] = Zeros[K-1][C-1]
Ones[K][1] = Sum(Zeros[K-1][C=1..T-1])
for C in 2..T-1:
Ones[K][C] = Ones[K-1][C-1]
Result is sum of the last row in both tables.
Also note that you really need only two active rows of the table, so you can optimize size to Zeros[2][D] after debugging.

This can be solved using dynamic programming. I'll give a recursive solution to the same. It'll be similar to generating a binary string.
States will be:
i: The ith character that we need to insert to the string.
cnt: The number of consecutive characters before i
bit: The character which was repeated cnt times before i. Value of bit will be either 0 or 1.
Base case will: Return 1, when we reach n since we are starting from 0 and ending at n-1.
Define the size of dp array accordingly. The time complexity will be 2 x N x max(D,T)
#include<bits/stdc++.h>
using namespace std;
int dp[1000][1000][2];
int n, d, t;
int count(int i, int cnt, int bit) {
if (i == n) {
return 1;
}
int &ans = dp[i][cnt][bit];
if (ans != -1) return ans;
ans = 0;
if (bit == 0) {
ans += count(i+1, 1, 1);
if (cnt != d - 1) {
ans += count(i+1, cnt + 1, 0);
}
} else {
// bit == 1
ans += count(i+1, 1, 0);
if (cnt != t-1) {
ans += count(i+1, cnt + 1, 1);
}
}
return ans;
}
signed main() {
ios_base::sync_with_stdio(false), cin.tie(nullptr);
cin >> n >> d >> t;
memset(dp, -1, sizeof dp);
cout << count(0, 0, 0);
return 0;
}

Asymmetric Levenshtein distance

Given two bit strings, x and y, with x longer than y, I'd like to compute a kind of asymmetric variant of the Levensthein distance between them. Starting with x, I'd like to know the minimum number of deletions and substitutions it takes to turn x into y.
Can I just use the usual Levensthein distance for this, or do I need I need to modify the algorithm somehow? In other words, with the usual set of edits of deletion, substitution, and addition, is it ever beneficial to delete more than the difference in lengths between the two strings and then add some bits back? I suspect the answer is no, but I'm not sure. If I'm wrong, and I do need to modify the definition of Levenshtein distance to disallow deletions, how do I do so?
Finally, I would expect intuitively that I'd get the same distance if I started with y (the shorter string) and only allowed additions and substitutions. Is this right? I've got a sense for what these answers are, I just can't prove them.

If i understand you correctly, I think the answer is yes, the Levenshtein edit distance could be different than an algorithm that only allows deletions and substitutions to the larger string. Because of this, you would need to modify, or create a different algorithm to get your limited version.
Consider the two strings "ABCD" and "ACDEF". The Levenshtein distance is 3 (ABCD->ACD->ACDE->ACDEF). If we start with the longer string, and limit ourselves to deletions and substitutions we must use 4 edits (1 deletion and 3 substitutions. The reason is that strings where deletions are applied to the smaller string to efficiently get to the larger string can't be achieved when starting with the longer string, because it does not have the complimentary insertion operation (since you're disallowing that).
Your last paragraph is true. If the path from shorter to longer uses only insertions and substitutions, then any allowed path can simply be reversed from the longer to the shorter. Substitutions are the same regardless of direction, but the inserts when going from small to large become deletions when reversed.
I haven't tested this thoroughly, but this modification shows the direction I would take, and appears to work with the values I've tested with it. It's written in c#, and follows the psuedo code in the wikipedia entry for Levenshtein distance. There are obvious optimizations that can be made, but I refrained from doing that so it was more obvious what changes I've made from the standard algorithm. An important observation is that (using your constraints) if the strings are the same length, then substitution is the only operation allowed.
static int LevenshteinDistance(string s, string t) {
int i, j;
int m = s.Length;
int n = t.Length;
// for all i and j, d[i,j] will hold the Levenshtein distance between
// the first i characters of s and the first j characters of t;
// note that d has (m+1)*(n+1) values
var d = new int[m + 1, n + 1];
// set each element to zero
// c# creates array already initialized to zero
// source prefixes can be transformed into empty string by
// dropping all characters
for (i = 0; i <= m; i++) d[i, 0] = i;
// target prefixes can be reached from empty source prefix
// by inserting every character
for (j = 0; j <= n; j++) d[0, j] = j;
for (j = 1; j <= n; j++) {
for (i = 1; i <= m; i++) {
if (s[i - 1] == t[j - 1])
d[i, j] = d[i - 1, j - 1]; // no operation required
else {
int del = d[i - 1, j] + 1; // a deletion
int ins = d[i, j - 1] + 1; // an insertion
int sub = d[i - 1, j - 1] + 1; // a substitution
// the next two lines are the modification I've made
//int insDel = (i < j) ? ins : del;
//d[i, j] = (i == j) ? sub : Math.Min(insDel, sub);
// the following 8 lines are a clearer version of the above 2 lines
if (i == j) {
d[i, j] = sub;
} else {
int insDel;
if (i < j) insDel = ins; else insDel = del;
// assign the smaller of insDel or sub
d[i, j] = Math.Min(insDel, sub);
}
}
}
}
return d[m, n];
}

How to find the longest continuous sub-string in a string?

For example, there is a given string which is consisted of 1s and 0s:
s = "00000000001111111111100001111111110000";
What is the efficient way to get the count of longest 1s substring in s? (11)
What is the efficient way to get the count of longest 0s substring in s? (10)
I appreciate the question would be answered from an algorithmic perspective.

I think the most straight-forward way is to walk through the bit-string while recording the max lengths for all 0 and all 1 sub-strings. This is of O (n) complexity as suggested by others.
If you can afford some sort of a data-parallel computation, you might want to look at parallel patterns as explained here. Specifically, take a look at parallel reduction. I think this problem can be implemented in O (log n) time if you can afford one of those methods.
I'm trying to think of a parallel reduction for this problem:
On the first level of the reduction, each thread will process chunks of 8 bit strings (depending on the number of threads you have and the length of the string) and produce a summary of the bit string like: 0 -> x, 1 -> y, 0 -> z, ....
On the next level each thread will merge two of these summaries into one, any possible joins will be performed at this phase (basically, if the previous summary ended with a 0 (1) and the next summary begins with a 0 (1), then the last entry and the first entry of the two summaries can be collapsed into one).
On the top level there will be just one structure with the overall summary of the bit string, which you'll have to step through to figure out the largest sequences (but this time they are all in summary form, so it should be faster). Or, you can make each summary structure keep track of the larges 0 and 1 sub-strings, this will make it unnecessary to walk through the final structure.
I guess this approach only makes sense in a very limited scope, but since you seem to be very keen on getting better than O (n)...

OK, here is one solution I come up with, I'm not sure whether this is bug-free. Correct me if you discover a bug or suggest a better way to do it. Vote it if you agree with this solution. Thanks!
#include <iostream>
using namespace std;
int main(){
int s[] = {0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0,0,0,0};
int length = sizeof(s) / sizeof(s[0]);
int one_start = 0;
int one_n = 0;
int max_one_n = 0;
int zero_start = 0;
int zero_n = 0;
int max_zero_n = 0;
for(int i=0; i<length; i++){
// Calculate 1s
if(one_start==0 && s[i]==1){
one_start = 1;
one_n++;
}
else if(one_start==1 && s[i]==1){
one_n++;
}
else if(one_start==1 && s[i]==0){
one_start = 0;
if(one_n > max_one_n){
max_one_n = one_n;
}
one_n = 0; // Reset
}
// Calculate 0s
if(zero_start==0 && s[i]==0){
zero_start = 1;
zero_n++;
}
else if(zero_start==1 && s[i]==0){
zero_n++;
}
else if(one_start==1 && s[i]==1){
zero_start = 0;
if(zero_n > max_zero_n){
max_zero_n = zero_n;
}
zero_n = 0; // Reset
}
}
if(one_n > max_one_n){
max_one_n = one_n;
}
if(zero_n > max_zero_n){
max_zero_n = zero_n;
}
cout << "max_one_n: " << max_one_n << endl;
cout << "max_zero_n: " << max_zero_n << endl;
return 0;
}

Worst case is always O(n), you can always find input which forces the algorithm to check every bit.
But you can probably get average slightly better than that (more simply if you scan just for 0 or 1, not both), because you can skip the length of currently found longest sequence and scan backwards. At the very least this will reduce the constant factor of O(n), but at least with random input, more items also means longer sequences, and thus longer and longer skips. But the difference to O(n) will not be much...

Is there a circular hash function?

Thinking about this question on testing string rotation, I wondered: Is there was such thing as a circular/cyclic hash function? E.g.
h(abcdef) = h(bcdefa) = h(cdefab) etc
Uses for this include scalable algorithms which can check n strings against each other to see where some are rotations of others.
I suppose the essence of the hash is to extract information which is order-specific but not position-specific. Maybe something that finds a deterministic 'first position', rotates to it and hashes the result?
It all seems plausible, but slightly beyond my grasp at the moment; it must be out there already...

I'd go along with your deterministic "first position" - find the "least" character; if it appears twice, use the next character as the tie breaker (etc). You can then rotate to a "canonical" position, and hash that in a normal way. If the tie breakers run for the entire course of the string, then you've got a string which is a rotation of itself (if you see what I mean) and it doesn't matter which you pick to be "first".
So:
"abcdef" => hash("abcdef")
"defabc" => hash("abcdef")
"abaac" => hash("aacab") (tie-break between aa, ac and ab)
"cabcab" => hash("abcabc") (it doesn't matter which "a" comes first!)

Update: As Jon pointed out, the first approach doesn't handle strings with repetition very well. Problems arise as duplicate pairs of letters are encountered and the resulting XOR is 0. Here is a modification that I believe fixes the the original algorithm. It uses Euclid-Fermat sequences to generate pairwise coprime integers for each additional occurrence of a character in the string. The result is that the XOR for duplicate pairs is non-zero.
I've also cleaned up the algorithm slightly. Note that the array containing the EF sequences only supports characters in the range 0x00 to 0xFF. This was just a cheap way to demonstrate the algorithm. Also, the algorithm still has runtime O(n) where n is the length of the string.
static int Hash(string s)
{
int H = 0;
if (s.Length > 0)
{
//any arbitrary coprime numbers
int a = s.Length, b = s.Length + 1;
//an array of Euclid-Fermat sequences to generate additional coprimes for each duplicate character occurrence
int[] c = new int[0xFF];
for (int i = 1; i < c.Length; i++)
{
c[i] = i + 1;
}
Func<char, int> NextCoprime = (x) => c[x] = (c[x] - x) * c[x] + x;
Func<char, char, int> NextPair = (x, y) => a * NextCoprime(x) * x.GetHashCode() + b * y.GetHashCode();
//for i=0 we need to wrap around to the last character
H = NextPair(s[s.Length - 1], s[0]);
//for i=1...n we use the previous character
for (int i = 1; i < s.Length; i++)
{
H ^= NextPair(s[i - 1], s[i]);
}
}
return H;
}
static void Main(string[] args)
{
Console.WriteLine("{0:X8}", Hash("abcdef"));
Console.WriteLine("{0:X8}", Hash("bcdefa"));
Console.WriteLine("{0:X8}", Hash("cdefab"));
Console.WriteLine("{0:X8}", Hash("cdfeab"));
Console.WriteLine("{0:X8}", Hash("a0a0"));
Console.WriteLine("{0:X8}", Hash("1010"));
Console.WriteLine("{0:X8}", Hash("0abc0def0ghi"));
Console.WriteLine("{0:X8}", Hash("0def0abc0ghi"));
}
The output is now:
7F7D7F7F
7F7D7F7F
7F7D7F7F
7F417F4F
C796C7F0
E090E0F0
A909BB71
A959BB71
First Version (which isn't complete): Use XOR which is commutative (order doesn't matter) and another little trick involving coprimes to combine ordered hashes of pairs of letters in the string. Here is an example in C#:
static int Hash(char[] s)
{
//any arbitrary coprime numbers
const int a = 7, b = 13;
int H = 0;
if (s.Length > 0)
{
//for i=0 we need to wrap around to the last character
H ^= (a * s[s.Length - 1].GetHashCode()) + (b * s[0].GetHashCode());
//for i=1...n we use the previous character
for (int i = 1; i < s.Length; i++)
{
H ^= (a * s[i - 1].GetHashCode()) + (b * s[i].GetHashCode());
}
}
return H;
}
static void Main(string[] args)
{
Console.WriteLine(Hash("abcdef".ToCharArray()));
Console.WriteLine(Hash("bcdefa".ToCharArray()));
Console.WriteLine(Hash("cdefab".ToCharArray()));
Console.WriteLine(Hash("cdfeab".ToCharArray()));
}
The output is:
4587590
4587590
4587590
7077996

You could find a deterministic first position by always starting at the position with the "lowest" (in terms of alphabetical ordering) substring. So in your case, you'd always start at "a". If there were multiple "a"s, you'd have to take two characters into account etc.

I am sure that you could find a function that can generate the same hash regardless of character position in the input, however, how will you ensure that h(abc) != h(efg) for every conceivable input? (Collisions will occur for all hash algorithms, so I mean, how do you minimize this risk.)
You'd need some additional checks even after generating the hash to ensure that the strings contain the same characters.

Here's an implementation using Linq
public string ToCanonicalOrder(string input)
{
char first = input.OrderBy(x => x).First();
string doubledForRotation = input + input;
string canonicalOrder
= (-1)
.GenerateFrom(x => doubledForRotation.IndexOf(first, x + 1))
.Skip(1) // the -1
.TakeWhile(x => x < input.Length)
.Select(x => doubledForRotation.Substring(x, input.Length))
.OrderBy(x => x)
.First();
return canonicalOrder;
}
assuming generic generator extension method:
public static class TExtensions
{
public static IEnumerable<T> GenerateFrom<T>(this T initial, Func<T, T> next)
{
var current = initial;
while (true)
{
yield return current;
current = next(current);
}
}
}
sample usage:
var sequences = new[]
{
"abcdef", "bcdefa", "cdefab",
"defabc", "efabcd", "fabcde",
"abaac", "cabcab"
};
foreach (string sequence in sequences)
{
Console.WriteLine(ToCanonicalOrder(sequence));
}
output:
abcdef
abcdef
abcdef
abcdef
abcdef
abcdef
aacab
abcabc
then call .GetHashCode() on the result if necessary.
sample usage if ToCanonicalOrder() is converted to an extension method:
sequence.ToCanonicalOrder().GetHashCode();

One possibility is to combine the hash functions of all circular shifts of your input into one meta-hash which does not depend on the order of the inputs.
More formally, consider
for(int i=0; i<string.length; i++) {
result^=string.rotatedBy(i).hashCode();
}
Where you could replace the ^= with any other commutative operation.
More examply, consider the input
"abcd"
to get the hash we take
hash("abcd") ^ hash("dabc") ^ hash("cdab") ^ hash("bcda").
As we can see, taking the hash of any of these permutations will only change the order that you are evaluating the XOR, which won't change its value.

I did something like this for a project in college. There were 2 approaches I used to try to optimize a Travelling-Salesman problem. I think if the elements are NOT guaranteed to be unique, the second solution would take a bit more checking, but the first one should work.
If you can represent the string as a matrix of associations so abcdef would look like
a b c d e f
a x
b x
c x
d x
e x
f x
But so would any combination of those associations. It would be trivial to compare those matrices.
Another quicker trick would be to rotate the string so that the "first" letter is first. Then if you have the same starting point, the same strings will be identical.
Here is some Ruby code:
def normalize_string(string)
myarray = string.split(//) # split into an array
index = myarray.index(myarray.min) # find the index of the minimum element
index.times do
myarray.push(myarray.shift) # move stuff from the front to the back
end
return myarray.join
end
p normalize_string('abcdef').eql?normalize_string('defabc') # should return true

Maybe use a rolling hash for each offset (RabinKarp like) and return the minimum hash value? There could be collisions though.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string