Algorithm for doing many substring reversals? - string

Suppose I have a string S of length N, and I want to perform M of the following operations:
choose 1 <= L,R <= N and reverse the substring S[L..R]
I am interested in what the final string looks like after all M operations. The obvious approach is to do the actual swapping, which leads to O(MN) worst-case behavior. Is there a faster way? I'm trying to just keep track of where an index ends up, but I cannot find a way to reduce the running time (though I have a gut feeling O(M lg N + N) -- for the operations and the final reading -- is possible).

Yeah, it's possible. Make a binary tree structure like
struct node {
struct node *child[2];
struct node *parent;
char label;
bool subtree_flipped;
};
Then you can have a logical getter/setter for left/right child:
struct node *get_child(struct node *u, bool right) {
return u->child[u->subtree_flipped ^ right];
}
void set_child(struct node *u, bool right, struct node *c) {
u->child[u->subtree_flipped ^ right] = c;
if (c != NULL) { c->parent = u; }
}
Rotations have to preserve flipped bits:
struct node *detach(struct node *u, bool right) {
struct node *c = get_child(u, right);
if (c != NULL) { c->subtree_flipped ^= u->subtree_flipped; }
return c;
}
void attach(struct node *u, bool right, struct node *c) {
set_child(u, right, c);
if (c != NULL) { c->subtree_flipped ^= u->subtree_flipped; }
}
// rotates one of |p|'s child up.
// does not fix up the pointer to |p|.
void rotate(struct node *p, bool right) {
struct node *u = detach(p, right);
struct node *c = detach(u, !right);
attach(p, right, c);
attach(u, !right, p);
}
Implement splay with rotations. It should take a "guard" pointer that is treated as a NULL parent for the purpose of splaying, so that you can splay one node to the root and another to its right child. Do this and then you can splay both endpoints of the flipped region and then toggle the flip bits for the root and the two subtrees corresponding to segments left unaffected.
Traversal looks like this.
void traverse(struct node *u, bool flipped) {
if (u == NULL) { return; }
flipped ^= u->subtree_flipped;
traverse(u->child[flipped], flipped);
visit(u);
traverse(u->child[!flipped], flipped);
}

Splay tree may help you, it supports reverse operation in an array, with total complexity O(mlogn)

#F. Ju is right, splay trees are one of the best data structures to achieve your goal.
However, if you don't want to implement them, or a solution in O((N + M) * sqrt(M)) is good enough, you can do the following:
We will perform sqrt(M) consecutive queries and then rebuilt the array from the scratch in O(N) time.
In order to do that, for each query, we will store the information that the queried segment [a, b] is reversed or not (if you reverse some range of elements twice, they become unreversed).
The key here is to maintain the information for disjoint segments here. Notice that since we are performing at most sqrt(M) queries before rebuilding the array, we will have at most sqrt(M) disjoint segments and we can perform query operation on sqrt(M) segments in sqrt(M) time. Let me know if you need a detailed explanation on how to "reverse" these disjoint segments.
This trick is very useful while solving problems like that and it is worth to know it.
UPDATE:
I solved the problem exactly corresponding to yours on HackerRank, during their contest, using the method I described.
Here is the problem
Here is my solution in C++.
Here is the discussion about the problem and a brief description of my method, please check my 3rd message there.

I'm trying to just keep track of where an index ends up
If you're just trying to follow one entry of the starting array, it's easy to do that in O(M) time.
I was going to just write pseudocode, but no hand-waving was needed so I ended up with what's probably valid C++.
// untested C++, but it does compile to code that looks right.
struct swap {
int l, r;
// or make these non-member functions for C
bool covers(int pos) { return l <= pos && pos <= r; }
int apply_if_covering(int pos) {
// startpos - l = r - endpos;
// endpos = l - startpos + r
if(covers(pos))
pos = l - pos + r;
return pos;
}
};
int follow_swaps (int pos, int len, struct swap swaps[], int num_swaps)
{
// pos = starting position of the element we want to track
// return value = where it will be after all the swaps
for (int i = 0 ; i < num_swaps ; i++) {
pos = swaps[i].apply_if_covering(pos);
}
return pos;
}
This compiles to very efficient-looking code.

Related

How do I simple remove duplicates in my vector?

I am new to coding and struggling with a section in my code. I am at the part where i want to remove duplicate int values from my vector.
my duplicated vector contains: 1 1 2 1 4
my goal is to get a deduplicated vector: 1, 2, 4.
This is what I have so far, It also needs to be a rather simple solution. No pointers and fancy stuff as I still need to study those in the future.
for(int i = 0; i < duplicatedVector.size(); i++) {
int temp = duplicatedVector.at(i);
int counter = 0;
if(temp == duplicatedVector.at(i)) {
counter++;
if(counter > 1) {
deduplicatedVector.push_back(temp);
}
}
}
Could anyone tell me what I do wrong ? I genuinly am trying to iterate through the vector and delete duplicated int, in the given order.
Your algorithm is not well-enough thought out.
Break it up:
for each element of the original vector:
is it in the result vector?
yes: do nothing
no: add it to the result vector
You have your (1) loop, but the (2) part is confused. The result vector is not the same as the original vector, and is not to be indexed the same.
To determine whether an element is in a vector, you need a loop. Loop through your result vector to see if the element is in it. If you find it, it is, so break the inner loop. If you do not, you don't.
You can tell whether or not you found a duplicate by the final value of your inner loop index (the index into the result vector). If it equals result.size() then no duplicate was found.
Clearer variable naming might help as well. You are calling your original/source vector duplicatedVector, and your result vector deduplicatedVector. Even hasDuplicates and noDuplicates would be easier to mentally parse.
You could use a set since it eliminates duplicates:
#include <bits/stdc++.h>
using namespace std;
int main () {
vector<int> vec = vector<int>();
vector<int> dedupl = vector<int>();
vec.push_back(2);
vec.push_back(4);
vec.push_back(2);
vec.push_back(7);
vec.push_back(34);
vec.push_back(34);
set<int> mySet = set<int>();
for (int i = 0; i < vec.size(); i++) {
mySet.insert(vec[i]);
}
for (int elem : mySet) {
dedupl.push_back(elem);
}
for (int elem : dedupl) {
cout << elem << " ";
}
}

Reduce a string using grammar-like rules

I'm trying to find a suitable DP algorithm for simplifying a string. For example I have a string a b a b and a list of rules
a b -> b
a b -> c
b a -> a
c c -> b
The purpose is to get all single chars that can be received from the given string using these rules. For this example it will be b, c. The length of the given string can be up to 200 symbols. Could you please prompt an effective algorithm?
Rules always are 2 -> 1. I've got an idea of creating a tree, root is given string and each child is a string after one transform, but I'm not sure if it's the best way.
If you read those rules from right to left, they look exactly like the rules of a context free grammar, and have basically the same meaning. You could apply a bottom-up parsing algorithm like the Earley algorithm to your data, along with a suitable starting rule; something like
start <- start a
| start b
| start c
and then just examine the parse forest for the shortest chain of starts. The worst case remains O(n^3) of course, but Earley is fairly effective, these days.
You can also produce parse forests when parsing with derivatives. You might be able to efficiently check them for short chains of starts.
For a DP problem, you always need to understand how you can construct the answer for a big problem in terms of smaller sub-problems. Assume you have your function simplify which is called with an input of length n. There are n-1 ways to split the input in a first and a last part. For each of these splits, you should recursively call your simplify function on both the first part and the last part. The final answer for the input of length n is the set of all possible combinations of answers for the first and for the last part, which are allowed by the rules.
In Python, this can be implemented like so:
rules = {'ab': set('bc'), 'ba': set('a'), 'cc': set('b')}
all_chars = set(c for cc in rules.values() for c in cc)
# memoize
def simplify(s):
if len(s) == 1: # base case to end recursion
return set(s)
possible_chars = set()
# iterate over all the possible splits of s
for i in range(1, len(s)):
head = s[:i]
tail = s[i:]
# check all possible combinations of answers of sub-problems
for c1 in simplify(head):
for c2 in simplify(tail):
possible_chars.update(rules.get(c1+c2, set()))
# speed hack
if possible_chars == all_chars: # won't get any bigger
return all_chars
return possible_chars
Quick check:
In [53]: simplify('abab')
Out[53]: {'b', 'c'}
To make this fast enough for large strings (to avoiding exponential behavior), you should use a memoize decorator. This is a critical step in solving DP problems, otherwise you are just doing a brute-force calculation. A further tiny speedup can be obtained by returning from the function as soon as possible_chars == set('abc'), since at that point, you are already sure that you can generate all possible outcomes.
Analysis of running time: for an input of length n, there are 2 substrings of length n-1, 3 substrings of length n-2, ... n substrings of length 1, for a total of O(n^2) subproblems. Due to the memoization, the function is called at most once for every subproblem. Maximum running time for a single sub-problem is O(n) due to the for i in range(len(s)), so the overall running time is at most O(n^3).
Let N - length of given string and R - number of rules.
Expanding a tree in a top down manner yields computational complexity O(NR^N) in the worst case (input string of type aaa... and rules aa -> a).
Proof:
Root of the tree has (N-1)R children, which have (N-1)R^2 children, ..., which have (N-1)R^N children (leafs). So, the total complexity is O((N-1)R + (N-1)R^2 + ... (N-1)R^N) = O(N(1 + R^2 + ... + R^N)) = (using binomial theorem) = O(N(R+1)^N) = O(NR^N).
Recursive Java implementation of this naive approach:
public static void main(String[] args) {
Map<String, Character[]> rules = new HashMap<String, Character[]>() {{
put("ab", new Character[]{'b', 'c'});
put("ba", new Character[]{'a'});
put("cc", new Character[]{'b'});
}};
System.out.println(simplify("abab", rules));
}
public static Set<String> simplify(String in, Map<String, Character[]> rules) {
Set<String> result = new HashSet<String>();
simplify(in, rules, result);
return result;
}
private static void simplify(String in, Map<String, Character[]> rules, Set<String> result) {
if (in.length() == 1) {
result.add(in);
}
for (int i = 0; i < in.length() - 1; i++) {
String two = in.substring(i, i + 2);
Character[] rep = rules.get(two);
if (rep != null) {
for (Character c : rep) {
simplify(in.substring(0, i) + c + in.substring(i + 2, in.length()), rules, result);
}
}
}
}
Bas Swinckels's O(RN^3) Java implementation (with HashMap as a memoization cache):
public static Set<String> simplify2(final String in, Map<String, Character[]> rules) {
Map<String, Set<String>> cache = new HashMap<String, Set<String>>();
return simplify2(in, rules, cache);
}
private static Set<String> simplify2(final String in, Map<String, Character[]> rules, Map<String, Set<String>> cache) {
final Set<String> cached = cache.get(in);
if (cached != null) {
return cached;
}
Set<String> ret = new HashSet<String>();
if (in.length() == 1) {
ret.add(in);
return ret;
}
for (int i = 1; i < in.length(); i++) {
String head = in.substring(0, i);
String tail = in.substring(i, in.length());
for (String c1 : simplify2(head, rules)) {
for (String c2 : simplify2(tail, rules, cache)) {
Character[] rep = rules.get(c1 + c2);
if (rep != null) {
for (Character c : rep) {
ret.add(c.toString());
}
}
}
}
}
cache.put(in, ret);
return ret;
}
Output in both approaches:
[b, c]

Count of nodes in BST

I am trying to count the number of nodes in a Binary Search Tree and was wondering what the most efficient means was. These are the options that I have found:
store int count in the BST Class
store int children in each node of the tree which stores the number of children under it
write a method that counts the number of Nodes in the BST
if using option 3, I've written:
int InOrder {
Node *cur = root;
int count = 0;
Stack *s = null;
bool done = false;
while(!done) {
if(cur != NULL) {
s.push(cur);
cur = cur->left;
}
else {
if(!s.IsEmpty()) {
cur = s.pop();
count++;
cur = cur->right;
}
else {
done = true;
}
}
}
return count;
}
but from looking at it, it seems like it would get stuck in an infinite loop between cur = cur->left; and cur = cur->right;
So which option is the most efficient and if it is option 3, then will this method work?
I think the first option is the quickest and it only requires O(1) space to achieve this. However whenever you insert/delete an item, you need to keep updating this value.
It will take O(1) time to get the number of all the nodes.
The second option would make this program way too complicated since deleting/inserting a node somewhere would have to update all of its ancestors. Either you add a parent pointer so you can adequately update each one of the ancestors, or you need to go through all the nodes in the tree and update the numbers again. Anyway I think this would be the worst option of all three.
The third option is good if you don't call this many times since the first option is a lot quicker, O(1), than this option. This will take O(n) since you need to go through every single node to check the count.
In terms of your code, I think it's easier to write in a recursive way like below:
int getCount(Node* n)
{
if (!n)
return 0;
return 1 + getCount(n->left) + getCount(n->right);
}
Hope this helps!

Scrabble tile checking

For tile checking in scrabble, you make four 5x5 grids of letters totalling 100 tiles. I would like to make one where all 40 horizontal and vertical words are valid. The set of available tiles contains:
12 x E
9 x A, I
8 x O
6 x N, R, T
4 x D, L, S, U
3 x G
2 x B, C, F, H, M, P, V, W, Y, blank tile (wildcard)
1 x K, J, Q, X, Z
The dictionary of valid words is available here (700KB). There are about 12,000 valid 5 letter words.
Here's an example where all 20 horizontal words are valid:
Z O W I E|P I N O T
Y O G I N|O C t A D <= blank being used as 't'
X E B E C|N A L E D
W A I T E|M E R L E
V I N E R|L U T E A
---------+---------
U S N E A|K N O S P
T A V E R|J O L E D
S O F T A|I A M B I
R I D G Y|H A I T h <= blank being used as 'h'
Q U R S H|G R O U F
I'd like to create one where all the vertical ones are also valid. Can you help me solve this? It is not homework. It is a question a friend asked me for help with.
Final Edit: Solved! Here is a solution.
GNAWN|jOULE
RACHE|EUROS
IDIOT|STEAN
PINOT|TRAvE
TRIPY|SOLES
-----+-----
HOWFF|ZEBRA
AGILE|EQUID
CIVIL|BUXOM
EVENT|RIOJA
KEDGY|ADMAN
Here's a photo of it constructed with my scrabble set. http://twitpic.com/3wn7iu
This one was easy to find once I had the right approach, so I bet you could find many more this way. See below for methodology.
Construct a prefix tree from the dictionary of 5 letter words for each row and column. Recursively, a given tile placement is valid if it forms valid prefixes for its column and row, and if the tile is available, and if the next tile placement is valid. The base case is that it is valid if there is no tile left to place.
It probably makes sense to just find all valid 5x5 boards, like Glenn said, and see if any four of them can be combined. Recursing to a depth of 100 doesn't sound like fun.
Edit: Here is version 2 of my code for this.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
typedef union node node;
union node {
node* child[26];
char string[6];
};
typedef struct snap snap;
struct snap {
node* rows[5];
node* cols[5];
char tiles[27];
snap* next;
};
node* root;
node* vtrie[5];
node* htrie[5];
snap* head;
char bag[27] = {9,2,2,4,12,2,3,2,9,1,1,4,2,6,8,2,1,6,4,6,4,2,2,1,2,1,2};
const char full_bag[27] = {9,2,2,4,12,2,3,2,9,1,1,4,2,6,8,2,1,6,4,6,4,2,2,1,2,1,2};
const char order[26] = {16,23,9,25,21,22,5,10,1,6,7,12,15,2,24,3,20,13,19,11,8,17,14,0,18,4};
void insert(char* string){
node* place = root;
int i;
for(i=0;i<5;i++){
if(place->child[string[i] - 'A'] == NULL){
int j;
place->child[string[i] - 'A'] = malloc(sizeof(node));
for(j=0;j<26;j++){
place->child[string[i] - 'A']->child[j] = NULL;
}
}
place = place->child[string[i] - 'A'];
}
memcpy(place->string, string, 6);
}
void check_four(){
snap *a, *b, *c, *d;
char two_total[27];
char three_total[27];
int i;
bool match;
a = head;
for(b = a->next; b != NULL; b = b->next){
for(i=0;i<27; i++)
two_total[i] = a->tiles[i] + b->tiles[i];
for(c = b->next; c != NULL; c = c->next){
for(i=0;i<27; i++)
three_total[i] = two_total[i] + c->tiles[i];
for(d = c->next; d != NULL; d = d->next){
match = true;
for(i=0; i<27; i++){
if(three_total[i] + d->tiles[i] != full_bag[i]){
match = false;
break;
}
}
if(match){
printf("\nBoard Found!\n\n");
for(i=0;i<5;i++){
printf("%s\n", a->rows[i]->string);
}
printf("\n");
for(i=0;i<5;i++){
printf("%s\n", b->rows[i]->string);
}
printf("\n");
for(i=0;i<5;i++){
printf("%s\n", c->rows[i]->string);
}
printf("\n");
for(i=0;i<5;i++){
printf("%s\n", d->rows[i]->string);
}
exit(0);
}
}
}
}
}
void snapshot(){
snap* shot = malloc(sizeof(snap));
int i;
for(i=0;i<5;i++){
printf("%s\n", htrie[i]->string);
shot->rows[i] = htrie[i];
shot->cols[i] = vtrie[i];
}
printf("\n");
for(i=0;i<27;i++){
shot->tiles[i] = full_bag[i] - bag[i];
}
bool transpose = false;
snap* place = head;
while(place != NULL && !transpose){
transpose = true;
for(i=0;i<5;i++){
if(shot->rows[i] != place->cols[i]){
transpose = false;
break;
}
}
place = place->next;
}
if(transpose){
free(shot);
}
else {
shot->next = head;
head = shot;
check_four();
}
}
void pick(x, y){
if(y==5){
snapshot();
return;
}
int i, tile,nextx, nexty, nextz;
node* oldv = vtrie[x];
node* oldh = htrie[y];
if(x+1==5){
nexty = y+1;
nextx = 0;
} else {
nextx = x+1;
nexty = y;
}
for(i=0;i<26;i++){
if(vtrie[x]->child[order[i]]!=NULL &&
htrie[y]->child[order[i]]!=NULL &&
(tile = bag[i] ? i : bag[26] ? 26 : -1) + 1) {
vtrie[x] = vtrie[x]->child[order[i]];
htrie[y] = htrie[y]->child[order[i]];
bag[tile]--;
pick(nextx, nexty);
vtrie[x] = oldv;
htrie[y] = oldh;
bag[tile]++;
}
}
}
int main(int argc, char** argv){
root = malloc(sizeof(node));
FILE* wordlist = fopen("sowpods5letters.txt", "r");
head = NULL;
int i;
for(i=0;i<26;i++){
root->child[i] = NULL;
}
for(i=0;i<5;i++){
vtrie[i] = root;
htrie[i] = root;
}
char* string = malloc(sizeof(char)*6);
while(fscanf(wordlist, "%s", string) != EOF){
insert(string);
}
free(string);
fclose(wordlist);
pick(0,0);
return 0;
}
This tries the infrequent letters first, which I'm no longer sure is a good idea. It starts to get bogged down before it makes it out of the boards starting with x. After seeing how many 5x5 blocks there were I altered the code to just list out all the valid 5x5 blocks. I now have a 150 MB text file with all 4,430,974 5x5 solutions.
I also tried it with just recursing through the full 100 tiles, and that is still running.
Edit 2: Here is the list of all the valid 5x5 blocks I generated. http://web.cs.sunyit.edu/~levyt/solutions.rar
Edit 3: Hmm, seems there was a bug in my tile usage tracking, because I just found a block in my output file that uses 5 Zs.
COSTE
ORCIN
SCUZZ
TIZZY
ENZYM
Edit 4: Here is the final product.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
typedef union node node;
union node {
node* child[26];
char string[6];
};
node* root;
node* vtrie[5];
node* htrie[5];
int score;
int max_score;
char block_1[27] = {4,2,0,2, 2,0,0,0,2,1,0,0,2,1,2,0,1,2,0,0,2,0,0,1,0,1,0};//ZEBRA EQUID BUXOM RIOJA ADMAN
char block_2[27] = {1,0,1,1, 4,2,2,1,3,0,1,2,0,1,1,0,0,0,0,1,0,2,1,0,1,0,0};//HOWFF AGILE CIVIL EVENT KEDGY
char block_3[27] = {2,0,1,1, 1,0,1,1,4,0,0,0,0,3,2,2,0,2,0,3,0,0,1,0,1,0,0};//GNAWN RACHE IDIOT PINOT TRIPY
//JOULE EUROS STEAN TRAVE SOLES
char bag[27] = {9,2,2,4,12,2,3,2,9,1,1,4,2,6,8,2,1,6,4,6,4,2,2,1,2,1,2};
const char full_bag[27] = {9,2,2,4,12,2,3,2,9,1,1,4,2,6,8,2,1,6,4,6,4,2,2,1,2,1,2};
const char order[26] = {16,23,9,25,21,22,5,10,1,6,7,12,15,2,24,3,20,13,19,11,8,17,14,0,18,4};
const int value[27] = {244,862,678,564,226,1309,844,765,363,4656,909,414,691,463,333,687,11998,329,218,423,536,1944,1244,4673,639,3363,0};
void insert(char* string){
node* place = root;
int i;
for(i=0;i<5;i++){
if(place->child[string[i] - 'A'] == NULL){
int j;
place->child[string[i] - 'A'] = malloc(sizeof(node));
for(j=0;j<26;j++){
place->child[string[i] - 'A']->child[j] = NULL;
}
}
place = place->child[string[i] - 'A'];
}
memcpy(place->string, string, 6);
}
void snapshot(){
static int count = 0;
int i;
for(i=0;i<5;i++){
printf("%s\n", htrie[i]->string);
}
for(i=0;i<27;i++){
printf("%c%d ", 'A'+i, bag[i]);
}
printf("\n");
if(++count>=1000){
exit(0);
}
}
void pick(x, y){
if(y==5){
if(score>max_score){
snapshot();
max_score = score;
}
return;
}
int i, tile,nextx, nexty;
node* oldv = vtrie[x];
node* oldh = htrie[y];
if(x+1==5){
nextx = 0;
nexty = y+1;
} else {
nextx = x+1;
nexty = y;
}
for(i=0;i<26;i++){
if(vtrie[x]->child[order[i]]!=NULL &&
htrie[y]->child[order[i]]!=NULL &&
(tile = bag[order[i]] ? order[i] : bag[26] ? 26 : -1) + 1) {
vtrie[x] = vtrie[x]->child[order[i]];
htrie[y] = htrie[y]->child[order[i]];
bag[tile]--;
score+=value[tile];
pick(nextx, nexty);
vtrie[x] = oldv;
htrie[y] = oldh;
bag[tile]++;
score-=value[tile];
}
}
}
int main(int argc, char** argv){
root = malloc(sizeof(node));
FILE* wordlist = fopen("sowpods5letters.txt", "r");
score = 0;
max_score = 0;
int i;
for(i=0;i<26;i++){
root->child[i] = NULL;
}
for(i=0;i<5;i++){
vtrie[i] = root;
htrie[i] = root;
}
for(i=0;i<27;i++){
bag[i] = bag[i] - block_1[i];
bag[i] = bag[i] - block_2[i];
bag[i] = bag[i] - block_3[i];
printf("%c%d ", 'A'+i, bag[i]);
}
char* string = malloc(sizeof(char)*6);
while(fscanf(wordlist, "%s", string) != EOF){
insert(string);
}
free(string);
fclose(wordlist);
pick(0,0);
return 0;
}
After finding out how many blocks there were (nearly 2 billion and still counting), I switched to trying to find certain types of blocks, in particular the difficult to construct ones using uncommon letters. My hope was that if I ended up with a benign enough set of letters going in to the last block, the vast space of valid blocks would probably have one for that set of letters.
I assigned each tile a value inversely proportional to the number of 5 letter words it appears in. Then, when I found a valid block I would sum up the tile values, and if the score was the best I had yet seen, I would print out the block.
For the first block I removed the blank tiles, figuring that the last block would need that flexibility the most. After letting it run until I had not seen a better block appear for some time, I selected the best block, and removed the tiles in it from the bag, and ran the program again, getting the second block. I repeated this for the 3rd block. Then for the last block I added the blanks back in and used the first valid block it found.
Here's how I would try this. First construct a prefix tree.
Pick a word and place it horizontally on top. Pick a word and place it vertically. Alternate them until exhausted options. By alternating you start to fix the first letters and eliminating lots of mismatching words. If you really do find such square, then do a check whether they can be made with those pieces.
For 5x5 squares: after doing some thinking it can't be worse than O(12000!/11990!) for random text words. But thinking about it a little bit more. Every time you fix a letter (in normal text) you eliminate about 90% (an optimistic guess) of your words. This means after three iterations you've got 12 words. So the actual speed would be
O(n * n/10 * n/10 * n/100 * n/100 * n/1000 * n/1000 ...
which for 12000 elements acts something like n^4 algorithm
which isn't that bad.
Probably someone can do a better analysis of the problem. But the search for words should still converge quite quickly.
There can be more eliminating done by abusing the infrequent letters. Essentially find all words that have infrequent letters. Try to make a matching positions for each letters. Construct a set of valid letters for each position.
For example, let's say we have four words with letter Q in it.
AQFED, ZQABE, EDQDE, ELQUO
this means there are two valid positionings of those:
xZxxx
AQFED
xAxxx ---> this limits our search for words that contain [ABDEFZ] as the second letter
xBxxx
xExxx
same for the other
EDQDE ---> this limits our search for words that contain [EDLU] as the third letter
ELQUO
all appropriate words are in union of those two conditions
So basically, if we have multiple words that contain infrequent letter X in word S at position N, means that other words that are in that matrix must have letter that is also in S in position n.
Formula:
Find all words that contain infrequent letter X at position 1 (next iteration 2, 3... )
Make a set A out of the letters in those words
Keep only those words from the dictionary that have letter from set A in position 1
Try to fit those into the matrix (with the first method)
Repeat with position 2
I would approach the problem (naively, to be sure) by taking a pessimistic view. I'd try to prove there was no 5x5 solution, and therefore certainly not four 5x5 solutions. To prove there was no 5x5 solution I'd try to construct one from all possibilities. If my conjecture failed and I was able to construct a 5x5 solution, well, then, I'd have a way to construct 5x5 solutions and I would try to construct all of the (independent) 5x5 solutions. If there were at least 4, then I would determine if some combination satisfied the letter count restrictions.
[Edit] Null Set has determined that there are "4,430,974 5x5 solutions". Are these valid?
I mean that we have a limitation on the number of letters we can use. This limitation can be expressed as a boundary vector BV = [9, 2, 2, 4, ...] corresponding to the limits on A, B, C, etc. (You see this vector in Null Set's code). A 5x5 solution is valid if each term of its letter count vector is less than the corresponding term in BV. It would be easy to check if a 5x5 solution is valid as it was created. Perhaps the 4,430,974 number can be reduced, say to N.
Regardless, we can state the problem as: find four letter count vectors among the N whose sum is equal to BV. There are (N, 4) possible sums ("N choose 4"). With N equal to 4 million this is still on the order of 10^25---not an encouraging number. Perhaps you could search for four whose first terms sum to 9, and if so checking that their second terms sum to 2, etc.
I'd remark that after choosing 4 from N the computations are independent, so if you have a multi-core machine you can make this go faster with a parallel solution.
[Edit2] Parallelizing probably wouldn't make much difference, though. At this point I might take an optimistic view: there are certainly more 5x5 solutions than I expected, so there may be more final solutions than expected, too. Perhaps you might not have to get far into the 10^25 to hit one.
I'm starting with something simpler.
Here are some results so far:
3736 2x2 solutions
8812672 3x3 solutions
The 1000th 4x4 solution is
A A H S
A C A I
L A I R
S I R E
The 1000th 5x5 solution is
A A H E D
A B U N A
H U R S T
E N S U E
D A T E D
The 1000th 2x4x4 solution is
A A H S | A A H S
A B A C | A B A C
H A I R | L E K U
S C R Y | S T E D
--------+--------
D E E D | D E E M
E I N E | I N T I
E N O L | O V E R
T E L T | L Y N E
Note that transposing an 'A' and a blank that is being used as an 'A' should be considered the same solution. But transposing the rows with the columns should be considered a different solution. I hope that makes sense.
Here are a lot of precomputed 5x5's. Left as an exercise to the reader to find 4 compatible ones :-)
http://www.gtoal.com/wordgames/wordsquare/all5

Is there a circular hash function?

Thinking about this question on testing string rotation, I wondered: Is there was such thing as a circular/cyclic hash function? E.g.
h(abcdef) = h(bcdefa) = h(cdefab) etc
Uses for this include scalable algorithms which can check n strings against each other to see where some are rotations of others.
I suppose the essence of the hash is to extract information which is order-specific but not position-specific. Maybe something that finds a deterministic 'first position', rotates to it and hashes the result?
It all seems plausible, but slightly beyond my grasp at the moment; it must be out there already...
I'd go along with your deterministic "first position" - find the "least" character; if it appears twice, use the next character as the tie breaker (etc). You can then rotate to a "canonical" position, and hash that in a normal way. If the tie breakers run for the entire course of the string, then you've got a string which is a rotation of itself (if you see what I mean) and it doesn't matter which you pick to be "first".
So:
"abcdef" => hash("abcdef")
"defabc" => hash("abcdef")
"abaac" => hash("aacab") (tie-break between aa, ac and ab)
"cabcab" => hash("abcabc") (it doesn't matter which "a" comes first!)
Update: As Jon pointed out, the first approach doesn't handle strings with repetition very well. Problems arise as duplicate pairs of letters are encountered and the resulting XOR is 0. Here is a modification that I believe fixes the the original algorithm. It uses Euclid-Fermat sequences to generate pairwise coprime integers for each additional occurrence of a character in the string. The result is that the XOR for duplicate pairs is non-zero.
I've also cleaned up the algorithm slightly. Note that the array containing the EF sequences only supports characters in the range 0x00 to 0xFF. This was just a cheap way to demonstrate the algorithm. Also, the algorithm still has runtime O(n) where n is the length of the string.
static int Hash(string s)
{
int H = 0;
if (s.Length > 0)
{
//any arbitrary coprime numbers
int a = s.Length, b = s.Length + 1;
//an array of Euclid-Fermat sequences to generate additional coprimes for each duplicate character occurrence
int[] c = new int[0xFF];
for (int i = 1; i < c.Length; i++)
{
c[i] = i + 1;
}
Func<char, int> NextCoprime = (x) => c[x] = (c[x] - x) * c[x] + x;
Func<char, char, int> NextPair = (x, y) => a * NextCoprime(x) * x.GetHashCode() + b * y.GetHashCode();
//for i=0 we need to wrap around to the last character
H = NextPair(s[s.Length - 1], s[0]);
//for i=1...n we use the previous character
for (int i = 1; i < s.Length; i++)
{
H ^= NextPair(s[i - 1], s[i]);
}
}
return H;
}
static void Main(string[] args)
{
Console.WriteLine("{0:X8}", Hash("abcdef"));
Console.WriteLine("{0:X8}", Hash("bcdefa"));
Console.WriteLine("{0:X8}", Hash("cdefab"));
Console.WriteLine("{0:X8}", Hash("cdfeab"));
Console.WriteLine("{0:X8}", Hash("a0a0"));
Console.WriteLine("{0:X8}", Hash("1010"));
Console.WriteLine("{0:X8}", Hash("0abc0def0ghi"));
Console.WriteLine("{0:X8}", Hash("0def0abc0ghi"));
}
The output is now:
7F7D7F7F
7F7D7F7F
7F7D7F7F
7F417F4F
C796C7F0
E090E0F0
A909BB71
A959BB71
First Version (which isn't complete): Use XOR which is commutative (order doesn't matter) and another little trick involving coprimes to combine ordered hashes of pairs of letters in the string. Here is an example in C#:
static int Hash(char[] s)
{
//any arbitrary coprime numbers
const int a = 7, b = 13;
int H = 0;
if (s.Length > 0)
{
//for i=0 we need to wrap around to the last character
H ^= (a * s[s.Length - 1].GetHashCode()) + (b * s[0].GetHashCode());
//for i=1...n we use the previous character
for (int i = 1; i < s.Length; i++)
{
H ^= (a * s[i - 1].GetHashCode()) + (b * s[i].GetHashCode());
}
}
return H;
}
static void Main(string[] args)
{
Console.WriteLine(Hash("abcdef".ToCharArray()));
Console.WriteLine(Hash("bcdefa".ToCharArray()));
Console.WriteLine(Hash("cdefab".ToCharArray()));
Console.WriteLine(Hash("cdfeab".ToCharArray()));
}
The output is:
4587590
4587590
4587590
7077996
You could find a deterministic first position by always starting at the position with the "lowest" (in terms of alphabetical ordering) substring. So in your case, you'd always start at "a". If there were multiple "a"s, you'd have to take two characters into account etc.
I am sure that you could find a function that can generate the same hash regardless of character position in the input, however, how will you ensure that h(abc) != h(efg) for every conceivable input? (Collisions will occur for all hash algorithms, so I mean, how do you minimize this risk.)
You'd need some additional checks even after generating the hash to ensure that the strings contain the same characters.
Here's an implementation using Linq
public string ToCanonicalOrder(string input)
{
char first = input.OrderBy(x => x).First();
string doubledForRotation = input + input;
string canonicalOrder
= (-1)
.GenerateFrom(x => doubledForRotation.IndexOf(first, x + 1))
.Skip(1) // the -1
.TakeWhile(x => x < input.Length)
.Select(x => doubledForRotation.Substring(x, input.Length))
.OrderBy(x => x)
.First();
return canonicalOrder;
}
assuming generic generator extension method:
public static class TExtensions
{
public static IEnumerable<T> GenerateFrom<T>(this T initial, Func<T, T> next)
{
var current = initial;
while (true)
{
yield return current;
current = next(current);
}
}
}
sample usage:
var sequences = new[]
{
"abcdef", "bcdefa", "cdefab",
"defabc", "efabcd", "fabcde",
"abaac", "cabcab"
};
foreach (string sequence in sequences)
{
Console.WriteLine(ToCanonicalOrder(sequence));
}
output:
abcdef
abcdef
abcdef
abcdef
abcdef
abcdef
aacab
abcabc
then call .GetHashCode() on the result if necessary.
sample usage if ToCanonicalOrder() is converted to an extension method:
sequence.ToCanonicalOrder().GetHashCode();
One possibility is to combine the hash functions of all circular shifts of your input into one meta-hash which does not depend on the order of the inputs.
More formally, consider
for(int i=0; i<string.length; i++) {
result^=string.rotatedBy(i).hashCode();
}
Where you could replace the ^= with any other commutative operation.
More examply, consider the input
"abcd"
to get the hash we take
hash("abcd") ^ hash("dabc") ^ hash("cdab") ^ hash("bcda").
As we can see, taking the hash of any of these permutations will only change the order that you are evaluating the XOR, which won't change its value.
I did something like this for a project in college. There were 2 approaches I used to try to optimize a Travelling-Salesman problem. I think if the elements are NOT guaranteed to be unique, the second solution would take a bit more checking, but the first one should work.
If you can represent the string as a matrix of associations so abcdef would look like
a b c d e f
a x
b x
c x
d x
e x
f x
But so would any combination of those associations. It would be trivial to compare those matrices.
Another quicker trick would be to rotate the string so that the "first" letter is first. Then if you have the same starting point, the same strings will be identical.
Here is some Ruby code:
def normalize_string(string)
myarray = string.split(//) # split into an array
index = myarray.index(myarray.min) # find the index of the minimum element
index.times do
myarray.push(myarray.shift) # move stuff from the front to the back
end
return myarray.join
end
p normalize_string('abcdef').eql?normalize_string('defabc') # should return true
Maybe use a rolling hash for each offset (RabinKarp like) and return the minimum hash value? There could be collisions though.

Resources