Is qHash consistent across computers? - string

I have a database table with multiple text columns that collectively have to be unique, and I don't want to use a multicolumn key, so I was thinking to hash the strings together into an int and use that as the primary key. I was wondering if it would be a better idea to take advantage of uint qHash ( const QString & key ), or to write my own function, given that the database will need to be edited by different people in different places. (Also, if the whole approach is bad, please help.)

qHash is implemented as below :
static uint hash(const uchar *p, int n)
{
uint h = 0;
uint g;
while (n--) {
h = (h << 4) + *p++;
if ((g = (h & 0xf0000000)) != 0)
h ^= g >> 23;
h &= ~g;
}
return h;
}
static uint hash(const QChar *p, int n)
{
uint h = 0;
uint g;
while (n--) {
h = (h << 4) + (*p++).unicode();
if ((g = (h & 0xf0000000)) != 0)
h ^= g >> 23;
h &= ~g;
}
return h;
}
There is nothing specific to platform in that code. However a hash algorithm does not guarantee uniqueness like a database. It does its best to avoid collisions but it is not guaranteed. That is why most hash containers use buckets and reallocation algorithms.

Related

How to incorporate mod in rolling hash of Rabin Karp algorithm?

I am trying to implement the Rabin Karp algorithm with mod. The hash function which i am using is:
H1= c1*a^k-1 + c2*a^k-2 +c3*a^k-3 +…+ck*a^0
Here cx is the ASCII value of the character. And to roll it I first drop the first term by subtracting it, then multiply by a and add the new term by multiplying it with a^0.
Now the problem is to deal with large values i have used mod operations but doing that i am not able to roll it correctly. My code is as follows:
public class RabinKarp {
private static final int base = 26;
private static final int mod = 1180637;
public static void main(String[] args) {
String text = "ATCAAGTTACCAATA";
String pattern = "ATA";
char[] textArr = text.toCharArray();
char[] patternArr = pattern.toCharArray();
System.out.println(getMatchingIndex(textArr, patternArr));
}
public static int getMatchingIndex(char[] textArr, char[] patternArr) {
int n = textArr.length;
int m = patternArr.length;
int patternHash = getHashForPatternSize(patternArr, m);
int textHash = getHashForPatternSize(textArr, m);
for(int i = 0; i < n-m; i++) {
if(patternHash == textHash && checkMatch(textArr, patternArr, i, m))
return i;
textHash = rollingHash(textArr, textHash, i, m);
}
return -1;
}
public static boolean checkMatch(char[] textArr, char[] patternArr, int i, int m) {
for(int j = 0; j < m; j++,i++) {
if(textArr[i] != patternArr[j])
return false;
}
return true;
}
public static int rollingHash(char[] textArr, int textHash, int i, int m) {
return (textHash * base - modularExponentiation(base, m, mod) * (int)textArr[i] + (int) textArr[i+m])%mod;
}
public static int getHashForPatternSize(char[] arr, int m) {
int hash = 0;
for(int i = 0, p = m; i < m; i++, p--) {
hash = (hash%mod + calcHash(arr[i], p)%mod)%mod;
}
return hash;
}
public static int calcHash(char alphabet, int p) {
return (((int) alphabet)%mod * modularExponentiation(base, p, mod)%mod)%mod;
}
public static int modularExponentiation(int base, int p, int mod) {
if(p == 0)
return 1;
if(p%2 == 0)
return modularExponentiation((base*base)%mod, p/2, mod);
else
return (base*modularExponentiation((base*base)%mod, (p-1)/2, mod))%mod;
}
}
Problem is that textHash and patternHash do not match at any point. I am sure that the problem is with the mod operations. Can anyone tell how to have mod as well as to use the rolling hash correctly. I would be very thankful.
The usual way to compute a Rabin-Karp rolling hash is to consider the characters in big-endian order, rather than your little-endian solution. This makes the arithmetic much easier since it avoids division. Modular division is non-trivial and you cannot simply implement it as (p/q)%b.
If we take the rolling hash as
H0…k-1 = (c0*ak-1 + c1*ak-2 + c2*ak-3 …+… ck-1*a0) mod b
Then the next term is:
H1…k = ( c1*ak-1 + c2*ak-2 …+… ck-1*a1 + ck*a0) mod b
And we can easily see that
H1…k = (a * H0…k-1 - c0*ak + ck) mod b
If we then precompute m == ak mod b, that becomes:
H1…k = (a * H0…k-1 - m * c0 + ck) mod b
which is much less work on each iteration, and does not depend on division at all.

NxN matrix is given and we have to find

N things to select for N people, you were given a NxN matrix and cost at each element, you needed to find the one combination with max total weight, such that each person gets exactly one thing.
I found difficulty in making its dp state.
please help me and if possible then also write code for it
C++ style code:
double max_rec(int n, int r, int* c, double** m, bool* f)
{
if (r < n)
{
double max_v = 0.0;
int max_i = -1;
for (int i = 0; i < n; i++)
{
if (f[i] == false)
{
f[i] = true;
double value = m[r][i] + max_rec(n, r + 1, c, m, f);
if (value > max_v)
{
max_v = value;
max_i = i;
}
f[i] = false;
}
}
c[i] = max_i;
return max_v;
}
return 0.0;
}
int* max_comb(int n, double** m)
{
bool* f = new bool[n];
int* c = new int[n];
max_rec(n, 0, c, m, f);
delete [] f;
return c;
}
Call max_comb with N and your NxN matrix (2d array). Returns the column indices of the maximum combination.
Time complexity: O(N!)
I know this is bad but the problem does not have a greedy structure.
And as #mszalbach said, try to attempt the problem yourself before asking.
EDIT: can reduce to polynomial time by memoizing.

CodeJam 2014: How to solve task "New Lottery Game"?

I want to know efficient approach for the New Lottery Game problem.
The Lottery is changing! The Lottery used to have a machine to generate a random winning number. But due to cheating problems, the Lottery has decided to add another machine. The new winning number will be the result of the bitwise-AND operation between the two random numbers generated by the two machines.
To find the bitwise-AND of X and Y, write them both in binary; then a bit in the result in binary has a 1 if the corresponding bits of X and Y were both 1, and a 0 otherwise. In most programming languages, the bitwise-AND of X and Y is written X&Y.
For example:
The old machine generates the number 7 = 0111.
The new machine generates the number 11 = 1011.
The winning number will be (7 AND 11) = (0111 AND 1011) = 0011 = 3.
With this measure, the Lottery expects to reduce the cases of fraudulent claims, but unfortunately an employee from the Lottery company has leaked the following information: the old machine will always generate a non-negative integer less than A and the new one will always generate a non-negative integer less than B.
Catalina wants to win this lottery and to give it a try she decided to buy all non-negative integers less than K.
Given A, B and K, Catalina would like to know in how many different ways the machines can generate a pair of numbers that will make her a winner.
For small input we can check all possible pairs but how to do it with large inputs. I guess we represent the binary number into string first and then check permutations which would give answer less than K. But I can't seem to figure out how to calculate possible permutations of 2 binary strings.
I used a general DP technique that I described in a lot of detail in another answer.
We want to count the pairs (a, b) such that a < A, b < B and a & b < K.
The first step is to convert the numbers to binary and to pad them to the same size by adding leading zeroes. I just padded them to a fixed size of 40. The idea is to build up the valid a and b bit by bit.
Let f(i, loA, loB, loK) be the number of valid suffix pairs of a and b of size 40 - i. If loA is true, it means that the prefix up to i is already strictly smaller than the corresponding prefix of A. In that case there is no restriction on the next possible bit for a. If loA ist false, A[i] is an upper bound on the next bit we can place at the end of the current prefix. loB and loK have an analogous meaning.
Now we have the following transition:
long long f(int i, bool loA, bool loB, bool loK) {
// TODO add memoization
if (i == 40)
return loA && loB && loK;
int hiA = loA ? 1: A[i]-'0'; // upper bound on the next bit in a
int hiB = loB ? 1: B[i]-'0'; // upper bound on the next bit in b
int hiK = loK ? 1: K[i]-'0'; // upper bound on the next bit in a & b
long long res = 0;
for (int a = 0; a <= hiA; ++a)
for (int b = 0; b <= hiB; ++b) {
int k = a & b;
if (k > hiK) continue;
res += f(i+1, loA || a < A[i]-'0',
loB || b < B[i]-'0',
loK || k < K[i]-'0');
}
return res;
}
The result is f(0, false, false, false).
The runtime is O(max(log A, log B)) if memoization is added to ensure that every subproblem is only solved once.
What I did was just to identify when the answer is A * B.
Otherwise, just brute force the rest, this code passed the large input.
// for each test cases
long count = 0;
if ((K > A) || (K > B)) {
count = A * B;
continue; // print count and go to the next test case
}
count = A * B - (A-K) * (B-K);
for (int i = K; i < A; i++) {
for (int j = K; j < B; j++) {
if ((i&j) < K) count++;
}
}
I hope this helps!
just as Niklas B. said.
the whole answer is.
#include <algorithm>
#include <cstring>
#include <iomanip>
#include <iostream>
#include <iterator>
#include <map>
#include <sstream>
#include <string>
#include <vector>
using namespace std;
#define MAX_SIZE 32
int A, B, K;
int arr_a[MAX_SIZE];
int arr_b[MAX_SIZE];
int arr_k[MAX_SIZE];
bool flag [MAX_SIZE][2][2][2];
long long matrix[MAX_SIZE][2][2][2];
long long
get_result();
int main(int argc, char *argv[])
{
int case_amount = 0;
cin >> case_amount;
for (int i = 0; i < case_amount; ++i)
{
const long long result = get_result();
cout << "Case #" << 1 + i << ": " << result << endl;
}
return 0;
}
long long
dp(const int h,
const bool can_A_choose_1,
const bool can_B_choose_1,
const bool can_K_choose_1)
{
if (MAX_SIZE == h)
return can_A_choose_1 && can_B_choose_1 && can_K_choose_1;
if (flag[h][can_A_choose_1][can_B_choose_1][can_K_choose_1])
return matrix[h][can_A_choose_1][can_B_choose_1][can_K_choose_1];
int cnt_A_max = arr_a[h];
int cnt_B_max = arr_b[h];
int cnt_K_max = arr_k[h];
if (can_A_choose_1)
cnt_A_max = 1;
if (can_B_choose_1)
cnt_B_max = 1;
if (can_K_choose_1)
cnt_K_max = 1;
long long res = 0;
for (int i = 0; i <= cnt_A_max; ++i)
{
for (int j = 0; j <= cnt_B_max; ++j)
{
int k = i & j;
if (k > cnt_K_max)
continue;
res += dp(h + 1,
can_A_choose_1 || (i < cnt_A_max),
can_B_choose_1 || (j < cnt_B_max),
can_K_choose_1 || (k < cnt_K_max));
}
}
flag[h][can_A_choose_1][can_B_choose_1][can_K_choose_1] = true;
matrix[h][can_A_choose_1][can_B_choose_1][can_K_choose_1] = res;
return res;
}
long long
get_result()
{
cin >> A >> B >> K;
memset(arr_a, 0, sizeof(arr_a));
memset(arr_b, 0, sizeof(arr_b));
memset(arr_k, 0, sizeof(arr_k));
memset(flag, 0, sizeof(flag));
memset(matrix, 0, sizeof(matrix));
int i = 31;
while (i >= 1)
{
arr_a[i] = A % 2;
A /= 2;
arr_b[i] = B % 2;
B /= 2;
arr_k[i] = K % 2;
K /= 2;
i--;
}
return dp(1, 0, 0, 0);
}

Recursive Merge Sort C++

I want to write a recursive merge sort program in C++. The problem is, I don't know how to get the base case idea working recursively. Can anybody please tell me what would be the base case for Merg Function(), Split Function() and MergSort() function. I would be thankful to you.
void Merg(int A[], int s1, int e1, int s2, int e2)
{
int B[8];
int i=0;
while (A[s1] < A[s2])
B[i] = B[s1];
i++;
s1++;
if (s1 == e1)
{
B[i] = A[s2];
i++;
s2++;
}
while (A[s2] < A[s1])
B[i] = B[s2];
i++;
s2++;
if (s2 == e2)
{
B[i] = A[s1];
i++;
s1++;
}
}
void Split(int A[], int s, int e)
{
int mid = (s+e)/2;
if (s < e && mid != 0)
{
Split(A, s, mid);
Split(A, mid+1, e);
}
Merg(A, s, mid, mid+1, e);
}
int main()
{
int A[8] = {10,4,8,12,11,2,7,5};
Split(A, 0, 7);
return 0;
}
The base case is an array that is guaranteed to be sorted, so either an empty array or an array of length 1.
Your merge function is not correct, but at least contains most of the right ideas. All you need there is a further wrapping loop and a few conditions to prevent your merge running past the end of the arrays. The split function is totally off, splitting is not recursive, further splits happen inside the recursive mergeSort calls.
if length(A) < 2 return // already sorted
split A in lower half L and upper half H
merge-sort L
merge-sort H
merge the sorted L and H
done

Scrabble tile checking

For tile checking in scrabble, you make four 5x5 grids of letters totalling 100 tiles. I would like to make one where all 40 horizontal and vertical words are valid. The set of available tiles contains:
12 x E
9 x A, I
8 x O
6 x N, R, T
4 x D, L, S, U
3 x G
2 x B, C, F, H, M, P, V, W, Y, blank tile (wildcard)
1 x K, J, Q, X, Z
The dictionary of valid words is available here (700KB). There are about 12,000 valid 5 letter words.
Here's an example where all 20 horizontal words are valid:
Z O W I E|P I N O T
Y O G I N|O C t A D <= blank being used as 't'
X E B E C|N A L E D
W A I T E|M E R L E
V I N E R|L U T E A
---------+---------
U S N E A|K N O S P
T A V E R|J O L E D
S O F T A|I A M B I
R I D G Y|H A I T h <= blank being used as 'h'
Q U R S H|G R O U F
I'd like to create one where all the vertical ones are also valid. Can you help me solve this? It is not homework. It is a question a friend asked me for help with.
Final Edit: Solved! Here is a solution.
GNAWN|jOULE
RACHE|EUROS
IDIOT|STEAN
PINOT|TRAvE
TRIPY|SOLES
-----+-----
HOWFF|ZEBRA
AGILE|EQUID
CIVIL|BUXOM
EVENT|RIOJA
KEDGY|ADMAN
Here's a photo of it constructed with my scrabble set. http://twitpic.com/3wn7iu
This one was easy to find once I had the right approach, so I bet you could find many more this way. See below for methodology.
Construct a prefix tree from the dictionary of 5 letter words for each row and column. Recursively, a given tile placement is valid if it forms valid prefixes for its column and row, and if the tile is available, and if the next tile placement is valid. The base case is that it is valid if there is no tile left to place.
It probably makes sense to just find all valid 5x5 boards, like Glenn said, and see if any four of them can be combined. Recursing to a depth of 100 doesn't sound like fun.
Edit: Here is version 2 of my code for this.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
typedef union node node;
union node {
node* child[26];
char string[6];
};
typedef struct snap snap;
struct snap {
node* rows[5];
node* cols[5];
char tiles[27];
snap* next;
};
node* root;
node* vtrie[5];
node* htrie[5];
snap* head;
char bag[27] = {9,2,2,4,12,2,3,2,9,1,1,4,2,6,8,2,1,6,4,6,4,2,2,1,2,1,2};
const char full_bag[27] = {9,2,2,4,12,2,3,2,9,1,1,4,2,6,8,2,1,6,4,6,4,2,2,1,2,1,2};
const char order[26] = {16,23,9,25,21,22,5,10,1,6,7,12,15,2,24,3,20,13,19,11,8,17,14,0,18,4};
void insert(char* string){
node* place = root;
int i;
for(i=0;i<5;i++){
if(place->child[string[i] - 'A'] == NULL){
int j;
place->child[string[i] - 'A'] = malloc(sizeof(node));
for(j=0;j<26;j++){
place->child[string[i] - 'A']->child[j] = NULL;
}
}
place = place->child[string[i] - 'A'];
}
memcpy(place->string, string, 6);
}
void check_four(){
snap *a, *b, *c, *d;
char two_total[27];
char three_total[27];
int i;
bool match;
a = head;
for(b = a->next; b != NULL; b = b->next){
for(i=0;i<27; i++)
two_total[i] = a->tiles[i] + b->tiles[i];
for(c = b->next; c != NULL; c = c->next){
for(i=0;i<27; i++)
three_total[i] = two_total[i] + c->tiles[i];
for(d = c->next; d != NULL; d = d->next){
match = true;
for(i=0; i<27; i++){
if(three_total[i] + d->tiles[i] != full_bag[i]){
match = false;
break;
}
}
if(match){
printf("\nBoard Found!\n\n");
for(i=0;i<5;i++){
printf("%s\n", a->rows[i]->string);
}
printf("\n");
for(i=0;i<5;i++){
printf("%s\n", b->rows[i]->string);
}
printf("\n");
for(i=0;i<5;i++){
printf("%s\n", c->rows[i]->string);
}
printf("\n");
for(i=0;i<5;i++){
printf("%s\n", d->rows[i]->string);
}
exit(0);
}
}
}
}
}
void snapshot(){
snap* shot = malloc(sizeof(snap));
int i;
for(i=0;i<5;i++){
printf("%s\n", htrie[i]->string);
shot->rows[i] = htrie[i];
shot->cols[i] = vtrie[i];
}
printf("\n");
for(i=0;i<27;i++){
shot->tiles[i] = full_bag[i] - bag[i];
}
bool transpose = false;
snap* place = head;
while(place != NULL && !transpose){
transpose = true;
for(i=0;i<5;i++){
if(shot->rows[i] != place->cols[i]){
transpose = false;
break;
}
}
place = place->next;
}
if(transpose){
free(shot);
}
else {
shot->next = head;
head = shot;
check_four();
}
}
void pick(x, y){
if(y==5){
snapshot();
return;
}
int i, tile,nextx, nexty, nextz;
node* oldv = vtrie[x];
node* oldh = htrie[y];
if(x+1==5){
nexty = y+1;
nextx = 0;
} else {
nextx = x+1;
nexty = y;
}
for(i=0;i<26;i++){
if(vtrie[x]->child[order[i]]!=NULL &&
htrie[y]->child[order[i]]!=NULL &&
(tile = bag[i] ? i : bag[26] ? 26 : -1) + 1) {
vtrie[x] = vtrie[x]->child[order[i]];
htrie[y] = htrie[y]->child[order[i]];
bag[tile]--;
pick(nextx, nexty);
vtrie[x] = oldv;
htrie[y] = oldh;
bag[tile]++;
}
}
}
int main(int argc, char** argv){
root = malloc(sizeof(node));
FILE* wordlist = fopen("sowpods5letters.txt", "r");
head = NULL;
int i;
for(i=0;i<26;i++){
root->child[i] = NULL;
}
for(i=0;i<5;i++){
vtrie[i] = root;
htrie[i] = root;
}
char* string = malloc(sizeof(char)*6);
while(fscanf(wordlist, "%s", string) != EOF){
insert(string);
}
free(string);
fclose(wordlist);
pick(0,0);
return 0;
}
This tries the infrequent letters first, which I'm no longer sure is a good idea. It starts to get bogged down before it makes it out of the boards starting with x. After seeing how many 5x5 blocks there were I altered the code to just list out all the valid 5x5 blocks. I now have a 150 MB text file with all 4,430,974 5x5 solutions.
I also tried it with just recursing through the full 100 tiles, and that is still running.
Edit 2: Here is the list of all the valid 5x5 blocks I generated. http://web.cs.sunyit.edu/~levyt/solutions.rar
Edit 3: Hmm, seems there was a bug in my tile usage tracking, because I just found a block in my output file that uses 5 Zs.
COSTE
ORCIN
SCUZZ
TIZZY
ENZYM
Edit 4: Here is the final product.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
typedef union node node;
union node {
node* child[26];
char string[6];
};
node* root;
node* vtrie[5];
node* htrie[5];
int score;
int max_score;
char block_1[27] = {4,2,0,2, 2,0,0,0,2,1,0,0,2,1,2,0,1,2,0,0,2,0,0,1,0,1,0};//ZEBRA EQUID BUXOM RIOJA ADMAN
char block_2[27] = {1,0,1,1, 4,2,2,1,3,0,1,2,0,1,1,0,0,0,0,1,0,2,1,0,1,0,0};//HOWFF AGILE CIVIL EVENT KEDGY
char block_3[27] = {2,0,1,1, 1,0,1,1,4,0,0,0,0,3,2,2,0,2,0,3,0,0,1,0,1,0,0};//GNAWN RACHE IDIOT PINOT TRIPY
//JOULE EUROS STEAN TRAVE SOLES
char bag[27] = {9,2,2,4,12,2,3,2,9,1,1,4,2,6,8,2,1,6,4,6,4,2,2,1,2,1,2};
const char full_bag[27] = {9,2,2,4,12,2,3,2,9,1,1,4,2,6,8,2,1,6,4,6,4,2,2,1,2,1,2};
const char order[26] = {16,23,9,25,21,22,5,10,1,6,7,12,15,2,24,3,20,13,19,11,8,17,14,0,18,4};
const int value[27] = {244,862,678,564,226,1309,844,765,363,4656,909,414,691,463,333,687,11998,329,218,423,536,1944,1244,4673,639,3363,0};
void insert(char* string){
node* place = root;
int i;
for(i=0;i<5;i++){
if(place->child[string[i] - 'A'] == NULL){
int j;
place->child[string[i] - 'A'] = malloc(sizeof(node));
for(j=0;j<26;j++){
place->child[string[i] - 'A']->child[j] = NULL;
}
}
place = place->child[string[i] - 'A'];
}
memcpy(place->string, string, 6);
}
void snapshot(){
static int count = 0;
int i;
for(i=0;i<5;i++){
printf("%s\n", htrie[i]->string);
}
for(i=0;i<27;i++){
printf("%c%d ", 'A'+i, bag[i]);
}
printf("\n");
if(++count>=1000){
exit(0);
}
}
void pick(x, y){
if(y==5){
if(score>max_score){
snapshot();
max_score = score;
}
return;
}
int i, tile,nextx, nexty;
node* oldv = vtrie[x];
node* oldh = htrie[y];
if(x+1==5){
nextx = 0;
nexty = y+1;
} else {
nextx = x+1;
nexty = y;
}
for(i=0;i<26;i++){
if(vtrie[x]->child[order[i]]!=NULL &&
htrie[y]->child[order[i]]!=NULL &&
(tile = bag[order[i]] ? order[i] : bag[26] ? 26 : -1) + 1) {
vtrie[x] = vtrie[x]->child[order[i]];
htrie[y] = htrie[y]->child[order[i]];
bag[tile]--;
score+=value[tile];
pick(nextx, nexty);
vtrie[x] = oldv;
htrie[y] = oldh;
bag[tile]++;
score-=value[tile];
}
}
}
int main(int argc, char** argv){
root = malloc(sizeof(node));
FILE* wordlist = fopen("sowpods5letters.txt", "r");
score = 0;
max_score = 0;
int i;
for(i=0;i<26;i++){
root->child[i] = NULL;
}
for(i=0;i<5;i++){
vtrie[i] = root;
htrie[i] = root;
}
for(i=0;i<27;i++){
bag[i] = bag[i] - block_1[i];
bag[i] = bag[i] - block_2[i];
bag[i] = bag[i] - block_3[i];
printf("%c%d ", 'A'+i, bag[i]);
}
char* string = malloc(sizeof(char)*6);
while(fscanf(wordlist, "%s", string) != EOF){
insert(string);
}
free(string);
fclose(wordlist);
pick(0,0);
return 0;
}
After finding out how many blocks there were (nearly 2 billion and still counting), I switched to trying to find certain types of blocks, in particular the difficult to construct ones using uncommon letters. My hope was that if I ended up with a benign enough set of letters going in to the last block, the vast space of valid blocks would probably have one for that set of letters.
I assigned each tile a value inversely proportional to the number of 5 letter words it appears in. Then, when I found a valid block I would sum up the tile values, and if the score was the best I had yet seen, I would print out the block.
For the first block I removed the blank tiles, figuring that the last block would need that flexibility the most. After letting it run until I had not seen a better block appear for some time, I selected the best block, and removed the tiles in it from the bag, and ran the program again, getting the second block. I repeated this for the 3rd block. Then for the last block I added the blanks back in and used the first valid block it found.
Here's how I would try this. First construct a prefix tree.
Pick a word and place it horizontally on top. Pick a word and place it vertically. Alternate them until exhausted options. By alternating you start to fix the first letters and eliminating lots of mismatching words. If you really do find such square, then do a check whether they can be made with those pieces.
For 5x5 squares: after doing some thinking it can't be worse than O(12000!/11990!) for random text words. But thinking about it a little bit more. Every time you fix a letter (in normal text) you eliminate about 90% (an optimistic guess) of your words. This means after three iterations you've got 12 words. So the actual speed would be
O(n * n/10 * n/10 * n/100 * n/100 * n/1000 * n/1000 ...
which for 12000 elements acts something like n^4 algorithm
which isn't that bad.
Probably someone can do a better analysis of the problem. But the search for words should still converge quite quickly.
There can be more eliminating done by abusing the infrequent letters. Essentially find all words that have infrequent letters. Try to make a matching positions for each letters. Construct a set of valid letters for each position.
For example, let's say we have four words with letter Q in it.
AQFED, ZQABE, EDQDE, ELQUO
this means there are two valid positionings of those:
xZxxx
AQFED
xAxxx ---> this limits our search for words that contain [ABDEFZ] as the second letter
xBxxx
xExxx
same for the other
EDQDE ---> this limits our search for words that contain [EDLU] as the third letter
ELQUO
all appropriate words are in union of those two conditions
So basically, if we have multiple words that contain infrequent letter X in word S at position N, means that other words that are in that matrix must have letter that is also in S in position n.
Formula:
Find all words that contain infrequent letter X at position 1 (next iteration 2, 3... )
Make a set A out of the letters in those words
Keep only those words from the dictionary that have letter from set A in position 1
Try to fit those into the matrix (with the first method)
Repeat with position 2
I would approach the problem (naively, to be sure) by taking a pessimistic view. I'd try to prove there was no 5x5 solution, and therefore certainly not four 5x5 solutions. To prove there was no 5x5 solution I'd try to construct one from all possibilities. If my conjecture failed and I was able to construct a 5x5 solution, well, then, I'd have a way to construct 5x5 solutions and I would try to construct all of the (independent) 5x5 solutions. If there were at least 4, then I would determine if some combination satisfied the letter count restrictions.
[Edit] Null Set has determined that there are "4,430,974 5x5 solutions". Are these valid?
I mean that we have a limitation on the number of letters we can use. This limitation can be expressed as a boundary vector BV = [9, 2, 2, 4, ...] corresponding to the limits on A, B, C, etc. (You see this vector in Null Set's code). A 5x5 solution is valid if each term of its letter count vector is less than the corresponding term in BV. It would be easy to check if a 5x5 solution is valid as it was created. Perhaps the 4,430,974 number can be reduced, say to N.
Regardless, we can state the problem as: find four letter count vectors among the N whose sum is equal to BV. There are (N, 4) possible sums ("N choose 4"). With N equal to 4 million this is still on the order of 10^25---not an encouraging number. Perhaps you could search for four whose first terms sum to 9, and if so checking that their second terms sum to 2, etc.
I'd remark that after choosing 4 from N the computations are independent, so if you have a multi-core machine you can make this go faster with a parallel solution.
[Edit2] Parallelizing probably wouldn't make much difference, though. At this point I might take an optimistic view: there are certainly more 5x5 solutions than I expected, so there may be more final solutions than expected, too. Perhaps you might not have to get far into the 10^25 to hit one.
I'm starting with something simpler.
Here are some results so far:
3736 2x2 solutions
8812672 3x3 solutions
The 1000th 4x4 solution is
A A H S
A C A I
L A I R
S I R E
The 1000th 5x5 solution is
A A H E D
A B U N A
H U R S T
E N S U E
D A T E D
The 1000th 2x4x4 solution is
A A H S | A A H S
A B A C | A B A C
H A I R | L E K U
S C R Y | S T E D
--------+--------
D E E D | D E E M
E I N E | I N T I
E N O L | O V E R
T E L T | L Y N E
Note that transposing an 'A' and a blank that is being used as an 'A' should be considered the same solution. But transposing the rows with the columns should be considered a different solution. I hope that makes sense.
Here are a lot of precomputed 5x5's. Left as an exercise to the reader to find 4 compatible ones :-)
http://www.gtoal.com/wordgames/wordsquare/all5

Resources