Remove occurrences of substring recursively - string

Here's a problem:
Given string A and a substring B, remove the first occurence of substring B in string A till it is possible to do so. Note that removing a substring, can further create a new same substring. Ex. removing 'hell' from 'hehelllloworld' once would yield 'helloworld' which after removing once more would become 'oworld', the desired string.
Write a program for the above for input constraints of length 10^6 for A, and length 100 for B.
This question was asked to me in an interview, I gave them a simple algorithm to solve it that was to do exactly what the statement was and remove it iteratievly(to decresae over head calls), I later came to know there's a better solution for it that's much faster what would it be ? I've thought of a few optimizations but it's still not as fast as the fastest soln for the problem(acc. the company), so can anyone tell me of a faster way to solve the problem ?
P.S> I know of stackoverflow rules and that having code is better, but for this problem, I don't think that having code would be in any way beneficial...

Your approach has a pretty bad complexity. In a very bad case the string a will be aaaaaaaaabbbbbbbbb, and the string b will be ab, in which case you will need O(|a|) searches, each taking O(|a| + |b|) (assuming using some sophisticated search algorithm), resulting in a total complexity of O(|a|^2 + |a| * |b|), which with their constraints is years.
For their constraints a good complexity to aim for would be O(|a| * |b|), which is around 100 million operations, will finish in subsecond. Here's one way to approach it. For each position i in the string a let's compute the largest length n_i, such that the a[i - n_i : i] = b[0 : n_i] (in other words, the longest suffix of a at that position which is a prefix of b). We can compute it in O(|a| + |b|) by using Knuth-Morris-Pratt algorithm.
After we have n_i computed, finding the first occurrence of b in a is just a matter of finding the first n_i that is equal to |b|. This will be the right end of one of the occurrences of b in a.
Finally, we will need to modify Knuth-Morris-Pratt slightly. We will be logically removing occurrences of b as soon as we compute an n_i that is equal to |b|. To account for the fact that some letters were removed from a we will rely on the fact that Knuth-Morris-Pratt only relies on the last value of n_i (and those computed for b), and the current letter of a, so we just need a fast way of retrieving the last value of n_i after we logically remove an occurrence of b. That can be done with a deque, that stores all the valid values of n_i. Each value will be pushed into the deque once, and popped from it once, so that complexity of maintaining it is O(|a|), while the complexity of the Knuth-Morris-Pratt is O(|a| + |b|), resulting in O(|a| + |b|) total complexity.
Here's a C++ implementation. It could have some off-by-one errors, but it works on your sample, and it flies for the worst case that I described at the beginning.
#include <deque>
#include <string>
#include <iostream>
#include <vector>
#include <algorithm>
using namespace std;
int main() {
string a, b;
cin >> a >> b;
size_t blen = b.size();
// make a = b$a
a = b + "$" + a;
vector<size_t> n(a.size()); // array for knuth-morris-pratt
vector<bool> removals(a.size()); // positions of right ends at which we remove `b`s
deque<size_t> lastN;
n[0] = 0;
// For the first blen + 1 iterations just do vanilla knuth-morris-pratt
for (size_t i = 1; i < blen + 1; ++ i) {
size_t z = n[i - 1];
while (z && a[i] != a[z]) {
z = n[z - 1];
}
if (a[i] != a[z]) n[i] = 0;
else n[i] = z + 1;
lastN.push_back(n[i]);
}
// For the remaining iterations some characters could have been logically
// removed from `a`, so use lastN to get last value of n instaed
// of actually getting it from `n[i - 1]`
for (size_t i = blen + 1; i < a.size(); ++ i) {
size_t z = lastN.back();
while (z && a[i] != a[z]) {
z = n[z - 1];
}
if (a[i] != a[z]) n[i] = 0;
else n[i] = z + 1;
if (n[i] == blen) // found a match
{
removals[i] = true;
// kill last |b| - 1 `n_i`s
for (size_t j = 0; j < blen - 1; ++ j) {
lastN.pop_back();
}
}
else {
lastN.push_back(n[i]);
}
}
string ret;
size_t toRemove = 0;
for (size_t pos = a.size() - 1; a[pos] != '$'; -- pos) {
if (removals[pos]) toRemove += blen;
if (toRemove) -- toRemove;
else ret.push_back(a[pos]);
}
reverse(ret.begin(), ret.end());
cout << ret << endl;
return 0;
}
[in] hehelllloworld
[in] hell
[out] oworld
[in] abababc
[in] ababc
[out] ab
[in] caaaaa ... aaaaaabbbbbb ... bbbbc
[in] ab
[out] cc

Related

find the number of ways you can form a string on size N, given an unlimited number of 0s and 1s

The below question was asked in the atlassian company online test ,I don't have test cases , this is the below question I took from this link
find the number of ways you can form a string on size N, given an unlimited number of 0s and 1s. But
you cannot have D number of consecutive 0s and T number of consecutive 1s. N, D, T were given as inputs,
Please help me on this problem,any approach how to proceed with it
My approach for the above question is simply I applied recursion and tried for all possiblity and then I memoized it using hash map
But it seems to me there must be some combinatoric approach that can do this question in less time and space? for debugging purposes I am also printing the strings generated during recursion, if there is flaw in my approach please do tell me
#include <bits/stdc++.h>
using namespace std;
unordered_map<string,int>dp;
int recurse(int d,int t,int n,int oldd,int oldt,string s)
{
if(d<=0)
return 0;
if(t<=0)
return 0;
cout<<s<<"\n";
if(n==0&&d>0&&t>0)
return 1;
string h=to_string(d)+" "+to_string(t)+" "+to_string(n);
if(dp.find(h)!=dp.end())
return dp[h];
int ans=0;
ans+=recurse(d-1,oldt,n-1,oldd,oldt,s+'0')+recurse(oldd,t-1,n-1,oldd,oldt,s+'1');
return dp[h]=ans;
}
int main()
{
int n,d,t;
cin>>n>>d>>t;
dp.clear();
cout<<recurse(d,t,n,d,t,"")<<"\n";
return 0;
}
You are right, instead of generating strings, it is worth to consider combinatoric approach using dynamic programming (a kind of).
"Good" sequence of length K might end with 1..D-1 zeros or 1..T-1 of ones.
To make a good sequence of length K+1, you can add zero to all sequences except for D-1, and get 2..D-1 zeros for the first kind of precursors and 1 zero for the second kind
Similarly you can add one to all sequences of the first kind, and to all sequences of the second kind except for T-1, and get 1 one for the first kind of precursors and 2..T-1 ones for the second kind
Make two tables
Zeros[N][D] and Ones[N][T]
Fill the first row with zero counts, except for Zeros[1][1] = 1, Ones[1][1] = 1
Fill row by row using the rules above.
Zeros[K][1] = Sum(Ones[K-1][C=1..T-1])
for C in 2..D-1:
Zeros[K][C] = Zeros[K-1][C-1]
Ones[K][1] = Sum(Zeros[K-1][C=1..T-1])
for C in 2..T-1:
Ones[K][C] = Ones[K-1][C-1]
Result is sum of the last row in both tables.
Also note that you really need only two active rows of the table, so you can optimize size to Zeros[2][D] after debugging.
This can be solved using dynamic programming. I'll give a recursive solution to the same. It'll be similar to generating a binary string.
States will be:
i: The ith character that we need to insert to the string.
cnt: The number of consecutive characters before i
bit: The character which was repeated cnt times before i. Value of bit will be either 0 or 1.
Base case will: Return 1, when we reach n since we are starting from 0 and ending at n-1.
Define the size of dp array accordingly. The time complexity will be 2 x N x max(D,T)
#include<bits/stdc++.h>
using namespace std;
int dp[1000][1000][2];
int n, d, t;
int count(int i, int cnt, int bit) {
if (i == n) {
return 1;
}
int &ans = dp[i][cnt][bit];
if (ans != -1) return ans;
ans = 0;
if (bit == 0) {
ans += count(i+1, 1, 1);
if (cnt != d - 1) {
ans += count(i+1, cnt + 1, 0);
}
} else {
// bit == 1
ans += count(i+1, 1, 0);
if (cnt != t-1) {
ans += count(i+1, cnt + 1, 1);
}
}
return ans;
}
signed main() {
ios_base::sync_with_stdio(false), cin.tie(nullptr);
cin >> n >> d >> t;
memset(dp, -1, sizeof dp);
cout << count(0, 0, 0);
return 0;
}

Maximum element in array which is equal to product of two elements in array

We need to find the maximum element in an array which is also equal to product of two elements in the same array. For example [2,3,6,8] , here 6=2*3 so answer is 6.
My approach was to sort the array and followed by a two pointer method which checked whether the product exist for each element. This is o(nlog(n)) + O(n^2) = O(n^2) approach. Is there a faster way to this ?
There is a slight better solution with O(n * sqrt(n)) if you are allowed to use O(M) memory M = max number in A[i]
Use an array of size M to mark every number while you traverse them from smaller to bigger number.
For each number try all its factors and see if those were already present in the array map.
Here is a pseudo code for that:
#define M 1000000
int array_map[M+2];
int ans = -1;
sort(A,A+n);
for(i=0;i<n;i++) {
for(j=1;j<=sqrt(A[i]);j++) {
int num1 = j;
if(A[i]%num1==0) {
int num2 = A[i]/num1;
if(array_map[num1] && array_map[num2]) {
if(num1==num2) {
if(array_map[num1]>=2) ans = A[i];
} else {
ans = A[i];
}
}
}
}
array_map[A[i]]++;
}
There is an ever better approach if you know how to find all possible factors in log(M) this just becomes O(n*logM). You have to use sieve and backtracking for that
#JerryGoyal 's solution is correct. However, I think it can be optimized even further if instead of using B pointer, we use binary search to find the other factor of product if arr[c] is divisible by arr[a]. Here's the modification for his code:
for(c=n-1;(c>1)&& (max==-1);c--){ // loop through C
for(a=0;(a<c-1)&&(max==-1);a++){ // loop through A
if(arr[c]%arr[a]==0) // If arr[c] is divisible by arr[a]
{
if(binary_search(a+1, c-1, (arr[c]/arr[a]))) //#include<algorithm>
{
max = arr[c]; // if the other factor x of arr[c] is also in the array such that arr[c] = arr[a] * x
break;
}
}
}
}
I would have commented this on his solution, unfortunately I lack the reputation to do so.
Try this.
Written in c++
#include <vector>
#include <algorithm>
using namespace std;
int MaxElement(vector< int > Input)
{
sort(Input.begin(), Input.end());
int LargestElementOfInput = 0;
int i = 0;
while (i < Input.size() - 1)
{
if (LargestElementOfInput == Input[Input.size() - (i + 1)])
{
i++;
continue;
}
else
{
if (Input[i] != 0)
{
LargestElementOfInput = Input[Input.size() - (i + 1)];
int AllowedValue = LargestElementOfInput / Input[i];
int j = 0;
while (j < Input.size())
{
if (Input[j] > AllowedValue)
break;
else if (j == i)
{
j++;
continue;
}
else
{
int Product = Input[i] * Input[j++];
if (Product == LargestElementOfInput)
return Product;
}
}
}
i++;
}
}
return -1;
}
Once you have sorted the array, then you can use it to your advantage as below.
One improvement I can see - since you want to find the max element that meets the criteria,
Start from the right most element of the array. (8)
Divide that with the first element of the array. (8/2 = 4).
Now continue with the double pointer approach, till the element at second pointer is less than the value from the step 2 above or the match is found. (i.e., till second pointer value is < 4 or match is found).
If the match is found, then you got the max element.
Else, continue the loop with next highest element from the array. (6).
Efficient solution:
2 3 8 6
Sort the array
keep 3 pointers C, B and A.
Keeping C at the last and A at 0 index and B at 1st index.
traverse the array using pointers A and B till C and check if A*B=C exists or not.
If it exists then C is your answer.
Else, Move C a position back and traverse again keeping A at 0 and B at 1st index.
Keep repeating this till you get the sum or C reaches at 1st index.
Here's the complete solution:
int arr[] = new int[]{2, 3, 8, 6};
Arrays.sort(arr);
int n=arr.length;
int a,b,c,prod,max=-1;
for(c=n-1;(c>1)&& (max==-1);c--){ // loop through C
for(a=0;(a<c-1)&&(max==-1);a++){ // loop through A
for(b=a+1;b<c;b++){ // loop through B
prod=arr[a]*arr[b];
if(prod==arr[c]){
System.out.println("A: "+arr[a]+" B: "+arr[b]);
max=arr[c];
break;
}
if(prod>arr[c]){ // no need to go further
break;
}
}
}
}
System.out.println(max);
I came up with below solution where i am using one array list, and following one formula:
divisor(a or b) X quotient(b or a) = dividend(c)
Sort the array.
Put array into Collection Col.(ex. which has faster lookup, and maintains insertion order)
Have 2 pointer a,c.
keep c at last, and a at 0.
try to follow (divisor(a or b) X quotient(b or a) = dividend(c)).
Check if a is divisor of c, if yes then check for b in col.(a
If a is divisor and list has b, then c is the answer.
else increase a by 1, follow step 5, 6 till c-1.
if max not found then decrease c index, and follow the steps 4 and 5.
Check this C# solution:
-Loop through each element,
-loop and multiply each element with other elements,
-verify if the product exists in the array and is the max
private static int GetGreatest(int[] input)
{
int max = 0;
int p = 0; //product of pairs
//loop through the input array
for (int i = 0; i < input.Length; i++)
{
for (int j = i + 1; j < input.Length; j++)
{
p = input[i] * input[j];
if (p > max && Array.IndexOf(input, p) != -1)
{
max = p;
}
}
}
return max;
}
Time complexity O(n^2)

Longest Common Substring non-DP solution with O(m*n)

The definition of the problem is:
Given two strings, find the longest common substring.
Return the length of it.
I was solving this problem and I think I solved it with O(m*n) time complexity. However I don't know why when I look up the solution, it's all talking about the optimal solution being dynamic programming - http://www.geeksforgeeks.org/longest-common-substring/
Here's my solution, you can test it here: http://www.lintcode.com/en/problem/longest-common-substring/
int longestCommonSubstring(string &A, string &B) {
int ans = 0;
for (int i=0; i<A.length(); i++) {
int counter = 0;
int k = i;
for (int j=0; j<B.length() && k <A.length(); j++) {
if (A[k]!=B[j]) {
counter = 0;
k = i;
} else {
k++;
counter++;
ans = max(ans, counter);
}
}
}
return ans;
}
My idea is simple, start from the first position of string A and see what's the longest substring I can match with string B, then start from the second position of string A and see what's the longest substring I can match....
Is there something wrong with my solution? Or is it not O(m*n) complexity?
Good news: your algorithm is O(mn). Bad news: it doesn't work correctly.
Your inner loop is wrong: it's intended to find the longest initial substring of A[i:] in B, but it works like this:
j = 0
While j < len(B)
Match as much of A[i:] against B[j:]. Call it s.
Remember s if it's the longest so far found.
j += len(s)
This fails to find the longest match. For example, when A = "XXY" and B = "XXXY" and i=0 it'll find "XX" as the longest match instead of the complete match "XXY".
Here's a runnable version of your code (lightly transcribed into C) that shows the faulty result:
#include <string.h>
#include <stdio.h>
int lcs(const char* A, const char* B) {
int al = strlen(A);
int bl = strlen(B);
int ans = 0;
for (int i=0; i<al; i++) {
int counter = 0;
int k = i;
for (int j=0; j<bl && k<al; j++) {
if (A[k]!=B[j]) {
counter = 0;
k = i;
} else {
k++;
counter++;
if (counter >= ans) ans = counter;
}
}
}
return ans;
}
int main(int argc, char**argv) {
printf("%d\n", lcs("XXY", "XXXY"));
return 0;
}
Running this program outputs "2".
Your solution is O(nm) complexity and if you look compare the structure to the provided algorithm its the exact same; however, yours does not memoize.
One advantage that the dynamic algorithm provided in the link has is that in the same complexity class time it can recall different substring lengths in O(1); otherwise, it looks good to me.
This is a kind of thing will happen from time to time because storing subspace solutions will not always result in a better run time (on first call) and result in the same complexity class runtime instead (eg. try to compute the nth Fibonacci number with a dynamic solution and compare that to a tail recursive solution. Note that in this case like your case, after the array is filled the first time, its faster to return an answer each successive call.

Asymmetric Levenshtein distance

Given two bit strings, x and y, with x longer than y, I'd like to compute a kind of asymmetric variant of the Levensthein distance between them. Starting with x, I'd like to know the minimum number of deletions and substitutions it takes to turn x into y.
Can I just use the usual Levensthein distance for this, or do I need I need to modify the algorithm somehow? In other words, with the usual set of edits of deletion, substitution, and addition, is it ever beneficial to delete more than the difference in lengths between the two strings and then add some bits back? I suspect the answer is no, but I'm not sure. If I'm wrong, and I do need to modify the definition of Levenshtein distance to disallow deletions, how do I do so?
Finally, I would expect intuitively that I'd get the same distance if I started with y (the shorter string) and only allowed additions and substitutions. Is this right? I've got a sense for what these answers are, I just can't prove them.
If i understand you correctly, I think the answer is yes, the Levenshtein edit distance could be different than an algorithm that only allows deletions and substitutions to the larger string. Because of this, you would need to modify, or create a different algorithm to get your limited version.
Consider the two strings "ABCD" and "ACDEF". The Levenshtein distance is 3 (ABCD->ACD->ACDE->ACDEF). If we start with the longer string, and limit ourselves to deletions and substitutions we must use 4 edits (1 deletion and 3 substitutions. The reason is that strings where deletions are applied to the smaller string to efficiently get to the larger string can't be achieved when starting with the longer string, because it does not have the complimentary insertion operation (since you're disallowing that).
Your last paragraph is true. If the path from shorter to longer uses only insertions and substitutions, then any allowed path can simply be reversed from the longer to the shorter. Substitutions are the same regardless of direction, but the inserts when going from small to large become deletions when reversed.
I haven't tested this thoroughly, but this modification shows the direction I would take, and appears to work with the values I've tested with it. It's written in c#, and follows the psuedo code in the wikipedia entry for Levenshtein distance. There are obvious optimizations that can be made, but I refrained from doing that so it was more obvious what changes I've made from the standard algorithm. An important observation is that (using your constraints) if the strings are the same length, then substitution is the only operation allowed.
static int LevenshteinDistance(string s, string t) {
int i, j;
int m = s.Length;
int n = t.Length;
// for all i and j, d[i,j] will hold the Levenshtein distance between
// the first i characters of s and the first j characters of t;
// note that d has (m+1)*(n+1) values
var d = new int[m + 1, n + 1];
// set each element to zero
// c# creates array already initialized to zero
// source prefixes can be transformed into empty string by
// dropping all characters
for (i = 0; i <= m; i++) d[i, 0] = i;
// target prefixes can be reached from empty source prefix
// by inserting every character
for (j = 0; j <= n; j++) d[0, j] = j;
for (j = 1; j <= n; j++) {
for (i = 1; i <= m; i++) {
if (s[i - 1] == t[j - 1])
d[i, j] = d[i - 1, j - 1]; // no operation required
else {
int del = d[i - 1, j] + 1; // a deletion
int ins = d[i, j - 1] + 1; // an insertion
int sub = d[i - 1, j - 1] + 1; // a substitution
// the next two lines are the modification I've made
//int insDel = (i < j) ? ins : del;
//d[i, j] = (i == j) ? sub : Math.Min(insDel, sub);
// the following 8 lines are a clearer version of the above 2 lines
if (i == j) {
d[i, j] = sub;
} else {
int insDel;
if (i < j) insDel = ins; else insDel = del;
// assign the smaller of insDel or sub
d[i, j] = Math.Min(insDel, sub);
}
}
}
}
return d[m, n];
}

Scrabble tile checking

For tile checking in scrabble, you make four 5x5 grids of letters totalling 100 tiles. I would like to make one where all 40 horizontal and vertical words are valid. The set of available tiles contains:
12 x E
9 x A, I
8 x O
6 x N, R, T
4 x D, L, S, U
3 x G
2 x B, C, F, H, M, P, V, W, Y, blank tile (wildcard)
1 x K, J, Q, X, Z
The dictionary of valid words is available here (700KB). There are about 12,000 valid 5 letter words.
Here's an example where all 20 horizontal words are valid:
Z O W I E|P I N O T
Y O G I N|O C t A D <= blank being used as 't'
X E B E C|N A L E D
W A I T E|M E R L E
V I N E R|L U T E A
---------+---------
U S N E A|K N O S P
T A V E R|J O L E D
S O F T A|I A M B I
R I D G Y|H A I T h <= blank being used as 'h'
Q U R S H|G R O U F
I'd like to create one where all the vertical ones are also valid. Can you help me solve this? It is not homework. It is a question a friend asked me for help with.
Final Edit: Solved! Here is a solution.
GNAWN|jOULE
RACHE|EUROS
IDIOT|STEAN
PINOT|TRAvE
TRIPY|SOLES
-----+-----
HOWFF|ZEBRA
AGILE|EQUID
CIVIL|BUXOM
EVENT|RIOJA
KEDGY|ADMAN
Here's a photo of it constructed with my scrabble set. http://twitpic.com/3wn7iu
This one was easy to find once I had the right approach, so I bet you could find many more this way. See below for methodology.
Construct a prefix tree from the dictionary of 5 letter words for each row and column. Recursively, a given tile placement is valid if it forms valid prefixes for its column and row, and if the tile is available, and if the next tile placement is valid. The base case is that it is valid if there is no tile left to place.
It probably makes sense to just find all valid 5x5 boards, like Glenn said, and see if any four of them can be combined. Recursing to a depth of 100 doesn't sound like fun.
Edit: Here is version 2 of my code for this.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
typedef union node node;
union node {
node* child[26];
char string[6];
};
typedef struct snap snap;
struct snap {
node* rows[5];
node* cols[5];
char tiles[27];
snap* next;
};
node* root;
node* vtrie[5];
node* htrie[5];
snap* head;
char bag[27] = {9,2,2,4,12,2,3,2,9,1,1,4,2,6,8,2,1,6,4,6,4,2,2,1,2,1,2};
const char full_bag[27] = {9,2,2,4,12,2,3,2,9,1,1,4,2,6,8,2,1,6,4,6,4,2,2,1,2,1,2};
const char order[26] = {16,23,9,25,21,22,5,10,1,6,7,12,15,2,24,3,20,13,19,11,8,17,14,0,18,4};
void insert(char* string){
node* place = root;
int i;
for(i=0;i<5;i++){
if(place->child[string[i] - 'A'] == NULL){
int j;
place->child[string[i] - 'A'] = malloc(sizeof(node));
for(j=0;j<26;j++){
place->child[string[i] - 'A']->child[j] = NULL;
}
}
place = place->child[string[i] - 'A'];
}
memcpy(place->string, string, 6);
}
void check_four(){
snap *a, *b, *c, *d;
char two_total[27];
char three_total[27];
int i;
bool match;
a = head;
for(b = a->next; b != NULL; b = b->next){
for(i=0;i<27; i++)
two_total[i] = a->tiles[i] + b->tiles[i];
for(c = b->next; c != NULL; c = c->next){
for(i=0;i<27; i++)
three_total[i] = two_total[i] + c->tiles[i];
for(d = c->next; d != NULL; d = d->next){
match = true;
for(i=0; i<27; i++){
if(three_total[i] + d->tiles[i] != full_bag[i]){
match = false;
break;
}
}
if(match){
printf("\nBoard Found!\n\n");
for(i=0;i<5;i++){
printf("%s\n", a->rows[i]->string);
}
printf("\n");
for(i=0;i<5;i++){
printf("%s\n", b->rows[i]->string);
}
printf("\n");
for(i=0;i<5;i++){
printf("%s\n", c->rows[i]->string);
}
printf("\n");
for(i=0;i<5;i++){
printf("%s\n", d->rows[i]->string);
}
exit(0);
}
}
}
}
}
void snapshot(){
snap* shot = malloc(sizeof(snap));
int i;
for(i=0;i<5;i++){
printf("%s\n", htrie[i]->string);
shot->rows[i] = htrie[i];
shot->cols[i] = vtrie[i];
}
printf("\n");
for(i=0;i<27;i++){
shot->tiles[i] = full_bag[i] - bag[i];
}
bool transpose = false;
snap* place = head;
while(place != NULL && !transpose){
transpose = true;
for(i=0;i<5;i++){
if(shot->rows[i] != place->cols[i]){
transpose = false;
break;
}
}
place = place->next;
}
if(transpose){
free(shot);
}
else {
shot->next = head;
head = shot;
check_four();
}
}
void pick(x, y){
if(y==5){
snapshot();
return;
}
int i, tile,nextx, nexty, nextz;
node* oldv = vtrie[x];
node* oldh = htrie[y];
if(x+1==5){
nexty = y+1;
nextx = 0;
} else {
nextx = x+1;
nexty = y;
}
for(i=0;i<26;i++){
if(vtrie[x]->child[order[i]]!=NULL &&
htrie[y]->child[order[i]]!=NULL &&
(tile = bag[i] ? i : bag[26] ? 26 : -1) + 1) {
vtrie[x] = vtrie[x]->child[order[i]];
htrie[y] = htrie[y]->child[order[i]];
bag[tile]--;
pick(nextx, nexty);
vtrie[x] = oldv;
htrie[y] = oldh;
bag[tile]++;
}
}
}
int main(int argc, char** argv){
root = malloc(sizeof(node));
FILE* wordlist = fopen("sowpods5letters.txt", "r");
head = NULL;
int i;
for(i=0;i<26;i++){
root->child[i] = NULL;
}
for(i=0;i<5;i++){
vtrie[i] = root;
htrie[i] = root;
}
char* string = malloc(sizeof(char)*6);
while(fscanf(wordlist, "%s", string) != EOF){
insert(string);
}
free(string);
fclose(wordlist);
pick(0,0);
return 0;
}
This tries the infrequent letters first, which I'm no longer sure is a good idea. It starts to get bogged down before it makes it out of the boards starting with x. After seeing how many 5x5 blocks there were I altered the code to just list out all the valid 5x5 blocks. I now have a 150 MB text file with all 4,430,974 5x5 solutions.
I also tried it with just recursing through the full 100 tiles, and that is still running.
Edit 2: Here is the list of all the valid 5x5 blocks I generated. http://web.cs.sunyit.edu/~levyt/solutions.rar
Edit 3: Hmm, seems there was a bug in my tile usage tracking, because I just found a block in my output file that uses 5 Zs.
COSTE
ORCIN
SCUZZ
TIZZY
ENZYM
Edit 4: Here is the final product.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>
typedef union node node;
union node {
node* child[26];
char string[6];
};
node* root;
node* vtrie[5];
node* htrie[5];
int score;
int max_score;
char block_1[27] = {4,2,0,2, 2,0,0,0,2,1,0,0,2,1,2,0,1,2,0,0,2,0,0,1,0,1,0};//ZEBRA EQUID BUXOM RIOJA ADMAN
char block_2[27] = {1,0,1,1, 4,2,2,1,3,0,1,2,0,1,1,0,0,0,0,1,0,2,1,0,1,0,0};//HOWFF AGILE CIVIL EVENT KEDGY
char block_3[27] = {2,0,1,1, 1,0,1,1,4,0,0,0,0,3,2,2,0,2,0,3,0,0,1,0,1,0,0};//GNAWN RACHE IDIOT PINOT TRIPY
//JOULE EUROS STEAN TRAVE SOLES
char bag[27] = {9,2,2,4,12,2,3,2,9,1,1,4,2,6,8,2,1,6,4,6,4,2,2,1,2,1,2};
const char full_bag[27] = {9,2,2,4,12,2,3,2,9,1,1,4,2,6,8,2,1,6,4,6,4,2,2,1,2,1,2};
const char order[26] = {16,23,9,25,21,22,5,10,1,6,7,12,15,2,24,3,20,13,19,11,8,17,14,0,18,4};
const int value[27] = {244,862,678,564,226,1309,844,765,363,4656,909,414,691,463,333,687,11998,329,218,423,536,1944,1244,4673,639,3363,0};
void insert(char* string){
node* place = root;
int i;
for(i=0;i<5;i++){
if(place->child[string[i] - 'A'] == NULL){
int j;
place->child[string[i] - 'A'] = malloc(sizeof(node));
for(j=0;j<26;j++){
place->child[string[i] - 'A']->child[j] = NULL;
}
}
place = place->child[string[i] - 'A'];
}
memcpy(place->string, string, 6);
}
void snapshot(){
static int count = 0;
int i;
for(i=0;i<5;i++){
printf("%s\n", htrie[i]->string);
}
for(i=0;i<27;i++){
printf("%c%d ", 'A'+i, bag[i]);
}
printf("\n");
if(++count>=1000){
exit(0);
}
}
void pick(x, y){
if(y==5){
if(score>max_score){
snapshot();
max_score = score;
}
return;
}
int i, tile,nextx, nexty;
node* oldv = vtrie[x];
node* oldh = htrie[y];
if(x+1==5){
nextx = 0;
nexty = y+1;
} else {
nextx = x+1;
nexty = y;
}
for(i=0;i<26;i++){
if(vtrie[x]->child[order[i]]!=NULL &&
htrie[y]->child[order[i]]!=NULL &&
(tile = bag[order[i]] ? order[i] : bag[26] ? 26 : -1) + 1) {
vtrie[x] = vtrie[x]->child[order[i]];
htrie[y] = htrie[y]->child[order[i]];
bag[tile]--;
score+=value[tile];
pick(nextx, nexty);
vtrie[x] = oldv;
htrie[y] = oldh;
bag[tile]++;
score-=value[tile];
}
}
}
int main(int argc, char** argv){
root = malloc(sizeof(node));
FILE* wordlist = fopen("sowpods5letters.txt", "r");
score = 0;
max_score = 0;
int i;
for(i=0;i<26;i++){
root->child[i] = NULL;
}
for(i=0;i<5;i++){
vtrie[i] = root;
htrie[i] = root;
}
for(i=0;i<27;i++){
bag[i] = bag[i] - block_1[i];
bag[i] = bag[i] - block_2[i];
bag[i] = bag[i] - block_3[i];
printf("%c%d ", 'A'+i, bag[i]);
}
char* string = malloc(sizeof(char)*6);
while(fscanf(wordlist, "%s", string) != EOF){
insert(string);
}
free(string);
fclose(wordlist);
pick(0,0);
return 0;
}
After finding out how many blocks there were (nearly 2 billion and still counting), I switched to trying to find certain types of blocks, in particular the difficult to construct ones using uncommon letters. My hope was that if I ended up with a benign enough set of letters going in to the last block, the vast space of valid blocks would probably have one for that set of letters.
I assigned each tile a value inversely proportional to the number of 5 letter words it appears in. Then, when I found a valid block I would sum up the tile values, and if the score was the best I had yet seen, I would print out the block.
For the first block I removed the blank tiles, figuring that the last block would need that flexibility the most. After letting it run until I had not seen a better block appear for some time, I selected the best block, and removed the tiles in it from the bag, and ran the program again, getting the second block. I repeated this for the 3rd block. Then for the last block I added the blanks back in and used the first valid block it found.
Here's how I would try this. First construct a prefix tree.
Pick a word and place it horizontally on top. Pick a word and place it vertically. Alternate them until exhausted options. By alternating you start to fix the first letters and eliminating lots of mismatching words. If you really do find such square, then do a check whether they can be made with those pieces.
For 5x5 squares: after doing some thinking it can't be worse than O(12000!/11990!) for random text words. But thinking about it a little bit more. Every time you fix a letter (in normal text) you eliminate about 90% (an optimistic guess) of your words. This means after three iterations you've got 12 words. So the actual speed would be
O(n * n/10 * n/10 * n/100 * n/100 * n/1000 * n/1000 ...
which for 12000 elements acts something like n^4 algorithm
which isn't that bad.
Probably someone can do a better analysis of the problem. But the search for words should still converge quite quickly.
There can be more eliminating done by abusing the infrequent letters. Essentially find all words that have infrequent letters. Try to make a matching positions for each letters. Construct a set of valid letters for each position.
For example, let's say we have four words with letter Q in it.
AQFED, ZQABE, EDQDE, ELQUO
this means there are two valid positionings of those:
xZxxx
AQFED
xAxxx ---> this limits our search for words that contain [ABDEFZ] as the second letter
xBxxx
xExxx
same for the other
EDQDE ---> this limits our search for words that contain [EDLU] as the third letter
ELQUO
all appropriate words are in union of those two conditions
So basically, if we have multiple words that contain infrequent letter X in word S at position N, means that other words that are in that matrix must have letter that is also in S in position n.
Formula:
Find all words that contain infrequent letter X at position 1 (next iteration 2, 3... )
Make a set A out of the letters in those words
Keep only those words from the dictionary that have letter from set A in position 1
Try to fit those into the matrix (with the first method)
Repeat with position 2
I would approach the problem (naively, to be sure) by taking a pessimistic view. I'd try to prove there was no 5x5 solution, and therefore certainly not four 5x5 solutions. To prove there was no 5x5 solution I'd try to construct one from all possibilities. If my conjecture failed and I was able to construct a 5x5 solution, well, then, I'd have a way to construct 5x5 solutions and I would try to construct all of the (independent) 5x5 solutions. If there were at least 4, then I would determine if some combination satisfied the letter count restrictions.
[Edit] Null Set has determined that there are "4,430,974 5x5 solutions". Are these valid?
I mean that we have a limitation on the number of letters we can use. This limitation can be expressed as a boundary vector BV = [9, 2, 2, 4, ...] corresponding to the limits on A, B, C, etc. (You see this vector in Null Set's code). A 5x5 solution is valid if each term of its letter count vector is less than the corresponding term in BV. It would be easy to check if a 5x5 solution is valid as it was created. Perhaps the 4,430,974 number can be reduced, say to N.
Regardless, we can state the problem as: find four letter count vectors among the N whose sum is equal to BV. There are (N, 4) possible sums ("N choose 4"). With N equal to 4 million this is still on the order of 10^25---not an encouraging number. Perhaps you could search for four whose first terms sum to 9, and if so checking that their second terms sum to 2, etc.
I'd remark that after choosing 4 from N the computations are independent, so if you have a multi-core machine you can make this go faster with a parallel solution.
[Edit2] Parallelizing probably wouldn't make much difference, though. At this point I might take an optimistic view: there are certainly more 5x5 solutions than I expected, so there may be more final solutions than expected, too. Perhaps you might not have to get far into the 10^25 to hit one.
I'm starting with something simpler.
Here are some results so far:
3736 2x2 solutions
8812672 3x3 solutions
The 1000th 4x4 solution is
A A H S
A C A I
L A I R
S I R E
The 1000th 5x5 solution is
A A H E D
A B U N A
H U R S T
E N S U E
D A T E D
The 1000th 2x4x4 solution is
A A H S | A A H S
A B A C | A B A C
H A I R | L E K U
S C R Y | S T E D
--------+--------
D E E D | D E E M
E I N E | I N T I
E N O L | O V E R
T E L T | L Y N E
Note that transposing an 'A' and a blank that is being used as an 'A' should be considered the same solution. But transposing the rows with the columns should be considered a different solution. I hope that makes sense.
Here are a lot of precomputed 5x5's. Left as an exercise to the reader to find 4 compatible ones :-)
http://www.gtoal.com/wordgames/wordsquare/all5

Resources