Probability question: Estimating the number of attempts needed to exhaustively try all possible placements in a word search - search

Would it be reasonable to systematically try all possible placements in a word search?
Grids commonly have dimensions of 15*15 (15 cells wide, 15 cells tall) and contain about 15 words to be placed, each of which can be placed in 8 possible directions. So in general it seems like you can calculate all possible placements by the following:
width*height*8_directions_to_place_word*number of words
So for such a grid it seems like we only need to try 15*15*8*15 = 27,000 which doesn't seem that bad at all. I am expecting some huge number so either the grid size and number of words is really small or there is something fishy with my math.

Formally speaking, assuming that x is number of rows and y is number of columns you should sum all the probabilities of every possible direction for every possible word.
Inputs are: x, y, l (average length of a word), n (total words)
so you have
horizontally a word can start from 0 to x-l and going right or from l to x going left for each row: 2x(x-l)
same approach is used for vertical words: they can go from 0 to y-l going down or from l to y going up. So it's 2y(y-l)
for diagonal words you shoul consider all possible start positions x*y and subtract l^2 since a rect of the field can't be used. As before you multiply by 4 since you have got 4 possible directions: 4*(x*y - l^2).
Then you multiply the whole result for the number of words included:
total = n*(2*x*(x-l)+2*y*(y-l)+4*(x*y-l^2)

Related

Finding the optimal selections of x number per column and y numbers per row of an NxM array

Given an NxM array of positive integers, how would one go about selecting integers so that the maximum sum of values is achieved where there is a maximum of x selections in each row and y selections in each column. This is an abstraction of a problem I am trying to face in making NCAA swimming lineups. Each swimmer has a time in every event that can be converted to an integer using the USA Swimming Power Points Calculator the higher the better. Once you convert those times, I want to assign no more than 3 swimmers per event, and no more than 3 races per swimmer such that the total sum of power scores is maximized. I think this is similar to the Weapon-targeting assignment problem but that problem allows a weapon type to attack the same target more than once (in my case allowing a single swimmer to race the same event twice) and that does not work for my use case. Does anybody know what this variation on the wta problem is called, and if so do you know of any solutions or resources I could look to?
Here is a mathematical model:
Data
Let a[i,j] be the data matrix
and
x: max number of selected cells in each row
y: max number of selected cells in each column
(Note: this is a bit unusual: we normally reserve the names x and y for variables. These conventions can help with readability).
Variables
δ[i,j] ∈ {0,1} are binary variables indicating if cell (i,j) is selected.
Optimization Model
max sum((i,j), a[i,j]*δ[i,j])
sum(j,δ[i,j]) ≤ x ∀i
sum(i,δ[i,j]) ≤ y ∀j
δ[i,j] ∈ {0,1}
This can be fed into any MIP solver.

How to code rule number 4 from Western Electric rule's for quality control charts

I am new to statistics. I have a problem at hand for which I need to code all the 4 rules of Western Electric rules for Quality control. I have been able to code the first one and second with the help of my peer, could anyone help me out in writing down rule number 4 - "NINE consecutive points fall on the same side of the centerline"
I have plotted rule 1 by getting the data below and above the threshold and then ran the matplotlib plot in single cell.
I am not able to get the data for rule number 4.
Even if no one answered it, it gave me enough motivation to workout a solution.
although I know my code is not optimal, please suggest me if I need any changes further.
temp_up=[]
temp_down=[]
for i in range(len(data)):
if arr_data[i] > y_mean:
temp_up.append(i)
else:
temp_down.append(i)
#now we have index values for both data above and below mean,
#we will now get the sequence of the index to know if there's any run of or greater than length 9
from itertools import groupby
from operator import itemgetter
d_up=[]
d_down=[]
for k, g in groupby(enumerate(temp_up), lambda ix : ix[0] - ix[1]):
t_up=(list(map(itemgetter(1), g)))
if len(t_up)>=9:#check if the length of the sequence is greater than or equal to 9
#get index to mark red for the data
for i in range(8, len(t_up), 1):
d_up.append(t_up[i])#index number of data points voilating number 4 rule (above mean)
for k, g in groupby(enumerate(temp_down), lambda ix : ix[0] - ix[1]):
t_down=(list(map(itemgetter(1), g)))
if len(t_down)>=9:#check if the length of the sequence is greater than or equal to 9
#print(t_down)
#get index to mark red for the data
for i in range(8, len(t_down), 1):
d_down.append(t_down[i])#index number of data points voilating number 4 rule (above mean)
data_above_r4 = pd.DataFrame(data.iloc[d_up])
Really late followup, but you could do something like "if minimum of previous 9 points and maximum of previous 9 points are both same side (> or <) mean, failure."
This shows the rolling min, rolling max should work similarly https://stackoverflow.com/a/33920859/18474867
Generate a "rolling min" and "rolling max" column for the last 9 rows, if anywhere they're both + or both -, flag it as a failure.

String pre-processing step, to answer further queries in O(1) time

A string is given to you and it contains characters consisting of only 3 characters. Say, x y z.
There will be million queries given to you.
Query format: x z i j
Now in this we need to find all possible different substrings which begins with x and ends in z. i and j denotes the lower and upper bound of the range where the substring must lie. It should not cross this.
My Logic:-
Read the string. Have 3 arrays which will store the count of x y z respectively, for i=0 till strlen
Store the indexes of each characters separately in 3 more arrays. xlocation[], ylocation[], zlocation[]
Now, accordingly to the query, (a b i j) find all the indices of b within the range i and j.
Calculate the answer, for each index of b and sum it to get the result.
Is it possible to pre-process this string before the query? So, like that it takes O(1) time to answer the query.
As the others suggested, you can do this with a divide and conquer algorithm.
Optimal substructure:
If we are given a left half of the string and a right half and we know how many substrings there are in the left half and how many there are in the right half then we can add the two numbers together. We will be undercounting by all the strings that begin in the left and end in the right. This is simply the number of x's in the left substring multiplied by the number of z's in the right substring.
Therefore we can use a recursive algorithm.
This would be a problem however if we tried to solve for everything single i and j combination as the bottom level subproblems would be solved many many times.
You should look into implementing this with a dynamic programming algorithm keeping track of substrings in range i,j, x's in range i,j, and z's in range i,j.

CodeJam 2014: Solution for The Repeater

I participated in code jam, I successfully solved small input of The Repeater Challenge but can't seem to figure out approach for multiple strings.
Can any one give the algorithm used for multiple strings. For 2 strings ( small input ) I am comparing strings character by character and doing operations to make them equal. However this approach would time out for large input.
Can some one explain their algorithm they used. I can see solutions of other users but can't figure out what have they done.
I can tell you my solution which worked fine for both small and large inputs.
First, we have to see if there is a solution, you do that by bringing all strings to their "simplest" form. If any of them does not match, there there is no solution.
e.g.
aaabbbc => abc
abbbbbcc => abc
abbcca => abca
If only the first two were given, then a solution would be possible. As soon as the third is thrown into the mix, then it's impossible. The algorithm to do the "simplification" is to parse the string and eliminate any double character you see. As soon as a string does not equal the simplified form of the batch, bail out.
As for actual solution to the problem, i simply converted the strings to a [letter, repeat] format. So for example
qwerty => 1q,1w,1e,1r,1t,1y
qqqwweeertttyy => 3q,2w,3e,1r,3t,2y
(mind you the outputs are internal structures, not actual strings)
Imagine now you have 100 strings, you have already passed the test that there is a solution and you have all strings into the [letter, repeat] representation. Now go through every letter and find the least 'difference' of repetitions you have to do, to reach the same number. So for example
1a, 1a, 1a => 0 diff
1a, 2a, 2a => 1 diff
1a, 3a, 10a => 9 diff (to bring everything to 3)
the way to do this (i'm pretty sure there is a more efficient way) is to go from the min number to the max number and calculate the sum of all diffs. You are not guaranteed that the number will be one of the numbers in the set. For the last example, you would calculate the diff to bring everything to 1 (0,2,9 =11) then for 2 (1,1,8 =10), the for 3 (2,0,7 =9) and so on up to 10 and choose the min again. Strings are limited to 1000 characters so this is an easy calculation. On my moderate laptop, the results were instant.
Repeat the same for every letter of the strings and sum everything up and that is your solution.
This answer gives an example to explain why finding the median number of repeats produces the lowest cost.
Suppose we have values:
1 20 30 40 100
And we are trying to find the value which has shortest total distance to all these values.
We might guess the best answer is 50, with cost |50-1|+|50-20|+|50-30|+|50-40|+|50-100| = 159.
Split this into two sums, left and right, where left is the cost of all numbers to the left of our target, and right is the cost of all numbers to the right.
left = |50-1|+|50-20|+|50-30|+|50-40| = 50-1+50-20+50-30+50-40 = 109
right = |50-100| = 100-50 = 50
cost = left + right = 159
Now consider changing the value by x. Providing x is small enough such that the same numbers are on the left, then the values will change to:
left(x) = |50+x-1|+|50+x-20|+|50+x-30|+|50+x-40| = 109 + 4x
right(x) = |50+x-100| = 50 - x
cost(x) = left(x)+right(x) = 159+3x
So if we set x=-1 we will decrease our cost by 3, therefore the best answer is not 50.
The amount our cost will change if we move is given by difference between the number to our left (4) and the number to our right (1).
Therefore, as long as these are different we can always decrease our cost by moving towards the median.
Therefore the median gives the lowest cost.
If there are an even number of points, such as 1,100 then all numbers between the two middle points will give identical costs, so any of these values can be chosen.
Since Thanasis already explained the solution, I'm providing here my source code in Ruby. It's really short (only 400B) and following his algorithm exactly.
def solve(strs)
form = strs.first.squeeze
strs.map { |str|
return 'Fegla Won' if form != str.squeeze
str.chars.chunk { |c| c }.map { |arr|
arr.last.size
}
}.transpose.map { |row|
Range.new(*row.minmax).map { |n|
row.map { |r|
(r - n).abs
}.reduce :+
}.min
}.reduce :+
end
gets.to_i.times { |i|
result = solve gets.to_i.times.map { gets.chomp }
puts "Case ##{i+1}: #{result}"
}
It uses a method squeeze on strings, which removes all the duplicate characters. This way, you just compare every squeezed line to the reference (variable form). If there's an inconsistency, you just return that Fegla Won.
Next you use a chunk method on char array, which collects all consecutive characters. This way you can count them easily.

binary sequence subsum combinations

Given a sequence a1a2....a_{m+n} with n +1s and m -1s, if for any 1=< i <=m+n, we have
sum(ai) >=0, i.e.,
a1 >= 0
a1+a2>=0
a1+a2+a3>=0
...
a1+a2+...+a_{m+n}>=0
then the number of sequence that meets the requirement is C(m+n,n) - C(m+n,n-1), where the first item is the total number of sequence, and the second term refers to those sub-sum < 0.
I was wondering whether there is a similar formula for the bi-side sequence number :
a1 >= 0
a1+a2>=0
a1+a2+a3>=0
...
a1+a2+...+a_{m+n}>=0
a_{m+n}>=0
a_{m+n-1}+a_{m+n}>=0
...
a1+a2+...+a_{m+n}>=0
I feel like it can be derived similarly with the single-side subsum problem, but the number C(m+n,n) - 2 * C(m+n,n-1) is definitely incorrect. Any ideas ?
A clue: the first case is a number of paths (with +-1 step) from (0,0) to (n+m, n-m) point, where path never falls below zero line. (Like Catalan numbers for parenthesis pairs, but without balance requirement n=2m)
Desired formula is a number of (+-1) paths which never rise above (n-m) line. It is possible to get recursive formulas. I hope that compact formula exists for it.
If we consider lattice path at nxm grid, where horizontal step for +1 and vertical step for -1, then we need a number of paths restricted by parallelogramm with (n-m) base

Resources