Valid Sudoku: How to decrease runtime - python-3.x

Problem is to check whether the given 2D array represents a valid Sudoku or not. Given below are the conditions required
Each row must contain the digits 1-9 without repetition.
Each column must contain the digits 1-9 without repetition.
Each of the 9 3x3 sub-boxes of the grid must contain the digits 1-9 without repetition.
Here is the code I prepared for this, please give me tips on how I can make it faster and reduce runtime and whether by using the dictionary my program is slowing down ?
def isValidSudoku(self, boards: List[List[str]]) -> bool:
r = {}
a = {}
for i in range(len(boards)):
c = {}
for j in range(len(boards[i])):
if boards[i][j] != '.':
x,y = r.get(boards[i][j]+f'{j}',0),c.get(boards[i][j],0)
u,v = (i+3)//3,(j+3)//3
z = a.get(boards[i][j]+f'{u}{v}',0)
if (x==0 and y==0 and z==0):
r[boards[i][j]+f'{j}'] = x+1
c[boards[i][j]] = y+1
a[boards[i][j]+f'{u}{v}'] = z+1
else:
return False
return True

Simply optimizing assignment without rethinking your algorithm limits your overall efficiency by a lot. When you make a choice you generally take a long time before discovering a contradiction.
Instead of representing, "Here are the values that I have figured out", try to represent, "Here are the values that I have left to try in each spot." And now your fundamental operation is, "Eliminate this value from this spot." (Remember, getting it down to 1 propagates to eliminating the value from all of its peers, potentially recursively.)
Assignment is now "Eliminate all values but this one from this spot."
And now your fundamental search operation is, "Find the square with the least number of remaining possibilities > 1. Try each possibility in turn."
This may feel heavyweight. But the immediate propagation of constraints results in very quickly discovering constraints on the rest of the solution, which is far faster than having to do exponential amounts of reasoning before finding the logical contradiction in your partial solution so far.
I recommend doing this yourself. But https://norvig.com/sudoku.html has full working code that you can look at at need.

Related

Algorithm, that finds the k-greatest number in O(n*log(k))

was wondering, if you have given an unsorted list of arrays of any length n >= k,
what is your idea, to find the k-greatest number in O(n*log(k)) time. So the k = 2 -greatest number of an Array containing the numbers 1 to 9 would be 8 for example.
I'm trying to code this in python, if you have an idea how in that time complexity :)
My answer is not python-specific, however you should be able to implement the used concepts in python, or find libraries already implementing them.
The basic idea is to iterate over the list and store the current greatest, second greatest, ... , k-greatest number in a separate data structure. Since you will be iterating over all n entries in your array, the complexity of this is in O(n * insertion_step_complexity)
As seen above, the insertion step needs to not exceed a complexity of O(log(k)) to achieve this you can use a AVL-Tree that has a complexity of O(log(m)) for inserting and deleting items, where m is the number of items stored within the avl-tree.
An algorithm would look like this:
def find_k_greatest_number(k, array):
avl_tree = initialize AVL tree here
avl_items = 0
for number in array:
if (number > avl_tree.smallest_number()):
if (avl_itmes >= k):
avl_tree.delete_smallest_number()
else:
avl_items++
avl_tree.insert(number)
return avl_tree.smallest_number()
Finding the smallest number in a sorted tree is dependent on its height. Since the AVL tree can't exceed the height of log(k) the complexity of finding the smallest number is O(log(k)).

How does Duval's algorithm handle odd-length strings?

Finding the Lexicographically minimal string rotation is a well known problem, for which a linear time algorithm was proposed by Jean Pierre Duval in 1983. This blog post is probably the only publicly available resource that talks about the algorithm in detail. However, Duval's algorithms is based on the idea of pairwise comparisons ("duels"), and the blog conveniently uses an even-length string as an example.
How does the algorithm work for odd-length strings, where the last character wouldn't have a competing one to duel with?
One character can get a "bye", where it wins without participating in a "duel". The correctness of the algorithm does not rely on the specific duels that you perform; given any two distinct indices i and j, you can always conclusively rule out that one of them is the start-index of the lexicographically-minimal rotation (unless both are start-indices of identical lexicographically-minimal rotations, in which case it doesn't matter which one you reject). The reason to perform the duels in a specific order is performance: to get asymptotically linear time by ensuring that half the duels only need to compare one character, half of the rest only need to compare two characters, and so on, until the last duel only needs to compare half the length of the string. But a single odd character here and there doesn't change the asymptotic complexity, it just makes the math (and implementation) a little bit more complicated. A string of length 2n+1 still requires fewer "duels" than one of length 2n+1.
OP here: I accepted ruakh's answer as it pertains to my question, but I wanted to provide my own explanation for others that might stumble across this post trying to understand Duval's algorithm.
Problem:
Lexicographically least circular substring is the problem of finding
the rotation of a string possessing the lowest lexicographical order
of all such rotations. For example, the lexicographically minimal
rotation of "bbaaccaadd" would be "aaccaaddbb".
Solution:
A O(n) time algorithm was proposed by Jean Pierre Duval (1983).
Given two indices i and j, Duval's algorithm compares string segments of length j - i starting at i and j (called a "duel"). If index + j - i is greater than the length of the string, the segment is formed by wrapping around.
For example, consider s = "baabbaba", i = 5 and j = 7. Since j - i = 2, the first segment starting at i = 5 is "ab". The second segment starting at j = 7 is constructed by wrapping around, and is also "ab".
If the strings are lexicographically equal, like in the above example, we choose the one starting at i as the winner, which is i = 5.
The above process repeated until we have a single winner. If the input string is of odd length, the last character wins without a comparison in the first iteration.
Time complexity:
The first iteration compares n strings each of length 1 (n/2 comparisons), the second iteration may compare n/2 strings of length 2 (n/2 comparisons), and so on, until the i-th iteration compares 2 strings of length n/2 (n/2 comparisons). Since the number of winners is halved each time, the height of the recursion tree is log(n), thus giving us a O(n log(n)) algorithm. For small n, this is approximately O(n).
Space complexity is O(n) too, since in the first iteration, we have to store n/2 winners, second iteration n/4 winners, and so on. (Wikipedia claims this algorithm uses constant space, I don't understand how).
Here's a Scala implementation; feel free to convert to your favorite programming language.
def lexicographicallyMinRotation(s: String): String = {
#tailrec
def duel(winners: Seq[Int]): String = {
if (winners.size == 1) s"${s.slice(winners.head, s.length)}${s.take(winners.head)}"
else {
val newWinners: Seq[Int] = winners
.sliding(2, 2)
.map {
case Seq(x, y) =>
val range = y - x
Seq(x, y)
.map { i =>
val segment = if (s.isDefinedAt(i + range - 1)) s.slice(i, i + range)
else s"${s.slice(i, s.length)}${s.take(s.length - i)}"
(i, segment)
}
.reduce((a, b) => if (a._2 <= b._2) a else b)
._1
case xs => xs.head
}
.toSeq
duel(newWinners)
}
}
duel(s.indices)
}

Solving math with integers larger than any available integer data type

In some programming competitions where the numbers are larger than any available integer data type, we often use strings instead.
Question 1:
Given these large numbers, how to calculate e and f in the below expression?
(a/b) + (c/d) = e/f
note: GCD(e,f) = 1, i.e. they must be in minimised form. For example {e,f} = {1,2} rather than {2,4}.
Also, all a,b,c,d are large numbers known to us.
Question 2:
Can someone also suggest a way to find GCD of two big numbers (bigger than any available integer type)?
I would suggest using full bytes or words rather than strings.
It is relatively easy to think in base 256 instead of base 10 and a lot more efficient for the processor to not do multiplication and division by 10 all the time. Ideally, choose a word size that is half the processor's natural word size, as that makes carry easy to implement. Of course thinking in base 64K or 4G is slightly more complex, but even better than base 256.
The only downside is generating the initial big numbers from the ascii input, which you get for free in base 10. Using a larger word size you can make this more efficient by processing a number of digits initially into a single word (eg 9 digits at a time into 4G), then performing a long multiply of that single word into the correct offset in your large integer format.
A compromise might be to run your engine in base 1 billion: This will still be 9 or 81 times more efficient than using base 10!
The simplest way to solve this equation is to multiply a/b * d/d and c/d * b/b so they both have the common denominator b*d.
I think you will then need to prime factorise your big numbers e and f to find any common factors. Remember to search again for the same factor squared.
Of course, that means you have to write a prime generating sieve. You only need to generate factors up to the square root, or half the digits of the min value of e and f.
You could prime factorise b and d to get a lower initial denominator, but you will need to do it again anyway after the addition.
I think that the way to solve this is to separate the problem:
Process the input numbers as an array of characters (ie. std::string)
Make a class where each object can store an std::list (or similar) that represents one of the large numbers, and can do the needed arithmetic with your data
You can then solve your problems normally, without having to worry about your large inputs causing overflow.
Here's a webpage that explains how you can have such an arithmetic class (with sample code in C++ showing addition).
Once you have such an arithmetic class, you no longer need to worry about how to store the data or any overflow.
I get the impression that you already know how to find the GCD when you don't have overflow issues, but just in case, here's an explanation of finding the GCD (with C++ sample code).
As for the specific math problem:
// given formula: a/b + c/d = e/f
// = ( ( a*d + b*c ) / ( b*d ) )
// Define some variables here to save on copying
// (I assume that your class that holds the
// large numbers is called "ARITHMETIC")
ARITHMETIC numerator = a*d + b*c;
ARITHMETIC denominator = b*d;
ARITHMETIC gcd = GCD( numerator , denominator );
// because we know that GCD(e,f) is 1, this implies:
ARITHMETIC e = numerator / gcd;
ARITHMETIC f = denominator / gcd;

Simultaneous Subset sums

I am dealing with a problem which is a variant of a subset-sum problem, and I am hoping that the additional constraint could make it easier to solve than the classical subset-sum problem. I have searched for a problem with this constraint but I have been unable to find a good example with an appropriate algorithm either on StackOverflow or through googling elsewhere.
The problem:
Assume you have two lists of positive numbers A1,A2,A3... and B1,B2,B3... with the same number of elements N. There are two sums Sa and Sb. The problem is to find the simultaneous set Q where |sum (A{Q}) - Sa| <= epsilon and |sum (B{Q}) - Sb| <= epsilon. So, if Q is {1, 5, 7} then A1 + A5 + A7 - Sa <= epsilon and B1 + B5 + B7 - Sb <= epsilon. Epsilon is an arbitrarily small positive constant.
Now, I could solve this as two completely separate subset sum problems, but removing the simultaneity constraint results in the possibility of erroneous solutions (where Qa != Qb). I also suspect that the additional constraint should make this problem easier than the two NP complete problems. I would like to solve an instance with 18+ elements in both lists of numbers, and most subset-sum algorithms have a long run time with this number of elements. I have investigated the pseudo-polynomial run time dynamic programming algorithm, but this has the problems that a) the speed relies on a short bit-depth of the list of numbers (which does not necessarily apply to my instance) and b) it does not take into account the simultaneity constraint.
Any advice on how to use the simultaneity constraint to reduce the run time? Is there a dynamic programming approach I could use to take into account this constraint?
If I understand your description of the problem correctly (I'm confused about why you have the distance symbols around "sum (A{Q}) - Sa" and "sum (B{Q}) - Sb", it doesn't seem to fit the rest of the explanation), then it is in NP.
You can see this by making a reduction from Subset sum (SUB) to Simultaneous subset sum (SIMSUB).
If you have a SUB problem consisting of a set X = {x1,x2,...,xn} and a target called t and you have an algorithm that solves SIMSUB when given two sets A = {a1,a2,...,an} and B = {b1,b2,...,bn}, two intergers Sa and Sb and a value for epsilon then we can solve SUB like this:
Let A = X and let B be a set of length n consisting of only 0's. Set Sa = t, Sb = 0 and epsilon = 0. You can now run the SIMSUB algorithm on this problem and get the solution to your SUB problem.
This shows that SUBSIM is as least as hard as SUB and therefore in NP.

Percentage diff b/t two strings of different lengths

I have a problem where I am trying to prevent repeats of a string. So far the best solution is to compare the strings for a percentage and check if it is above a certain fixed point.
I've looked up Levenshtein distance but so far I believe it does not accomplish my goal since it compares strings of the same length. Both of my strings are more than likely to be significantly different lengths (stack trace). I'm looking for content or word comparison rather than char to char comparison. A percentage answer is the most important part of this.
I assume someone has an algorithm or would be willing to point me in the right direction? Thank you for reading and even more so for helping!
An indirect example... think of them as being stacktraces in py.test form.
I have filepaths and am comparing them
/test/opt/somedir/blah/something
def do_something(self, x):
return x
SomeError: do_something in 'filepath' threw some exception or something
vs
/test/opt/somedir/blah2/somethingelse
def do_another_thing(self, y):
return y
SomeError: do_another_thing in 'different filepath' threw some exception
But also when you have the same filepath, but different errors. The traces are hundreds of lines long, so showing a full example isn't reasonable. This example is as close as I can get without the actual trace.
One way of going at this would be through applications of the Jaro-Winkler String Similarity metric. Happily, this has a PyPI package.
Let's start off with three string, your two examples, and the begining of your question:
s1 = u'''
/test/opt/somedir/blah/something
def do_something(self, x):
return x
SomeError: do_something in 'filepath' threw some exception or something'''
s2 = u'''
/test/opt/somedir/blah2/somethingelse
def do_another_thing(self, y):
return y
SomeError: do_another_thing in 'different filepath' threw some exception'''
q = u'''
I have a problem where I am trying to prevent repeats of a string. So far the best solution is to compare the strings for a percentage and check if it is above a certain fixed point.'''
Then the similarities are:
>> jaro.jaro_metric(s1, s2)
0.8059572665529058
>> jaro.jaro_metric(s1, q)
0.6562121541167517
However, since you know something of the problem domain (it is a sequence of lines of stacktraces), you could do better by calculating line differences, perhaps:
import itertools
>> [jaro.jaro_metric(l1, l2) for l1, l2 in itertools.izip(s1.split('\n'), s2.split('\n'))]
[1.0,
0.9353471118177001,
0.8402824228911184,
0.9444444444444443,
0.8043725314852076]
So, you need to experiment with this, but you could try, given two stacktraces, calculating a "distance" which is a matrix - the i-j entry would be the similarity between the i-th string of the first to the j-th of the second. (This is a bit computationally expensive.) See if there's a threshold for a percentage or number of entries obtaining very high scores.

Resources