Percentage diff b/t two strings of different lengths - string

I have a problem where I am trying to prevent repeats of a string. So far the best solution is to compare the strings for a percentage and check if it is above a certain fixed point.
I've looked up Levenshtein distance but so far I believe it does not accomplish my goal since it compares strings of the same length. Both of my strings are more than likely to be significantly different lengths (stack trace). I'm looking for content or word comparison rather than char to char comparison. A percentage answer is the most important part of this.
I assume someone has an algorithm or would be willing to point me in the right direction? Thank you for reading and even more so for helping!
An indirect example... think of them as being stacktraces in py.test form.
I have filepaths and am comparing them
/test/opt/somedir/blah/something
def do_something(self, x):
return x
SomeError: do_something in 'filepath' threw some exception or something
vs
/test/opt/somedir/blah2/somethingelse
def do_another_thing(self, y):
return y
SomeError: do_another_thing in 'different filepath' threw some exception
But also when you have the same filepath, but different errors. The traces are hundreds of lines long, so showing a full example isn't reasonable. This example is as close as I can get without the actual trace.

One way of going at this would be through applications of the Jaro-Winkler String Similarity metric. Happily, this has a PyPI package.
Let's start off with three string, your two examples, and the begining of your question:
s1 = u'''
/test/opt/somedir/blah/something
def do_something(self, x):
return x
SomeError: do_something in 'filepath' threw some exception or something'''
s2 = u'''
/test/opt/somedir/blah2/somethingelse
def do_another_thing(self, y):
return y
SomeError: do_another_thing in 'different filepath' threw some exception'''
q = u'''
I have a problem where I am trying to prevent repeats of a string. So far the best solution is to compare the strings for a percentage and check if it is above a certain fixed point.'''
Then the similarities are:
>> jaro.jaro_metric(s1, s2)
0.8059572665529058
>> jaro.jaro_metric(s1, q)
0.6562121541167517
However, since you know something of the problem domain (it is a sequence of lines of stacktraces), you could do better by calculating line differences, perhaps:
import itertools
>> [jaro.jaro_metric(l1, l2) for l1, l2 in itertools.izip(s1.split('\n'), s2.split('\n'))]
[1.0,
0.9353471118177001,
0.8402824228911184,
0.9444444444444443,
0.8043725314852076]
So, you need to experiment with this, but you could try, given two stacktraces, calculating a "distance" which is a matrix - the i-j entry would be the similarity between the i-th string of the first to the j-th of the second. (This is a bit computationally expensive.) See if there's a threshold for a percentage or number of entries obtaining very high scores.

Related

Valid Sudoku: How to decrease runtime

Problem is to check whether the given 2D array represents a valid Sudoku or not. Given below are the conditions required
Each row must contain the digits 1-9 without repetition.
Each column must contain the digits 1-9 without repetition.
Each of the 9 3x3 sub-boxes of the grid must contain the digits 1-9 without repetition.
Here is the code I prepared for this, please give me tips on how I can make it faster and reduce runtime and whether by using the dictionary my program is slowing down ?
def isValidSudoku(self, boards: List[List[str]]) -> bool:
r = {}
a = {}
for i in range(len(boards)):
c = {}
for j in range(len(boards[i])):
if boards[i][j] != '.':
x,y = r.get(boards[i][j]+f'{j}',0),c.get(boards[i][j],0)
u,v = (i+3)//3,(j+3)//3
z = a.get(boards[i][j]+f'{u}{v}',0)
if (x==0 and y==0 and z==0):
r[boards[i][j]+f'{j}'] = x+1
c[boards[i][j]] = y+1
a[boards[i][j]+f'{u}{v}'] = z+1
else:
return False
return True
Simply optimizing assignment without rethinking your algorithm limits your overall efficiency by a lot. When you make a choice you generally take a long time before discovering a contradiction.
Instead of representing, "Here are the values that I have figured out", try to represent, "Here are the values that I have left to try in each spot." And now your fundamental operation is, "Eliminate this value from this spot." (Remember, getting it down to 1 propagates to eliminating the value from all of its peers, potentially recursively.)
Assignment is now "Eliminate all values but this one from this spot."
And now your fundamental search operation is, "Find the square with the least number of remaining possibilities > 1. Try each possibility in turn."
This may feel heavyweight. But the immediate propagation of constraints results in very quickly discovering constraints on the rest of the solution, which is far faster than having to do exponential amounts of reasoning before finding the logical contradiction in your partial solution so far.
I recommend doing this yourself. But https://norvig.com/sudoku.html has full working code that you can look at at need.

Very large float in python

I'm trying to construct a neural network for the Mnist database. When computing the softmax function I receive an error to the same ends as "you can't store a float that size"
code is as follows:
def softmax(vector): # REQUIRES a unidimensional numpy array
adjustedVals = [0] * len(vector)
totalExp = np.exp(vector)
print("totalExp equals")
print(totalExp)
totalSum = totalExp.sum()
for i in range(len(vector)):
adjustedVals[i] = (np.exp(vector[i])) / totalSum
return adjustedVals # this throws back an error sometimes?!?!
After inspection, most recommend using the decimal module. However when I've messed around with the values being used in the command line with this module, that is:
from decimal import Decimal
import math
test = Decimal(math.exp(720))
I receive a similar error for any values which are math.exp(>709).
OverflowError: (34, 'Numerical result out of range')
My conclusion is that even decimal cannot handle this number. Does anyone know of another method I could use to represent these very large floats.
There is a technique which makes the softmax function more feasible computationally for a certain kind of value distribution in your vector. Namely, you can subtract the maximum value in the vector (let's call it x_max) from each of its elements. If you recall the softmax formula, such operation doesn't affect the outcome as it reduced to multiplication of the result by e^(x_max) / e^(x_max) = 1. This way the highest intermediate value you get is e^(x_max - x_max) = 1 so you avoid the overflow.
For additional explanation I recommend the following article: https://nolanbconaway.github.io/blog/2017/softmax-numpy
With a value above 709 the function 'math.exp' exceeds the floating point range and throws this overflow error.
If, instead of math.exp, you use numpy.exp for such large exponents you will see that it evaluates to the special value inf (infinity).
All this apart, I wonder why you would want to produce such a big number (not sure you are aware how big it is. Just to give you an idea, the number of atoms in the universe is estimated to be in the range of 10 to the power of 80. The number you are trying to produce is MUCH larger than that).

Scipy.integrate gives odd results; are there best practices?

I am still struggling with scipy.integrate.quad.
Sparing all the details, I have an integral to evaluate. The function is of the form of the integral of a product of functions in x, like so:
Z(k) = f(x) g(k/x) / abs(x)
I know for certain the range of integration is between tow positive numbers. Oddly, when I pick a wide range that I know must contain all values of x that are positive - like integrating from 1 to 10,000,000 - it intgrates fast and gives an answer which looks right. But when I fingure out the exact limits - which I know sice f(x) is zero over a lot of the real line - and use those, I get another answer that is different. They aren't very different, though I know the second is more accurate.
After much fiddling I got it to work OK, but then needed to add in an exopnentiation - I was at least getting a 'smooth' answer for the computed function of z. I had this working in an OK way before I added in the exponentiation (which is needed), but now the function that gets generated (z) becomes more and more oscillatory and peculiar.
Any idea what is happening here? I know this code comes from an old Fortran library, so there must be some known issues, but I can't find references.
Here is the core code:
def normal(x, mu, sigma) :
return (1.0/((2.0*3.14159*sigma**2)**0.5)*exp(-(x-
mu)**2/(2*sigma**2)))
def integrand(x, z, mu, sigma, f) :
return np.exp(normal(z/x, mu, sigma)) * getP(x, f._x, f._y) / abs(x)
for _z in range (int(z_min), int(z_max) + 1, 1000):
z.append(_z)
pResult = quad(integrand, lb, ub,
args=(float(_z), MU-SIGMA**2/2, SIGMA, X),
points = [100000.0],
epsabs = 1, epsrel = .01) # drop error estimate of tuple
p.append(pResult[0]) # drop error estimate of tuple
By the way, getP() returns a linearly interpolated, piecewise continuous,but non-smooth function to give the integrator values that smoothly fit between the discrete 'buckets' of the histogram.
As with many numerical methods, it can be very sensitive to asymptotes, zeros, etc. The only choice is to keep giving it 'hints' if it will accept them.

Python 3 - calculate total in if else function using for loop

If anybody can give me some hints to point me in the right direction so I can solve it myself that would be great.
I am trying to calculate the total and average income depending on number of employee's. Do I have to make another list or iterate the current list (list1) to solve.
def get_input():
Name = input("Enter a name: ")
Hours = float(input("Enter hours worked: "))
Rate = float(input("Enter hourly rate: "))
return Name, Hours, Rate
def calc_pay(Hours, Rate):
if Hours > 40:
overtime = (40 * Rate) + (Hours - 40) * (Rate * 1.5)
print(list1[0], "should be paid", overtime)
else:
no_overtime = (Hours * Rate)
print(list1[0], "should be paid", no_overtime)
return Hours, Rate
x = int(input("Enter the number of employees: "))
for i in range(x):
list1 = list(get_input())
calc_pay(list1[1], list1[2])
i += 1
If you want to keep track of the total pay for all the employees, you probably need to make two major changes to your code.
The first is to change calc_pay to return the calculated pay amount instead of only printing it (the current return value is pretty useless, since the caller already has those values). You may want to skip printing in the function (since calculating the value and returning it is the function's main job) and let that get done by the caller, if necessary.
The second change is to add the pay values together in your top level code. You could either append the pay values to a list and add them up at the end (with sum), or you could just keep track of a running total and add each employee's pay to it after you compute it.
There are a few other minor things I'd probably change in your code if I was writing it, but they're not problems with its correctness, just style issues.
The first is variable names. Python has a guide, PEP 8 that makes a bunch of suggestions about coding style. It's only an official rule for the Python code that's part of the standard library, but many other Python programmers use it loosely as a baseline style for all Python projects. It recommends using lowercase_names_with_underscores for most variable and function names, and reserving CapitalizedNames for classes. So I'd use name, hours and rate instead of the capitalized versions of those names. I'd also strongly recommend that you use meaningful names instead of generic names like x. Some short names like i and x can be useful in some situations (like coordinates and indexes), but I'd avoid using them for any non-generic purpose. You also don't seem to be using your i variable for anything useful, so it might make sense to rename it _, which suggests that it's not going to be used. I'd use num_employees or something similar instead of x. The name list1 is also bad, but I suggest doing away with that list entirely below. Variable names with numbers in them are often a bad idea. If you're using a lot of numbered names together (e.g. list1, list2, list3, etc.), you probably should be putting your values in a single list instead (a list of lists) instead of the numbered variables. If you just have a few, they should just have more specific names (e.g. employee_data instead of list1).
My second suggestion is about handling the return value from get_input. You can unpack the tuple of values returned by the function into separate variables, rather than putting them into a list. Just put the names separated by commas on the left side of the = operator:
name, hours, rate = get_input()
calc_pay(hours, rate)
My last minor suggestion is about avoiding repetition in your code. A well known programming suggestion is "Don't Repeat Yourself" (often abbreviated DRY), since repeated (especially copy/pasted) code is hard to modify later and sometimes harbors subtle bugs. Your calc_pay function has a repeated print line that could easily be moved outside of the if/else block so that it doesn't need to be repeated. Just have both branches of the conditional code write the computed pay to the same variable name (instead of different names) and then use that single variable in the print line (and a return line if you follow my suggested fix above for the main issue of your question).
Thanks for the help people. Here was the answer
payList = []
num_of_emps = int(input("Enter number of employees: "))
for i in range(num_of_emps):
name, hours, rate = get_input()
pay = calc_pay(hours, rate)
payList.append(pay)
total = sum(payList)
avg = total / num_of_emps
print("The total amount to be paid is $", format(total, ",.2f"), sep="")
print("\nThe average employee is paid $", format(avg, ",.2f"), sep="")
Enter objects mass, then calculate its weight.
If the object weighs more than 500.
Else the object weighs less than 100.
Use formula: weight = mass x 9.8

(Incremental)PCA's Eigenvectors are not transposed but should be?

When we posted a homework assignment about PCA we told the course participants to pick any way of calculating the eigenvectors they found. They found multiple ways: eig, eigh (our favorite was svd). In a later task we told them to use the PCAs from scikit-learn - and were surprised that the results differed a lot more than we expected.
I toyed around a bit and we posted an explanation to the participants that either solution was correct and probably just suffered from numerical instabilities in the algorithms. However, recently I picked that file up again during a discussion with a co-worker and we quickly figured out that there's an interesting subtle change to make to get all results to be almost equivalent: Transpose the eigenvectors obtained from the SVD (and thus from the PCAs).
A bit of code to show this:
def pca_eig(data):
"""Uses numpy.linalg.eig to calculate the PCA."""
data = data.T # data
val, vec = np.linalg.eig(data)
return val, vec
versus
def pca_svd(data):
"""Uses numpy.linalg.svd to calculate the PCA."""
u, s, v = np.linalg.svd(data)
return s ** 2, v
Does not yield the same result. Changing the return of pca_svd to s ** 2, v.T, however, works! It makes perfect sense following the definition by wikipedia: The SVD of X follows X=UΣWT where
the right singular vectors W of X are equivalent to the eigenvectors of XTX
So to get the eigenvectors we need to transposed the output v of np.linalg.eig(...).
Unless there is something else going on? Anyway, the PCA and IncrementalPCA both show wrong results (or eig is wrong? I mean, transposing that yields the same equality), and looking at the code for PCA reveals that they are doing it as I did it initially:
U, S, V = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, V = svd_flip(U, V)
components_ = V
I created a little gist demonstrating the differences (nbviewer), the first with PCA and IncPCA as they are (also no transposition of the SVD), the second with transposed eigenvectors:
Comparison without transposition of SVD/PCAs (normalized data)
Comparison with transposition of SVD/PCAs (normalized data)
As one can clearly see, in the upper image the results are not really great, while the lower image only differs in some signs, thus mirroring the results here and there.
Is this really wrong and a bug in scikit-learn? More likely I am using the math wrong – but what is right? Can you please help me?
If you look at the documentation, it's pretty clear from the shape that the eigenvectors are in the rows, not the columns.
The point of the sklearn PCA is that you can use the transform method to do the correct transformation.

Resources