Clustering of a list of strings of letters according to 2 (and ideally generalized to n) arbitrary grouping rules? - python-3.x

I want to sort a set of strings (of letters) of variable length in n groups, based on inclusion of any/all/none letters of n given sets.
For instance, here I am trying to sort all combinations of the letters 'A,B,P,Q,X' in 2 groups, with the following rules: group1 must include all/any of 'A,P' (but not 'B,Q'), group2 must include all/any of 'B,Q' (but not 'A,P'). My final goal is to build a list that has the groups as segregated as possible (e.g. beginning and end), with strings containing no members of any group in the middle, followed by members of both groups and hybrids between the middle and the extremes. Ideally the order would be: all-1/none-2,some-1/none-2,all-1/some-2,none-1-2/some-1-2,all-2/some-1,some-2/none-1,all-2/none-1.
labels_powerset = ['A','B','P','Q','X',
'AB','AP','AQ','AX','BP','BQ','BX','PQ','PX','QX',
'ABP','ABQ','ABX','APQ','APX','AQX','BPQ','BPX','BQX','PQX',
'ABPQ','ABPX','ABQX','APQX','BPQX','ABPQX']
labels_for_order = []
for length in range(1,len(max(labels_powerset,key=len))+1):
order = [label for label in labels_powerset if len(label)==length]
labels_for_order.append(order)
group1 = ['A','P']
group2 = ['B','Q']
all1 = [y for y in[[label for label in order if all(x in label for x in group1) and not any(y in label for y in group2)]
for order in labels_for_order]if y]
any1 = [y for y in[[label for label in order if any(x in label for x in group1) and not all(x in label for x in group1) and not any(y in label for y in group2)]
for order in labels_for_order]if y]
all2 = [y for y in[[label for label in order if all(x in label for x in group2) and not any(y in label for y in group1)]
for order in labels_for_order]if y]
any2 = [y for y in[[label for label in order if any(x in label for x in group2) and not all(x in label for x in group2) and not any(y in label for y in group1)]
for order in labels_for_order]if y]
none = [y for y in[[label for label in order if not any(x in label for x in group1) and not any(y in label for y in group2)]
for order in labels_for_order]if y]
both = [y for y in[[label for label in order if any(x in label for x in group1) and any(y in label for y in group2)]
for order in labels_for_order]if y]
both1 = [both[x] for x in range(0,int(len(both)/2))]
both2 = [both[x] for x in range(int(len(both)/2),len(both))]
sorted_labels = flatten(any1+all1+both1+none+both2+all2+any2)
The objective is to have a list as symmetric as possible in terms of membership and length of the strings.
I am pretty new at coding and have read something on k-means but can't figure out how to apply it to strings of letters.
How do I do it more efficiently, and in a way that is generalizable to n groups/rules?

K-means is for multivariate continuous data, and clustering does not attempt to make balanced groups.
What you should consider is to use sorting.
Define a score function. For example, give +1 for each "good" letter, -1 for each "bad" letter, and a bonus of +-100 if it is pure.
Then sort the words based on this score.

Related

Minimum cost to reach point

Given are positive integers n, a, b, p[1], p[2], p[3] (All <= 10^9). Suppose we are standing at position x=0, and we want to reach point n. We can move only to right. For 1 move, we can move by 1 at cost p[1], move by a at cost p[2] and move by b at cost p[3]. The task is to find minimum cost if we can use this moves. There is no restrictions on number of moves.
The ideas I came up with are dynamic programming and linear equations with 3 variables. But still can't solve the problem.
UPD: Suppose also we are given with T (T <= 50) independent queries.
Think of the last step: it would be to move right by 1 unit, a units, or b units. If the last step is to move right 1 unit, then the total cost is p[1]+mincost(n-1). Similarly for the other two possibilities.
Thus, we get the recursive equation: mincost(n) = min{p[1]+mincost(n-1), p[2]+mincost(n-a), p[3]+mincost(n-b)}. This can be implemented using top-down recursion with memoization or bottom-up dynamic programming.
You mentioned that you had two ideas:
using a recurrence relation to write a dynamic programming algorithm;
solving a linear equation.
Both ideas are great. Since another answer already focuses on dynamic programming, I will focus on the linear equation.
Here is a reformulation of your problem:
MINIMIZE:
x * p[1] + y * p[2] + z * p[3]
UNDER CONSTRAINTS:
x + y * a + z * b = n
x, y, z ≥ 0
x, y, z are integers
This is an integer linear program. Formulating a problem as a linear program or as an integer linear program is a very useful skill. Congratulations!
If this is an interview question, the interviewer will probably be very happy with you simply formulating the problem as a linear program. Thus the "code" I wrote above would probably satisfy them.
There are many existing solvers for linear programs.
For instance, here is code in python, using library PuLP:
from pulp import LpProblem, LpVariable, LpMinimize
def min_cost_to_reach_point(n, a, b, p):
x = LpVariable('x', lowBound=0, cat='Integer')
y = LpVariable('y', lowBound=0, cat='Integer')
z = LpVariable('z', lowBound=0, cat='Integer')
P = LpProblem('min_cost_to_reach_point', LpMinimize)
P += x * p[0] + y * p[1] + z * p[2]
P += x + y * a + z * b == n
P.solve()
return {'cost': P.objective.value(), 'steps_1': x.value(), 'steps_a': y.value(), 'steps_b': z.value()}
n = 100
a, b = 11, 17
p = (2, 20, 20)
print( min_cost_to_reach_point(n, a, b, p) )
# {'cost': 128, 'steps_1': 4, 'steps_a': 1, 'steps_b': 5}

Why torch.autograd.grad() returns None with torch.cat?

I know the torch.autograd.grad() returns None if the gradient is stopped somehow, however, I am wondering what is wrong with the following snippet?
x = torch.rand(6, requires_grad=True)
y = x.pow(2).sum()
z = torch.cat([x])
grad1 = torch.autograd.grad(y, x, allow_unused=True)
grad2 = torch.autograd.grad(y, z, allow_unused=True)
print(f'grad1 = {grad1}, grad = {grad2}')
The output is grad1 = (tensor([0.3705, 0.7468, 0.6102, 1.8640, 0.3518, 0.5397]),), grad = (None,).
I am expecting the grad2 is the same to grad1, because z is essentially the x. May I know why please?
Update: After reading the post and the help from #Ivan, I conclude the reason is x is a leaf node of y but z is not any more. x is the leaf node of both y and z in the computation graph, but there is no direct path from z to y, so the torch.autograd.grad returns None.
Note: The returned value None does not necessarily guarantee the values are 0.
Tensor z was not used to compute the value of y, as such it is not connected to its computation graph, and you won't get a gradient on z, since it's not connected to y.
On the other hand, the following will work:
>>> y = x.pow(2).sum()
>>> torch.autograd.grad(y, x, allow_unused=True)
(tensor([0.3134, 1.6802, 0.1989, 0.8495, 1.9203, 1.0905]),)
>>> z = torch.cat([x])
>>> y = z.pow(2).sum()
>>> torch.autograd.grad(y, z, allow_unused=True)
(tensor([0.3134, 1.6802, 0.1989, 0.8495, 1.9203, 1.0905]),)

Why NMSboxes is not eleminating multiple bounding boxes?

First of all here is my code :
image = cv2.imread(filePath)
height, width, channels = image.shape
# USing blob function of opencv to preprocess image
blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (416, 416),
swapRB=True, crop=False)
#Detecting objects
net.setInput(blob)
outs = net.forward(output_layers)
# Showing informations on the screen
class_ids = []
confidences = []
boxes = []
for out in outs:
for detection in out:
scores = detection[5:]
class_id = np.argmax(scores)
confidence = scores[class_id]
if confidence > 0.7:
# Object detected
center_x = int(detection[0] * width)
center_y = int(detection[1] * height)
w = int(detection[2] * width)
h = int(detection[3] * height)
# Rectangle coordinates
x = int(center_x - w / 2)
y = int(center_y - h / 2)
boxes.append([x, y, w, h])
confidences.append(float(confidence))
class_ids.append(class_id)
indexes = cv2.dnn.NMSBoxes(boxes, confidences,score_threshold=0.4,nms_threshold=0.8,top_k=1)
font = cv2.FONT_HERSHEY_PLAIN
colors = np.random.uniform(0, 255, size=(len(classes), 3))
labels = ['bicycle','car','motorbike','bus','truck']
for i in range(len(boxes)):
if i in indexes:
label = str(classes[class_ids[i]])
if label in labels:
x, y, w, h = boxes[i]
color = colors[class_ids[i]]
cv2.rectangle(image, (x, y), (x + w, y + h), color, 2)
cv2.putText(image, label, (x, y + 30), font, 2, color, 3)
cv2.imshow(fileName,image)
My Question is : Isn't cv2.dnn.NMSBoxes is suppose to eliminate multiple bounding boxes? then why I still get output like sample below :
What I expected is something like below :
Did I do something wrong with my code? Is there any better alternative? Thank you very much for your help.
The process of NMS goes like this
Input - A list of Proposal boxes B, corresponding confidence scores S and overlap threshold N
Output - A list of filtered proposals D
Algorithm/steps
Select the proposal with highest confidence score, remove it from B and add it to the final proposal list D. (Initially D is empty)
Now compare this proposal with all the proposals — calculate the IOU (Intersection over Union) of this proposal with every other proposal. If the IOU is greater than the threshold N, remove that proposal from B
Again take the proposal with the highest confidence from the remaining proposals in B and remove it from B and add it to D
Once again calculate the IOU of this proposal with all the proposals in B and eliminate the boxes which have high IOU than threshold
This process is repeated until there are no more proposals left in B
The threshold that is being referred to here is nothing but the nms_threshold.
In the cv2.dnn.NMSBoxes function, nms_threshold is the IOU threshold used in non-maximum suppression.
So if you have a large value, you are enforcing two boxes to have a very high overlap (which is usually not the case) and the box will be removed only if it has an IOU more than 0.8 with another box. Since there's usually not this much overlap, the boxes won't be removed. Reducing this value will make it easier to remove redundant detections
Hope this makes sense
You can read more about Non-Maxima Suppresion here

How can I calculate the gradient of a vector field from its values?

I'd like some help with numpy and arrays. I want to calculate the gradient of a vector field.
Suppose I have a function foo that takes a tuple of coordinates (x,y,z) and returns a vector (u,v,w).
Then if I have an array of coordinates POS = [[x1,y1,z1],[x2,y2,z2],[x3,y3,z3],etc] I can generate an array of vectors with origin in pos and direction in DIR = [[u1,v1,w1],[u2,v2,w2],[u3,v3,w3],etc].
Now how could I calculate the gradient of this vector field in every point of POS ? What I need in the end would be something like another array GRAD = [grad1, grad2, grad3, etc] where every grad would be a 3x3 array of the partial derivatives of the vector field in that corresponding point in POS.
PS: I know I can derivate manually the function foo and then implement the derivatives in python but in my case the function foo is really complex and I want to do the derivatives this way :)
EDIT1: for now I'm coming up with POS this way:
parts = 100
limit = 10
xs = linspace(-limit, limit, parts)
ys = linspace(-limit, limit, parts)
zs = linspace(-limit, limit, parts)
POS = array([(x, y, z) for z in zs for y in ys for x in xs])
DIR = array([foo(pos) for pos in POS])
which allows me to do this if necessary:
POS = POS.reshape(parts,parts,parts,3)
DIR = DIR.reshape(parts,parts,parts,3)
You can use numpy.gradient for this in the following way:
import numpy as np
N = 100
limit = .1
def vec(x,y,z): # Example vector field
return np.array([x,x,z])
x = np.arange(-limit, limit, 2*limit/N) # np.arange takes the spacing as 3. arg
y = np.arange(-limit, limit, 2*limit/N)
z = np.arange(-limit, limit, 2*limit/N)
# Create 3D grid from 1D arrays, indexing is important!
X,Y,Z = np.meshgrid(x,y,z,indexing='ij')
V = vec(X,Y,Z) # Get vector field, shape: (3,N,N,N)
D = np.gradient(V, x, y, z, axis=(1,2,3)) # Get gradient, this is a list!
D = np.array(D).transpose((1,0,2,3,4)) # Make it an array and flip first axes.
The resulting array has shape (3,3,N,N,N) where the first axis indexes the component of the vector field and the second axis indexes the coordinate direction w.r.t. which the derivative was computed.

How to modify x axis distance in gnuplot?

I wonder, can x axis distance in gnuplot be modified?
For example, I have a datum with x = 10 and another datum x = 1000
In this case, the datum with x 10 will be so small compared to the x = 1000 and people will not be able to see the small datum clearly.
Therefore, I want to modify the distance in x axis like this:
a third part axis for x 10,
a third part axis for x 100,
a third part axis for x 1000
Or let's say that x axis length = 12 cm, I want to give 4cm to x = 10, 4cm to x = 100, and 4cm to x = 1000...
I hope my question is clear enough.. thanks!

Resources