AgglomerativeClustering scikit learn connectivity - scikit-learn

I have a matrix x=
[[0,1,1,1,0,0,0,0],
[1,0,1,1,0,0,0,0],
[1,1,0,1,0,0,0,0],
[1,1,1,0,0,0,0,0],
[0,0,0,0,0,1,1,1],
[0,0,0,0,1,0,1,1],
[0,0,0,0,1,1,0,1],
[0,0,0,0,1,1,1,0],]
After calling AgglomerativeClustering I was expecting the data to be divided into 2 clusters (0-3) and (4-7) i.e labels_=[0,0,0,0,1,1,1,1] but insted the labels_ list is [0, 0, 0, 1, 0, 0, 0, 1]
My code is as follows s=AgglomerativeClustering(affinity='precomputed',n_clusters=2,linkage='complete)
s.fit(x)
Does the code contain any error? Why is the clustering not as expected

It appears to me, after playing around with a few examples, that AgglomerativeClustering interprets the 'affinity' matrix as a distance matrix, although I can't find this specified anywhere. This means that your 0's and 1's should be switched.
Also it only seems to consider the upper triangular portion of the matrix (everything else being redundant).
I believe that defining x as:
x=
[[0,0,0,0,1,1,1,1],
[ 0,0,0,0,1,1,1,1],
[ 0,0,0,0,1,1,1,1],
[ 0,0,0,0,1,1,1,1],
[ 0,0,0,0,0,0,0,0],
[ 0,0,0,0,0,0,0,0],
[ 0,0,0,0,0,0,0,0],
[ 0,0,0,0,0,0,0,0],]
will give you the expected results.

The error is in how you are specifying the connectivity matrix. From your description, I assume that your matrix indicates linkage between points, where [0/1] indicate [no link/link]. However, the algorithm treats this as a matrix of pairwise distances, which is why you get unexpected results.
You could convert your affinity matrix to a sort of distance matrix with a simple transform; e.g.
>>> x = np.array(x)
>>> s.fit(np.exp(-x))
>>> s.labels_
array([1, 1, 1, 1, 0, 0, 0, 0])
Better would be to use an actual distance metric on the data used to generate this affinity matrix.

Related

How to use "cv2.estimateAffine3D" correctly, to align two 3d coordinate systems?

I have to align two 3d coordinate systems and I have 32 coordinates of points in the real world for each coordinate system.
I already know that I have to calculate the translation and the rotation and I (think) that I also understand what the calculation is. Because some smart people already had solve this problem I just want to use openCV for this and I asked the aiChatBot of how to do this calculation without and with openCV.
But I think using "cv2.estimateAffine3D" looks right.
Maybe some can just give me an example how to use it correctly?
At the moment the code with example I have from the aiChatBot looks like this:
import cv2
import numpy as np
# Define the coordinates of the three points in the first coordinate system
p1 = np.array([[1, 2, 3]], dtype=np.float32)
p2 = np.array([[4, 5, 6]], dtype=np.float32)
p3 = np.array([[7, 8, 9]], dtype=np.float32)
# Define the coordinates of the three points in the second coordinate system
q1 = np.array([[1, 2, 3]], dtype=np.float32)
q2 = np.array([[5, 6, 7]], dtype=np.float32)
q3 = np.array([[9, 10, 11]], dtype=np.float32)
# Calculate the affine transformation that aligns the two coordinate systems
T = cv2.estimateAffine3D(p1, p2, p3)
# Reshape the point arrays to 2D or 3D
q1_reshaped = q1.reshape(1, -1, 3)
q2_reshaped = q2.reshape(1, -1, 3)
q3_reshaped = q3.reshape(1, -1, 3)
# Apply the transformation to the second coordinate system
q1_aligned = cv2.transform(q1_reshaped, T)
q2_aligned = cv2.transform(q2_reshaped, T)
q3_aligned = cv2.transform(q3_reshaped, T)
# Print the transformed points
print(q1_aligned)
print(q2_aligned)
print(q3_aligned)
the error I get is :
error: OpenCV(4.6.0) :-1: error: (-5:Bad argument) in function 'transform'
Overload resolution failed:
m is not a numerical tuple
Expected Ptr<cv::UMat> for argument 'm
the answer form the aiChatBot to solve this is:
"It looks like the error you are experiencing is caused by passing an incorrect argument to the cv2.transform function. The cv2.transform function expects a 2D or 3D point array as the first argument and a 3x3 or 3x4 transformation matrix as the second argument, but you are passing a 1D point array and a 3x3 or 3x4 transformation matrix as arguments.
To fix this error, you need to pass a 2D or 3D point array as the first argument to the cv2.transform function. You can do this by reshaping the 1D point arrays q1, q2, and q3 using the np.reshape function, as follows:"
but there we come in a loop because this "solution" is already in the code.
So I do not really find any information about that. Can anybody help me with this?
cheers!
I want to align two coordinate systems.

Is there a way to call Macro-Precision in Hugging Face Trainer?

I'm currently making tests on the DEFT-2015 dataset using Hugging Face models. I would like to compare my results to what has been done.
I checked in the list_metrics method from the datasets library, but I did not see Macro Precision, which was the metric used at the time by the researchers.
Do you have any indication for how I could tackle the problem ?
The huggingface library (version 4.20.0) seems to depend behind the curtains calls to scikit-learn library.
If you just use (without using scikit-learn):
from datasets import load_metric
metric = load_metric("precision")
precision = metric.compute(predictions=[...],references=[...])
it will throw an error that
scikit-learn is not installed.
Why this intro?
Well, in fact, you can easily use datasets metrics to calculate however you want your metric (just exactly like scikit-learn does).
You just need to add the 'average' parameter:
from datasets import load_metric
metric = load_metric("precision")
precision = metric.compute(predictions = [0, 0, 0, 0, 1, 1, 2, 2],
references = [0, 0, 0, 0, 1, 1, 1, 2],
average='macro')
# This prints 0.833333334
print(precision)
The snippet above will print {'precision': 0.8333333333333334}, because (1 + 1 + 0.5) / 3 = 0.83, which is exactly the definition of macro precision you are searching for.
Conclusion : Use the average parameter to set the way you want to calculate your metric (micro/macro/weighted).

Get 3D point on directional light ray in Blender Python given Euler angles

I am trying to get a 3Dpoint on Sun light in Blender 3D, so that I can use it to specify directional light target position in Three JS. I have read from this How to convert Euler angles to directional vector? I could not get it. Please let me know how to get it.
I think it is a good question. In blender, since the unit vector starts from the z-axis (the light points down when initialized), I think you could use the last column of the total rotation matrix. The function for calculating the total rotation matrix is given here. Here is a modification of the function that will return a point at unit distance in the direction of the light source:
def getCosinesFromEuler(roll,pitch,yaw):
Rz_yaw = np.array([
[np.cos(yaw), -np.sin(yaw), 0],
[np.sin(yaw), np.cos(yaw), 0],
[ 0, 0, 1]])
Ry_pitch = np.array([
[ np.cos(pitch), 0, np.sin(pitch)],
[ 0, 1, 0],
[-np.sin(pitch), 0, np.cos(pitch)]])
Rx_roll = np.array([
[1, 0, 0],
[0, np.cos(roll), -np.sin(roll)],
[0, np.sin(roll), np.cos(roll)]])
rotMat = Rz_yaw # Ry_pitch # Rx_roll
return rotMat # np.array([0,0,1])
And it can be called like this :
# assuming ob is the light object
roll = ob.rotation_euler.x
pitch = ob.rotation_euler.y
yaw = ob.rotation_euler.z
x,y,z = getCosinesFromEuler(roll,pitch,yaw)
And this point (x,y,z) needs to be subtracted from the position of the light object to get a point at unit distance on the ray.

Perform matrix multiplication with cosine similarity function

I have two lists:
list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'],
['scent', 'scents', 'aroma', 'smell', 'odor'],
['mental_illness', 'mental_disorders','bipolar_disorder']
['romance', 'romances', 'romantic', 'budding_romance']]
list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'],
['also', 'like', 'buy', 'perfumes'],
['suffer', 'from', 'clinical', 'depression'],
['really', 'love', 'my', 'wife']]
I would like to compute the cosine similarity between the two lists above in such a way where the cosine similarity between the first sub-list in list1 and all sublists of list 2 are measured against each other. Then the same thing but with the second sub-list in list 1 and all sub-lists in list 2, etc.
The goal is to create a len(list_2) by len(list_1) matrix, and each entry in that matrix is a cosine similarity score. Currently I've done this the following way:
import gensim
import numpy as np
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True)
similarity_mat = np.zeros([len(list_2), len(list_1)])
for i, L2 in enumerate(list_2):
for j, L1 in enumerate(list_1):
similarity_mat[i, j] = model.n_similarity(L2, L1)
However, I'd like to implement this with matrix multiplication and no for loops.
My two questions are:
Is there a way to do some sort of element-wise matrix multiplication but with gensim's n_similiarity() method to generate the required matrix?
Would it be more efficient and faster using the current method or matrix multiplication?
I hope my question was clear enough, please let me know if I can clarify even further.
Here's an approach, but it's not clear from the question whether you understand the underlying mechanics of the calculation, which might be causing the block.
I've changed the input strings to give more exact word matches, and given the two strings different dimensions to make it a bit clearer:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'],
['scent', 'my', 'aroma', 'smell', 'odor'],
['mental_illness', 'mental_disorders','bipolar_disorder'],
['romance', 'romances', 'romantic', 'budding_romance']]
list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'],
['also', 'like', 'buy', 'perfumes'],
['suffer', 'from', 'clinical', 'depression'],
['really', 'love', 'my', 'wife'],
['flavor', 'taste', 'romantic', 'aroma', 'what']]
cnt = CountVectorizer()
# Combine each sublist into single str, and join everything into corpus
combined_lists = ([' '.join(item) for item in list_1] +
[' '.join(item) for item in list_2])
count_matrix = cnt.fit_transform(combined_lists).toarray()
# Split them again into list_1 and list_2 word counts
count_matrix_1 = count_matrix[:len(list_1),]
count_matrix_2 = count_matrix[len(list_1):,]
match_matrix = np.matmult(count_matrix_1, count_matrix_2.T)
Output of match_matrix:
array([[0, 0, 0, 0, 2],
[0, 0, 0, 1, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]], dtype=int64)
You can see that the 1st string in list_1 has 2 matches with the 5th string in list_2, and so on.
So the first part of the calculation (the dot product) has been calculated. Now we need the magnitudes:
magnitudes = np.array([np.linalg.norm(count_matrix[i,:])
for i in range(len(count_matrix))])
Now we can use matrix multiplication to turn that into a matrix of divisors (we need to reshape magnitudes into n x 1 and 1 x n matrices for this to produce an n x n matrix:
divisor_matrix = np.matmul(magnitudes.reshape(len(magnitudes),1),
magnitudes.reshape(1,len(magnitudes)))
Now since we didn't compare every single sublist, but only the list_1 with the list_2 sublists, we need to take a subsection of this divisor matrix to get the right magnitudes:
divisor_matrix = divisor_matrix[:len(list_1), len(list_1):]
Output:
array([[4.89897949, 4. , 4. , 4. , 4.47213595],
[5.47722558, 4.47213595, 4.47213595, 4.47213595, 5. ],
[4.24264069, 3.46410162, 3.46410162, 3.46410162, 3.87298335],
[4.89897949, 4. , 4. , 4. , 4.47213595]])
Now we can calculate the final matrix of cosine similarity scores:
cos_sim = match_matrix / divisor_matrix
Output:
array([[0. , 0. , 0. , 0. , 0.4472136],
[0. , 0. , 0. , 0.2236068, 0.2 ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.2236068]])
Note these scores differ from the example given, since in the example every cosine similarity score would be 0.
There are two problems in code, the second last and last line.
import gensim
import numpy as np
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
similarity_mat = np.zeros([len(list_2), len(list_1)])
for i, L2 in enumerate(list_2):
for j, L1 in enumerate(list_1):
similarity_mat[i, j] = model.n_similarity(L2, L1)
Answers to you questions:
1. You are already using a direct function to calculate the similarity between two sentences(L1 and L2) which are first converted to two vectors and then cosine similarity is calculated of those two vectors. Everything is already done inside the n_similarity() so you can't do any kind of matrix multiplication.
If you want to do your own matrix multiplication then instead of directly using n_similarity() calculates the vectors of the sentences and then apply matrix multiplication while calculating cosine similarity.
2. As I said in (1) that everything is done in n_similarity() and creators of gensim takes care of the efficiency when writing the libraries so any other multiplication method will most likely not make a difference.

How to find all distances between points in a matrix without duplicates?

I have a Nx3 matrix that contains the x,y,z coordinates of N points in 3D space. I'd like to find the absolute distances between all points without duplicates.
I tried using scipy.spatial.distance.cdist()
[see documentation here: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cdist.html ]. However, the output matrix contains duplicats of distances. For example, the distance between the points P1 and P2 is calculated twice as distance from P1 to P2 and again as distance from P2 to P1. See code output:
>>> from scipy.spatial import distance
>>> points = [[1, 2, 3],
... [4, 5, 6],
... [7, 8, 9]]
>>> distances = distance.cdist(points, points, 'euclidean')
>>> print(distances)
[[ 0. 5.19615242 10.39230485]
[ 5.19615242 0. 5.19615242]
[10.39230485 5.19615242 0. ]]
I'd like the output to be without dupilcates. For example, find the distance between the first point and all other points then the second point and the remaining points (exluding the first point) and so on. Ideally, in an efficient and scalable manner that preserves the order of the points. That is once I find the distances, I'd like to query them; e.g. finding distances within a certain range and be able to output points that correspond to these distances.
Looks like in general you want a KDTree implementation, with query_pairs.
from scipy.spatial import KDTree
points_tree = KDTree(points)
points_in_radius = points_tree.query_pairs(radius)
This will be much faster than actually computing all of the instances and applying a tolerance.

Resources