I am trying to generate Voronoi split polygons and not able to understand the parameter 'furthest_site=True' in Voronoi's Scipy's implementation.
from scipy.spatial import Voronoi, voronoi_plot_2d
points = np.array([[0, 0], [0, 1], [0, 2], [1, 0], [1, 1], [1, 2],
[2, 0], [2, 1], [2, 2]])
vor = Voronoi(points,furthest_site=True)
import matplotlib.pyplot as plt
fig = voronoi_plot_2d(vor) plt.show()
This gives me output as :
What is the explanation for attribute "furthest_site=True"
scipy says it uses QHull to compute voronoi diagrams, and they have this in their documentation:
The furthest-site Voronoi diagram is the furthest-neighbor map for a set of points. Each region contains those points that are further from one input site than any other input site.
Furthest (or "farthest")-site diagrams are described in plenty of other places, including example diagrams; for example, in other stackexchange posts: 1, 2.
Your plot looks odd because your pointset is somewhat degenerate; only the four corner points ever serve as the furthest reference point.
Related
I am using trimesh to generate ray intersection from points.
ray_origins = np.array([[0, 0, 2],
[1, 1, 3],
[3, 2, 6]])
ray_directions = np.array([[0, 5, 8],
[0, 0, 1],
[0, 2, 2]])
locations, index_ray, index_tri = mesh.ray.intersects_location( ray_origins=ray_origins,
ray_directions=ray_directions)
I want to know if there is anyway I can generate many rays from each point and cast them to the mesh?
You could try "sample_surface_sphere", which will randomly pick number (vector) from a sphere.
nPoints = 100000
ray_directions = trimesh.sample.sample_surface_sphere(nPoints)
I'm puzzeled about how does cosine metric works in sklearn's clustering algorithoms.
For example, DBSCAN has a parameter eps and it specified maximum distance when clustering. However, bigger cosine similarity means two vectors are closer, which is just the opposite to our distance concept.
I found that there are cosine_similarity and cosine_distance( just 1-cos() ) in pairwise_metric, and when we specified the metric is cosine we use cosine_similarity.
So, when clustering, how does DBSCAN compares the cosine_similarity and #param eps to decide whether two vectors have the same label?
An example
import numpy as np
from sklearn.cluster import DBSCAN
samples = [[1, 0], [0, 1], [1, 1], [2, 2]]
clf = DBSCAN(metric='cosine', eps=0.1)
result = clf.fit_predict(samples)
print(result)
it outputs [-1, -1, -1, -1] which means these four points are in the same cluster
However,
for points pair [1,1], [2, 2],
its cosine_similarity is 4/(4) = 1,
the cosine distance will be 1-1 = 0, so they are in the same cluster
for points pair[1,1], [1,0],
its cosine_similarity is 1/sqrt(2),
the cosine distance will be 1-1/sqrt(2) = 0.29289321881345254, this distance is bigger than our eps 0.1, why DBSCAN clustered them into the same cluster?
Thanks for #Stanislas Morbieu 's answer, and I finally understand the cosine metric means cosine_distance which is 1-cosine
The implementation of DBSCAN in scikit-learn rely on NearestNeighbors (see the implementation of DBSCAN).
Here is an example to see how it works with cosine metric:
import numpy as np
from sklearn.neighbors import NearestNeighbors
samples = [[1, 0], [0, 1], [1, 1], [2, 2]]
neigh = NearestNeighbors(radius=0.1, metric='cosine')
neigh.fit(samples)
rng = neigh.radius_neighbors([[1, 1]])
print([samples[i] for i in rng[1][0]])
It outputs [[1, 1], [2, 2]], i.e. the points which are closest to [1, 1] in a radius of 0.1.
So points which have a cosine distance smaller than eps in DBSCAN tend to be in the same cluster.
The parameter min_samples of DBSCAN plays an important role. Since by default, it is set to 5, no points can be considered as core point.
Setting it to 1, the example code:
import numpy as np
from sklearn.cluster import DBSCAN
samples = [[1, 0], [0, 1], [1, 1], [2, 2]]
clf = DBSCAN(metric='cosine', eps=0.1, min_samples=1)
result = clf.fit_predict(samples)
print(result)
outputs [0 1 2 2] which means that [1, 1] and [2, 2] are in the same cluster (numbered 2).
By the way, the output [-1, -1, -1, -1] doesn't mean that points are in the same cluster, but that all points are in no cluster.
How does numpy's matrix class work? I understand it will likely be removed in the future, so I am trying to understand how it works, so I can do the same with ndarrrays.
>>> x=np.matrix([[1,1,1],[2,2,2],[3,3,3]])
>>> x[:,0] + x[0,:]
matrix([[2, 2, 2],
[3, 3, 3],
[4, 4, 4]])
Seems like a row of ones got added to every row.
>>> x=np.matrix([[1,2,3],[1,2,3],[1,2,3]])
>>> x[0,:] + x[:,0]
matrix([[2, 3, 4],
[2, 3, 4],
[2, 3, 4]])
Now it seems like a column of ones got added to every column. What it does it with the identity is even weirder,
>>> x=np.matrix([[1,0,0],[0,1,0],[0,0,1]])
>>> x[0,:] + x[:,0]
matrix([[2, 1, 1],
[1, 0, 0],
[1, 0, 0]])
EDIT:
It seems if you take a (N,1) shape matrix and add it to a (N,1) shape matrix, then of one these is replicated to form a (N,N) matrix and the other is added to every row or column of this new matrix. It seems to be a convenience restricted to vectors of the right sizes. A nice use case was networkx's implementation of Floyd-Warshal.
Is there an equivalently convenient one-liner for this using standard numpy ndarrays?
Say we have two matrices A and B with a size of 2 by 2. Is there a command that can stack them horizontally and add A[:,1] to B[:,0] so that the resulting matrix C is 2 by 3, with C[:,0] = A[:,0], C[:,1] = A[:,1] + B[:,0], C[:,2] = B[:,1]. One step further, stacking them on diagonal so that C[0:2,0:2] = A, C[1:2,1:2] = B, C[1,1] = A[1,1] + B[0,0]. C is 3 by 3 in this case. Hard coding this routine is not hard, but I'm just curious since MATLAB has a similar function if my memory serves me well.
A straight forward approach is to copy or add the two arrays to a target:
In [882]: A=np.arange(4).reshape(2,2)
In [883]: C=np.zeros((2,3),int)
In [884]: C[:,:-1]=A
In [885]: C[:,1:]+=A # or B
In [886]: C
Out[886]:
array([[0, 1, 1],
[2, 5, 3]])
Another approach is to to pad A at the end, pad B at the start, and sum; while there is a convenient pad function, it won't be any faster.
And for the diagonal
In [887]: C=np.zeros((3,3),int)
In [888]: C[:-1,:-1]=A
In [889]: C[1:,1:]+=A
In [890]: C
Out[890]:
array([[0, 1, 0],
[2, 3, 1],
[0, 2, 3]])
Again the 2 arrays could be pad and added.
I'm not aware of any specialized function to do this; even if there were, it probably would do the same thing. This isn't a common enough operation to justify a compiled version.
I have built up finite element sparse matrices by adding over lapping element matrices. The sparse formats for both MATLAB and scipy facilitate this (duplicate coordinates are summed).
============
In [896]: np.pad(A,[[0,0],[0,1]],mode='constant')+np.pad(A,[[0,0],[1,0]],mode='
...: constant')
Out[896]:
array([[0, 1, 1],
[2, 5, 3]])
In [897]: np.pad(A,[[0,1],[0,1]],mode='constant')+np.pad(A,[[1,0],[1,0]],mode='
...: constant')
Out[897]:
array([[0, 1, 0],
[2, 3, 1],
[0, 2, 3]])
What's the special MATLAB code for doing this?
in Octave I found:
prepad(A,3,0,axis=2)+postpad(A,3,0,axis=2)
I have a training set of data. The python script for creating the model also calculates the attributes into a numpy array (It's a bit vector). I then want to use VarianceThreshold to eliminate all features that have 0 variance (eg. all 0 or 1). I then run get_support(indices=True) to get the indices of the select columns.
My issue now is how to get only the selected features for the data I want to predict. I first calculate all features and then use array indexing but it does not work:
x_predict_all = getAllFeatures(suppl_predict)
x_predict = x_predict_all[indices] #only selected features
indices is a numpy array.
The returned array x_predict has the correct length len(x_predict) but wrong shape x_predict.shape[1] which is still the original length. My classifier then throws an error due to wrong shape
prediction = gbc.predict(x_predict)
File "C:\Python27\lib\site-packages\sklearn\ensemble\gradient_boosting.py", li
ne 1032, in _init_decision_function
self.n_features, X.shape[1]))
ValueError: X.shape[1] should be 1855, not 2090.
How can I solve this issue?
You can do it like this:
Test data
from sklearn.feature_selection import VarianceThreshold
X = np.array([[0, 2, 0, 3],
[0, 1, 4, 3],
[0, 1, 1, 3]])
selector = VarianceThreshold()
Alternative 1
>>> selector.fit(X)
>>> idxs = selector.get_support(indices=True)
>>> X[:, idxs]
array([[2, 0],
[1, 4],
[1, 1]])
Alternative 2
>>> selector.fit_transform(X)
array([[2, 0],
[1, 4],
[1, 1]])