memory error when run pairwise_distances in sklearn - scikit-learn

There are 40 million datasets.And when i want to caculated the jaccard ,it reports memory error.How to imporve my code?
result=[]
for line in open("./raw_data1"):
#for line in sys.stdin:
#tagid_result = [0]*max_len
tagid_result = [0]*34
line = line.strip()
fields = line.split("\t")
if len(fields)<6:
continue
tagid = fields[3]
tagids = tagid.split(":")
try:
for i in range(0,len(tagids)):
tagid_result[i] = int(tagids[i])
except:
continue
result.append(tagid_result)
X=np.array(result)
distance_matrix = pairwise_distances(X, metric='jaccard')
print (distance_matrix)

You are running out of RAM. To compute the distances between N vectors you must store N^2 distance values. 40 million ^ 2 is too much data to fit into memory. There are two options:
1) You must split up your matrix, X, into subsets. Create a pairwise distance matrix for each subset. Then stitch those pairwise distance matrices together.
2) You should create a dataset of all vector pairs. Store each vector in its own file. Create a function to read two vector files, compute their distance, and return the distance value. Apply this function over all vector pairs. Concatenate the distance results to create your distance matrix. This function can be run in parallel to compute the distance matrix more efficiently.
I would opt for solution 2.

Related

How to shuffle tiny spheres inside a big sphere in a pythonic way?

I have a list of spheres with some known characteristics (ids, radii, masses, and positions) with ids, radii, and masses being 1D arrays with shape (511, ) and positions being 3D array with shape (511, 3) inside some big spherical volume with known center, (0, 0, 0) and radius, distance_max.
hal_ids_data = np.array([19895, 19896, ..., 24249])
hal_radiuss_data = np.array([1.047, 1.078, ..., 3.263])
hal_masss_data = np.array([2.427e+06, 8.268e+06, ..., 8.954e+07]
hal_positions_data = np.array([np.array([-33.78, 10.4, 33.83]), np.array([-33.61, 6.34, 35.64]), ..., np.array([-0.4014, 4.121, 33.05])])
I would like to randomly place these tiny spheres throughout the volume within the big sphere while keeping their individual characteristics intact meaning only their positions need to be shuffled subject to two constraints shown below.
for hal_id, hal_position, hal_radius, hal_mass in zip(hal_ids_data, hal_positions_data, hal_radiuss_data, hal_masss_data):
# check if 1) any one of the small spheres are above some mass threshold AND 2) inside the big sphere
if ((np.sqrt(pow(hal_position[0], 2)+pow(hal_position[1], 2)+pow(hal_position[2], 2)) < distance_max) and (log10(hal_mass)>=1e8)):
# if so, then do the following stuff down here but to the shuffled populations of small spheres meeting the conditions above rather than to the original population
What is the fastest and shortest way to shuffle my spheres under the last if statement before doing some stuff on them? (I do need my original population info though for later use so I cannot disregard it)
The best approach would be to compute your constraints in a vectorized format (which is very efficient in numpy) instead of using a for loop. Then generate an array of indexes that match your constraints, and then shuffle those indexes.
So using your example data above:
import numpy as np
distance_max = 49 #I chose this so that we have some matching items
hal_ids_data = np.array([19895, 19896, 24249])
hal_radius_data = np.array([1.047, 1.078, 3.263])
hal_mass_data = np.array([2.427e+06, 8.268e+06, 8.954e+07])
hal_positions_data = np.array([np.array([-33.78, 10.4, 33.83]), np.array([-33.61, 6.34, 35.64]), np.array([-0.4014, 4.121, 33.05])])
# Compute the conditions for every sphere at the same time instead of for loop
within_max = np.sqrt(pow(hal_positions_data[:,0],2) + pow(hal_positions_data[:,1],2) + pow(hal_positions_data[:,2],2)) < distance_max
mass_contraint = np.log10(hal_mass_data) >= 1 #I chose this so that we have some matching items
matched_spheres = within_max & mass_contraint
# Get indexes of matching spheres
idx = np.where(matched_spheres)[0] # create array of indexes
np.random.shuffle(idx) #shuffle array of indexes in place
# Generate shuffled data by applying the idx to the original arrays and saving to new 's_' arrays
s_hal_ids_data = hal_ids_data[idx]
s_hal_radius_data = hal_radius_data[idx]
s_hal_mass_data = hal_mass_data[idx]
s_hal_positions_data = hal_positions_data[idx]
# Do stuff with shuffled population of small spheres

Optimizing Dunn Index calculation?

The Dunn Index is a method of evaluating clustering. A higher value is better. It is calculated as the lowest intercluster distance (ie. the smallest distance between any two cluster centroids) divided by the highest intracluster distance (ie. the largest distance between any two points in any cluster).
I have a code snippet for calculating the Dunn Index:
def dunn_index(pf, cf):
"""
pf -- all data points
cf -- cluster centroids
"""
numerator = inf
for c in cf: # for each cluster
for t in cf: # for each cluster
if t is c: continue # if same cluster, ignore
numerator = min(numerator, distance(t, c)) # find distance between centroids
denominator = 0
for c in cf: # for each cluster
for p in pf: # for each point
if p.get_cluster() is not c: continue # if point not in cluster, ignore
for t in pf: # for each point
if t.get_cluster() is not c: continue # if point not in cluster, ignore
if t is p: continue # if same point, ignore
denominator = max(denominator, distance(t, p))
return numerator/denominator
The problem is this is exceptionally slow: for an example data set consisting of 5000 instances and 15 clusters, the function above needs to perform just over 375 million distance calculations at worst. Realistically it's much lower, but even a best case, where the data is ordered by cluster already, is around 25 million distance calculations. I want to shave time off of it, and I've already tried rectilinear distance vs. euclidean and it's not good.
How can I improve this algorithm?
TLDR: Importantly, the problem is set up in two-dimensions. For large dimensions, these techniques can be ineffective.
In 2D, we can compute the diameter (intracluster distance) of each cluster in O(n log n) time where n is the cluster size using convex hulls. Vectorization is used to speed up remaining operations. There are two possible asymptotic improvements mentioned at the end of the post, contributions welcome ;)
Setup and fake data:
import numpy as np
from scipy import spatial
from matplotlib import pyplot as plt
# set up fake data
np.random.seed(0)
n_centroids = 1000
centroids = np.random.rand(n_centroids, 2)
cluster_sizes = np.random.randint(1, 1000, size=n_centroids)
# labels from 1 to n_centroids inclusive
labels = np.repeat(np.arange(n_centroids), cluster_sizes) + 1
points = np.zeros((cluster_sizes.sum(), 2))
points[:,0] = np.repeat(centroids[:,0], cluster_sizes)
points[:,1] = np.repeat(centroids[:,1], cluster_sizes)
points += 0.05 * np.random.randn(cluster_sizes.sum(), 2)
Looks somewhat like this:
Next, we define a diameter function for computing the largest intracluster distance, based on this approach using the convex hull.
# compute the diameter based on convex hull
def diameter(pts):
# need at least 3 points to construct the convex hull
if pts.shape[0] <= 1:
return 0
if pts.shape[0] == 2:
return ((pts[0] - pts[1])**2).sum()
# two points which are fruthest apart will occur as vertices of the convex hull
hull = spatial.ConvexHull(pts)
candidates = pts[spatial.ConvexHull(pts).vertices]
return spatial.distance_matrix(candidates, candidates).max()
For the Dunn index calculation, I assume that we have already computed the points, the cluster labels and the cluster centroids.
If the number of clusters is large, the following solution based on Pandas may perform well:
import pandas as pd
def dunn_index_pandas(pts, labels, centroids):
# O(k n log(n)) with k clusters and n points; better performance with more even clusters
max_intracluster_dist = pd.DataFrame(pts).groupby(labels).agg(diameter_pandas)[0].max()
# O(k^2) with k clusters; can be reduced to O(k log(k))
# get pairwise distances between centroids
cluster_dmat = spatial.distance_matrix(centroids, centroids)
# fill diagonal with +inf: ignore zero distance to self in "min" computation
np.fill_diagonal(cluster_dmat, np.inf)
min_intercluster_dist = cluster_sizes.min()
return min_intercluster_dist / max_intracluster_dist
Otherwise, we can continue with a pure numpy solution.
def dunn_index(pts, labels, centroids):
# O(k n log(n)) with k clusters and n points; better performance with more even clusters
max_intracluster_dist = max(diameter(pts[labels==i]) for i in np.unique(labels))
# O(k^2) with k clusters; can be reduced to O(k log(k))
# get pairwise distances between centroids
cluster_dmat = spatial.distance_matrix(centroids, centroids)
# fill diagonal with +inf: ignore zero distance to self in "min" computation
np.fill_diagonal(cluster_dmat, np.inf)
min_intercluster_dist = cluster_sizes.min()
return min_intercluster_dist / max_intracluster_dist
%time dunn_index(points, labels, centroids)
# returned value 2.15
# in 2.2 seconds
%time dunn_index_pandas(points, labels, centroids)
# returned 2.15
# in 885 ms
For 1000 clusters with i.i.d. ~U[1,1000] cluster sizes this takes 2.2. seconds on my machine. This number drops to .8 seconds with the Pandas approach for this example (many small clusters).
There are two further optimization opportunities that are relevant when the number of clusters is large:
First, I am computing the minimal intercluster distance with a brute force O(k^2) approach where k is the number of clusters. This can be reduced to O(k log(k)), as discussed here.
Second, max(diameter(pts[labels==i]) for i in np.unique(labels)) requires k passes over an array of size n. With many clusters this can become the bottleneck (as in this example). This is somewhat mitigated with the pandas approach, but I expect that this can be optimized a lot further. For current parameters, roughly one third of compute time is spent outside of computing intercluser of intracluster distances.
It's not about optimizing algorithm itself, but I think one of the following advises can improve performance.
Using multiprocessing's pool of workers.
Extracting python code to c/cpp. Refer to official documentation.
Also there are Performance Tips on the https://www.python.org.

Fastest way to determine if two points are closest to one another

My problems consists of the following: I am given two pairs angles (in spherical coordinates) which consists of two parts--an azimuth and a colatitude angle. If we extend both angles (thereby increasing their respective radii) infinitely to make a long line pointing in the direction given by the pair of angles, then my goal is to determine
if they intersect or extremely close to one another and
where exactly they intersect.
Currently, I have tried several methods:
The most obvious one is to iteratively compare each radii until there is either a match or a small enough distance between the two. (When I say compare each radii, I am referring to converting each spherical coordinate into Cartesian and then finding the euclidean distance between the two). However, this runtime is $O(n^{2})$, which is extremely slow if I am trying to scale this program
The second most obvious method is to use the optimization package to find this distance. Unfortunately, I cannot the optimization package iteratively and after one instance the optimization algorithm repeats the same answer, which is not useful.
The least obvious method is to directly calculate (using calculus) the exact radii from the angles. While this is fast method, it is not extremely accurate.
Note: while it might seem simple that the intersection is always at the zero-origin (0,0,0), this is not ALWAYS the case. Some points never intersect.
Code for Method (1)
def match1(azimuth_recon_1,colatitude_recon_1,azimuth_recon_2, colatitude_recon_2,centroid_1,centroid_2 ):
# Constants: tolerance factor and extremely large distance
tol = 3e-2
prevDist = 99999999
# Initialize a list of radii to loop through
# Checking iteravely for a solution
for r1 in list(np.arange(0,5,tol)):
for r2 in list(np.arange(0,5,tol)):
# Get the estimates
estimate_1 = np.array(spher2cart(r1,azimuth_recon_1,colatitude_recon_1)) + np.array(centroid_1)
estimate_2 = np.array(spher2cart(r2,azimuth_recon_2,colatitude_recon_2))+ np.array(centroid_2)
# Calculate the euclidean distance between them
dist = np.array(np.sqrt(np.einsum('i...,i...', (estimate_1 - estimate_2), (estimate_1 - estimate_2)))[:,np.newaxis])
# Compare the distance to this tolerance
if dist < tol:
if dist == 0:
return estimate_1, [], True
else:
return estimate_1, estimate_2, False
## If the distance is too big break out of the loop
if dist > prevDist:
prevDist = 9999999
break
prevDist = dist
return [], [], False
Code for Method (3)
def match2(azimuth_recon_1,colatitude_recon_1,azimuth_recon_2, colatitude_recon_2,centriod_1,centroid_2):
# Set a Tolerance factor
tol = 3e-2
def calculate_radius_2(azimuth_1,colatitude_1,azimuth_2,colatitude_2):
"""Return radius 2 using both pairs of angles (azimuth and colatitude). Equation is provided in the document"""
return 1/((1-(math.sin(azimuth_1)*math.sin(azimuth_2)*math.cos(colatitude_1-colatitude_2))
+math.cos(azimuth_1)*math.cos(azimuth_2))**2)
def calculate_radius_1(radius_2,azimuth_1,colatitude_1,azimuth_2,colatitude_2):
"""Returns radius 1 using both pairs of angles (azimuth and colatitude) and radius 2.
Equation provided in document"""
return (radius_2)*((math.sin(azimuth_1)*math.sin(azimuth_2)*math.cos(colatitude_1-colatitude_2))
+math.cos(azimuth_1)*math.cos(azimuth_2))
# Compute radius 2
radius_2 = calculate_radius_2(azimuth_recon_1,colatitude_recon_1,azimuth_recon_2,colatitude_recon_2)
#Compute radius 1
radius_1 = calculate_radius_1(radius_2,azimuth_recon_1,colatitude_recon_1,azimuth_recon_2,colatitude_recon_2)
# Get the estimates
estimate_1 = np.array(spher2cart(radius_1,azimuth_recon_1,colatitude_recon_1))+ np.array(centroid_1)
estimate_2 = np.array(spher2cart(radius_2,azimuth_recon_2,colatitude_recon_2))+ np.array(centroid_2)
# Calculate the euclidean distance between them
dist = np.array(np.sqrt(np.einsum('i...,i...', (estimate_1 - estimate_2), (estimate_1 - estimate_2)))[:,np.newaxis])
# Compare the distance to this tolerance
if dist < tol:
if dist == 0:
return estimate_1, [], True
else:
return estimate_1, estimate_2, False
else:
return [], [], False
My question is two-fold:
Is there a faster and more accurate way to find the radii for both
points?
If so, how do I do it?
EDIT: I am thinking about just creating two numpy arrays of the two radii and then comparing them via numpy boolean logic. However, I would still be comparing them iteratively. Is there is a faster way to perform this comparison?
Use a kd-tree for such situations. It will easily look up the minimal distance:
def match(azimuth_recon_1,colatitude_recon_1,azimuth_recon_2, colatitude_recon_2,centriod_1,centroid_2):
cartesian_1 = np.array([np.cos(azimuth_recon_1)*np.sin(colatitude_recon_1),np.sin(azimuth_recon_1)*np.sin(colatitude_recon_1),np.cos(colatitude_recon_1)]) #[np.newaxis,:]
cartesian_2 = np.array([np.cos(azimuth_recon_2)*np.sin(colatitude_recon_2),np.sin(azimuth_recon_2)*np.sin(colatitude_recon_2),np.cos(colatitude_recon_2)]) #[np.newaxis,:]
# Re-center them via adding the centroid
estimate_1 = r1*cartesian_1.T + np.array(centroid_1)[np.newaxis,:]
estimate_2 = r2*cartesian_2.T + np.array(centroid_2)[np.newaxis,:]
# Add them to the output list
n = estimate_1.shape[0]
outputs_list_1.append(estimate_1)
outputs_list_2.append(estimate_2)
# Reshape them so that they are in proper format
a = np.array(outputs_list_1).reshape(len(two_pair_mic_list)*n,3)
b = np.array(outputs_list_2).reshape(len(two_pair_mic_list)*n,3)
# Get the difference
c = a - b
# Put into a KDtree
tree = spatial.KDTree(c)
# Find the indices where the radius (distance between the points) is 3e-3 or less
indices = tree.query_ball_tree(3e-3)
This will output a list of the indices where the distance is 3e-3 or less. Now all you will have to do is use the list of indices with the estimate list to find the exact points. And there you have it, this will save you a lot of time and space!

Custom metric for NearestNeighbors sklearn

Hello I'm on a project where I use 512 bits hash to create clusters. I'm using a custom metric bitwise hamming distance. But when I compare two hash with this function I obtain different distance results than using the NearestNeighbors.
Extending this to DBSCAN, using a eps=5, the cluster are created with some consistence, are being correctly clustered. But I try to check the distance between points from the same cluster I obtain distance enormous. Here is an example.
Example:
This a list of points from 2 clusters created by DBSCAN, and as you can see when using the function to calculate the distance gives number bigger than 30 but the NN gives results consistent with the eps=5.
from sklearn.neighbors import NearestNeighbors
hash_list_1 = [2711636196460699638441853508983975450613573844625556129377064665736210167114069990407028214648954985399518205946842968661290371575620508000646896480583712,
2711636396252606881895803338309150146134565539796776390549907030396205082681800682439355456735713892762967881436259141637319066484744271299497977370896760,
2711636396252606881918517135048330084905033589325484952567856239496981859330884970846906663264518266744879431357749780779892124020350824669153434630258784,
2711636396252797418317524490088561493800258861799581574018898781319096107333812163580085003775074676924785748114206505865657620572909617106316367216148512,
2711636196460318585955127494483972276879239064090689852809978361705086216958169367104329622890567955158961917611852516176654399246340379120409329566384160,
2711636396252606881918605860499354102197401318666579124151729671752374458560929422237113300739169875232495266727513833203360007861082211711747836501459040,
2685449071597530523833230885351500532369477539914318172159429043161052628696351016818586542171509728747070238075233795777242761861490021015910382103951968,
2685449271584547381638295372872027557715092296493457397817270817010861872186702795218797216694169625716749654321460983923962566367029011600112932108533792,
2685449071792640184514638654713547133316375160837810451952682241651988724244365461216285304336254942220323815140042850082680124299635209323646382761738272,
1847461275963134712629870519594779049860430827711272857522520377357653173694038204556169999876899727026751811340128091158803029889914422883922033917198368,
2711636396252606881901567718540735842607739343712295416931961674938924754114357607352250040524848697769853213132484145241805622979375000168935113673834592,
2711636396252606881901567718538101947732706353297593371282460773094032493492652041376662823635245997887100968237677157520342076957158825588198798784364576]
hash_list_2 = [1677246762479319235863065539858628614044010438213592493389244703420353559152336301659250128835190166728647823546464421558167523127086351613289685036466208,
1677246762479700308655934218233084077989052614799077817712715603728397519829375248244181345837838956827991047769168833176865438232999821278031784406056992,
1677246762479700314487411751526941880161990070273187005125752885368412445003620183982282356578440274746789782460884881633682918768578649732794162647826464,
1677246762479319238759152196394352786642547660315097253847095508872934279466872914748604884925141826161428241625796765725368284151706959618924400925900832,
1677246762479890853811162999308711253291696853123890392766127782305403145675433285374478727414572392743118524142664546768046227747593095585347134902140960,
1677246765601448867710925237522621090876591539557992237656925108430781026329148912958069241932475038282622646533152559554888274158032061637714105308528752,
1678883457783648388335228538833424204662395277995143067623864457726472665342252064374635323999849241968448535982901839797440478656657327613912450890367008,
1677246765601448864793634462245189770642489500950753120409198344054454862566173176691699195659218600616315451200851360013275424257209428603245704937128032,
1677246762479700314471974894075267160937462491405299015541470373650765401692659096424270522124311243007780041455682577230603077926878181390448030335795232,
1677246762479700317400446530288778920091525622772690226165317385340164047644547471081180880454458397836230795248631079659291423401151022423365062554976288,
1677246762479700317400446530288758590086745084806873060513679821541689120894219120403259478342385343805541797540566045409406476458247878183422733877936160,
2516871453405060707064684111867902766968378200849671168835363528433280949578746081906100803610196553501646503982070255639855643685380535999494563083255776,
1677246762479319230037086118512223039643232176451879100417048497454912234466993748113993020733268935613563596294183318283010061477487433484794582123053088,
1677246762479319235834673272207747972667132521699112379991979781620810490520617303678451683578338921267417975279632387450778387555221361833006151849902112,
1677246762479700305748490595643272813492272250002832996415474372704463760357437926852625171223210803220593114114602433734175731538424778624130491225112608]
def custom_metric(x, y):
return bin(int(x[0]) ^ int(y[0])).count('1')
objective_hash = hash_list_1[0]
complete_list = hash_list_1 + hash_list_2
distance = [custom_metric([objective_hash], [hash_point]) for hash_point in complete_list]
print("Function iteration distance:")
print(distance)
neighbors_model = NearestNeighbors(radius=100, algorithm='ball_tree',
leaf_size=2,
metric=custom_metric,
metric_params=None,
n_jobs=4)
X = [[x] for x in complete_list]
neighbors_model.fit(X)
distance, neighborhoods = neighbors_model.radius_neighbors(objective_hash, 100, return_distance=True)
print("Nearest Neighbors distance:")
print(distance)
print("Nearest Neighbors index:")
print(neighborhoods)
The problem:
Numpy can't handle numbers so big and converts them to float losing a lot of precision.
The solution:
Precompute with your custom metric all the distances and feed them to the DBSCAN algorithm.

Reducing Memory requirements for distance/adjacency matrix

Based on a subsample of around 5,000 to 100,000 word-embeddings (GloVe, 300-dimensional) I need to construct an adjancency matrix, i.e. a matrix of 1's and 0's indicating if the euclidean (or cosine) distance between two words is smaller than x.
Currently, I'm using scipy.spatial.distance.pdist:
distances = pdist(common_model, 'euclidean')
adjacency = (distances <= 0.4)
adjacency = csr_matrix(squareform(adjacency), dtype=np.uint8)
With increasing vocabulary size, my memory fills up rather quickly and pdist fails with a MemoryError (when common_model has the shape (91938, 300) and contains float64).
Iterating the model manually and creating the adjacency directly without the distance matrix in between would be a way, but that was extremely slow.
Is there another way to construct the adjacency matrix in a time- and memory-optimal way?

Resources