sklearn nearest neighbors with grouped data - scikit-learn

I am trying to do a group-based generalization of leave one out cross validation (LOOCV) with a k-nearest neighbor classifier.
Standard LOOCV finds the k-nearest neighbors of each data point, excluding the point itself.
In my scenario, each data point is associated with a group variable (such as collection date) in addition to its features.
I would like to find the k-nearest neighbors of each data point, excluding all points in the same group (e.g., to get nearest neighbors collected on a different date).
This is straightforward to implement (below), but I am looking for a cleaner solution, ideally a few standard library calls, that is also efficient, i.e., exploiting sklearn's optimizations for NearestNeighbors search like "ball-tree" algorithm.
Setup example data and import libraries:
import pandas as pd
import numpy as np
from sklearn.neighbors import NearestNeighbors
np.random.seed(0)
group = pd.Series(['a', 'b', 'a', 'c', 'b', 'c'])
X = pd.DataFrame(np.random.rand(len(group), 2))
The k-nearest-neighbors needed for standard LOOCV can be computed easily with a single library call:
# k nearest neighbor of query point, excluding query point
k = 2
nbrs = NearestNeighbors(n_neighbors=k)
nbrs.fit(X)
nbrs.kneighbors(return_distance=False)
But finding the k-nearest neighbors with group-based exclusion appears to require iteration over groups, which will be slow when there are many groups:
# Inefficient for large number of groups: uses each group of data number_of_groups^2 times.
def kneighbors(X, group, k):
"""k nearest neighbors of each query point, excluding all points in query point's group."""
def kneighbors_group(g):
"""k nearest neighbors of all points in group g, excluding points in group g."""
Xg = X.loc[group==g, :]
Xnotg = X.loc[group!=g, :]
obj = NearestNeighbors(n_neighbors=min(k, Xnotg.shape[0])) # prevent number of neighbors from exceeding size of remaining data
obj.fit(Xnotg)
nbrs_notg_local = obj.kneighbors(X=Xg, return_distance=False) # locally indexed neighbors
nbrs_notg = [list(Xnotg.index[a]) for a in nbrs_notg_local] # globally indexed neighbors
return pd.Series(nbrs_notg, index=Xg.index)
# concatenate neighbors of each group and restore initial data order
return pd.concat([kneighbors_group(g) for g in group.unique()], axis=0).loc[X.index]
kneighbors(X, group, k)
Another option I am considering is using a custom distance metric where points in the same group would be defined to have distance infinity. But I think this will be too slow on larger datasets since it cannot exploit sklearn's efficient search algorithms.

Related

Causal Inference where the treatment assignment is randomised

I have mostly worked with Observational data where the treatment assignment was not randomized. In the past, I have used PSM, IPTW to balance and then calculate ATE.
My problem is:
Now I am working on a problem where the treatment assignment is randomized meaning there won't be a confounding effect. But treatment and control groups have different sizes. There's a bucket imbalance.
Now should I just analyze the data as it is and run statistical significance and Statistical power test?
Or shall I balance the imbalance of sizes between the treatment and control using let's say covariate matching and then run significance tests?
In general, you don't need equal group sizes to estimate treatment effects.
Unequal groups will not bias the estimate, it will just affect its variance - namely, reducing the precision (recall the statistical power is determined by the smallest group, so unequal groups is less sample-efficient, but not categorically wrong).
you can further convince yourself with a simple simulation (code below). Showing that for repeated draws, the estimation is not biased (both distributions perfectly overlay), but having equal groups have improved precision (smaller standard error).
import statsmodels.api as sm
import numpy as np
import pandas as pd
import seaborn as sns
n_trials = 100
balanced = {
True: (100, 100),
False: (190, 10),
}
effect = 2.0
res = []
for i in range(n_trials):
np.random.seed(i)
noise = np.random.normal(size=sum(balanced))
for is_balanced, ratio in balanced.items():
t = np.array([0]*ratio[0] + [1]*ratio[1])
y = effect * t + noise
m = sm.OLS(y, t).fit()
res.append((is_balanced, m.params[0], m.bse[0]))
res = pd.DataFrame(res, columns=["is_balanced", "beta", "se"])
g = sns.jointplot(
x="se", y="beta",
hue="is_balanced",
data=res
)
# Annotate the true effect:
g.fig.axes[0].axhline(y=effect, color='grey', linestyle='--')
g.fig.axes[0].text(y=effect, x=res["se"].max(), s="True effect")

Optimizing Dunn Index calculation?

The Dunn Index is a method of evaluating clustering. A higher value is better. It is calculated as the lowest intercluster distance (ie. the smallest distance between any two cluster centroids) divided by the highest intracluster distance (ie. the largest distance between any two points in any cluster).
I have a code snippet for calculating the Dunn Index:
def dunn_index(pf, cf):
"""
pf -- all data points
cf -- cluster centroids
"""
numerator = inf
for c in cf: # for each cluster
for t in cf: # for each cluster
if t is c: continue # if same cluster, ignore
numerator = min(numerator, distance(t, c)) # find distance between centroids
denominator = 0
for c in cf: # for each cluster
for p in pf: # for each point
if p.get_cluster() is not c: continue # if point not in cluster, ignore
for t in pf: # for each point
if t.get_cluster() is not c: continue # if point not in cluster, ignore
if t is p: continue # if same point, ignore
denominator = max(denominator, distance(t, p))
return numerator/denominator
The problem is this is exceptionally slow: for an example data set consisting of 5000 instances and 15 clusters, the function above needs to perform just over 375 million distance calculations at worst. Realistically it's much lower, but even a best case, where the data is ordered by cluster already, is around 25 million distance calculations. I want to shave time off of it, and I've already tried rectilinear distance vs. euclidean and it's not good.
How can I improve this algorithm?
TLDR: Importantly, the problem is set up in two-dimensions. For large dimensions, these techniques can be ineffective.
In 2D, we can compute the diameter (intracluster distance) of each cluster in O(n log n) time where n is the cluster size using convex hulls. Vectorization is used to speed up remaining operations. There are two possible asymptotic improvements mentioned at the end of the post, contributions welcome ;)
Setup and fake data:
import numpy as np
from scipy import spatial
from matplotlib import pyplot as plt
# set up fake data
np.random.seed(0)
n_centroids = 1000
centroids = np.random.rand(n_centroids, 2)
cluster_sizes = np.random.randint(1, 1000, size=n_centroids)
# labels from 1 to n_centroids inclusive
labels = np.repeat(np.arange(n_centroids), cluster_sizes) + 1
points = np.zeros((cluster_sizes.sum(), 2))
points[:,0] = np.repeat(centroids[:,0], cluster_sizes)
points[:,1] = np.repeat(centroids[:,1], cluster_sizes)
points += 0.05 * np.random.randn(cluster_sizes.sum(), 2)
Looks somewhat like this:
Next, we define a diameter function for computing the largest intracluster distance, based on this approach using the convex hull.
# compute the diameter based on convex hull
def diameter(pts):
# need at least 3 points to construct the convex hull
if pts.shape[0] <= 1:
return 0
if pts.shape[0] == 2:
return ((pts[0] - pts[1])**2).sum()
# two points which are fruthest apart will occur as vertices of the convex hull
hull = spatial.ConvexHull(pts)
candidates = pts[spatial.ConvexHull(pts).vertices]
return spatial.distance_matrix(candidates, candidates).max()
For the Dunn index calculation, I assume that we have already computed the points, the cluster labels and the cluster centroids.
If the number of clusters is large, the following solution based on Pandas may perform well:
import pandas as pd
def dunn_index_pandas(pts, labels, centroids):
# O(k n log(n)) with k clusters and n points; better performance with more even clusters
max_intracluster_dist = pd.DataFrame(pts).groupby(labels).agg(diameter_pandas)[0].max()
# O(k^2) with k clusters; can be reduced to O(k log(k))
# get pairwise distances between centroids
cluster_dmat = spatial.distance_matrix(centroids, centroids)
# fill diagonal with +inf: ignore zero distance to self in "min" computation
np.fill_diagonal(cluster_dmat, np.inf)
min_intercluster_dist = cluster_sizes.min()
return min_intercluster_dist / max_intracluster_dist
Otherwise, we can continue with a pure numpy solution.
def dunn_index(pts, labels, centroids):
# O(k n log(n)) with k clusters and n points; better performance with more even clusters
max_intracluster_dist = max(diameter(pts[labels==i]) for i in np.unique(labels))
# O(k^2) with k clusters; can be reduced to O(k log(k))
# get pairwise distances between centroids
cluster_dmat = spatial.distance_matrix(centroids, centroids)
# fill diagonal with +inf: ignore zero distance to self in "min" computation
np.fill_diagonal(cluster_dmat, np.inf)
min_intercluster_dist = cluster_sizes.min()
return min_intercluster_dist / max_intracluster_dist
%time dunn_index(points, labels, centroids)
# returned value 2.15
# in 2.2 seconds
%time dunn_index_pandas(points, labels, centroids)
# returned 2.15
# in 885 ms
For 1000 clusters with i.i.d. ~U[1,1000] cluster sizes this takes 2.2. seconds on my machine. This number drops to .8 seconds with the Pandas approach for this example (many small clusters).
There are two further optimization opportunities that are relevant when the number of clusters is large:
First, I am computing the minimal intercluster distance with a brute force O(k^2) approach where k is the number of clusters. This can be reduced to O(k log(k)), as discussed here.
Second, max(diameter(pts[labels==i]) for i in np.unique(labels)) requires k passes over an array of size n. With many clusters this can become the bottleneck (as in this example). This is somewhat mitigated with the pandas approach, but I expect that this can be optimized a lot further. For current parameters, roughly one third of compute time is spent outside of computing intercluser of intracluster distances.
It's not about optimizing algorithm itself, but I think one of the following advises can improve performance.
Using multiprocessing's pool of workers.
Extracting python code to c/cpp. Refer to official documentation.
Also there are Performance Tips on the https://www.python.org.

Custom metric for NearestNeighbors sklearn

Hello I'm on a project where I use 512 bits hash to create clusters. I'm using a custom metric bitwise hamming distance. But when I compare two hash with this function I obtain different distance results than using the NearestNeighbors.
Extending this to DBSCAN, using a eps=5, the cluster are created with some consistence, are being correctly clustered. But I try to check the distance between points from the same cluster I obtain distance enormous. Here is an example.
Example:
This a list of points from 2 clusters created by DBSCAN, and as you can see when using the function to calculate the distance gives number bigger than 30 but the NN gives results consistent with the eps=5.
from sklearn.neighbors import NearestNeighbors
hash_list_1 = [2711636196460699638441853508983975450613573844625556129377064665736210167114069990407028214648954985399518205946842968661290371575620508000646896480583712,
2711636396252606881895803338309150146134565539796776390549907030396205082681800682439355456735713892762967881436259141637319066484744271299497977370896760,
2711636396252606881918517135048330084905033589325484952567856239496981859330884970846906663264518266744879431357749780779892124020350824669153434630258784,
2711636396252797418317524490088561493800258861799581574018898781319096107333812163580085003775074676924785748114206505865657620572909617106316367216148512,
2711636196460318585955127494483972276879239064090689852809978361705086216958169367104329622890567955158961917611852516176654399246340379120409329566384160,
2711636396252606881918605860499354102197401318666579124151729671752374458560929422237113300739169875232495266727513833203360007861082211711747836501459040,
2685449071597530523833230885351500532369477539914318172159429043161052628696351016818586542171509728747070238075233795777242761861490021015910382103951968,
2685449271584547381638295372872027557715092296493457397817270817010861872186702795218797216694169625716749654321460983923962566367029011600112932108533792,
2685449071792640184514638654713547133316375160837810451952682241651988724244365461216285304336254942220323815140042850082680124299635209323646382761738272,
1847461275963134712629870519594779049860430827711272857522520377357653173694038204556169999876899727026751811340128091158803029889914422883922033917198368,
2711636396252606881901567718540735842607739343712295416931961674938924754114357607352250040524848697769853213132484145241805622979375000168935113673834592,
2711636396252606881901567718538101947732706353297593371282460773094032493492652041376662823635245997887100968237677157520342076957158825588198798784364576]
hash_list_2 = [1677246762479319235863065539858628614044010438213592493389244703420353559152336301659250128835190166728647823546464421558167523127086351613289685036466208,
1677246762479700308655934218233084077989052614799077817712715603728397519829375248244181345837838956827991047769168833176865438232999821278031784406056992,
1677246762479700314487411751526941880161990070273187005125752885368412445003620183982282356578440274746789782460884881633682918768578649732794162647826464,
1677246762479319238759152196394352786642547660315097253847095508872934279466872914748604884925141826161428241625796765725368284151706959618924400925900832,
1677246762479890853811162999308711253291696853123890392766127782305403145675433285374478727414572392743118524142664546768046227747593095585347134902140960,
1677246765601448867710925237522621090876591539557992237656925108430781026329148912958069241932475038282622646533152559554888274158032061637714105308528752,
1678883457783648388335228538833424204662395277995143067623864457726472665342252064374635323999849241968448535982901839797440478656657327613912450890367008,
1677246765601448864793634462245189770642489500950753120409198344054454862566173176691699195659218600616315451200851360013275424257209428603245704937128032,
1677246762479700314471974894075267160937462491405299015541470373650765401692659096424270522124311243007780041455682577230603077926878181390448030335795232,
1677246762479700317400446530288778920091525622772690226165317385340164047644547471081180880454458397836230795248631079659291423401151022423365062554976288,
1677246762479700317400446530288758590086745084806873060513679821541689120894219120403259478342385343805541797540566045409406476458247878183422733877936160,
2516871453405060707064684111867902766968378200849671168835363528433280949578746081906100803610196553501646503982070255639855643685380535999494563083255776,
1677246762479319230037086118512223039643232176451879100417048497454912234466993748113993020733268935613563596294183318283010061477487433484794582123053088,
1677246762479319235834673272207747972667132521699112379991979781620810490520617303678451683578338921267417975279632387450778387555221361833006151849902112,
1677246762479700305748490595643272813492272250002832996415474372704463760357437926852625171223210803220593114114602433734175731538424778624130491225112608]
def custom_metric(x, y):
return bin(int(x[0]) ^ int(y[0])).count('1')
objective_hash = hash_list_1[0]
complete_list = hash_list_1 + hash_list_2
distance = [custom_metric([objective_hash], [hash_point]) for hash_point in complete_list]
print("Function iteration distance:")
print(distance)
neighbors_model = NearestNeighbors(radius=100, algorithm='ball_tree',
leaf_size=2,
metric=custom_metric,
metric_params=None,
n_jobs=4)
X = [[x] for x in complete_list]
neighbors_model.fit(X)
distance, neighborhoods = neighbors_model.radius_neighbors(objective_hash, 100, return_distance=True)
print("Nearest Neighbors distance:")
print(distance)
print("Nearest Neighbors index:")
print(neighborhoods)
The problem:
Numpy can't handle numbers so big and converts them to float losing a lot of precision.
The solution:
Precompute with your custom metric all the distances and feed them to the DBSCAN algorithm.

K modes calculating distance between each point and cluster centroid

I have a set of categorical variables to be clustered and so I am using k modes taken from a github package. I want to get the distance of each observation (point) to the centroid of the cluster it belongs to.
This is what I have implemented so far:
kmodes_cao = kmodes.KModes(n_clusters=6, init='Cao', verbose=1)
kmodes_cao.fit_predict(data)
# Print cluster centroids of the trained model.
print('k-modes (Cao) centroids:')
print(kmodes_cao.cluster_centroids_)
# Print training statistics
print('Final training cost: {}'.format(kmodes_cao.cost_))
print('Training iterations: {}'.format(kmodes_cao.n_iter_))
I cannot use the Eucledean distance since the variables are categorical. What is the ideal way to calculate the distance of each point to its cluster centroid?
Example if you have 2 variables V1 which can take A or B and V2 can take C or D
If your centroid is V1=A and V2=D
For each variable i, count when Vi != Ci(centroid i)
if you have an instance V1=A and V2=C then the distance from the centroid is 1
it is a binary distance
hop that will help
You can use the method matching_dissim() from kmodes library.
To compare to rows in your dataset, one can be you centroid and other anyone. First you must install the library Panda then import the method with this line:
from kmodes.util.dissim import matching_dissim
https://github.com/nicodv/kmodes/issues/39

How to find the nearest neighbors of 1 Billion records with Spark?

Given 1 Billion records containing following information:
ID x1 x2 x3 ... x100
1 0.1 0.12 1.3 ... -2.00
2 -1 1.2 2 ... 3
...
For each ID above, I want to find the top 10 closest IDs, based on Euclidean distance of their vectors (x1, x2, ..., x100).
What's the best way to compute this?
As it happens, I have a solution to this, involving combining sklearn with Spark: https://adventuresindatascience.wordpress.com/2016/04/02/integrating-spark-with-scikit-learn-visualizing-eigenvectors-and-fun/
The gist of it is:
Use sklearn’s k-NN fit() method centrally
But then use sklearn’s k-NN kneighbors() method distributedly
Performing a brute-force comparison of all records against all records is a losing battle. My suggestion would be to go for a ready-made implementation of k-Nearest Neighbor algorithm such as the one provided by scikit-learn then broadcast the resulting arrays of indices and distances and go further.
Steps in this case would be:
1- vectorize the features as Bryce suggested and let your vectorizing method return a list (or numpy array) of floats with as many elements as your features
2- fit your scikit-learn nn to your data:
nbrs = NearestNeighbors(n_neighbors=10, algorithm='auto').fit(vectorized_data)
3- run the trained algorithm on your vectorized data (training and query data are the same in your case)
distances, indices = nbrs.kneighbors(qpa)
Steps 2 and 3 will run on your pyspark node and are not parallelizable in this case. You will need to have enough memory on this node. In my case with 1.5 Million records and 4 features, it took a second or two.
Until we get a good implementation of NN for spark I guess we would have to stick to these workarounds. If you'd rather like to try something new, then go for http://spark-packages.org/package/saurfang/spark-knn
You haven't provided a lot of detail, but the general approach I would take to this problem would be to:
Convert the records to a data structure like like a LabeledPoint with (ID, x1..x100) as label and features
Map over each record and compare that record to all the other records (lots of room for optimization here)
Create some cutoff logic so that once you start comparing ID = 5 with ID = 1 you interrupt the computation because you have already compared ID = 1 with ID = 5
Some reduce step to get a data structure like {id_pair: [1,5], distance: 123}
Another map step to find the 10 closest neighbors of each record
You've identified pyspark and I generally do this type of work using scala, but some pseudo code for each step might look like:
# 1. vectorize the features
def vectorize_raw_data(record)
arr_of_features = record[1..99]
LabeledPoint( record[0] , arr_of_features)
# 2,3 + 4 map over each record for comparison
broadcast_var = []
def calc_distance(record, comparison)
# here you want to keep a broadcast variable with a list or dictionary of
# already compared IDs and break if the key pair already exists
# then, calc the euclidean distance by mapping over the features of
# the record and subtracting the values then squaring the result, keeping
# a running sum of those squares and square rooting that sum
return {"id_pair" : [1,5], "distance" : 123}
for record in allRecords:
for comparison in allRecords:
broadcast_var.append( calc_distance(record, comparison) )
# 5. map for 10 closest neighbors
def closest_neighbors(record, n=10)
broadcast_var.filter(x => x.id_pair.include?(record.id) ).takeOrdered(n, distance)
The psuedocode is terrible, but I think it communicates the intent. There will be a lot of shuffling and sorting here as you are comparing all records with all other records. IMHO, you want to store the keypair/distance in a central place (like a broadcast variable that gets updated though this is dangerous) to reduce the total euclidean distance calculations you perform.

Resources