K means with a condition

K means with a condition - scikit-learn

I want to apply K means ( or any other simple clustering algorithm ) to data with two variables, but i want clusters to respect a condition : the sum of a third variable per cluster > some_value.
Is that possible?

Notations :
- K is the number of clusters
- let's say that the first two variables are point coordinnates (x,y)
- V denotes the third variable
- Ci : the sum of V over each cluster i
- S the total sum (sum Ci)
- and the threshold T
Problem definition :
From what I understood, the aim is to run an algorithm that keeps the spirit of kmeans while respecting the constraints.
Task 1 - group points by proximity to centroids [kmeans]
Task 2 - for each cluster i, Ci > T* [constraint]
Regular kmeans limitation for the constraint problem :
A regular kmeans, assign points to centroids by taking them in arbitrary order. In our case, this will lead to uncontrol growth of the Ci while adding points.
For exemple, with K=2, T=40 and 4 points with the third variables equal to V1=50, V2=1, V3=50, V4=50.
Suppose also that point P1, P3, P4 are closer to centroid 1. Point P2 is closer to centroid 2.
Let's run the assignement step of a regular kmeans and keep track of Ci :
1-- take point P1, assign it to cluster 1. C1=50 > T
2-- take point P2, assign it to cluster 2 C2=1
3-- take point P3, assign it to cluster 1. C1=100 > T => C1 grows too much !
4-- take point P4, assign it to cluster 1. C1=150 > T => !!!
Modified kmeans :
In the previous, we want to prevent C1 from growing too much and help C2 grow.
This is like pouring champagne into several glasses : if you see a glass with less champaigne, you go and fill it. You do that because you have constraints : limited amound of champaigne (S is bounded) and because you want every glass to have enough champaign (Ci>T).
Of course this is just a analogy. Our modified kmeans will add new poins to the cluster with minimal Ci until the constraint is achieved (Task2). Now in which order should we add points ? By proximity to centroids (Task1). After all constraints are achieved for all cluster i, we can just run a regular kmeans on remaining unassigned points.
Implementation :
Next, I give a python implementation of the modified algorithm. Figure 1 displays a reprensentation of the third variable using transparency for vizualizing large VS low values. Figure 2 displays the evolution clusters using color.
You can play with the accept_thresh parameter. In particular, note that :
For accept_thresh=0 => regular kmeans (constraint is reached immediately)
For accept_thresh = third_var.sum().sum() / (2*K), you might observe that some points that closer to a given centroid are affected to another one for constraint reasons.
CODE :
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import time
nb_samples = 1000
K = 3 # for demo purpose, used to generate cloud points
c_std = 1.2
# Generate test samples :
points, classes = datasets.make_blobs(n_features=2, n_samples=nb_samples, \
centers=K, cluster_std=c_std)
third_var_distribution = 'cubic_bycluster' # 'uniform'
if third_var_distribution == 'uniform':
third_var = np.random.random((nb_samples))
elif third_var_distribution == 'linear_bycluster':
third_var = np.random.random((nb_samples))
third_var = third_var * classes
elif third_var_distribution == 'cubic_bycluster':
third_var = np.random.random((nb_samples))
third_var = third_var * classes
# Threshold parameters :
# Try with K=3 and :
# T = K => one cluster reach cosntraint, two clusters won't converge
# T = 2K =>
accept_thresh = third_var.sum().sum() / (2*K)
def dist2centroids(points, centroids):
'''return arrays of ordered points to each centroids
first array is index of points
second array is distance to centroid
dim 0 : centroid
dim 1 : distance or point index
'''
dist = np.sqrt(((points - centroids[:, np.newaxis]) ** 2).sum(axis=2))
ord_dist_indices = np.argsort(dist, axis=1)
ord_dist_indices = ord_dist_indices.transpose()
dist = dist.transpose()
return ord_dist_indices, dist
def assign_points_with_constraints(inds, dists, tv, accept_thresh):
assigned = [False] * nb_samples
assignements = np.ones(nb_samples, dtype=int) * (-1)
cumul_third_var = np.zeros(K, dtype=float)
current_inds = np.zeros(K, dtype=int)
max_round = nb_samples * K
for round in range(0, max_round): # we'll break anyway
# worst advanced cluster in terms of cumulated third_var :
cluster = np.argmin(cumul_third_var)
if cumul_third_var[cluster] > accept_thresh:
continue # cluster had enough samples
while current_inds[cluster] < nb_samples:
# add points to increase cumulated third_var on this cluster
i_inds = current_inds[cluster]
closest_pt_index = inds[i_inds][cluster]
if assigned[closest_pt_index] == True:
current_inds[cluster] += 1
continue # pt already assigned to a cluster
assignements[closest_pt_index] = cluster
cumul_third_var[cluster] += tv[closest_pt_index]
assigned[closest_pt_index] = True
current_inds[cluster] += 1
new_cluster = np.argmin(cumul_third_var)
if new_cluster != cluster:
break
return assignements, cumul_third_var
def assign_points_with_kmeans(points, centroids, assignements):
new_assignements = np.array(assignements, copy=True)
count = -1
for asg in assignements:
count += 1
if asg > -1:
continue
pt = points[count, :]
distances = np.sqrt(((pt - centroids) ** 2).sum(axis=1))
centroid = np.argmin(distances)
new_assignements[count] = centroid
return new_assignements
def move_centroids(points, labels):
centroids = np.zeros((K, 2), dtype=float)
for k in range(0, K):
centroids[k] = points[assignements == k].mean(axis=0)
return centroids
rgba_colors = np.zeros((third_var.size, 4))
rgba_colors[:, 0] = 1.0
rgba_colors[:, 3] = 0.1 + (third_var / max(third_var))/1.12
plt.figure(1, figsize=(14, 14))
plt.title("Three blobs", fontsize='small')
plt.scatter(points[:, 0], points[:, 1], marker='o', c=rgba_colors)
# Initialize centroids
centroids = np.random.random((K, 2)) * 10
plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', color='red')
# Step 1 : order points by distance to centroid :
inds, dists = dist2centroids(points, centroids)
# Check if clustering is theoriticaly possible :
tv_sum = third_var.sum()
tv_max = third_var.max()
if (tv_max > 1 / 3 * tv_sum):
print("No solution to the clustering problem !\n")
print("For one point : third variable is too high.")
sys.exit(0)
stop_criter_eps = 0.001
epsilon = 100000
prev_cumdist = 100000
plt.figure(2, figsize=(14, 14))
ln, = plt.plot([])
plt.ion()
plt.show()
while epsilon > stop_criter_eps:
# Modified kmeans assignment :
assignements, cumul_third_var = assign_points_with_constraints(inds, dists, third_var, accept_thresh)
# Kmeans on remaining points :
assignements = assign_points_with_kmeans(points, centroids, assignements)
centroids = move_centroids(points, assignements)
inds, dists = dist2centroids(points, centroids)
epsilon = np.abs(prev_cumdist - dists.sum().sum())
print("Delta on error :", epsilon)
prev_cumdist = dists.sum().sum()
plt.clf()
plt.title("Current Assignements", fontsize='small')
plt.scatter(points[:, 0], points[:, 1], marker='o', c=assignements)
plt.scatter(centroids[:, 0], centroids[:, 1], marker='o', color='red', linewidths=10)
plt.text(0,0,"THRESHOLD T = "+str(accept_thresh), va='top', ha='left', color="red", fontsize='x-large')
for k in range(0, K):
plt.text(centroids[k, 0], centroids[k, 1] + 0.7, "Ci = "+str(cumul_third_var[k]))
plt.show()
plt.pause(1)
Improvements :
- use the distribution of the third variable for assignments.
- manage divergence of the algorithm
- better initialization (kmeans++)

One way to handle this would be to filter the data before clustering.
>>> cluster_data = df.loc[df['third_variable'] > some_value]
>>> from sklearn.cluster import KMeans
>>> y_pred = KMeans(n_clusters=2).fit_predict(cluster_data)
If by sum you mean the sum of the third variable per cluster then you could use RandomSearchCV to find hyperparameters of KMeans that do or do not meet the condition.

K-means itself is an optimization problem.
Your additional constraint is a rather common optimization constraint, too.
So I'd rather approach this with an optimization solver.

Related

Python calculation of LennardJones 2D interaction pair correlation distribution function in Grand Canonical Ensemble

Edit
I believe there is a problem with the normalization of the histogram, since one must divide with the radius of each element.
I am trying trying to calculate the fluctuations of particle number and the radial distribution function of a 2d LennardJones(LJ) system using python3. Although I believe the particle fluctuations come out right, the pair correlation g(r) come right for small distances but then blow up ( the calculation uses numpy's histogram method).
The thing is, I can' t find out why such a behavior emerges- perhaps of some misunderstanding of a method? As it is, I am posting the relevant code right below, and if needed, I could also upload other parts of the code or the entire script.
Note first, that since we are working with the Grand-Canonical Ensemble, as the number of particles changes, so is the array that stores the particles- and perhaps that's another point where a mistake in implementation could exist.
Particle removal or insertion
def mcex(L,npart,particles,beta,rho0,V,en):
factorin=(rho0*V)/(npart+1)
factorout=(npart)/(V*rho0)
print("factorin=",factorin)
print("factorout",factorout)
# Produce random number and check:
rand = random.uniform(0, 1)
if rand <= 0.5:
# Insert a particle at a random location
x_new_coord = random.uniform(0, L)
y_new_coord = random.uniform(0, L)
new_particle = [x_new_coord,y_new_coord]
new_E = particleEnergy(new_particle,particles, npart+1)
deltaE = new_E
print("dEin=",deltaE)
# Acceptance rule for inserting
if(deltaE>10):
P_in=0
else:
P_in = (factorin) *math.exp(-beta*deltaE)
print("pinacc=",P_in)
rand= random.uniform(0, 1)
if rand <= P_in :
particles.append(new_particle)
en += deltaE
npart += 1
print("accepted insertion")
else:
if npart != 0:
p = random.randint(0, npart-1)
this_particle = particles[p]
prev_E = particleEnergy(this_particle, particles, p)
deltaE = prev_E
print("dEout=",deltaE)
# Acceptance rule for removing
if(deltaE>10):
P_re=1
else:
P_re = (factorout)*math.exp(beta*deltaE)
print("poutacc=",P_re)
rand = random.uniform(0, 1)
if rand <= P_re :
particles.remove(this_particle)
en += deltaE
npart = npart - 1
print("accepted removal")
print()
return particles, en, npart
Monte Carlo relevant part: for 1/10 runs, check the possibility of inserting or removing a particle
# MC
for step in range(0, runTimes):
print(step)
print()
rand = random.uniform(0,1)
if rand <= 0.9:
#----------- change energies-------------------------
#........
#........
else:
particles, en, N = mcex(L,N,particles,beta,rho0,V, en)
# stepList.append(step)
if((step+1)%1000==0):
for i, particle1 in enumerate(particles):
for j, particle2 in enumerate(particles):
if j!= i:
# print(particle1)
# print(particle2)
# print(i)
# print(j)
dist.append(distancesq(particle1, particle2))
NList.append(N)
where we call the function mcex and perhaps the particles array is not updated correctly:
def mcex(L,npart,particles,beta,rho0,V,en):
factorin=(rho0*V)/(npart+1)
factorout=(npart)/(V*rho0)
print("factorin=",factorin)
print("factorout",factorout)
# Produce random number and check:
rand = random.uniform(0, 1)
if rand <= 0.5:
# Insert a particle at a random location
x_new_coord = random.uniform(0, L)
y_new_coord = random.uniform(0, L)
new_particle = [x_new_coord,y_new_coord]
new_E = particleEnergy(new_particle,particles, npart+1)
deltaE = new_E
print("dEin=",deltaE)
# Acceptance rule for inserting
if(deltaE>10):
P_in=0
else:
P_in = (factorin) *math.exp(-beta*deltaE)
print("pinacc=",P_in)
rand= random.uniform(0, 1)
if rand <= P_in :
particles.append(new_particle)
en += deltaE
npart += 1
print("accepted insertion")
else:
if npart != 0:
p = random.randint(0, npart-1)
this_particle = particles[p]
prev_E = particleEnergy(this_particle, particles, p)
deltaE = prev_E
print("dEout=",deltaE)
# Acceptance rule for removing
if(deltaE>10):
P_re=1
else:
P_re = (factorout)*math.exp(beta*deltaE)
print("poutacc=",P_re)
rand = random.uniform(0, 1)
if rand <= P_re :
particles.remove(this_particle)
en += deltaE
npart = npart - 1
print("accepted removal")
print()
return particles, en, npart
and finally, we create the g(r) histogramm
where perhaps the normalization or the use of the histogram method are not as they should
RDF(N,particles,L)
with the function:
def RDF(N,particles, L):
minb=0
maxb=8
nbin=500
skata=np.asarray(dist).flatten()
rDf = np.histogram(skata, np.linspace(minb, maxb,nbin))
prefactor = (1/2/ np.pi)* (L**2/N **2) /len(dist) *( nbin /(maxb -minb) )
# prefactor = (1/(2* np.pi))*(L**2/N**2)/(len(dist)*num_increments/(rMax + 1.1 * dr ))
rDf = [prefactor*rDf[0], 0.5*(rDf[1][1:]+rDf[1][:-1])]
print('skata',len(rDf[0]))
print('incr',len(rDf[1]))
plt.figure()
plt.plot(rDf[1],rDf[0])
plt.xlabel("r")
plt.ylabel("g(r)")
plt.show()
The results are:
Particle N number fluctuations
and
[
but we want

Although I have accepted an answer, I am posting here some more details.
To normalize the pair correlation correctly one must divide each "number of particles found at a certain distance" or mathematically the sum of delta function of the distances , one must divide with the distance it's self.
Understanding first that a numpy.histogram is an array of two elements, first element the array of all counted events and second element the vector of bins, one must take each element of the first array, lets say np.histogram[0] and multiply it pairwise with np.histogram[1] of the second array.
That is, one must do the following:
def RDF(N,particles, L):
minb=0
maxb=25
nbin=200
width=(maxb-minb)/(nbin)
rings=np.linspace(minb, maxb,nbin)
skata=np.asarray(dist).flatten()
rDf = np.histogram(skata, rings ,density=True)
prefactor = (1/( np.pi*(L**2/N**2)))
rDf = [prefactor*rDf[0], 0.5*(rDf[1][1:]+rDf[1][:-1])]
rDf[0]=np.multiply(rDf[0],1/(rDf[1]*( width )))
where before the last multiply line, we are centering the bins so that their numbers equals the number of elements of the first array( you have five fingers, but four intermediate gaps between them)

Your g(r) is not correctly normalised. You need to divide the number of pairs found in each bin by the average density of the system times the area of the annulus associated to that bin, where the latter is just 2 pi r dr, with r being the bin's midpoint and dr the bin size. As far as I can tell, your prefactor does not contain the "r" bit. There is also something else that is missing, but it's hard to tell without knowing what all the other constants contain.
EDIT: here is a link that will guide you the implementation of a routine to compute the radial distribution function in 2D and 3D

How to vectorize a function of two matrices in numpy?

Say, I have a binary (adjacency) matrix A of dimensions nxn and another matrix U of dimensions nxl. I use the following piece of code to compute a new matrix that I need.
import numpy as np
from numpy import linalg as LA
new_U = np.zeros_like(U)
for idx, a in np.ndenumerate(A):
diff = U[idx[0], :] - U[idx[1], :]
if a == 1.0:
new_U[idx[0], :] += 2 * diff
elif a == 0.0:
norm_diff = LA.norm(U[idx[0], :] - U[idx[1], :])
new_U[idx[0], :] += -2 * diff * np.exp(-norm_diff**2)
return new_U
This takes quite a lot of time to run even when n and l are small. Is there a better way to rewrite (vectorize) this code to reduce the runtime?
Edit 1: Sample input and output.
A = np.array([[0,1,0], [1,0,1], [0,1,0]], dtype='float64')
U = np.array([[2,3], [4,5], [6,7]], dtype='float64')
new_U = np.array([[-4.,-4.], [0,0],[4,4]], dtype='float64')
Edit 2: In mathematical notation, I am trying to compute the following:
where u_ik = U[i, k],u_jk = U[j, k], and u_i = U[i, :]. Also, (i,j) \in E corresponds to a == 1.0 in the code.

Leveraging broadcasting and np.einsum for the sum-reductions -
# Get pair-wise differences between rows for all rows in a vectorized manner
Ud = U[:,None,:]-U
# Compute norm L1 values with those differences
L = LA.norm(Ud,axis=2)
# Compute 2 * diff values for all rows and mask it with ==0 condition
# and sum along axis=1 to simulate the accumulating behaviour
p1 = np.einsum('ijk,ij->ik',2*Ud,A==1.0)
# Similarly, compute for ==1 condition and finally sum those two parts
p2 = np.einsum('ijk,ij,ij->ik',-2*Ud,np.exp(-L**2),A==0.0)
out = p1+p2
Alternatively, use einsum for computing squared-norm values and using those to get p2 -
Lsq = np.einsum('ijk,ijk->ij',Ud,Ud)
p2 = np.einsum('ijk,ij,ij->ik',-2*Ud,np.exp(-Lsq),A==0.0)

How do I implement mean shift by using a grid of centroids?

This is for a class and I would really appreciate your help! I made some changes based on a comment I received, but now I get another error..
I need to modify an existing function that implements the mean-shift algorithm, but instead of initializing all the points as the first set of centroids, the function creates a grid of centroids with the grid based on the radius. I also need to delete the centroids that don't contain any data points. My issue is that I don't understand how to fix the error I get!
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
<ipython-input-7-de18ffed728f> in <module>()
49 centroids = initialize_centroids(x)
50
---> 51 new_centroids = update_centroids(x, centroids, r = 1)
52
53 print(len(centroids))
<ipython-input-7-de18ffed728f> in update_centroids(data, centroids, r)
26 #print(len(centroids))
27 #print(range(len(centroids)))
---> 28 centroid = centroids[i]
29 for data_point in data:
30 if np.linalg.norm(data_point - centroid) < r:
IndexError: index 2 is out of bounds for axis 0 with size 2
I tried using the range of the input dataset as boundaries for a grid, with the points separated by the radius.
from sklearn import datasets
import numpy as np
import matplotlib.pyplot as plt
def initialize_centroids(data, r = 1):
'''Creates a grid of centroids with grid based on radius'''
data = np.array(data)
xi,yi = min(range(len(data))), max(range(len(data)))
mx = np.arange(xi,yi,r)
x,y = np.meshgrid(mx,mx)
centroids=np.vstack([x.ravel(), y.ravel()])
return centroids
#update centroids based on mean of points that fall within a specified radius of each centroid
def update_centroids(data, centroids, r = 1):
new_centroids = []
for i in centroids:
in_radius = []
centroid = centroids[i] #this is where the error occurs
for data_point in data:
if np.linalg.norm(data_point - centroid) < radius:
in_radius.append(data_point) #this list is appended by adding the new centroid to it if the above conition is satisfied.
new_centroid = np.mean(in_radius, axis=0)
#maybe another way to do the next part
new_centroids.append(tuple(new_centroid))
unique_centroids = sorted(list(set(new_centroids))) #for element in in_radius, if element in set skip else set.append(element(in_rad)). append does not work with set.
new_centroids = {i:np.array(unique_centroids[i]) for i in range(len(unique_centroids))}
return new_centroids
#test function on:
x, y = datasets.make_blobs(n_samples=300, n_features = 2, centers=[[0, 7], [0, -7], [5,7], [5, 0]])
centroids = initialize_centroids(x)
new_centroids = update_centroids(x, centroids, radius = 2)
print(len(centroids))
print()
print(len(new_centroids))
#code for plotting initially:
plt.scatter(x[:,0], x[:,1], color = 'k')
for i in range(len(new_centroids)):
plt.scatter(new_centroids[i][0], new_centroids[i][1], s=200, color = 'r', marker = "*")
#code for plotting updated centroids:
new_centroids = update_centroids(x, new_centroids, radius = 2)
plt.scatter(x[:,0], x[:,1], color = 'k')
for i in range(len(new_centroids)):
plt.scatter(new_centroids[i][0], new_centroids[i][1], s=200, color = 'r', marker = "*")
#code for iterations:
def iterate_to_conv(data, max_iter=100):
centroids = initialize_centroids(data)
iter_count = 0
while iter_count <= max_iter:
new_centroids = update_centroids(data, centroids, radius = 2)
centroids = new_centroids
iter_count += 1
return centroids
centroids = iterate_to_conv(x)
plt.scatter(x[:,0], x[:,1], color = 'k')
for i in range(len(centroids)):
plt.scatter(centroids[i][0], centroids[i][1], s=200, color = 'r', marker = "*")
The function needs to return the number of final centroids. I haven't gotten ahead far enough to know how the entire implementation of mean-shift would work with this function..

When you are running that loop: for i in centroids the i that is iterated through centroids isn't a number, it is a vector which is why an error is pops up. For example, the first i value might be equal to [0 1 2 0 1 2 0 1 2]. So to take an index of that doesn't make sense. What your code is saying to do is to take centroid = centroid[n1 n2 nk]. To fix it, you really need to change how your initialize centroid function works. Meshgrid also won't create an N dimensional grid, so your meshgrid might work for 2 dimensions but not N. I hope that helps.

Grouping algorithm based on distance

I have a set of 200 points with x and y coordinates. I need to make batches of 20 such that each point in the batch is "n" centimeters away from the other 19 i.e. no two points in one batch are within "n" cm of the other. A point should only belong to one batch. How do I solve this?
I've used trees to draw out branches such that a new node is added only if it is "n" cm away from all other nodes in the branch. This works but is extremely slow.
Input.csv: |Point Name||X coordinate||Y coordinate|
Output: Lists of batches

From what I guess from your question, this should to it:
import math
from random import randrange
test_data = {(randrange(0, 1000), randrange(0, 1000)) for _ in range(200)}
inf = float("inf")
def distance(p1, p2):
return math.hypot(p2[0] - p1[0], p2[1] - p1[1])
def batch(data: set, min_distance, max_distance=inf, count=20, remove=True):
result = [next(iter(data))]
candidates = {t for t in data if max_distance > distance(result[0], t) > min_distance}
while len(result) < count and len(candidates) > 0:
result.append(candidates.pop())
candidates = {t for t in data if all(max_distance > distance(p, t) > min_distance for p in result)}
if len(result)<count:
raise ValueError("Not enough values in data that have great enough distances between each other")
if remove:
data.difference_update(result)
return result
print(len(test_data))
i = 1
while test_data:
print(i,batch(test_data, 10))
i+=1

finding optimum lambda and features for polynomial regression

I am new to Data Mining/ML. I've been trying to solve a polynomial regression problem of predicting the price from given input parameters (already normalized within range[0, 1])
I'm quite close as my output is in proportion to the correct one, but it seems a bit suppressed, my algorithm is correct, just don't know how to reach to an appropriate lambda, (regularized parameter) and how to decide to what extent I should populate features as the problem says : "The prices per square foot, are (approximately) a polynomial function of the features. This polynomial always has an order less than 4".
Is there a way we could visualize data to find optimum value for these parameters, like we find optimal alpha (step size) and number of iterations by visualizing cost function in linear regression using gradient descent.
Here is my code : http://ideone.com/6ctDFh
from numpy import *
def mapFeature(X1, X2):
degree = 2
out = ones((shape(X1)[0], 1))
for i in range(1, degree+1):
for j in range(0, i+1):
term1 = X1**(i-j)
term2 = X2 ** (j)
term = (term1 * term2).reshape( shape(term1)[0], 1 )
"""note that here 'out[i]' represents mappedfeatures of X1[i], X2[i], .......... out is made to store features of one set in out[i] horizontally """
out = hstack(( out, term ))
return out
def solve():
n, m = input().split()
m = int(m)
n = int(n)
data = zeros((m, n+1))
for i in range(0, m):
ausi = input().split()
for k in range(0, n+1):
data[i, k] = float(ausi[k])
X = data[:, 0 : n]
y = data[:, n]
theta = zeros((6, 1))
X = mapFeature(X[:, 0], X[:, 1])
ausi = computeCostVect(X, y, theta)
# print(X)
print("Results usning BFGS : ")
lamda = 2
theta, cost = findMinTheta(theta, X, y, lamda)
test = [0.05, 0.54, 0.91, 0.91, 0.31, 0.76, 0.51, 0.31]
print("prediction for 0.31 , 0.76 (using BFGS) : ")
for i in range(0, 7, 2):
print(mapFeature(array([test[i]]), array([test[i+1]])).dot( theta ))
# pyplot.plot(X[:, 1], y, 'rx', markersize = 5)
# fig = pyplot.figure()
# ax = fig.add_subplot(1,1,1)
# ax.scatter(X[:, 1],X[:, 2], s=y) # Added third variable income as size of the bubble
# pyplot.show()
The current output is:
183.43478288
349.10716957
236.94627602
208.61071682
The correct output should be:
180.38
1312.07
440.13
343.72

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

K means with a condition - scikit-learn

I want to apply K means ( or any other simple clustering algorithm ) to data with two variables, but i want clusters to respect a condition : the sum of a third variable per cluster > some_value. Is that possible?

K-means itself is an optimization problem. Your additional constraint is a rather common optimization constraint, too. So I'd rather approach this with an optimization solver.

Related

Python calculation of LennardJones 2D interaction pair correlation distribution function in Grand Canonical Ensemble

How to vectorize a function of two matrices in numpy?

How do I implement mean shift by using a grid of centroids?

Grouping algorithm based on distance

finding optimum lambda and features for polynomial regression

Categories

Resources