Related
Hello can someone help me to do map reduce with Kmeans using Spark . Actually can do Kmeans with spark , but i dont how to map and reduce it .
Thanks .
Below a proposed pseudo-code for your exercise:
centroids = k random sampled points from the dataset
Map:
Given a point and the set of centroids
Calculate the distance between the point and each centroid
Emit the point and the closest centroid
Reduce:
Given the centroid and the points belonging to its cluster
Calculate the new centroid as the arithmetic mean position of the points
Emit the new centroid
prev_centroids = centroids
centroids = new_centroids
while prev_centroids - centroids > threshold
The mapper class calculates the distance between the data point and each centroid. Then emits the index of the closest centroid and the data point:
class MAPPER
method MAP(file_offset, point)
min_distance = POSITIVE_INFINITY
closest_centroid = -1
for all centroid in list_of_centroids
distance = distance(centroid, point)
if (distance < min_distance)
closest_centroid = index_of(centroid)
min_distance = distance
EMIT(closest_centroid, point)
The reducer calculates the new approximation of the centroid and emits it.
class REDUCER
method REDUCER(centroid_index, list_of_point_sums)
number_of_points = partial_sum.number_of_points
point_sum = 0
for all partial_sum in list_of_partial_sums:
point_sum += partial_sum
point_sum.number_of_points += partial_sum.number_of_points
centroid_value = point_sum / point_sum.number_of_points
EMIT(centroid_index, centroid_value)
The actual K-Means Spark implementation:
First you read the file with the points and generate the initial centroids with a random sampling, using takeSample(False, k): this function takes k random samples, without replacement, from the RDD; so, the application generates the initial centroids in a distributed manner, avoiding to move all the data to the driver. You may reuse the RDD in an iterative algorithm, hence cache it in memory with cache() to avoid to re-evaluate it every time an action is triggered:
points = sc.textFile(INPUT_PATH).map(Point).cache()
initial_centroids = init_centroids(points, k=parameters["k"])
def init_centroids(dataset, k):
start_time = time.time()
initial_centroids = dataset.takeSample(False, k)
print("init centroid execution:", len(initial_centroids), "in",
(time.time() - start_time), "s")
return initial_centroids
After that, you iterate the mapper and the reducer stages until the stopping criterion is verified or when the maximum number of iterations is reached.
while True:
print("--Iteration n. {itr:d}".format(itr=n+1), end="\r",
flush=True)
cluster_assignment_rdd = points.map(assign_centroids)
sum_rdd = cluster_assignment_rdd.reduceByKey(lambda x, y: x.sum(y))
centroids_rdd = sum_rdd.mapValues(lambda x:
x.get_average_point()).sortByKey(ascending=True)
new_centroids = [item[1] for item in centroids_rdd.collect()]
stop = stopping_criterion(new_centroids,parameters["threshold"])
n += 1
if(stop == False and n < parameters["maxiteration"]):
centroids_broadcast = sc.broadcast(new_centroids)
else:
break
The stopping condition is computed this way:
def stopping_criterion(new_centroids, threshold):
old_centroids = centroids_broadcast.value
for i in range(len(old_centroids)):
check = old_centroids[i].distance(new_centroids[i],
distance_broadcast.value) <= threshold
if check == False:
return False
return True
In order to represent the points, a class Point has been defined. It's characterized by the following fields:
a numpyarray of components
number of points: a point can be seen as the aggregation of many points, so this variable is used to track the number of points that are represented by the object
It includes the following operations:
distance (it is possible to pass as parameter the type of distance)
sum
get_average_point: this method returns a point that has as components the average of the actual components on the number of the points represented by the object
class Point:
def init(self, line):
values = line.split(",")
self.components = np.array([round(float(k), 5) for k in values])
self.number_of_points = 1
def sum(self, p):
self.components = np.add(self.components, p.components)
self.number_of_points += p.number_of_points
return self
def distance(self, p, h):
if (h < 0):
h = 2
return linalg.norm(self.components - p.components, h)
def get_average_point(self):
self.components = np.around(np.divide(self.components,
self.number_of_points), 5)
return self
The mapper method is invoked, at each iteration, on the input file, that contains the points from the dataset
cluster_assignment_rdd = points.map(assign_centroids)
The assign_centroids function, for each point on which is invoked, assign the closest centroid to that point. The centroids are taken from the broadcast variable. The function returns the result as a tuple (id of the centroid, point)
def assign_centroids(p):
min_dist = float("inf")
centroids = centroids_broadcast.value
nearest_centroid = 0
for i in range(len(centroids)):
distance = p.distance(centroids[i], distance_broadcast.value)
if(distance < min_dist):
min_dist = distance
nearest_centroid = i
return (nearest_centroid, p)
The reduce stage is done using two spark transformations:
reduceByKey: for each cluster, compute the sum of the points belonging to it. It is mandatory to pass one associative function as a parameter. The associative function (which accepts two arguments and returns a single element) should be commutative and associative in mathematical nature
sum_rdd = cluster_assignment_rdd.reduceByKey(lambda x, y: x.sum(y))
mapValues: it is used to calculate the average point for each cluster at the end of each stage. The points are already divided by key. This trasformation works only on the value of a key. The results are sorted in order to make easier comparisons.
centroids_rdd = sum_rdd.mapValues(lambda x:
x.get_average_point()).sortBy(lambda x: x[1].components[0])
The get_average_point() function returns the new computed centroid.
def get_average_point(self):
self.components = np.around(np.divide(self.components,
self.number_of_points), 5)
return self
You don't need to write map-reduce. You can use spark dataframe API and use Spark ML library.
You can read more about it here.
https://spark.apache.org/docs/latest/ml-clustering.html
i wrote Quicksort and Mergesort and a Benchmark for them, to see how fast they are.
Here is my code:
#------------------------------Creating a Random List-----------------------------#
def create_random_list (length):
import random
random_list = list(range(0,length))
random.shuffle(random_list)
return random_list
# Initialize default list length to 0
random_list = create_random_list(0)
# Testing random list function
print ("\n" + "That is a randomized list: " + "\n")
print (random_list)
print ("\n")
#-------------------------------------Quicksort-----------------------------------#
"""
Recursive Divide and Conquer Algorithm
+ Very efficient for large data set
- Performance Depends largely on Pivot Selection
Time Complexity
--> Worst-Case -----> O (n^2)
--> Best-Case -----> Ω (n log (n))
--> Average Case ---> O (n log (n))
Space Complexity
--> O(log(n))
"""
# Writing the Quick Sort Algorithm for sorting the list - Recursive Method
def qsort (random_list):
less = []
equal = []
greater = []
if len(random_list)>1:
# Initialize starting Point
pivot = random_list[0]
for x in random_list:
if x < pivot:
less.append(x)
elif x == pivot:
equal.append(x)
elif x > pivot:
greater.append(x)
return qsort(less) + equal + qsort(greater)
else:
return random_list
"""
Build in Python Quick Sort:
def qsort(L):
if len(L) <= 1: return L
return qsort([lt for lt in L[1:] if lt < L[0]]) + L[0:1] + \
qsort([ge for ge in L[1:] if ge >= L[0]])
"""
# Calling Quicksort
sorted_list_qsort = qsort(random_list)
# Testint Quicksort
print ("That is a sorted list with Quicksort: " + "\n")
print (sorted_list_qsort)
print ("\n")
#-------------------------------------FINISHED-------------------------------------#
#-------------------------------------Mergesort------------------------------------#
"""
Recursive Divide and Conquer Algorithm
+
-
Time Complexity
--> Worst-Case -----> O (n l(n))
--> Best-Case -----> Ω (n l(n))
--> Average Case ---> O (n l(n))
Space Complexity
--> O (n)
"""
# Create a merge algorithm
def merge(a,b): # Let a and b be two arrays
c = [] # Final sorted output array
a_idx, b_idx = 0,0 # Index or start from a and b array
while a_idx < len(a) and b_idx < len(b):
if a[a_idx] < b[b_idx]:
c.append(a[a_idx])
a_idx+=1
else:
c.append(b[b_idx])
b_idx+=1
if a_idx == len(a): c.extend(b[b_idx:])
else: c.extend(a[a_idx:])
return c
# Create final Mergesort algorithm
def merge_sort(a):
# A list of zero or one elements is sorted by definition
if len(a)<=1:
return a
# Split the list in half and call Mergesort recursively on each half
left, right = merge_sort(a[:int(len(a)/2)]), merge_sort(a[int(len(a)/2):])
# Merge the now-sorted sublists with the merge function which sorts two lists
return merge(left,right)
# Calling Mergesort
sorted_list_mgsort = merge_sort(random_list)
# Testing Mergesort
print ("That is a sorted list with Mergesort: " + "\n")
print (sorted_list_mgsort)
print ("\n")
#-------------------------------------FINISHED-------------------------------------#
#------------------------------Algorithm Benchmarking------------------------------#
# Creating an array for iterations
n = [100,1000,10000,100000]
# Creating a dictionary for times of algorithms
times = {"Quicksort":[], "Mergesort": []}
# Import time for analyzing the running time of the algorithms
from time import time
# Create a for loop which loop through the arrays of length n and analyse their times
for size in range(len(n)):
random_list = create_random_list(n[size])
t0 = time()
qsort(random_list)
t1 = time()
times["Quicksort"].append(t1-t0)
random_list = create_random_list(n[size-1])
t0 = time()
merge_sort(random_list)
t1 = time()
times["Mergesort"].append(t1-t0)
# Create a table while shows the Benchmarking of the algorithms
print ("n\tMerge\tQuick")
print ("_"*25)
for i,j in enumerate(n):
print ("{}\t{:.5f}\t{:.5f}\t".format(j, times["Mergesort"][i], times["Quicksort"][i]))
#----------------------------------End of Benchmarking---------------------------------#
The code is well documented and runs perfectly with Python 3.8. You may copy it in a code editor for better readability.
--> My Question as the title states:
Is my Benchmarking right? I'm doubting it a litte bit, because the running times of my Algorithms seem a little odd. Can someone confirm my runtime?
--> Here is the output of this code:
That is a randomized list:
[]
That is a sorted list with Quicksort:
[]
That is a sorted list with Mergesort:
[]
n Merge Quick
_________________________
100 0.98026 0.00021
1000 0.00042 0.00262
10000 0.00555 0.03164
100000 0.07919 0.44718
--> If someone has another/better code snippet on how to print the table - feel free to share it with me.
The error is in n[size-1]: when size is 0 (the first iteration), this translates to n[-1], which corresponds to your largest size. So in the first iteration you are comparing qsort(100) with merge_sort(100000), which obviously will favour the first a lot. It doesn't help that you call this variable size, as it really isn't the size, but the index in the n list, which contains the sizes.
So remove the -1, or even better: iterate directly over n. And I would also make sure both sorting algorithms get to sort the same list:
for size in n:
random_list1 = create_random_list(size)
random_list2 = random_list1[:]
t0 = time()
qsort(random_list1)
t1 = time()
times["Quicksort"].append(t1-t0)
t0 = time()
merge_sort(random_list2)
t1 = time()
times["Mergesort"].append(t1-t0)
Finally, consider using timeit which is designed for measuring performance.
Edit
I believe there is a problem with the normalization of the histogram, since one must divide with the radius of each element.
I am trying trying to calculate the fluctuations of particle number and the radial distribution function of a 2d LennardJones(LJ) system using python3. Although I believe the particle fluctuations come out right, the pair correlation g(r) come right for small distances but then blow up ( the calculation uses numpy's histogram method).
The thing is, I can' t find out why such a behavior emerges- perhaps of some misunderstanding of a method? As it is, I am posting the relevant code right below, and if needed, I could also upload other parts of the code or the entire script.
Note first, that since we are working with the Grand-Canonical Ensemble, as the number of particles changes, so is the array that stores the particles- and perhaps that's another point where a mistake in implementation could exist.
Particle removal or insertion
def mcex(L,npart,particles,beta,rho0,V,en):
factorin=(rho0*V)/(npart+1)
factorout=(npart)/(V*rho0)
print("factorin=",factorin)
print("factorout",factorout)
# Produce random number and check:
rand = random.uniform(0, 1)
if rand <= 0.5:
# Insert a particle at a random location
x_new_coord = random.uniform(0, L)
y_new_coord = random.uniform(0, L)
new_particle = [x_new_coord,y_new_coord]
new_E = particleEnergy(new_particle,particles, npart+1)
deltaE = new_E
print("dEin=",deltaE)
# Acceptance rule for inserting
if(deltaE>10):
P_in=0
else:
P_in = (factorin) *math.exp(-beta*deltaE)
print("pinacc=",P_in)
rand= random.uniform(0, 1)
if rand <= P_in :
particles.append(new_particle)
en += deltaE
npart += 1
print("accepted insertion")
else:
if npart != 0:
p = random.randint(0, npart-1)
this_particle = particles[p]
prev_E = particleEnergy(this_particle, particles, p)
deltaE = prev_E
print("dEout=",deltaE)
# Acceptance rule for removing
if(deltaE>10):
P_re=1
else:
P_re = (factorout)*math.exp(beta*deltaE)
print("poutacc=",P_re)
rand = random.uniform(0, 1)
if rand <= P_re :
particles.remove(this_particle)
en += deltaE
npart = npart - 1
print("accepted removal")
print()
return particles, en, npart
Monte Carlo relevant part: for 1/10 runs, check the possibility of inserting or removing a particle
# MC
for step in range(0, runTimes):
print(step)
print()
rand = random.uniform(0,1)
if rand <= 0.9:
#----------- change energies-------------------------
#........
#........
else:
particles, en, N = mcex(L,N,particles,beta,rho0,V, en)
# stepList.append(step)
if((step+1)%1000==0):
for i, particle1 in enumerate(particles):
for j, particle2 in enumerate(particles):
if j!= i:
# print(particle1)
# print(particle2)
# print(i)
# print(j)
dist.append(distancesq(particle1, particle2))
NList.append(N)
where we call the function mcex and perhaps the particles array is not updated correctly:
def mcex(L,npart,particles,beta,rho0,V,en):
factorin=(rho0*V)/(npart+1)
factorout=(npart)/(V*rho0)
print("factorin=",factorin)
print("factorout",factorout)
# Produce random number and check:
rand = random.uniform(0, 1)
if rand <= 0.5:
# Insert a particle at a random location
x_new_coord = random.uniform(0, L)
y_new_coord = random.uniform(0, L)
new_particle = [x_new_coord,y_new_coord]
new_E = particleEnergy(new_particle,particles, npart+1)
deltaE = new_E
print("dEin=",deltaE)
# Acceptance rule for inserting
if(deltaE>10):
P_in=0
else:
P_in = (factorin) *math.exp(-beta*deltaE)
print("pinacc=",P_in)
rand= random.uniform(0, 1)
if rand <= P_in :
particles.append(new_particle)
en += deltaE
npart += 1
print("accepted insertion")
else:
if npart != 0:
p = random.randint(0, npart-1)
this_particle = particles[p]
prev_E = particleEnergy(this_particle, particles, p)
deltaE = prev_E
print("dEout=",deltaE)
# Acceptance rule for removing
if(deltaE>10):
P_re=1
else:
P_re = (factorout)*math.exp(beta*deltaE)
print("poutacc=",P_re)
rand = random.uniform(0, 1)
if rand <= P_re :
particles.remove(this_particle)
en += deltaE
npart = npart - 1
print("accepted removal")
print()
return particles, en, npart
and finally, we create the g(r) histogramm
where perhaps the normalization or the use of the histogram method are not as they should
RDF(N,particles,L)
with the function:
def RDF(N,particles, L):
minb=0
maxb=8
nbin=500
skata=np.asarray(dist).flatten()
rDf = np.histogram(skata, np.linspace(minb, maxb,nbin))
prefactor = (1/2/ np.pi)* (L**2/N **2) /len(dist) *( nbin /(maxb -minb) )
# prefactor = (1/(2* np.pi))*(L**2/N**2)/(len(dist)*num_increments/(rMax + 1.1 * dr ))
rDf = [prefactor*rDf[0], 0.5*(rDf[1][1:]+rDf[1][:-1])]
print('skata',len(rDf[0]))
print('incr',len(rDf[1]))
plt.figure()
plt.plot(rDf[1],rDf[0])
plt.xlabel("r")
plt.ylabel("g(r)")
plt.show()
The results are:
Particle N number fluctuations
and
[
but we want
Although I have accepted an answer, I am posting here some more details.
To normalize the pair correlation correctly one must divide each "number of particles found at a certain distance" or mathematically the sum of delta function of the distances , one must divide with the distance it's self.
Understanding first that a numpy.histogram is an array of two elements, first element the array of all counted events and second element the vector of bins, one must take each element of the first array, lets say np.histogram[0] and multiply it pairwise with np.histogram[1] of the second array.
That is, one must do the following:
def RDF(N,particles, L):
minb=0
maxb=25
nbin=200
width=(maxb-minb)/(nbin)
rings=np.linspace(minb, maxb,nbin)
skata=np.asarray(dist).flatten()
rDf = np.histogram(skata, rings ,density=True)
prefactor = (1/( np.pi*(L**2/N**2)))
rDf = [prefactor*rDf[0], 0.5*(rDf[1][1:]+rDf[1][:-1])]
rDf[0]=np.multiply(rDf[0],1/(rDf[1]*( width )))
where before the last multiply line, we are centering the bins so that their numbers equals the number of elements of the first array( you have five fingers, but four intermediate gaps between them)
Your g(r) is not correctly normalised. You need to divide the number of pairs found in each bin by the average density of the system times the area of the annulus associated to that bin, where the latter is just 2 pi r dr, with r being the bin's midpoint and dr the bin size. As far as I can tell, your prefactor does not contain the "r" bit. There is also something else that is missing, but it's hard to tell without knowing what all the other constants contain.
EDIT: here is a link that will guide you the implementation of a routine to compute the radial distribution function in 2D and 3D
I was trying to take a oscillation avarage of a highly oscillating data. The oscillations are not uniform, it has less oscillations in the initial regions.
x = np.linspace(0, 1000, 1000001)
y = some oscillating data say, sin(x^2)
(The original data file is huge, so I can't upload it)
I want to take a weighted moving avarage of the function and plot it. Initially the period of the function is larger, so I want to take avarage over a large time interval. While I can do with smaller time interval latter.
I have found a possible elegant solution in following post:
Weighted moving average in python
However, I want to have different width in different regions of x. Say when x is between (0,100) I want the width=0.6, while when x is between (101, 300) width=0.2 and so on.
This is what I have tried to implement( with my limited knowledge in programing!)
def weighted_moving_average(x,y,step_size=0.05):#change the width to control average
bin_centers = np.arange(np.min(x),np.max(x)-0.5*step_size,step_size)+0.5*step_size
bin_avg = np.zeros(len(bin_centers))
#We're going to weight with a Gaussian function
def gaussian(x,amp=1,mean=0,sigma=1):
return amp*np.exp(-(x-mean)**2/(2*sigma**2))
if x.any() < 100:
for index in range(0,len(bin_centers)):
bin_center = bin_centers[index]
weights = gaussian(x,mean=bin_center,sigma=0.6)
bin_avg[index] = np.average(y,weights=weights)
else:
for index in range(0,len(bin_centers)):
bin_center = bin_centers[index]
weights = gaussian(x,mean=bin_center,sigma=0.1)
bin_avg[index] = np.average(y,weights=weights)
return (bin_centers,bin_avg)
It is needless to say that this is not working! I am getting the plot with the first value of sigma. Please help...
The following snippet should do more or less what you tried to do. You have mainly a logical problem in your code, x.any() < 100 will always be True, so you'll never execute the second part.
import numpy as np
import matplotlib.pyplot as plt
x = np.linspace(0, 10, 1000)
y = np.sin(x**2)
def gaussian(x,amp=1,mean=0,sigma=1):
return amp*np.exp(-(x-mean)**2/(2*sigma**2))
def weighted_average(x,y,step_size=0.3):
weights = np.zeros_like(x)
bin_centers = np.arange(np.min(x),np.max(x)-.5*step_size,step_size)+.5*step_size
bin_avg = np.zeros_like(bin_centers)
for i, center in enumerate(bin_centers):
# Select the indices that should count to that bin
idx = ((x >= center-.5*step_size) & (x <= center+.5*step_size))
weights = gaussian(x[idx], mean=center, sigma=step_size)
bin_avg[i] = np.average(y[idx], weights=weights)
return (bin_centers,bin_avg)
idx = x <= 4
plt.plot(*weighted_average(x[idx],y[idx], step_size=0.6))
idx = x >= 3
plt.plot(*weighted_average(x[idx],y[idx], step_size=0.1))
plt.plot(x,y)
plt.legend(['0.6', '0.1', 'y'])
plt.show()
However, depending on the usage, you could also implement moving average directly:
x = np.linspace(0, 60, 1000)
y = np.sin(x**2)
z = np.zeros_like(x)
z[0] = x[0]
for i, t in enumerate(x[1:]):
a=.2
z[i+1] = a*y[i+1] + (1-a)*z[i]
plt.plot(x,y)
plt.plot(x,z)
plt.legend(['data', 'moving average'])
plt.show()
Of course you could then change a adaptively, e.g. depending of the local variance. Also note that this has apriori a small bias depending on a and the step size in x.
I want to apply K means ( or any other simple clustering algorithm ) to data with two variables, but i want clusters to respect a condition : the sum of a third variable per cluster > some_value.
Is that possible?
Notations :
- K is the number of clusters
- let's say that the first two variables are point coordinnates (x,y)
- V denotes the third variable
- Ci : the sum of V over each cluster i
- S the total sum (sum Ci)
- and the threshold T
Problem definition :
From what I understood, the aim is to run an algorithm that keeps the spirit of kmeans while respecting the constraints.
Task 1 - group points by proximity to centroids [kmeans]
Task 2 - for each cluster i, Ci > T* [constraint]
Regular kmeans limitation for the constraint problem :
A regular kmeans, assign points to centroids by taking them in arbitrary order. In our case, this will lead to uncontrol growth of the Ci while adding points.
For exemple, with K=2, T=40 and 4 points with the third variables equal to V1=50, V2=1, V3=50, V4=50.
Suppose also that point P1, P3, P4 are closer to centroid 1. Point P2 is closer to centroid 2.
Let's run the assignement step of a regular kmeans and keep track of Ci :
1-- take point P1, assign it to cluster 1. C1=50 > T
2-- take point P2, assign it to cluster 2 C2=1
3-- take point P3, assign it to cluster 1. C1=100 > T => C1 grows too much !
4-- take point P4, assign it to cluster 1. C1=150 > T => !!!
Modified kmeans :
In the previous, we want to prevent C1 from growing too much and help C2 grow.
This is like pouring champagne into several glasses : if you see a glass with less champaigne, you go and fill it. You do that because you have constraints : limited amound of champaigne (S is bounded) and because you want every glass to have enough champaign (Ci>T).
Of course this is just a analogy. Our modified kmeans will add new poins to the cluster with minimal Ci until the constraint is achieved (Task2). Now in which order should we add points ? By proximity to centroids (Task1). After all constraints are achieved for all cluster i, we can just run a regular kmeans on remaining unassigned points.
Implementation :
Next, I give a python implementation of the modified algorithm. Figure 1 displays a reprensentation of the third variable using transparency for vizualizing large VS low values. Figure 2 displays the evolution clusters using color.
You can play with the accept_thresh parameter. In particular, note that :
For accept_thresh=0 => regular kmeans (constraint is reached immediately)
For accept_thresh = third_var.sum().sum() / (2*K), you might observe that some points that closer to a given centroid are affected to another one for constraint reasons.
CODE :
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import time
nb_samples = 1000
K = 3 # for demo purpose, used to generate cloud points
c_std = 1.2
# Generate test samples :
points, classes = datasets.make_blobs(n_features=2, n_samples=nb_samples, \
centers=K, cluster_std=c_std)
third_var_distribution = 'cubic_bycluster' # 'uniform'
if third_var_distribution == 'uniform':
third_var = np.random.random((nb_samples))
elif third_var_distribution == 'linear_bycluster':
third_var = np.random.random((nb_samples))
third_var = third_var * classes
elif third_var_distribution == 'cubic_bycluster':
third_var = np.random.random((nb_samples))
third_var = third_var * classes
# Threshold parameters :
# Try with K=3 and :
# T = K => one cluster reach cosntraint, two clusters won't converge
# T = 2K =>
accept_thresh = third_var.sum().sum() / (2*K)
def dist2centroids(points, centroids):
'''return arrays of ordered points to each centroids
first array is index of points
second array is distance to centroid
dim 0 : centroid
dim 1 : distance or point index
'''
dist = np.sqrt(((points - centroids[:, np.newaxis]) ** 2).sum(axis=2))
ord_dist_indices = np.argsort(dist, axis=1)
ord_dist_indices = ord_dist_indices.transpose()
dist = dist.transpose()
return ord_dist_indices, dist
def assign_points_with_constraints(inds, dists, tv, accept_thresh):
assigned = [False] * nb_samples
assignements = np.ones(nb_samples, dtype=int) * (-1)
cumul_third_var = np.zeros(K, dtype=float)
current_inds = np.zeros(K, dtype=int)
max_round = nb_samples * K
for round in range(0, max_round): # we'll break anyway
# worst advanced cluster in terms of cumulated third_var :
cluster = np.argmin(cumul_third_var)
if cumul_third_var[cluster] > accept_thresh:
continue # cluster had enough samples
while current_inds[cluster] < nb_samples:
# add points to increase cumulated third_var on this cluster
i_inds = current_inds[cluster]
closest_pt_index = inds[i_inds][cluster]
if assigned[closest_pt_index] == True:
current_inds[cluster] += 1
continue # pt already assigned to a cluster
assignements[closest_pt_index] = cluster
cumul_third_var[cluster] += tv[closest_pt_index]
assigned[closest_pt_index] = True
current_inds[cluster] += 1
new_cluster = np.argmin(cumul_third_var)
if new_cluster != cluster:
break
return assignements, cumul_third_var
def assign_points_with_kmeans(points, centroids, assignements):
new_assignements = np.array(assignements, copy=True)
count = -1
for asg in assignements:
count += 1
if asg > -1:
continue
pt = points[count, :]
distances = np.sqrt(((pt - centroids) ** 2).sum(axis=1))
centroid = np.argmin(distances)
new_assignements[count] = centroid
return new_assignements
def move_centroids(points, labels):
centroids = np.zeros((K, 2), dtype=float)
for k in range(0, K):
centroids[k] = points[assignements == k].mean(axis=0)
return centroids
rgba_colors = np.zeros((third_var.size, 4))
rgba_colors[:, 0] = 1.0
rgba_colors[:, 3] = 0.1 + (third_var / max(third_var))/1.12
plt.figure(1, figsize=(14, 14))
plt.title("Three blobs", fontsize='small')
plt.scatter(points[:, 0], points[:, 1], marker='o', c=rgba_colors)
# Initialize centroids
centroids = np.random.random((K, 2)) * 10
plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', color='red')
# Step 1 : order points by distance to centroid :
inds, dists = dist2centroids(points, centroids)
# Check if clustering is theoriticaly possible :
tv_sum = third_var.sum()
tv_max = third_var.max()
if (tv_max > 1 / 3 * tv_sum):
print("No solution to the clustering problem !\n")
print("For one point : third variable is too high.")
sys.exit(0)
stop_criter_eps = 0.001
epsilon = 100000
prev_cumdist = 100000
plt.figure(2, figsize=(14, 14))
ln, = plt.plot([])
plt.ion()
plt.show()
while epsilon > stop_criter_eps:
# Modified kmeans assignment :
assignements, cumul_third_var = assign_points_with_constraints(inds, dists, third_var, accept_thresh)
# Kmeans on remaining points :
assignements = assign_points_with_kmeans(points, centroids, assignements)
centroids = move_centroids(points, assignements)
inds, dists = dist2centroids(points, centroids)
epsilon = np.abs(prev_cumdist - dists.sum().sum())
print("Delta on error :", epsilon)
prev_cumdist = dists.sum().sum()
plt.clf()
plt.title("Current Assignements", fontsize='small')
plt.scatter(points[:, 0], points[:, 1], marker='o', c=assignements)
plt.scatter(centroids[:, 0], centroids[:, 1], marker='o', color='red', linewidths=10)
plt.text(0,0,"THRESHOLD T = "+str(accept_thresh), va='top', ha='left', color="red", fontsize='x-large')
for k in range(0, K):
plt.text(centroids[k, 0], centroids[k, 1] + 0.7, "Ci = "+str(cumul_third_var[k]))
plt.show()
plt.pause(1)
Improvements :
- use the distribution of the third variable for assignments.
- manage divergence of the algorithm
- better initialization (kmeans++)
One way to handle this would be to filter the data before clustering.
>>> cluster_data = df.loc[df['third_variable'] > some_value]
>>> from sklearn.cluster import KMeans
>>> y_pred = KMeans(n_clusters=2).fit_predict(cluster_data)
If by sum you mean the sum of the third variable per cluster then you could use RandomSearchCV to find hyperparameters of KMeans that do or do not meet the condition.
K-means itself is an optimization problem.
Your additional constraint is a rather common optimization constraint, too.
So I'd rather approach this with an optimization solver.