Related
I am trying to solve LeetCode problem 1129. Shortest Path with Alternating Colors:
You are given an integer n, the number of nodes in a directed graph where the nodes are labeled from 0 to n - 1. Each edge is red or blue in this graph, and there could be self-edges and parallel edges.
You are given two arrays redEdges and blueEdges where:
redEdges[i] = [aᵢ, bᵢ] indicates that there is a directed red edge from node aᵢ to node bᵢ in the graph, and
blueEdges[j] = [uⱼ, vⱼ] indicates that there is a directed blue edge from node uⱼ to node vⱼ in the graph.
Return an array answer of length n, where each answer[x] is the length of the shortest path from node 0 to node x such that the edge colors alternate along the path, or -1 if such a path does not exist.
Example 1:
Input: n = 3, redEdges = [[0,1],[1,2]], blueEdges = []
Output: [0,1,-1]
Here is my code for that problem:
class Solution:
def shortestAlternatingPaths(self, n: int, redEdges: List[List[int]], blueEdges: List[List[int]]) -> List[int]:
res = [0] + [-1]*(n-1)
Red = defaultdict(list)
Blue = defaultdict(list)
for x,y in redEdges:
if x!=0 or y!=0:
Red[x].append(y)
for x,y in blueEdges:
if x!=0 or y!=0:
Blue[x].append(y)
def dfs(vertex,color,cost):
if color == "red":
for x in Red[vertex]:
if res[x] != -1:
res[x] = min(cost,res[x])
else:
res[x] = cost
if vertex in Red.keys():
del Red[vertex]
dfs(x,"blue",cost+1)
else:
for x in Blue[vertex]:
if res[x] != -1:
res[x] = min(cost,res[x])
else:
res[x] = cost
if vertex in Blue.keys():
del Blue[vertex]
dfs(x,"red",cost+1)
dfs(0,"red",1)
dfs(0,"blue",1)
return res
But for the following input:
redEdges=[[2,2],[0,1],[0,3],[0,0],[0,4],[2,1],[2,0],[1,4],[3,4]]
blueEdges=[[1,3],[0,0],[0,3],[4,2],[1,0]]
...my output is:
[0,1,4,1,1]
But the correct solution is:
[0,1,2,1,1]
...because there is a path from node 0 to node 2 like this:
red blue
0 -----> 4 -----> 2
I have no idea why my code doesn't give 2 as this path should be found via my DFS algorithm.
I thought that it might be something with the [0,0] edge, but it seems that it doesn't have an impact on a solution.
What is wrong in my code?
The problem is that your code deletes the vertex it has visited, but doesn't restore it when backtracking. There is a possibility that there is another path from vertex 0 to the one you just deleted that still needs to be traversed and is a shorter path.
Here is example input that demonstrates the problem:
redEdges = [[0,1],[2,3],[0,3]]
blueEdges = [[1,2],[3,4]]
Your code will correctly create the following adjacency lists:
Red = {0: [1, 3], 2: [3]}
Blue = {1: [2], 3: [4]}
With dfs the path 0, 1, 2, 3, 4 will be visited and during this traversal all the keys in these dictionaries will be deleted. As the loop over the outgoing edges from 0 is still active, dfs will still follow the red edge from 0 to 3, but there it finds no blue edges as the key 3 is not there anymore. And so the algorithm doesn't see the shorter path from 0 to 4, which is 0, 3, 4.
Not your question, but BFS is more suitable for finding shortest paths. I would suggest you rework the algorithm and use BFS instead.
Just to have a complete answer, here is a spoiler solution using BFS:
class Solution:
def shortestAlternatingPaths(self, n, redEdges, blueEdges):
# Build a separate adjacency list for each color
adj = [[[] for _ in range(n)], [[] for _ in range(n)]]
for i, edges in enumerate((redEdges, blueEdges)):
for x, y in edges:
if x or y:
adj[i][x].append(y)
# Collect shortest distances for each color separately
res = [[0] + [-1] * (n-1), [0] + [-1] * (n-1)]
# Start BFS at node 0, traversing with either color
frontier = [(0, 0), (0, 1)]
distance = 1
while frontier: # BFS loop
nextfrontier = []
for node, color in frontier:
for neighbor in adj[color][node]:
# If not yet visited with this color...
if res[color][neighbor] == -1:
res[color][neighbor] = distance
nextfrontier.append((neighbor, 1-color))
frontier = nextfrontier
distance += 1
# Get the minimum distance per node from the two color alternatives
return [min(a, b) if min(a, b) > -1 else max(a, b)
for a, b in zip(*res)]
I have a code that works perfectly well but I wish to speed up the time it takes to converge. A snippet of the code is shown below:
def myfunction(x, i):
y = x + (min(0, target[i] - data[i, :]x))*data[i]/(norm(data[i])**2))
return y
rows, columns = data.shape
start = time.time()
iterate = 0
iterate_count = []
norm_count = []
res = 5
x_not = np.ones(columns)
norm_count.append(norm(x_not))
iterate_count.append(0)
while res > 1e-8:
for row in range(rows):
y = myfunction(x_not, row)
x_not = y
iterate += 1
iterate_count.append(iterate)
norm_count.append(norm(x_not))
res = abs(norm_count[-1] - norm_count[-2])
print('Converge at {} iterations'.format(iterate))
print('Duration: {:.4f} seconds'.format(time.time() - start))
I am relatively new in Python. I will appreciate any hint/assistance.
Ax=b is the problem we wish to solve. Here, 'A' is the 'data' and 'b' is the 'target'
Ugh! After spending a while on this I don't think it can be done the way you've set up your problem. In each iteration over the row, you modify x_not and then pass the updated result to get the solution for the next row. This kind of setup can't be vectorized easily. You can learn the thought process of vectorization from the failed attempt, so I'm including it in the answer. I'm also including a different iterative method to solve linear systems of equations. I've included a vectorized version -- where the solution is updated using matrix multiplication and vector addition, and a loopy version -- where the solution is updated using a for loop to demonstrate what you can expect to gain.
1. The failed attempt
Let's take a look at what you're doing here.
def myfunction(x, i):
y = x + (min(0, target[i] - data[i, :] # x)) * (data[i] / (norm(data[i])**2))
return y
You subtract
the dot product of (the ith row of data and x_not)
from the ith row of target,
limited at zero.
You multiply this result with the ith row of data divided my the norm of that row squared. Let's call this part2
Then you add this to the ith element of x_not
Now let's look at the shapes of the matrices.
data is (M, N).
target is (M, ).
x_not is (N, )
Instead of doing these operations rowwise, you can operate on the entire matrix!
1.1. Simplifying the dot product.
Instead of doing data[i, :] # x, you can do data # x_not and this gives an array with the ith element giving the dot product of the ith row with x_not. So now we have data # x_not with shape (M, )
Then, you can subtract this from the entire target array, so target - (data # x_not) has shape (M, ).
So far, we have
part1 = target - (data # x_not)
Next, if anything is greater than zero, set it to zero.
part1[part1 > 0] = 0
1.2. Finding rowwise norms.
Finally, you want to multiply this by the row of data, and divide by the square of the L2-norm of that row. To get the norm of each row of a matrix, you do
rownorms = np.linalg.norm(data, axis=1)
This is a (M, ) array, so we need to convert it to a (M, 1) array so we can divide each row. rownorms[:, None] does this. Then divide data by this.
part2 = data / (rownorms[:, None]**2)
1.3. Add to x_not
Finally, we're adding each row of part1 * part2 to the original x_not and returning the result
result = x_not + (part1 * part2).sum(axis=0)
Here's where we get stuck. In your approach, each call to myfunction() gives a value of part1 that depends on target[i], which was changed in the last call to myfunction().
2. Why vectorize?
Using numpy's inbuilt methods instead of looping allows it to offload the calculation to its C backend, so it runs faster. If your numpy is linked to a BLAS backend, you can extract even more speed by using your processor's SIMD registers
The conjugate gradient method is a simple iterative method to solve certain systems of equations. There are other more complex algorithms that can solve general systems well, but this should do for the purposes of our demo. Again, the purpose is not to have an iterative algorithm that will perfectly solve any linear system of equations, but to show what kind of speedup you can expect if you vectorize your code.
Given your system
data # x_not = target
Let's define some variables:
A = data.T # data
b = data.T # target
And we'll solve the system A # x = b
x = np.zeros((columns,)) # Initial guess. Can be anything
resid = b - A # x
p = resid
while (np.abs(resid) > tolerance).any():
Ap = A # p
alpha = (resid.T # resid) / (p.T # Ap)
x = x + alpha * p
resid_new = resid - alpha * Ap
beta = (resid_new.T # resid_new) / (resid.T # resid)
p = resid_new + beta * p
resid = resid_new + 0
To contrast the fully vectorized approach with one that uses iterations to update the rows of x and resid_new, let's define another implementation of the CG solver that does this.
def solve_loopy(data, target, itermax = 100, tolerance = 1e-8):
A = data.T # data
b = data.T # target
rows, columns = data.shape
x = np.zeros((columns,)) # Initial guess. Can be anything
resid = b - A # x
resid_new = b - A # x
p = resid
niter = 0
while (np.abs(resid) > tolerance).any() and niter < itermax:
Ap = A # p
alpha = (resid.T # resid) / (p.T # Ap)
for i in range(len(x)):
x[i] = x[i] + alpha * p[i]
resid_new[i] = resid[i] - alpha * Ap[i]
# resid_new = resid - alpha * A # p
beta = (resid_new.T # resid_new) / (resid.T # resid)
p = resid_new + beta * p
resid = resid_new + 0
niter += 1
return x
And our original vector method:
def solve_vect(data, target, itermax = 100, tolerance = 1e-8):
A = data.T # data
b = data.T # target
rows, columns = data.shape
x = np.zeros((columns,)) # Initial guess. Can be anything
resid = b - A # x
resid_new = b - A # x
p = resid
niter = 0
while (np.abs(resid) > tolerance).any() and niter < itermax:
Ap = A # p
alpha = (resid.T # resid) / (p.T # Ap)
x = x + alpha * p
resid_new = resid - alpha * Ap
beta = (resid_new.T # resid_new) / (resid.T # resid)
p = resid_new + beta * p
resid = resid_new + 0
niter += 1
return x
Let's solve a simple system to see if this works first:
2x1 + x2 = -5
−x1 + x2 = -2
should give a solution of [-1, -3]
data = np.array([[ 2, 1],
[-1, 1]])
target = np.array([-5, -2])
print(solve_loopy(data, target))
print(solve_vect(data, target))
Both give the correct solution [-1, -3], yay! Now on to bigger things:
data = np.random.random((100, 100))
target = np.random.random((100, ))
Let's ensure the solution is still correct:
sol1 = solve_loopy(data, target)
np.allclose(data # sol1, target)
# Output: False
sol2 = solve_vect(data, target)
np.allclose(data # sol2, target)
# Output: False
Hmm, looks like the CG method doesn't work for badly conditioned random matrices we created. Well, at least both give the same result.
np.allclose(sol1, sol2)
# Output: True
But let's not get discouraged! We don't really care if it works perfectly, the point of this is to demonstrate how amazing vectorization is. So let's time this:
import timeit
timeit.timeit('solve_loopy(data, target)', number=10, setup='from __main__ import solve_loopy, data, target')
# Output: 0.25586539999994784
timeit.timeit('solve_vect(data, target)', number=10, setup='from __main__ import solve_vect, data, target')
# Output: 0.12008900000000722
Nice! A ~2x speedup simply by avoiding a loop while updating our solution!
For larger systems, this will be even better.
for N in [10, 50, 100, 500, 1000]:
data = np.random.random((N, N))
target = np.random.random((N, ))
t_loopy = timeit.timeit('solve_loopy(data, target)', number=10, setup='from __main__ import solve_loopy, data, target')
t_vect = timeit.timeit('solve_vect(data, target)', number=10, setup='from __main__ import solve_vect, data, target')
print(N, t_loopy, t_vect, t_loopy/t_vect)
This gives us:
N t_loopy t_vect speedup
00010 0.002823 0.002099 1.345390
00050 0.051209 0.014486 3.535048
00100 0.260348 0.114601 2.271773
00500 0.980453 0.240151 4.082644
01000 1.769959 0.508197 3.482822
I would like to rotate the following k-partite graph vertically or at 45 degrees. I want to show the following plot in a hierarchical way where red nodes are at the top and green node-set are at the bottom
Documentation of networks only have a rotate option for shell_layout and for edges labels networkx.drawing.nx_pylab.draw_networkx_edge_labels
Here is the program code:
G = nx.Graph()
G.add_nodes_from(emc["entity"], bipartite=0)
G.add_nodes_from(set(EMM_unique["keys"]).symmetric_difference(set(emc["entity"])), bipartite=1)
G.add_nodes_from(EMM["id"], bipartite=2)
G.add_edges_from(list(emc.itertuples(index=False)))
G.add_edges_from(list(EMM.itertuples(index=False)))
nodes = G.nodes()
# for each of the parts create a set
nodes_0 = set([n for n in nodes if G.nodes[n]['bipartite']==0])
nodes_1 = set([n for n in nodes if G.nodes[n]['bipartite']==1])
nodes_2 = set([n for n in nodes if G.nodes[n]['bipartite']==2])
# set the location of the nodes for each set
pos = dict()
pos.update( (n, (1, i)) for i, n in enumerate(nodes_0) ) # put nodes from X at x=1
pos.update( (n, (2, i)) for i, n in enumerate(nodes_1) ) # put nodes from Y at x=2
pos.update( (n, (3, i)) for i, n in enumerate(nodes_2) ) # put nodes from X at x=1
color_map = []
for node in G:
if node in emc["entity"].values:
color_map.append("red")
elif node in EMM["id"].values:
color_map.append("green")
else:
color_map.append("blue")
nx.draw(G, pos, node_color=color_map, width= 2, with_labels=True, with_arrows=True)
This solution is only good to flip the position and is not useful for rotation. As I am not adding the nodes one by one therefore, this solution is not very helpful as well.
Instead of flipping the graph, have you looked into using pygraphviz or the hierarchical_pos solution as in the answer below?
Can one get hierarchical graphs from networkx with python 3?
the hierarchy_pos solution worked well to some extent for me:
## Converting the graph to an oriented tree through depth first search or breadth first search
tree_g = nx.dfs_tree(g, <starting_node>)
## attempt to draw it like a tree
pos = hierarchy_pos(tree_g)
nx.draw(tree_g, pos=pos,....)
https://networkx.github.io/documentation/networkx-1.10/reference/generated/networkx.algorithms.traversal.depth_first_search.dfs_tree.html
I tried to change the coordinates at the position(pos) variable and it worked. The part of the code that is different from what I posted above and has a solution is here
# set the location of the nodes for each set
pos = dict()
pos.update( (n, (i, -1)) for i, n in enumerate(nodes_0) )
pos.update( (n, (i, -2) for i, n in enumerate(nodes_1) )
pos.update( (n, (i, -3)) for i, n in enumerate(nodes_2) )
Edit
I believe there is a problem with the normalization of the histogram, since one must divide with the radius of each element.
I am trying trying to calculate the fluctuations of particle number and the radial distribution function of a 2d LennardJones(LJ) system using python3. Although I believe the particle fluctuations come out right, the pair correlation g(r) come right for small distances but then blow up ( the calculation uses numpy's histogram method).
The thing is, I can' t find out why such a behavior emerges- perhaps of some misunderstanding of a method? As it is, I am posting the relevant code right below, and if needed, I could also upload other parts of the code or the entire script.
Note first, that since we are working with the Grand-Canonical Ensemble, as the number of particles changes, so is the array that stores the particles- and perhaps that's another point where a mistake in implementation could exist.
Particle removal or insertion
def mcex(L,npart,particles,beta,rho0,V,en):
factorin=(rho0*V)/(npart+1)
factorout=(npart)/(V*rho0)
print("factorin=",factorin)
print("factorout",factorout)
# Produce random number and check:
rand = random.uniform(0, 1)
if rand <= 0.5:
# Insert a particle at a random location
x_new_coord = random.uniform(0, L)
y_new_coord = random.uniform(0, L)
new_particle = [x_new_coord,y_new_coord]
new_E = particleEnergy(new_particle,particles, npart+1)
deltaE = new_E
print("dEin=",deltaE)
# Acceptance rule for inserting
if(deltaE>10):
P_in=0
else:
P_in = (factorin) *math.exp(-beta*deltaE)
print("pinacc=",P_in)
rand= random.uniform(0, 1)
if rand <= P_in :
particles.append(new_particle)
en += deltaE
npart += 1
print("accepted insertion")
else:
if npart != 0:
p = random.randint(0, npart-1)
this_particle = particles[p]
prev_E = particleEnergy(this_particle, particles, p)
deltaE = prev_E
print("dEout=",deltaE)
# Acceptance rule for removing
if(deltaE>10):
P_re=1
else:
P_re = (factorout)*math.exp(beta*deltaE)
print("poutacc=",P_re)
rand = random.uniform(0, 1)
if rand <= P_re :
particles.remove(this_particle)
en += deltaE
npart = npart - 1
print("accepted removal")
print()
return particles, en, npart
Monte Carlo relevant part: for 1/10 runs, check the possibility of inserting or removing a particle
# MC
for step in range(0, runTimes):
print(step)
print()
rand = random.uniform(0,1)
if rand <= 0.9:
#----------- change energies-------------------------
#........
#........
else:
particles, en, N = mcex(L,N,particles,beta,rho0,V, en)
# stepList.append(step)
if((step+1)%1000==0):
for i, particle1 in enumerate(particles):
for j, particle2 in enumerate(particles):
if j!= i:
# print(particle1)
# print(particle2)
# print(i)
# print(j)
dist.append(distancesq(particle1, particle2))
NList.append(N)
where we call the function mcex and perhaps the particles array is not updated correctly:
def mcex(L,npart,particles,beta,rho0,V,en):
factorin=(rho0*V)/(npart+1)
factorout=(npart)/(V*rho0)
print("factorin=",factorin)
print("factorout",factorout)
# Produce random number and check:
rand = random.uniform(0, 1)
if rand <= 0.5:
# Insert a particle at a random location
x_new_coord = random.uniform(0, L)
y_new_coord = random.uniform(0, L)
new_particle = [x_new_coord,y_new_coord]
new_E = particleEnergy(new_particle,particles, npart+1)
deltaE = new_E
print("dEin=",deltaE)
# Acceptance rule for inserting
if(deltaE>10):
P_in=0
else:
P_in = (factorin) *math.exp(-beta*deltaE)
print("pinacc=",P_in)
rand= random.uniform(0, 1)
if rand <= P_in :
particles.append(new_particle)
en += deltaE
npart += 1
print("accepted insertion")
else:
if npart != 0:
p = random.randint(0, npart-1)
this_particle = particles[p]
prev_E = particleEnergy(this_particle, particles, p)
deltaE = prev_E
print("dEout=",deltaE)
# Acceptance rule for removing
if(deltaE>10):
P_re=1
else:
P_re = (factorout)*math.exp(beta*deltaE)
print("poutacc=",P_re)
rand = random.uniform(0, 1)
if rand <= P_re :
particles.remove(this_particle)
en += deltaE
npart = npart - 1
print("accepted removal")
print()
return particles, en, npart
and finally, we create the g(r) histogramm
where perhaps the normalization or the use of the histogram method are not as they should
RDF(N,particles,L)
with the function:
def RDF(N,particles, L):
minb=0
maxb=8
nbin=500
skata=np.asarray(dist).flatten()
rDf = np.histogram(skata, np.linspace(minb, maxb,nbin))
prefactor = (1/2/ np.pi)* (L**2/N **2) /len(dist) *( nbin /(maxb -minb) )
# prefactor = (1/(2* np.pi))*(L**2/N**2)/(len(dist)*num_increments/(rMax + 1.1 * dr ))
rDf = [prefactor*rDf[0], 0.5*(rDf[1][1:]+rDf[1][:-1])]
print('skata',len(rDf[0]))
print('incr',len(rDf[1]))
plt.figure()
plt.plot(rDf[1],rDf[0])
plt.xlabel("r")
plt.ylabel("g(r)")
plt.show()
The results are:
Particle N number fluctuations
and
[
but we want
Although I have accepted an answer, I am posting here some more details.
To normalize the pair correlation correctly one must divide each "number of particles found at a certain distance" or mathematically the sum of delta function of the distances , one must divide with the distance it's self.
Understanding first that a numpy.histogram is an array of two elements, first element the array of all counted events and second element the vector of bins, one must take each element of the first array, lets say np.histogram[0] and multiply it pairwise with np.histogram[1] of the second array.
That is, one must do the following:
def RDF(N,particles, L):
minb=0
maxb=25
nbin=200
width=(maxb-minb)/(nbin)
rings=np.linspace(minb, maxb,nbin)
skata=np.asarray(dist).flatten()
rDf = np.histogram(skata, rings ,density=True)
prefactor = (1/( np.pi*(L**2/N**2)))
rDf = [prefactor*rDf[0], 0.5*(rDf[1][1:]+rDf[1][:-1])]
rDf[0]=np.multiply(rDf[0],1/(rDf[1]*( width )))
where before the last multiply line, we are centering the bins so that their numbers equals the number of elements of the first array( you have five fingers, but four intermediate gaps between them)
Your g(r) is not correctly normalised. You need to divide the number of pairs found in each bin by the average density of the system times the area of the annulus associated to that bin, where the latter is just 2 pi r dr, with r being the bin's midpoint and dr the bin size. As far as I can tell, your prefactor does not contain the "r" bit. There is also something else that is missing, but it's hard to tell without knowing what all the other constants contain.
EDIT: here is a link that will guide you the implementation of a routine to compute the radial distribution function in 2D and 3D
I want to apply K means ( or any other simple clustering algorithm ) to data with two variables, but i want clusters to respect a condition : the sum of a third variable per cluster > some_value.
Is that possible?
Notations :
- K is the number of clusters
- let's say that the first two variables are point coordinnates (x,y)
- V denotes the third variable
- Ci : the sum of V over each cluster i
- S the total sum (sum Ci)
- and the threshold T
Problem definition :
From what I understood, the aim is to run an algorithm that keeps the spirit of kmeans while respecting the constraints.
Task 1 - group points by proximity to centroids [kmeans]
Task 2 - for each cluster i, Ci > T* [constraint]
Regular kmeans limitation for the constraint problem :
A regular kmeans, assign points to centroids by taking them in arbitrary order. In our case, this will lead to uncontrol growth of the Ci while adding points.
For exemple, with K=2, T=40 and 4 points with the third variables equal to V1=50, V2=1, V3=50, V4=50.
Suppose also that point P1, P3, P4 are closer to centroid 1. Point P2 is closer to centroid 2.
Let's run the assignement step of a regular kmeans and keep track of Ci :
1-- take point P1, assign it to cluster 1. C1=50 > T
2-- take point P2, assign it to cluster 2 C2=1
3-- take point P3, assign it to cluster 1. C1=100 > T => C1 grows too much !
4-- take point P4, assign it to cluster 1. C1=150 > T => !!!
Modified kmeans :
In the previous, we want to prevent C1 from growing too much and help C2 grow.
This is like pouring champagne into several glasses : if you see a glass with less champaigne, you go and fill it. You do that because you have constraints : limited amound of champaigne (S is bounded) and because you want every glass to have enough champaign (Ci>T).
Of course this is just a analogy. Our modified kmeans will add new poins to the cluster with minimal Ci until the constraint is achieved (Task2). Now in which order should we add points ? By proximity to centroids (Task1). After all constraints are achieved for all cluster i, we can just run a regular kmeans on remaining unassigned points.
Implementation :
Next, I give a python implementation of the modified algorithm. Figure 1 displays a reprensentation of the third variable using transparency for vizualizing large VS low values. Figure 2 displays the evolution clusters using color.
You can play with the accept_thresh parameter. In particular, note that :
For accept_thresh=0 => regular kmeans (constraint is reached immediately)
For accept_thresh = third_var.sum().sum() / (2*K), you might observe that some points that closer to a given centroid are affected to another one for constraint reasons.
CODE :
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import time
nb_samples = 1000
K = 3 # for demo purpose, used to generate cloud points
c_std = 1.2
# Generate test samples :
points, classes = datasets.make_blobs(n_features=2, n_samples=nb_samples, \
centers=K, cluster_std=c_std)
third_var_distribution = 'cubic_bycluster' # 'uniform'
if third_var_distribution == 'uniform':
third_var = np.random.random((nb_samples))
elif third_var_distribution == 'linear_bycluster':
third_var = np.random.random((nb_samples))
third_var = third_var * classes
elif third_var_distribution == 'cubic_bycluster':
third_var = np.random.random((nb_samples))
third_var = third_var * classes
# Threshold parameters :
# Try with K=3 and :
# T = K => one cluster reach cosntraint, two clusters won't converge
# T = 2K =>
accept_thresh = third_var.sum().sum() / (2*K)
def dist2centroids(points, centroids):
'''return arrays of ordered points to each centroids
first array is index of points
second array is distance to centroid
dim 0 : centroid
dim 1 : distance or point index
'''
dist = np.sqrt(((points - centroids[:, np.newaxis]) ** 2).sum(axis=2))
ord_dist_indices = np.argsort(dist, axis=1)
ord_dist_indices = ord_dist_indices.transpose()
dist = dist.transpose()
return ord_dist_indices, dist
def assign_points_with_constraints(inds, dists, tv, accept_thresh):
assigned = [False] * nb_samples
assignements = np.ones(nb_samples, dtype=int) * (-1)
cumul_third_var = np.zeros(K, dtype=float)
current_inds = np.zeros(K, dtype=int)
max_round = nb_samples * K
for round in range(0, max_round): # we'll break anyway
# worst advanced cluster in terms of cumulated third_var :
cluster = np.argmin(cumul_third_var)
if cumul_third_var[cluster] > accept_thresh:
continue # cluster had enough samples
while current_inds[cluster] < nb_samples:
# add points to increase cumulated third_var on this cluster
i_inds = current_inds[cluster]
closest_pt_index = inds[i_inds][cluster]
if assigned[closest_pt_index] == True:
current_inds[cluster] += 1
continue # pt already assigned to a cluster
assignements[closest_pt_index] = cluster
cumul_third_var[cluster] += tv[closest_pt_index]
assigned[closest_pt_index] = True
current_inds[cluster] += 1
new_cluster = np.argmin(cumul_third_var)
if new_cluster != cluster:
break
return assignements, cumul_third_var
def assign_points_with_kmeans(points, centroids, assignements):
new_assignements = np.array(assignements, copy=True)
count = -1
for asg in assignements:
count += 1
if asg > -1:
continue
pt = points[count, :]
distances = np.sqrt(((pt - centroids) ** 2).sum(axis=1))
centroid = np.argmin(distances)
new_assignements[count] = centroid
return new_assignements
def move_centroids(points, labels):
centroids = np.zeros((K, 2), dtype=float)
for k in range(0, K):
centroids[k] = points[assignements == k].mean(axis=0)
return centroids
rgba_colors = np.zeros((third_var.size, 4))
rgba_colors[:, 0] = 1.0
rgba_colors[:, 3] = 0.1 + (third_var / max(third_var))/1.12
plt.figure(1, figsize=(14, 14))
plt.title("Three blobs", fontsize='small')
plt.scatter(points[:, 0], points[:, 1], marker='o', c=rgba_colors)
# Initialize centroids
centroids = np.random.random((K, 2)) * 10
plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', color='red')
# Step 1 : order points by distance to centroid :
inds, dists = dist2centroids(points, centroids)
# Check if clustering is theoriticaly possible :
tv_sum = third_var.sum()
tv_max = third_var.max()
if (tv_max > 1 / 3 * tv_sum):
print("No solution to the clustering problem !\n")
print("For one point : third variable is too high.")
sys.exit(0)
stop_criter_eps = 0.001
epsilon = 100000
prev_cumdist = 100000
plt.figure(2, figsize=(14, 14))
ln, = plt.plot([])
plt.ion()
plt.show()
while epsilon > stop_criter_eps:
# Modified kmeans assignment :
assignements, cumul_third_var = assign_points_with_constraints(inds, dists, third_var, accept_thresh)
# Kmeans on remaining points :
assignements = assign_points_with_kmeans(points, centroids, assignements)
centroids = move_centroids(points, assignements)
inds, dists = dist2centroids(points, centroids)
epsilon = np.abs(prev_cumdist - dists.sum().sum())
print("Delta on error :", epsilon)
prev_cumdist = dists.sum().sum()
plt.clf()
plt.title("Current Assignements", fontsize='small')
plt.scatter(points[:, 0], points[:, 1], marker='o', c=assignements)
plt.scatter(centroids[:, 0], centroids[:, 1], marker='o', color='red', linewidths=10)
plt.text(0,0,"THRESHOLD T = "+str(accept_thresh), va='top', ha='left', color="red", fontsize='x-large')
for k in range(0, K):
plt.text(centroids[k, 0], centroids[k, 1] + 0.7, "Ci = "+str(cumul_third_var[k]))
plt.show()
plt.pause(1)
Improvements :
- use the distribution of the third variable for assignments.
- manage divergence of the algorithm
- better initialization (kmeans++)
One way to handle this would be to filter the data before clustering.
>>> cluster_data = df.loc[df['third_variable'] > some_value]
>>> from sklearn.cluster import KMeans
>>> y_pred = KMeans(n_clusters=2).fit_predict(cluster_data)
If by sum you mean the sum of the third variable per cluster then you could use RandomSearchCV to find hyperparameters of KMeans that do or do not meet the condition.
K-means itself is an optimization problem.
Your additional constraint is a rather common optimization constraint, too.
So I'd rather approach this with an optimization solver.