This is my dataset:
4095 546
3213 2059
4897 2661
...
3586 2583
3437 3317
3364 1216
Each line is a pair of nodes which have an edge between them. The whole dataset build an graph. But I want to get many node pairs which are disconnected with each other. How can I get 1000(or more) such node pairs from dataset? Such as:
2761 2788
4777 3365
3631 3553
...
3717 4074
3013 2225
Each line is a pair of nodes without edge.
Please see the part under the EDIT!
I think the other options are more general, and probably nicer from a programmatic view. I just had a quick idea how you could get the list in a very easy way using numpy.
First create the adjacency matrix and your list of nodes is an array:
import numpy as np
node_list= np.random.randint(10 , size=(10, 2))
A = np.zeros((np.max(node_list) + 1, np.max(node_list) + 1)) # + 1 to account for zero indexing
A[node_list[:, 0], node_list[:, 1]] = 1 # set connected nodes to 1
x, y = np.where(A == 0) # Find disconnected nodes
disconnected_list = np.vstack([x, y]).T # The final list of disconnected nodes
I have no idea though, how this will work with really large scale networks.
EDIT: The above solution was me thinking a bit too fast. As of now the solution above provides the missing edges between nodes, not the disconnected nodes (in the case of a directed graph). Furthermore, the disconnected_list includes the each node twice. Here is a hacky second idea of solution:
import numpy as np
node_list= np.random.randint(10 , size=(10, 2))
A = np.zeros((np.max(node_list) + 1, np.max(node_list) + 1)) # + 1 to account for zero indexing
A[node_list[:, 0], node_list[:, 1]] = 1 # set connected nodes to 1
A[node_list[:, 1], node_list[:, 0]] = 1 # Make the graph symmetric
A = A + np.triu(np.ones(A.shape)) # Add ones to the upper triangular
# matrix, so they are not considered in np.where (set k if you want to consider the diagonal)
x, y = np.where(A == 0) # Find disconnected nodes
disconnected_list = np.vstack([x, y]).T # The final list of disconnected nodes
Just do a BFS or DFS to get the size of every connected component in O(|E|) time. Then once you have the component sizes, you can get the number of disconnected nodes easily: it's the sum of the products of every pair of sizes.
Eg. If your graph has 3 connected components with sizes: 50, 20, 100. Then the number of pairs of disconnected nodes is: 50*20 + 50*100 + 20*100 = 8000.
If you want to actually output the disconnected pairs instead of just counting them, you should probably use union-find and then just iterate through all pairs of nodes and output them if they're not in the same component.
Related
I am coding PyTorch. Between the torch inference code, I add some peripheral code for my own interest. This code works fine, but it is too slow. The reason might be for iteration. So, i need parallel and fast way of doing this.
It is okay to do this in tensor, Numpy, or just python array.
I made a function named selective_max to find maximum value in arrays. But the problem is that I don't want a maximum among the whole arrays, but among specific candidates which is designated by mask array. Let me show the gist of this function (below shows the code itself)
Input
x [batch_size , dim, num_points, k] : x is a original input, but this becomes [batch_size, num_points, dim, k] by x.permute(0,2,1,3).
batch_size is a well-known definition in the deep learning society. In every mini batch, there are many points. And a single point is represented by dim length feature. For each feature element, there are k potential candidates which is target of max function later.
mask [batch_size, num_points, k] : This array is similar to x without dim. Its element is either 0 or 1. So, I use this as a mask signal, like do max operation only on 1 masked value.
Kindly see the code below with this explanation. I use 3 for iteration. Let's say we target a specific batch and a specific point. For a specific batch and a specific point, x has [dim, k] array. And mask has [k] array which consists of either 0 or 1. So, I extract the non-zero index from [k] array and use this for extracting specific elements in x dim by dim('for k in range(dim)').
Toy example
Let's say we are in the second for iteration. So, we now have [dim, k] for x and [k] for mask. For this toy example, i presume k=3 and dim=4. x = [[3,2,1],[5,6,4],[9,8,7],[12,11,10]], k=[0,1,1]. So, output would be [2,6,8,11], not [3, 6, 9, 12].
Previous attempt
I try { mask.repeat(0,0,1,0) *(element-wise mul) x } and do the max operation. But, '0' might the max value, because the x might have minus values in all array. So, this would result in wrong operation.
def selective_max2(x, mask): # x : [batch_size , dim, num_points, k] , mask : [batch_size, num_points, k]
batch_size = x.size(0)
dim = x.size(1)
num_points = x.size(2)
k = x.size(3)
device = torch.device('cuda')
x = x.permute(0,2,1,3) # : [batch, num_points, dim, k]
#print('permuted x dimension : ',x.size())
x = x.detach().cpu().numpy()
mask = mask.cpu().numpy()
output = np.zeros((batch_size,num_points,dim))
for i in range(batch_size):
for j in range(num_points):
query=np.nonzero(mask[i][j]) # among mask entries, we get the index of nonzero values.
for k in range(dim): # for different k values, we get the max value.
# query is index of nonzero values. so, using query, we can get the values that we want.
output[i][j][k] = np.max(x[i][j][k][query])
output = torch.from_numpy(output).float().to(device=device)
output = output.permute(0,2,1).contiguous()
return output
Disclaimer: I've followed your toy example (however while retaining generality) to write the following solution.
The first thing is to expand your k as x (treating them both as PyTorch tensors):
k_expanded = k.expand_as(x)
Then you select the elements where your 1's exist in the k_expanded, and view the resulting tensor as x number of rows (written as x.shape[0]), and number of 1's in k (or the mask) as the number of columns. Up to this point, we have selected the range we want to query the maximum element for. Then, you find the maximum along the rows dimension (showed in .sum(0)) using max(1)
values, indices = x[k_expanded == 1].view(x.shape[0], (k == 1).sum(0)).max(1)
values
Out[29]: tensor([ 2, 6, 8, 11])
Benchmarks
def find_max_elements_inside_tensor_range(arr, mask, return_indices=False):
mask_expanded = mask.expand_as(arr)
values, indices = x[k_expanded==1].view(x.shape[0], (k == 1).sum(0)).max(1)
return (values, indices) if return_indices else values
Just added a third parameter in case you want to get the numbers indices
%timeit find_max_elements_inside_tensor_range(x, k)
38.4 µs ± 534 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Note: the above solution also works for tensors and masks of various shapes.
I've a random graph created using Netwrokx and I want to delete nodes with degree less than 2 except for 2 user-defined nodes that have degree = 1. To remove all nodes with degree < 2, I could use Networkx's k-core. But I am not sure how to retain the 2 user-defined nodes. For example, the following code generates,
import networkx as nx
import matplotlib.pyplot as plt
# fig 1
G = nx.gnm_random_graph(n=20, m=30, seed=1)
nx.draw(G, with_labels=True, pos=nx.spring_layout(G))
plt.show()
G = nx.k_core(G, k=2)
nx.draw(G, with_labels=True, pos=nx.spring_layout(G))
plt.show()
Figure 1:
Figure 2:
I would like to ask for suggestions on how to retain 2 user-defined nodes:
e.g
retain_node_ids = [1,2]
EDIT:
I could use remove_nodes_from as suggested below. But if we delete nodes with degree < 2 we may end up with new nodes, which originally had degree >=2, with degree <2. To repeat the process until no nodes with degree < 2 is found, k-core has been used.
Here is how you can do it:
degrees = nx.classes.degree(G)
G.remove_nodes_from([node
for node in G.nodes
if node not in retain_node_ids and degrees[node] <= 2])
Of course this piece of code does not find a maximal subgraph (as k_core function does): it simply remove all nodes with degree less than or equal to 2, and which are not in the retain_node_ids list.
EDIT:
You can add two fake nodes, connect nodes to retain to them, compute the k-core and then get rid of them:
G.add_edges_from([(u, v) for u in retain_node_ids for v in (n, n+1)])
G = nx.k_core(G, k=2)
G.remove_nodes_from([n, n+1])
I've a graph network created using Networkx and plotted using Mayavi.
After the graph is created, I 'm deleting nodes with degree < 2, using G.remove_nodes_from(). Once the nodes are deleted, the edges connected to these nodes are deleted but the nodes still appear in the final output (image below).
import matplotlib.pyplot as plt
from mayavi import mlab
import numpy as np
import pandas as pd
pos = [[0.1, 2, 0.3], [40, 0.5, -10],
[0.1, -40, 0.3], [-49, 0.1, 2],
[10.3, 0.3, 0.4], [-109, 0.3, 0.4]]
pos = pd.DataFrame(pos, columns=['x', 'y', 'z'])
ed_ls = [(x, y) for x, y in zip(range(0, 5), range(1, 6))]
G = nx.Graph()
G.add_edges_from(ed_ls)
remove = [node for node, degree in dict(G.degree()).items() if degree < 2]
G.remove_nodes_from(remove)
pos.drop(pos.index[remove], inplace=True)
print(G.edges)
nx.draw(G)
plt.show()
mlab.figure(1, bgcolor=bgcolor)
mlab.clf()
for i, e in enumerate(G.edges()):
# ----------------------------------------------------------------------------
# the x,y, and z co-ordinates are here
pts = mlab.points3d(pos['x'], pos['y'], pos['z'],
scale_mode='none',
scale_factor=1)
# ----------------------------------------------------------------------------
pts.mlab_source.dataset.lines = np.array(G.edges())
tube = mlab.pipeline.tube(pts, tube_radius=edge_size)
mlab.pipeline.surface(tube, color=edge_color)
mlab.show() # interactive window
I'd like to ask for suggestions on how to remove the deleted nodes and the corresponding positions and display the rest in the output.
Secondly, I would like to know how to delete the nodes and the edges connected to these nodes interactively. For instance, if I want to delete nodes and edges connected to nodes of degree < 2, first I would like to display an interactive graph with all nodes with degree < 2 highlighted. The user can select the nodes that have to be deleted in an interactive manner. By clicking on a highlighted node, the node and connect edge can be deleted.
EDIT:
I tried to remove the positions of the deleted nodes from the dataframe pos by including pos.drop(pos.index[remove], inplace=True) updated in the complete code posted above.
But I still don't get the correct output.
Here is a solution for interactive removal of network nodes and edges in Mayavi
(I think matplotlib might be sufficient and easier but anyways...).
The solution is inspired by this Mayavi example.
However, the example is not directly transferable because a glyph (used to visualize the nodes) consists of many points and when plotting
each glyph/node by itself, the point_id cannot be used to identify the glyph/node. Moreover, it does not include the option to
hide/delete objects. To avoid these problems, I used four ideas:
Each node/edge is plotted as a separate object, so it is easier to adjust it's (visibility) properties.
Instead of deleting nodes/edges, they are just hidden when clicked upon.
Moreover, clicking twice makes the node visible again
(this does not work for the edges with the code below but you might be able to implement that if required,
just needs keeping track of visible nodes).
The visible nodes can be collected at the end (see code below).
As in the example, the mouse position is captured using a picker callback.
But instead of using the point_id of the closest point, it's coordinates are used directly.
The node to be deleted/hidden is found by computing the minimum Euclidean distance between the mouse position and all nodes.
PS: In your original code, the for-loop is quite redundant because it plots all nodes and edges many times on top of each other.
Hope that helps!
# import modules
from mayavi import mlab
import numpy as np
import pandas as pd
import networkx as nx
# set number of nodes
number = 6
# create random node positions
np.random.seed(5)
pos = 100*np.random.rand(6, 3)
pos = pd.DataFrame(pos, columns=['x', 'y', 'z'])
# create chain graph links
links = [(x, y) for x, y in zip(range(0, number-1), range(1, number))]
# create graph (not strictly needed, link list above would be enough)
graph = nx.Graph()
graph.add_edges_from(links)
# setup mayavi figure
figure = mlab.gcf()
mlab.clf()
# add nodes as individual glyphs
# store glyphs in dictionary to allow interactive adjustments of visibility
color = (0.5, 0.0, 0.5)
nodes = dict()
texts = dict()
for ni, n in enumerate(graph.nodes()):
xyz = pos.loc[n]
n = mlab.points3d(xyz['x'], xyz['y'], xyz['z'], scale_factor=5, color=color)
label = 'node %s' % ni
t = mlab.text3d(xyz['x'], xyz['y'], xyz['z']+5, label, scale=(5, 5, 5))
# each glyph consists of many points
# arr = n.glyph.glyph_source.glyph_source.output.points.to_array()
nodes[ni] = n
texts[ni] = t
# add edges as individual tubes
edges = dict()
for ei, e in enumerate(graph.edges()):
xyz = pos.loc[np.array(e)]
edges[ei] = mlab.plot3d(xyz['x'], xyz['y'], xyz['z'], tube_radius=1, color=color)
# define picker callback for figure interaction
def picker_callback(picker):
# get coordinates of mouse click position
cen = picker.pick_position
# compute Euclidean distance btween mouse position and all nodes
dist = np.linalg.norm(pos-cen, axis=1)
# get closest node
ni = np.argmin(dist)
# hide/show node and text
n = nodes[ni]
n.visible = not n.visible
t = texts[ni]
t.visible = not t.visible
# hide/show edges
# must be adjusted if double-clicking should hide/show both nodes and edges in a reasonable way
for ei, edge in enumerate(graph.edges()):
if ni in edge:
e = edges[ei]
e.visible = not e.visible
# add picker callback
picker = figure.on_mouse_pick(picker_callback)
picker.tolerance = 0.01
# show interactive window
# mlab.show()
# collect visibility/deletion status of nodes, e.g.
# [(0, True), (1, False), (2, True), (3, True), (4, True), (5, True)]
[(key, node.visible) for key, node in nodes.items()]
I am doing clustering and conducted scaling therefore. I now want my visualization (cluster chart) to use the original data points, i.e. before they were scaled. I did not come across a good solution yet. I hope someone can help.
#convert df='data' to numpy array for clustering
data=data.values
X=data
#Scale
X = StandardScaler().fit_transform(X)
# Compute DBSCAN
db = DBSCAN(eps=0.25, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_
# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)
#Internal indeces measure for performance
print("Silhouette Coefficient: %0.3f" % metrics.silhouette_score(X, labels))
# Plot result
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = [0, 0, 0, 1]
class_member_mask = (labels == k)
xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)
xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)
plt.title('Estimated number of clusters, excluding noise cluster: %d' % n_clusters_)
plt.xlabel('A', fontsize=18)
plt.ylabel('B', fontsize=16)
plt.ylim(ymax = 5, ymin = -0.5)
plt.xlim(xmax = 5, xmin = -0.5)
plt.show();
Output: It shows the cluster graph but with scaled values on the axis.
Questions:
1. How can I plot it with its original values?
2. Am I missing anything in general for doing DBSCAN clustering? i.e. How do I ensure that my cluster performance is good? I do not have a ground truth, so I only used the Shilouette metric but I feel not confident that my model's performance is really good. What is the purpose of ground truth if I am NOT trying to predict in my case and rather describe the current state only?
Just plot the original data then.
I.e., plot data, not X, if that is what you want.
Cluster performance is inherently subjective. It is good if you learn something about your data that you did not know before. As it cannot be captured in equations what you "know" or what is "useful", 8t cannot be reliably evaluated. Any evaluation is just a heuristic. Silhouette is not a good choice, because it punishes noise and non-convex clusters. Internal measures are just like clustering algorithms. External measures compute how well they find something you already know - neither is good for actual data. External measures are popular for scientific papers though to demonstrate that the algorithm isn't complete garbage. You pretend you do not know what you do know, and then check if the algorithm can still find that pattern.
So what do you need to do? Investigate: does it look useful, is it worth trying to use this. Then proceed: Try to use the clustering to solve your problem. It is good if it helps solving your problem.
I have a speed of feature points at every frame. Here I have 165 frames in a video where every frame contains speed of feature points.This is my data.
TrajDbscanData
array([[ 1. , 0.51935178],
[ 1. , 0.52063496],
[ 1. , 0.54598193],
...,
[165. , 0.47198981],
[165. , 2.2686042 ],
[165. , 0.79044946]])
where first index is frame number and second one is speed of a feature point at that frame.
Here I want to do density based clustering for different speed range. For this , I use following code.
import sklearn.cluster as sklc
core_samples, labels_db = sklc.dbscan(
TrajDbscanData, # array has to be (n_samples, n_features)
eps=0.5,
min_samples=15,
metric='euclidean',
algorithm='auto'
)
core_samples_mask = np.zeros_like(labels_db, dtype=bool)
core_samples_mask[core_samples] = True
unique_labels = set(labels_db)
n_clusters_ = len(unique_labels) - (1 if -1 in labels_db else 0)
colors = plt.cm.Spectral(np.linspace(0, 1, len(unique_labels)))
plt.figure(figcount)
figcount+=1
for k, col in zip(unique_labels, colors):
if k == -1:
# Black used for noise.
col = 'k'
class_member_mask = (labels_db == k)
xy = TrajDbscanData[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=col, markeredgecolor='k', markersize=6)
xy = TrajDbscanData[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'x', markerfacecolor=col, markeredgecolor='k', markersize=4)
plt.rcParams["figure.figsize"] = (10,7)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.grid(True)
plt.show()
I got the following result.
Y axis is speed and x axis is frame number
I want to do density based clustering according to speed. for example speed upto 1.0 in one cluster , speed from 1 to 1.5 as outlier , speed from 1.5 to 2.0 another cluster and speed above 2.0 in another cluster. This helps to identify common motion pattern types. How can I do this ?
Don't use Euclidean distance.
Since your x and y a is have very different meaning, that is the wrong distance function to use.
Your plot is misleading, because the axes have different scale. If you would scale x and y the same way, you would understand what has been happening... The y axis is effectively ignored, and you slice the data by your discrete integer time axis.
You may need to use Generalized DBSCAN and treat time and value separately!