Graph drawing software - graphics

I need to draw a graph composed by 1876 clusters organized in the following manner:
962 clusters composed by 1 node
651 clusters composed by 2 nodes
144 clusters composed by 3 nodes
52 clusters composed by 4 nodes
24 clusters composed by 5 nodes
8 clusters composed by 6 nodes
8 clusters composed by 7 nodes
2 clusters composed by 8 nodes
4 clusters composed by 9 nodes
3 clusters composed by 10 nodes
1 cluster composed by 11 nodes
1 cluster composed by 12 nodes
4 clusters composed by 13 nodes
1 cluster composed by 16 nodes
1 cluster composed by 21 nodes
1 cluster composed by 22 nodes
1 cluster composed by 24 nodes
1 cluster composed by 25 nodes
1 cluster composed by 26 nodes
1 cluster composed by 29 nodes
2 clusters composed by 31 nodes
1 cluster composed by 43 nodes
1 cluster composed by 65 nodes
1 cluster composed by 4843953 nodes
I tried several software included pajek, SocNet but they seems to be more node centered (they let you perform statistics and some advanced operations on the nodes). Instead, I don't care about the node itself and I neither care about their connections. I just want to show such clusters with the nodes inside. Does anyone know any software that should help me?P.S That is the livejournal's graph

In principle, Mathematica 8 should be able to handle your problem with its new Graph object. I say, "in principle", because I have trouble imagining how a cluster of almost 5 million nodes (or vertices) will look when printed on screen or on paper. It will be crucial that you choose a suitable GraphLayout, as this comparison from Hu shows:
They are 3 depictions of the same graph (936 vertices), with the poorest rendering (of course) on the left. The article contains a rendering of a graph with 225k vertices that has a somewhat discernible structure.
Anyway, it can handle input in the form of adjacency matrix or list of edges, among others. Edges may be directed or not. You can show and label all or some or none of the vertices and edges. You can also remove the clusters (GraphComponents) and display them alone or in combination. It also gives you various GraphLayout options: CircularEmbedding, SpiralEmbedding, HighDimensionalEmbedding, LargeNetwork, etc. There are a variety of GraphStyles.
There is a command called NeighborhoodGraph that you may find useful for that huge cluster. Neighborhood[g,v,n] generates a subgraph of all nodes within n steps from vertex v. You can also simplify things by asking for a Subgraph with a predetermined list of edges, vertices or both.
Beware that some of the Graph documentation will refer to Combinatorica, which though excellent and useful for many purposes, does not render graphs with as much precision, in my view, as the version 8 Graph object will.
Some of the issues regarding graph layout for huge graphs are discussed here. There is also a SO discussion about plotting large graphs in which various software solutions are compared.

Try AT&T's graphviz.

I don't think that you will get a reasonable answer to your question, but I want to provide my 2 cents here:
Have a look at examples how to draw graphs of graphs, and reformulate your problem then.
How should one cluster visualized? Most of the clusters are easy to visualize, but the one with 4.843.953 nodes is a killer. I suspect that it won't be suitable to visualize that cluster together with the others ...
The best what I have seen up to now is the Software from TowSawyer to visualize graphs. But the software comes with a high price tag, perhaps it could help to get a clear idea how to visualize.
When you have found answers and details to the remarks above, I think there will be a way to visualize them accordingly.

Tulip, which comes as package with every major Linux distro (but apparently is also available for Windows) is said to be "capable of managing graphs with up to 500,000 nodes and edges on relatively modest hardware (eg. 600MHz Pentium III, 256MB RAM)".
That sounds just like what you want.

Related

graph_edit_distance of two graphs using networkx

I have two non-Isomorphic graphs:
MultiDiGraph with 779 nodes and 20 edges , MultiDiGraph with 146 nodes and 28 edge
n = nx.graph_edit_distance(g, h, timeout=10)
The above code gives the output None. What does this mean?
According to me, we cannot calculate the edit distance between these two graphs as the difference in number of nodes is large. What I think is, to find the Graph Edit Distance of these two graphs using the Edge transformation as it has many nodes but less difference in the number of edges.
So, in order to use edge transformation we have a function:
edge_subst_cost(G1[u1][v1], G2[u2][v2]), edge_del_cost(G1[u1][v1]), edge_ins_cost(G2[u2][v2])
My question is how to supply the parameters G1 and G2 in the edge_subst_cost() ?

HDBSCAN Cluster choice

I have been working with HDBSCAN and have a few hundreds of clusters based on my data. I am trying to select some cluster groups for further analysis. Looking for the clusters which have high inter-cluster-distance, as in more spread out and behave bit outlier-ish than the rest of the cluster. As of now, I was working with the (-1)clusters category but realized that cluster.probabilities_ of these clusters are 0. I need this value for further analysis.
My question is:
What does cluster.probabilities_ score say about a cluster?
And is there any way (other than just choosing the -1 cluster category) I can select some other clusters where there might be possibilities of outliers as well? As in calculating inter-cluster-distance or maybe some other way?
cluster.probabilities_ means the probability that given data point belongs to that cluster
-1 means that this data point has been labeled as noise. If you want them to be allocated Soft Clustering might be a solution

DFS - comparing the time and score complexity of two connected graph

Imagine the following graph (it is not a real graph), which upper level presents an acyclic directed graph with n nodes, while the lower level presents an acyclic directed graph with m nodes. Each node in the upper level is connected to a few nodes in the lower level (let's say each node in the upper level covers a few nodes in the lower level). So, n is less than m (each node in upper level at least covers 2 nodes in lower level).
My questions are:
1- What are the time and space complexities using the depth-first search algorithm to find all sequences from a certain node in the upper level and in the lower level? and how these time and space complexity of two levels can be compared (how they are related)?
My answer about time and space complexity, which I am suspicious about, are as follow:
Upper level: time complexity:o(n), space complexity: o(n+e) (e:
number of edges).
Lower level:time complexity:o(m),space complexity:o(m+e) (e: number of edges).
But I can not find the relations between these two.
2- If we want to find all possible sequences from a single node in the lower level of the graph, if an additional node will be added to this graph, how the number of sequences increase (for the worst case scenario)?
Any idea is appreciated!

scikit-learn AgglomerativeClustering and connectivity

I am trying to use AgglomerativeClustering from scikit-learn to cluster points on a place. Points are defined by coordinates (X,Y) stored in _XY.
Cluster are limited to a few neighbours through the connectivity matrix defined by
C = kneighbors_graph(_XY, n_neighbors = 20).
I want some points not be part of the same cluster, even if they are neighbours, so I modified the connectivity matrix to put 0 between these points.
The algorithm runs smoothly but, at the end, some clusters contain points that should not be together, i.e. some couple for which I imposed _C = 0.
From the children, I can see that the problem arises when a cluster of two points (i, j) is already formed and that k joins (i,j) even if _C[i,k]=0.
So I was wondering how the connectivity constraint is propagated when the size of some clusters is larger than 2, _C being not defined in that case.
Thanks !
So what seems to be happening in your case is that despite your active disconnection of point you do not want to have in one cluster, these points are still part of the same connected component and the data associated to them still imply that they should be connected to the same cluster from a certain level up.
In general, AgglomerativeClustering works as follows: At the beginning, all data points are separate clusters. Then, at each iteration, two adjacent clusters are merged, such that the overall increase in discrepancy with the original data is minimal if we compare the original data with cluster means in L2 distance.
Hence, although you sever the direct link between two nodes, they can be clustered together one level higher by an intermediate node.

Node graph layout library for incremental graphs

I'm looking for a library to which nodes and edges can be supplied and which generates a coordinate list of all the nodes laid out nicely. However, it should be possible to supply fixed positions for some, but not all nodes which the layout algorithm should respect.
I have tried graphviz (fdp, neato) so far, which does not seem able to keep the position of certain nodes and built the layout around them.
The library has to be usable with python, so it should be either python or c/c++ so we could write our own binding.
The following pictures illustrates exactly what I'm looking for (this is the uDraw-project, which does not seem to exist as a library).
You can do this in graphviz in reverse, if that is useful to you. To do that, you would plot the right-hand graph first, and then plot the left-hand graph with nodes 15, 16 and 17 set to style=invis. That would give you much the same layout as is shown here.
A problem that I could perceive with plotting the left-hand graph first is that the software (dot or something else) would naturally try to plot a "nice-looking" graph without nodes 15, 16 and 17 and that might not leave enough space for nodes 15, 16 and 17 to be fitted in if they're needed later. For example, if we tried to insert a node 12a between nodes 11 and 12, there wouldn't be room for that node in the graphs shown above. On the other hand, if node 12a was initially plotted but invisible, then the software would allocate the space for it, where it could be included later.

Resources