networkx: node spacing when plotting multipartite graph - python-3.x

I want to plot a multiparite graph using networkx. However, when adding more nodes, the plot becomes very crowdy. Is there a way to have more space between nodes and partitions?
Looking at the documentation of multipartite_layout, I couldn't find parameters for this.
Of course, one could create complicated formulas for the positions, but since the spacing of multipartite_layout already looks so good for small graphs, I was how to scale this to bigger graphs.
Has anyone an idea how to do this (efficiently)?
Sample code, generating a graph with three partitions:
import matplotlib.pyplot as plt
import networkx as nx
# build graph:
G = nx.Graph()
for i in range (0,30):
G.add_node(i,layer=0)
for i in range (30,50):
G.add_node(i,layer=1)
for j in range(0,30):
G.add_edge(i,j)
G.add_node(100,layer=2)
G.add_edge(40,100)
# plot graph
pos = nx.multipartite_layout(G, subset_key="layer",)
plt.figure(figsize=(20, 8))
nx.draw(G, pos,with_labels=False)
plt.axis("equal")
plt.show()
The current, crowdy plot:

nx.multipartite_layout returns a dictionary with the following format: {node: array([x, y])}
I suggest you try pos = {p:array_op(pos[p]) for p in pos} where array_op is a function acting on the position of each node, array([x, y]).
In your case, I think a simple scaling along the x-axis suffice, i.e.
array_op = lambda x, sx: np.array(x[0]*sx, x[1]).
For visualization purpose I guess this should be equivalent with #JPM 's comment. However, this approach gives you the advantage of having the actual transformed position data.
In the end, if such uniform transformation does not satisfy your need, you can always manipulate the position manually with the knowledge of the format of the dict (although it might be less efficient).

Related

Networkx (or Graphviz) layout with fixed y positions

Are there any layout algorithms in networkx (or that I can call in Graphviz) that allow me to fix the Y-position of nodes in a DAG to a potentially different floating point value for each node, but spread out the X positions in some reasonable way (ideally attempting to minimise edge lengths or crossovers, although I suspect this might not be possible)? I can only find layouts that require nodes to be on discrete layers.
Added: Below is an example of the sort of graph topology I have, plotted using nx.kamada_kawai_layout. The thing is that these nodes have a "time" value (not shown here), which I want to plot on the Y axis. The vertices are directed in time, so that a parent node (e.g. 54 here) is always older than its children (here 52 and 53). So I want to lay this out with the Y position given by the node "time", and the X position such that crossings are minimised, in as much as that's possible (I know this is NP hard in general, but the layout below is actually doing a pretty good job.
p.s. usually all the leaf nodes, e.g. 2, 3, 7 here, are at time 0, so should be laid out at the bottom of the final layout.
p.p.s. Essentially what I would like to do is to imagine this as a spring diagram, "pick up" the root node (54) in the plot above and place it at the top of the page, with the topology dangling down, then adjust the Y-position of the children to the their internal "time" values.
Edit 2. Thanks to #sroush below, I can get a decent layout with the dot graphviz engine:
A = nx.nx_agraph.to_agraph(G)
fig = plt.figure(1, figsize=(10, 10))
A.add_subgraph(ts.samples(), level="same", name="cluster")
A.layout(prog="dot")
pos = {n: [float(x) for x in A.get_node(n).attr["pos"].split(",")] for n in G.nodes()}
nx.draw_networkx(G, pos, with_labels=True)
But I then want to reposition the nodes slightly so instead of ranked times (the numbers) they use their actual, floating point times. Like this:
true_times = nx.get_node_attributes(G, 'time')
reposition = {node_id: np.array([pos[node_id][0], true_times[node_id]]) for node_id in true_times}
nx.draw_networkx(G, reposition, with_labels=True)
As you can see, that squashed the nodes together rather a lot. Is there any way to increase the horizontal positions of those nodes to make them not bump into one-another? I could perhaps cluster some on to the same layer and iterate, but that seems quite expensive.
The Graphviz dot engine can get you pretty close. This is usually described as a "timeline" issue. Here is a graph that is part of the Graphviz source that seems to do what you want: https://www.flickr.com/photos/kentbye/1155560169

only writing visible points to disk of an overplotted scatterplot

I am creating matplotlib scatterplots of around 10000 points. At the point size I am using, this results in overplotting, i.e. some of the points will be hidden by the points that are plotted over them.
While I don't mind about the fact that I cannot see the hidden points, they are redundantly written out when I write the figure to disk as pdf (or other vector format), resulting in a large file.
Is there a way to create a vector image where only the visible points would be written to the file? This would be similar to the concept of "flattening" / merging layers in photo editing software. (I still like to retain the image as vector, as I would like to have the ability to zoom in).
Example plot:
import numpy as np
import pandas as pd
import random
import matplotlib.pyplot as plt
random.seed(15)
df = pd.DataFrame({'x': np.random.normal(10, 1.2, 10000),
'y': np.random.normal(10, 1.2, 10000),
'color' : np.random.normal(10, 1.2, 10000)})
df.plot(kind = "scatter", x = "x", y = "y", c = "color", s = 80, cmap = "RdBu_r")
plt.show()
tl;dr
I don't know of any simple solution such as
RemoveOccludedCircles(C)
The algorithm below requires some implementation, but it shouldn't be too bad.
Problem reformulation
While we could try to remove existing circles when adding new ones, I find it easier to think about the problem the other way round, processing all circles in reverse order and pretending to draw each new circle behind the existing ones.
The main problem then becomes: How can I efficiently determine whether one circle would be completely hidden by another set of circles?
Conditions
In the following, I will describe an algorithm for the case where the circles are sorted by size, such that larger circles are placed behind smaller circles. This includes the special case where all circles have same size. An extension to the general case would actually be significantly more complicated as one would have to maintain a triangulation of the intersection points. In addition, I will make the assumption that no two circles have the exact same properties (radius and position). These identical circles could easily be filtered.
Datastructures
C: A set of visible circles
P: A set of control points
Control points will be placed in such a way that no newly placed circle can become visible unless either its center lies outside the existing circles or at least one control point falls inside the new circle.
Problem visualisation
In order to better understand the role of control poins, their maintenance and the algorithm, have a look at the following drawing:
Processing 6 circles
In the linked image, active control points are painted in red. Control points that are removed after each step are painted in green or blue, where blue points were created by computing intersections between circles.
In image g), the green area highlights the region in which the center of a circle of same size could be placed such that the corresponding circle would be occluded by the existing circles. This area was derived by placing circles on each control point and subtracting the resulting area from the area covered by all visible circles.
Control point maintenance
Whenever adding one circle to the canvas, we add four active points, which are placed on the border of the circle in an equidistant way. Why four? Because no circle of same or bigger size can be placed with its center inside the current circle without containing one of the four control points.
After placing one circle, the following assumption holds: A new circle is completely hidden by existing circles if
Its center falls into a visible circle.
No control point lies strictly inside the new circle.
In order to maintain this assumption while adding new circles, the set of control points needs to be updated after each addition of a visible circle:
Add 4 new control points for the new circle, as described before.
Add new control points at each intersection of the new circle with existing visible circles.
Remove all control points that lie strictly inside any visible circle.
This rule will maintain control points at the outer border of the visible circles in such a dense way that no new visible circle intersecting the existing circles can be placed without 'eating' at least one control point.
Pseudo-Code
AllCircles <- All circles, sorted from front to back
C <- {} // the set of visible circles
P <- {} // the set of control points
for X in AllCircles {
if (Inside(center(X), C) AND Outside(P, X)) {
// ignore circle, it is occluded!
} else {
C <- C + X
P <- P + CreateFourControlPoints(X)
P <- P + AllCuttingPoints(X, C)
RemoveHiddenControlPoints(P, C)
}
}
DrawCirclesInReverseOrder(C)
The functions 'Inside' and 'Outside' are a bit abstract here, as 'Inside' returns true if a point is contained in one or more circles from a seto circles and 'Outside' returns true if all points from a set of points lie outside of a circle. But none of the functions used should be hard to write out.
Minor problems to be solved
How to determine in a numerically stable way whether a point is strictly inside a circle? -> This shouldn't be too bad to solve as all points are never more complicated than the solution of a quadratic equation. It is important, though, to not rely solely on floating point representations as these will be numerically insufficient and some control points would probable get completely lost, effectively leaving holes in the final plot. So keep a symbolic and precise representation of the control point coordinates. I would try SymPy to tackle this problem as it seems to cover all the required math. The formula for intersecting circles can easily be found online, for example here.
How to efficiently determine whether a circle contains any control point or any visible circle contains the center of a new circle? -> In order to solve this, I would propose to keep all elements of P and C in grid-like structures, where the width and height of each grid element equals the radius of the circles. On average, the number of active points and visible circles per grid cell should be in O(1), although it is possible to contruct artificial setups with arbitrary amounts of elements per grid cell, which would turn the overall algorithm from O(N) to O(N * N).
Runtime thoughts
As mentioned above, I would expect the runtime to scale linearly with the number of circles on average, because the number of visible circles in each grid cell will be in O(N) unless constructed in an evil way.
The data structures should be easily maintainable in memory if the circle radius isn't excessively small and computing intersections between circles should also be quite fast. I'm curious about final computational time, but I don't expect that it would be much worse than drawing all circles in a naive way a single time.
My best guess would be to use a hexbin. Note that with a scatter plot, the dots that are plotted latest will be the only ones visible. With a hexbin, all coinciding dots will be averaged.
If interested, the centers of the hexagons can be used to again create a scatter plot only showing the minimum.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
np.random.seed(15)
df = pd.DataFrame({'x': np.random.normal(10, 1.2, 10000),
'y': np.random.normal(10, 1.2, 10000),
'color': np.random.normal(10, 1.2, 10000)})
fig, ax = plt.subplots(ncols=4, gridspec_kw={'width_ratios': [10,10,10,1]})
norm = plt.Normalize(df.color.min(), df.color.max())
df.plot(kind="scatter", x="x", y="y", c="color", s=10, cmap="RdBu_r", norm=norm, colorbar=False, ax=ax[0])
hexb = ax[1].hexbin(df.x, df.y, df.color, cmap="RdBu_r", norm=norm, gridsize=80)
centers = hexb.get_offsets()
values = hexb.get_array()
ax[2].scatter(centers[:,0], centers[:,1], c=values, s=10, cmap="RdBu_r", norm=norm)
plt.colorbar(hexb, cax=ax[3])
plt.show()
Here is another comparison. The number of dots is reduced with a factor of 10, and the plot is more "honest" as overlapping dots are averaged.

How to apply a Pearson Correlation Analysis over all pairs of pixels of a DataArray as a Correlation Matrix?

I am facing serious difficulties in generating a correlation matrix (pixel by pixel) of a single Netcdf with dimensions ('lon', 'lat', 'time'). My final intent is to generate what one calls a Teleconnectivity Map.
This Map is composed of correlation coefficients. Each pixel has a value that represents the highest correlation value (in module) found in the correlation matrix over all pairs of pixels in the DataArray.
Therefore, in order to create my Teleconnectivity Map, instead of looping over every longitude ('lon') and every latitude ('lat') and later checking all possible combinations of correlation for which one was higher in magnitude, I was thinking of applying the xr.apply_ufunction with a wrapped correlation function inside.
Despite my efforts, I still don't get what is truly happening behind the scenes in the xr.apply_ufunc. All I managed to do was as a single Resultant Matrix with all pixels equals to 1 (perfect correlation).
See code below:
import numpy as np
import xarray as xr
def correlation(x, y):
return np.corrcoef(x, y)[0,0] # to return a single correlation index, instead of a matriz
def wrapped_correlation(da, x, coord='time'):
"""Finds the correlation along a given dimension of a dataarray."""
from functools import partial
fpartial = partial(correlation, x.values)
return xr.apply_ufunc(fpartial,
da,
input_core_dims=[[coord]] ,
output_core_dims=[[]],
vectorize=True,
output_dtypes=[float]
)
# testing the wrapped correlation for a sample data:
ds = xr.tutorial.open_dataset('air_temperature').load()
# testing for a single point in space.
x = ds['air'].sel(dict(lon=1, lat=92), method='nearest')
# over all points in the DataArray
Corr_over_x = wrapped_correlation(ds['air'], x)
Corr_over_x# notice that the resultant DataArray is composed solely of ones (perfect correlation match). This is impossible. I would expect to have different values of correlation for each pixel in here
# if one would plot the data, I would be composed of a variety of correlation values (see example below):
Corr_over_x.plot()
This is an important asset for meteorologists and Remote Sensing researches. It allows the evaluation of potential geophysical patterns over a given area of study.
I thank you for your time, and I hope hearing from you soon.
Sincerely yours,
Firstly, you need to use np.corrcoef(x, y)[0,1]. In the end, you don't need to use partial at all, see below:
def correlation(x1, x2):
return np.corrcoef(x1, x2)[0,1] # to return a single correlation index, instead of a matriz
def wrapped_correlation(da, x, coord='time'):
"""Finds the correlation along a given dimension of a dataarray."""
return xr.apply_ufunc(correlation,
da,
x,
input_core_dims=[[coord],[coord]] ,
output_core_dims=[[]],
vectorize=True,
output_dtypes=[float]
)
I have managed to solve my question. The script has become a bit long. Nevertheless, it does what it was previously intended.
The code is adapted from this reference
Since it is too long to show a snippet in here, I am posting a link to my Github account in which the algorithm (organized in a package named Teleconnection_using_xarray_data) can be checked here.
The package has two modules with similar results.
The first module (teleconnection_with_connecting_pathways) is slower than the second (teleconnection_via_numpy), but it allows to evaluate the connecting pathways between the partial teleconnection maps.
The second, only returns the resultant teleconnection map, without the connecting lines (geopandas-Linestrings), though it is much faster.
Feel free to colaborate. If possible, I would like to combine both modules ensuring speed and pathway analyses in the Teleconnection algorithm.
Sincerely yours,
Philipe Leal

matplotlib.pyplot.imshow awkwardly not plotting all of the data when array is transposed

I am trying to plot an array full of ones and zeros and most of the time it works well and looks like this.
However, when my array becomes too big (I need to plot 60,000x70) the plot only draws part of the data.
At first I thought that this might be some sort of memory issue, but the arrays actually are not that big after all and when looking into memory usage there also was no sign of too heavy lifting.
It becomes really weird, however, when I plot the transposed array, because then it works like a breeze.
I looked around in forums quite a lot but apparently nobody else has had such an issue. Might this be a bug? I really need to plot it in the original orientation. So, any help is highly appreciated. Thanks in advance!
UPDATE
This exactly reproduces my problem:
import numpy as np
import matplotlib.pyplot as plt
# generate fake data
a = np.random.random((60000, 70))
for x in np.nditer(a, op_flags=['readwrite']):
if x > 0.9:
x[...] = 1
else:
x[...] = 0
# plot fake data
fig, axes = plt.subplots(2, 2)
axes[0][0].imshow(a, interpolation='none', cmap='binary', aspect='auto')
axes[0][1].imshow(a.T, interpolation='none', cmap='binary', aspect='auto')
axes[1][0].imshow(a[:30000], interpolation='none', cmap='binary', aspect='auto')
axes[1][1].imshow(a[:30000].T, interpolation='none', cmap='binary', aspect='auto')
plt.show()
The code yields this. In the upper left subplot everything is plotted. In the plot showing the transposed array (upper right), however, matplotlib only draws the first ~10000 columns. The lower two plots just show the first half of the array (left normal, right transposed) and as you can see, with smaller arrays there is no issue.
SOLVED
This problem does not occur with matplotlib 2.x
[SOLVED]
The problem only occurs with outdated versions of matplotlib.

Fit a GMM to a 3D histogram in scikit-learn

The mixture model code in scikit-learn works for a list individual data points, but what if you have a histogram? That is, I have a density value for every voxel, and I want the mixture model to approximate it. Is this possible? I suppose one solution would be to sample values from this histogram, but that shouldn't be necessary.
Scikit-learn has extensive utilities and algorithms for kernel density estimation, which is specifically centered around inferring distributions from things like histograms. See the documentation here for some examples. If you have no expectations for the distribution of your data, KDE might be a more general approach.
For 2D histogram Z (your 2D array of voxels)
import numpy as np
# create the co-ordinate values
X, Y = np.mgrid[0:Z.shape[0], 0:Z.shape[1]]
# artificially create a list of points from your histogram
data_points = []
for x, y, z in zip(X.ravel(), Y.ravel(), Z.ravel()):
# add the data point / voxel (x, y) as many times as it occurs
# in the histogram
for iz in z:
data_points.append((x, y))
# now fit your GMM
from sklearn.mixture import GMM
gmm = GMM()
gmm.fit(data_points)
Though, as #Kyle Kastner points out, there are better methods for achieving this. For a start, your histogram will be 'binned' which will already loose you some resolution. Can you get hold of the raw data before it was binned?

Resources