I'm developing a social network based in exchange of emails, which dataset is a csv that can be downloaded at my Google Drive and consists of integers (individuals, column source) connecting to other individuals (integers, column target): https://drive.google.com/file/d/183fIXkGUqDC7YGGdxy50jAPrekaI1273/view?usp=sharing
The point is, my dataframe has 400 rows, but only 21 nodes show up:
Here is the sample code:
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
import pandas as pd
df=pd.read_csv('/home/......./social.csv', sep=',',header=None)
df=df.iloc[0:400,:]
df.columns=['source','target']
nodes=np.arange(0,400)
G=nx.from_pandas_edgelist(df, "source", "target")
G.add_nodes_from(nodes)
pos = nx.spectral_layout(G)
coordinates=np.concatenate(list(pos.values())).reshape(-1,2)
nx.draw_networkx_edges(G, pos, edgelist=[e for e in G.edges],alpha=0.9)
nx.draw_networkx_nodes(G, pos, nodelist=nodes)
plt.show()
Column source has 160 different individuals and target has 260 different individuals.
The whole algorithm is running right, this is the only issue:
I'm wondering what I'm doing wrong. Any insights are welcome.
Your nodes are being drawn but the the nx.spectral_layout positions them on top of each other.
If you print the positions:
pos = nx.spectral_layout(G)
print(pos)
You get:
{0: array([0.00927318, 0.01464153]), 1: array([0.00927318, 0.01464153]), 2: array([0.00927318, 0.01464153]), 3: array([0.00927318, 0.01464153]), 4: array([0.00927318, 0.01464153]), 5: array([-1. , -0.86684471]), 6: array([-1. , -0.86684471]), ...
And you can already see the overlap by comparing the positions.
You could instead use nx.circular_layout if you want to see all the nodes:
fig=plt.figure(figsize=(16,12))
pos = nx.circular_layout(G)
nx.draw(G, pos, nodelist=nodes,node_size=40)
And you will get:
Related
I have a dataframe of XY coordinates which I'm plotting as Markers in a Scatter plot. I'd like to add_trace lines between specific XY pairs, not between every pair. For example, I'd like a line between Index 0 and Index 3 and another between Index 1 and Index 2. This means that just using a line plot won't work as I don't want to show all the connections. Is it possible to do it with a version of iloc or do I need to make my DataFrame in 'Wide-format' and have each XY pair as separate column pairs?
I've read through this but I'm not sure it helps in my case.
Adding specific lines to a Plotly Scatter3d() plot
import pandas as pd
import plotly.graph_objects as go
# sample data
d={'MeanE': {0: 22.448461538460553, 1: 34.78435897435799, 2: 25.94307692307667, 3: 51.688974358974164},
'MeanN': {0: 110.71128205129256, 1: 107.71666666666428, 2: 140.6384615384711, 3: 134.58615384616363}}
# pandas dataframe
df=pd.DataFrame(d)
# set up plotly figure
fig = go.Figure()
fig.add_trace(go.Scatter(x=df['MeanE'],y=df['MeanN'],mode='markers'))
fig.show()
UPDATE:
Adding the accepted answer below to what I had already, I now get the following finished plot.
taken approach of updating data frame rows that are the pairs of co-ordinates where you have defined
then add traces to figure to complete requirement as a list comprehension
import pandas as pd
import plotly.graph_objects as go
# sample data
d={'MeanE': {0: 22.448461538460553, 1: 34.78435897435799, 2: 25.94307692307667, 3: 51.688974358974164},
'MeanN': {0: 110.71128205129256, 1: 107.71666666666428, 2: 140.6384615384711, 3: 134.58615384616363}}
# pandas dataframe
df=pd.DataFrame(d)
# set up plotly figure
fig = go.Figure()
fig.add_trace(go.Scatter(x=df['MeanE'],y=df['MeanN'],mode='markers'))
# mark of pairs that will be lines
df.loc[[0, 3], "group"] = 1
df.loc[[1, 2], "group"] = 2
# add the lines to the figure
fig.add_traces(
[
go.Scatter(
x=df.loc[df["group"].eq(g), "MeanE"],
y=df.loc[df["group"].eq(g), "MeanN"],
mode="lines",
)
for g in df["group"].unique()
]
)
fig.show()
alternate solution to enhanced requirement in comments
# mark of pairs that will be lines
lines = [[0, 3], [1, 2], [0,2],[1,3]]
# add the lines to the figure
fig.add_traces(
[
go.Scatter(
x=df.loc[pair, "MeanE"],
y=df.loc[pair, "MeanN"],
mode="lines",
)
for pair in lines
]
)
I have a dictionary called "topic_word"
topic_word = {0: [[-0.669712, 0.6868, 0.9821409999999999], [-0.925967, 0.6138399999999999, 1.247525], [-1.09941, 1.0252620000000001, 1.327866]],
1: [[-0.862131, 0.890915, 1.07759], [-0.437658, 0.279271, 0.627497], [-0.437658, 0.279271, 0.627497]],
2: [[-0.671647, 0.670583, 0.937155], [-0.675347, 0.466983, 0.8505440000000001], [-0.706244, 0.612532, 0.762877]],
3: [[-0.8414590000000001, 0.797826, 1.124295], [-0.567535, 0.40820300000000004, 0.811368], [-0.800963, 0.699767, 0.9237989999999999]],
4: [[-0.8560549999999999, 1.0617020000000001, 1.579302], [-0.576105, 0.5029239999999999, 0.9392], [-0.743683, 0.69884, 0.9794930000000001]]
}
where each key represents topic ( here 0 to 4; 5 topics) and value represents embeddings of words under each topic ( here every topic has 3 words). I want to visualize data using 2-d scatter plot if need to normalize how can I normalize "topic_word" data that I can represent correctly in python 3.x
How to visualize it using Scatter plot that will show cluster of words (dots) under their topics.
something as below:
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
for key, value in topic_word.items():
ax.scatter(value[0],value[1],label=key)
plt.legend()
I gather from your post that you want to have normalized values for each list corresponding to a key. And, each one of these normalized lists are represented as scatter datapoints. Here's one way to do it:
import numpy as np
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
topic_word = {0: [[-0.669712, 0.6868, 0.9821409999999999], [-0.925967, 0.6138399999999999, 1.247525], [-1.09941, 1.0252620000000001, 1.327866]],
1: [[-0.862131, 0.890915, 1.07759], [-0.437658, 0.279271, 0.627497], [-0.437658, 0.279271, 0.627497]],
2: [[-0.671647, 0.670583, 0.937155], [-0.675347, 0.466983, 0.8505440000000001], [-0.706244, 0.612532, 0.762877]],
3: [[-0.8414590000000001, 0.797826, 1.124295], [-0.567535, 0.40820300000000004, 0.811368], [-0.800963, 0.699767, 0.9237989999999999]],
4: [[-0.8560549999999999, 1.0617020000000001, 1.579302], [-0.576105, 0.5029239999999999, 0.9392], [-0.743683, 0.69884, 0.9794930000000001]]
}
colorkey={0:'red',1:'blue',2:'green',3:'black',4:'magenta'} # creating a color map for keys
for key, value in topic_word.items():
valno=0 # keeping a count of number of lists under each topic_word (key)
for val in value:
meanval=np.mean(val)
stdval=np.std(val)
val = (val-meanval)/(stdval) # normalized list
ax.scatter(key*np.ones(len(val)),val,color=colorkey[key],label="Topic "+str(key) if valno == 0 else "") # label is done such that duplication of legend elements is avoided
handles, labels = ax.get_legend_handles_labels()
valno=valno+1
fig.legend(handles, labels, loc='best')
I have a text file with the following data:
192.168.12.22 192.168.12.21 23
192.168.12.21 192.168.12.22 26
192.168.12.23 192.168.12.22 56
There are three nodes and two of them are sending packets to each other. I want to be able to show both weights on two different edges, but it only shows one with a single weight.
This is my code:
import networkx as nx
import matplotlib.pyplot as plt
G = nx.read_weighted_edgelist('test.txt', create_using=nx.DiGraph())
pos = nx.spring_layout(G)
print(nx.info(G))
nx.draw(G, pos, with_labels=True)
nx.draw_networkx_edge_labels(G, pos)
plt.show()
You can use the label_pos parameter (see draw_networkx_edge_labels):
import networkx as nx
import matplotlib.pyplot as plt
edges = [["192.168.12.22", "192.168.12.21", 23],
["192.168.12.21", "192.168.12.22", 26],
["192.168.12.23", "192.168.12.22", 56]]
graph = nx.DiGraph()
graph.add_weighted_edges_from(edges)
pos = nx.spring_layout(graph)
nx.draw(graph, pos, with_labels=True)
nx.draw_networkx_edge_labels(graph,
pos,
edge_labels={(u, v): d for u, v, d in graph.edges(data="weight")},
label_pos=.66)
plt.show()
You may also want to take a look at this answer.
Referring to : https://stackoverflow.com/a/44907357/305883
I am using python-louvain implementation to detect community in complete weighted graph.
But I only get one partition, containing all nodes.
Code:
import community # this is pip install python-louvain
import networkx as nx
import matplotlib.pyplot as plt
# Replace this with your networkx graph loading depending on your format !
# using graph g as a completed graph, weights between 0 and 1
#first compute the best partition
partition = community.best_partition(g)
#drawing
size = float(len(set(partition.values())))
pos = nx.spring_layout(g)
count = 0.
for com in set(partition.values()) :
count = count + 1.
list_nodes = [nodes for nodes in partition.keys() if partition[nodes] == com]
nx.draw_networkx_nodes(g, pos, list_nodes, node_size = 20, node_color = str(count / size))
nx.draw_networkx_edges(g, pos, alpha=0.1)
plt.show()
I would like to extract communities from a complete weighted network.
I also tried girvan_newman (https://networkx.github.io/documentation/networkx-2.0/reference/algorithms/generated/networkx.algorithms.community.centrality.girvan_newman.html) but could only detect 2 communities out of a complete graph of 200 nodes (with 198 and 2 nodes).
Is Louvain working correctly to detect communities in complete graph?
Better suggestions?
It is possible that the used model selection for this case returns a single block with all nodes, which means that there is not enough statistical evidence for more blocks.
You could try Peixotos graph-tool package, which has an implementation of weighted stochastic block model.
If you have a weighted network you need to use the weight='weight' argument:
import networkx as nx
import community
import numpy as np
np.random.seed(0)
W = np.random.rand(15,15)
np.fill_diagonal(W,0.0)
G = nx.from_numpy_array(W)
louvain_partition = community.best_partition(G, weight='weight')
modularity2 = community.modularity(louvain_partition, G, weight='weight')
print("The modularity Q based on networkx is {}".format(modularity2))
The modularity Q based on networkx is 0.0849022950503318
There are two columns in the dataset, user_id, and site_name respectively. It records every site name that every user browsed.
toy_dict = {'site_name': {0: u'\u4eac\u4e1c\u7f51\u4e0a\u5546\u57ce',
1: u'\u963f\u91cc\u4e91',
2: u'\u6dd8\u5b9d\u7f51',
3: u'\u624b\u673a\u6dd8\u5b9d\u7f51',
4: u'\u6211\u4eec\u7684\u70b9\u5fc3\u7f51',
5: u'\u8c46\u74e3\u7f51',
6: u'\u9ad8\u5fb7\u5730\u56fe',
7: u'\u817e\u8baf\u7f51',
8: u'\u70b9\u5fc3',
9: u'\u767e\u5ea6',
10: u'\u641c\u72d7',
11: u'\u8c37\u6b4c',
12: u'AccuWeather\u6c14\u8c61\u9884\u62a5',
13: u'\u79fb\u52a8\u68a6\u7f51',
14: u'\u817e\u8baf\u7f51',
15: u'\u641c\u72d7\u7f51',
16: u'360\u624b\u673a\u52a9\u624b',
17: u'\u641c\u72d0',
18: u'\u767e\u5ea6'},
'user_id': {0: 37924550,
1: 37924550,
2: 37924550,
3: 37924550,
4: 37924550,
5: 37924550,
6: 37924550,
7: 37924550,
8: 37924551,
9: 37924551,
10: 37924551,
11: 37924551,
12: 37924551,
13: 37924552,
14: 45285152,
15: 45285153,
16: 45285153,
17: 45285153,
18: 45285153}}
Now I want to reconstruct random network and meanwhile ensure a person with n sites in the observed network will have also have n sites in the randomized network.
The numpy.random.shuffle in Python is of low efficiency when the amount of data is massive.
I am using the following Python script currently:
import pandas as pd
import numpy as np
import itertools
from collections import Counter
for i in range (10): # reconstruct random network for 10 times
name='site_exp'+str(i)
name=pd.DataFrame(toy_dict)# read data
np.random.shuffle(name['site_name'].values) # shuffle the data
users=name['user_id'].drop_duplicates()
groups=name.groupby('user_id')
pairs = []
for ui in users[:5]:
userdata = groups.get_group(ui)
userdata=userdata.drop_duplicates()
site_list=userdata['site_name'].values
pair=list(itertools.combinations(site_list,2))
for j in pair:
pairs.append(j)
site_exp=pd.DataFrame(pairs, columns = ['node1', 'node2'], dtype= str)
site_exp['pair']=site_exp['node1']+'<--->'+site_exp['node2']
counterdict=Counter(site_exp['pair'].values)
counterdict=pd.DataFrame(list(counterdict.items()),columns=['pair','site_obs'])
counterdict.to_csv('site_exp'+str(i) + '.csv')
I am wondering if we can use a Monte Carlo algorithm in Python and reduce computational complexity?
Shuffling complexity
The time complexity of np.shuffle is O(n) as explained here, so at least in the programs below it should not be a bottleneck, but let's explore different aspects of the question below.
Problem formalization and complexity
If I understand correctly, your problem can be formulated as a bipartite graph with N_u user nodes, N_s website nodes and N_v edges between them, reflecting the visits, see panel (A) below.
Then counting the number of users who visited the same pairs of websites (your counterdict dictionary) simply corresponds to the
weighted bipartite network projection onto the website nodes, see panel (B) below.
The complexity of the weighted bipartite network projection for the brute-force approach is O(N_u^2*N_s). Consequently, when iterating over multiple randomizations, the O(N_v) from shuffling should be neglible (unless of course N_v > N_u^2*N_s). There are also approaches for sampling bipartite network projections in case of very large graphs.
In the small dummy example below, using bipartite network projection is around 150 times faster than your implementation (0.00024 vs 0.03600 seconds) and yields identical results.
The code 1
# import modules
import collections
import itertools
import time
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
import pandas as pd
import pymc3 as pm
# generate fake data for demonstration purposes
np.random.seed(0)
nvisits = 24
nusers = 12
nsites = 6
userz = np.random.choice(['U'+str(user).zfill(3) for user in range(nusers)], nvisits)
sitez = np.random.choice(range(nsites), nvisits)
users = np.unique(userz)
sites = np.unique(sitez)
# copy original implementation from the question
def get_site_pairs(users, sites, userz, sitez):
dct = dict()
dct['user'] = userz
dct['site'] = sitez
name=pd.DataFrame(dct)
groups=name.groupby('user')
pairs = []
for ui in users:
userdata = groups.get_group(ui)
userdata=userdata.drop_duplicates()
site_list=userdata['site'].values
pair=list(itertools.combinations(site_list, 2))
for j in pair:
pairs.append(sorted(j))
site_exp=pd.DataFrame(pairs, columns=['node1', 'node2'], dtype=str)
site_exp['pair'] = site_exp['node1']+'<--->'+site_exp['node2']
counterdict=collections.Counter(site_exp['pair'].values)
counterdict=pd.DataFrame(list(counterdict.items()), columns=['pair','site_obs'])
return counterdict
temp = time.time()
counterdict = get_site_pairs(users, sites, userz, sitez)
print (time.time() - temp)
# 0.03600 seconds
# implement bipartite-graph based algorithm
def get_site_graph(users, sites, userz, sitez):
graph = nx.Graph()
graph.add_nodes_from(users, bipartite=0)
graph.add_nodes_from(sites, bipartite=1)
graph.add_edges_from(zip(userz, sitez))
projection = nx.algorithms.bipartite.projection.weighted_projected_graph(graph, sites)
return graph, projection
temp = time.time()
graph, projection = get_site_graph(users, sites, userz, sitez)
print (time.time() - temp)
# 0.00024 seconds
# verify equality of results
for idr, row in counterdict.iterrows():
u, v = np.array(row['pair'].split('<--->')).astype(np.int)
pro = projection[u][v]
assert row['site_obs'] == pro['weight']
# prepare graph layouts for plotting
layers = nx.bipartite_layout(graph, userz)
circle = nx.circular_layout(projection)
width = np.array(list(nx.get_edge_attributes(projection, 'weight').values()))
width = 0.2 + 0.8 * width / max(width)
degrees = graph.degree()
# plot graphs
fig = plt.figure(figsize=(16, 9))
plt.subplot(131)
plt.title('(A)\nbipartite graph', loc='center')
nx.draw_networkx(graph, layers, width=2)
plt.axis('off')
plt.subplot(132)
plt.title('(B)\none-mode projection (onto sites)', loc='center')
nx.draw_networkx(projection, circle, edge_color=plt.cm.Greys(width), width=2)
plt.axis('off')
plt.subplot(133)
plt.title('(C)\nrandomization setup', loc='center')
nx.draw_networkx(graph, layers, width=2)
plt.text(*(layers['U000']-[0.1, 0]), '$n_u=%s$' % degrees['U000'], ha='right')
plt.text(*(layers[0]+[0.1, 0]), '$n_s=%s$' % degrees[0], ha='left')
plt.text(*(layers[1]+[0.1, 0]), '$n_t=%s$' % degrees[1], ha='left')
plt.text(0.3, -1, '$N_v=%s$' % nvisits)
plt.plot([0.3]*2, [-1, 1], lw=160, color='white')
plt.axis('off')
Network randomization and PyMC3 simulation
When randomizing the user list, as mentioned in the question, we can get a distribution of site-site connections. For networks of moderate size this should be reasonably fast, see argument regarding shuffling complexity above and code example below.
If the network is too large, sampling may be an option and the graph formalization helps to set up the sampling scenario, see panel (C) above. For given n_u and n_s edge randomization corresponds to random draws from a multivariate hypergeometric distribution.
Unfortunately, PyMC3 does not yet support hypergeometic distributions. In case this helps, I added a small example using PyMC3 and sampling from a simple binomial distribution below. The black histograms show the distribution of site-site connections n_{s,t} from full network randomization and bipartite projection.
The gray vertical line indicates that the maximum n_{s,t} <= min(N_u, n_s, n_t).
The red dots are from the binomial approximation which assumes there are nvisits*(nvisits-1)/2 pairs of edges to be distributed and the chance of connecting nodes s and t via user u is p_s * p_u * p_t * p_u, with p_x = n_x / N_x. Here, all edges are assumed to be independent and the result obviously yields an approximation only.
The code 2
# randomize user visits and store frequencies of site-site connections
niters = 1000
matrix = np.zeros((niters, nsites, nsites))
siten = collections.Counter(sitez)
for i in range(niters):
np.random.shuffle(userz)
graph, projection = get_site_graph(users, sites, userz, sitez)
edges = projection.edges(data=True)
for u, v, d in edges:
matrix[i, u, v] = d['weight']
# define PyMC3 function for sampling from binomial distribution
def sample_pymc3(prob, number, bins, draws=1000):
with pm.Model() as model:
nst = pm.Binomial('nst', n=number, p=prob)
trace = pm.sample(draws=draws, step=pm.Metropolis())
nst = trace.get_values('nst')
freqs = [np.mean(nst == val) for val in bins]
return freqs
# define auxiliary variables
# probability to select site s by chance
probs = [np.mean(sitez == s) for s in sites]
# probability to select user u by chance
probu = [np.mean(userz == u) for u in users]
# plot connectivity statistics
nsitez = min(5, nsites)
bins = np.arange(9)
number = nvisits*(nvisits-1)/2
fig, axis = plt.subplots(nrows=nsitez,
ncols=nsitez,
figsize=(16, 9))
for s in sites[:nsitez]:
for t in sites[:nsitez]:
# prepare axis
axia = axis[s, t]
if t <= s:
axia.set_axis_off()
continue
# plot histogram
axia.hist(matrix[:, s, t], bins=bins, histtype='step', density=True,
zorder=-10, align='left', color='black', lw=2)
axia.plot([min(siten[s], siten[t], nusers)+0.5]*2, [0, 0.5], lw=4, color='gray')
# approximate probabilities using PyMC3
prob = np.sum([probs[s] * pru * probs[t] * pru for pru in probu])
freqs = sample_pymc3(prob, number, bins)
freqs = sample_pymc3(prob, number, bins)
axia.scatter(bins, freqs, color='red')
# set axes
nst = '$n_{s=%s,t=%s}$' % (s, t)
axia.set_xlabel(nst)
if t == s+1:
axia.set_ylabel('frequency')
plt.suptitle('distribution of the number $n_{s,t}$\nof connections between site $s$ and $t$')
plt.tight_layout(rect=[-0.2, -0.2, 1, 0.9])