There are two columns in the dataset, user_id, and site_name respectively. It records every site name that every user browsed.
toy_dict = {'site_name': {0: u'\u4eac\u4e1c\u7f51\u4e0a\u5546\u57ce',
1: u'\u963f\u91cc\u4e91',
2: u'\u6dd8\u5b9d\u7f51',
3: u'\u624b\u673a\u6dd8\u5b9d\u7f51',
4: u'\u6211\u4eec\u7684\u70b9\u5fc3\u7f51',
5: u'\u8c46\u74e3\u7f51',
6: u'\u9ad8\u5fb7\u5730\u56fe',
7: u'\u817e\u8baf\u7f51',
8: u'\u70b9\u5fc3',
9: u'\u767e\u5ea6',
10: u'\u641c\u72d7',
11: u'\u8c37\u6b4c',
12: u'AccuWeather\u6c14\u8c61\u9884\u62a5',
13: u'\u79fb\u52a8\u68a6\u7f51',
14: u'\u817e\u8baf\u7f51',
15: u'\u641c\u72d7\u7f51',
16: u'360\u624b\u673a\u52a9\u624b',
17: u'\u641c\u72d0',
18: u'\u767e\u5ea6'},
'user_id': {0: 37924550,
1: 37924550,
2: 37924550,
3: 37924550,
4: 37924550,
5: 37924550,
6: 37924550,
7: 37924550,
8: 37924551,
9: 37924551,
10: 37924551,
11: 37924551,
12: 37924551,
13: 37924552,
14: 45285152,
15: 45285153,
16: 45285153,
17: 45285153,
18: 45285153}}
Now I want to reconstruct random network and meanwhile ensure a person with n sites in the observed network will have also have n sites in the randomized network.
The numpy.random.shuffle in Python is of low efficiency when the amount of data is massive.
I am using the following Python script currently:
import pandas as pd
import numpy as np
import itertools
from collections import Counter
for i in range (10): # reconstruct random network for 10 times
name='site_exp'+str(i)
name=pd.DataFrame(toy_dict)# read data
np.random.shuffle(name['site_name'].values) # shuffle the data
users=name['user_id'].drop_duplicates()
groups=name.groupby('user_id')
pairs = []
for ui in users[:5]:
userdata = groups.get_group(ui)
userdata=userdata.drop_duplicates()
site_list=userdata['site_name'].values
pair=list(itertools.combinations(site_list,2))
for j in pair:
pairs.append(j)
site_exp=pd.DataFrame(pairs, columns = ['node1', 'node2'], dtype= str)
site_exp['pair']=site_exp['node1']+'<--->'+site_exp['node2']
counterdict=Counter(site_exp['pair'].values)
counterdict=pd.DataFrame(list(counterdict.items()),columns=['pair','site_obs'])
counterdict.to_csv('site_exp'+str(i) + '.csv')
I am wondering if we can use a Monte Carlo algorithm in Python and reduce computational complexity?
Shuffling complexity
The time complexity of np.shuffle is O(n) as explained here, so at least in the programs below it should not be a bottleneck, but let's explore different aspects of the question below.
Problem formalization and complexity
If I understand correctly, your problem can be formulated as a bipartite graph with N_u user nodes, N_s website nodes and N_v edges between them, reflecting the visits, see panel (A) below.
Then counting the number of users who visited the same pairs of websites (your counterdict dictionary) simply corresponds to the
weighted bipartite network projection onto the website nodes, see panel (B) below.
The complexity of the weighted bipartite network projection for the brute-force approach is O(N_u^2*N_s). Consequently, when iterating over multiple randomizations, the O(N_v) from shuffling should be neglible (unless of course N_v > N_u^2*N_s). There are also approaches for sampling bipartite network projections in case of very large graphs.
In the small dummy example below, using bipartite network projection is around 150 times faster than your implementation (0.00024 vs 0.03600 seconds) and yields identical results.
The code 1
# import modules
import collections
import itertools
import time
import matplotlib.pyplot as plt
import numpy as np
import networkx as nx
import pandas as pd
import pymc3 as pm
# generate fake data for demonstration purposes
np.random.seed(0)
nvisits = 24
nusers = 12
nsites = 6
userz = np.random.choice(['U'+str(user).zfill(3) for user in range(nusers)], nvisits)
sitez = np.random.choice(range(nsites), nvisits)
users = np.unique(userz)
sites = np.unique(sitez)
# copy original implementation from the question
def get_site_pairs(users, sites, userz, sitez):
dct = dict()
dct['user'] = userz
dct['site'] = sitez
name=pd.DataFrame(dct)
groups=name.groupby('user')
pairs = []
for ui in users:
userdata = groups.get_group(ui)
userdata=userdata.drop_duplicates()
site_list=userdata['site'].values
pair=list(itertools.combinations(site_list, 2))
for j in pair:
pairs.append(sorted(j))
site_exp=pd.DataFrame(pairs, columns=['node1', 'node2'], dtype=str)
site_exp['pair'] = site_exp['node1']+'<--->'+site_exp['node2']
counterdict=collections.Counter(site_exp['pair'].values)
counterdict=pd.DataFrame(list(counterdict.items()), columns=['pair','site_obs'])
return counterdict
temp = time.time()
counterdict = get_site_pairs(users, sites, userz, sitez)
print (time.time() - temp)
# 0.03600 seconds
# implement bipartite-graph based algorithm
def get_site_graph(users, sites, userz, sitez):
graph = nx.Graph()
graph.add_nodes_from(users, bipartite=0)
graph.add_nodes_from(sites, bipartite=1)
graph.add_edges_from(zip(userz, sitez))
projection = nx.algorithms.bipartite.projection.weighted_projected_graph(graph, sites)
return graph, projection
temp = time.time()
graph, projection = get_site_graph(users, sites, userz, sitez)
print (time.time() - temp)
# 0.00024 seconds
# verify equality of results
for idr, row in counterdict.iterrows():
u, v = np.array(row['pair'].split('<--->')).astype(np.int)
pro = projection[u][v]
assert row['site_obs'] == pro['weight']
# prepare graph layouts for plotting
layers = nx.bipartite_layout(graph, userz)
circle = nx.circular_layout(projection)
width = np.array(list(nx.get_edge_attributes(projection, 'weight').values()))
width = 0.2 + 0.8 * width / max(width)
degrees = graph.degree()
# plot graphs
fig = plt.figure(figsize=(16, 9))
plt.subplot(131)
plt.title('(A)\nbipartite graph', loc='center')
nx.draw_networkx(graph, layers, width=2)
plt.axis('off')
plt.subplot(132)
plt.title('(B)\none-mode projection (onto sites)', loc='center')
nx.draw_networkx(projection, circle, edge_color=plt.cm.Greys(width), width=2)
plt.axis('off')
plt.subplot(133)
plt.title('(C)\nrandomization setup', loc='center')
nx.draw_networkx(graph, layers, width=2)
plt.text(*(layers['U000']-[0.1, 0]), '$n_u=%s$' % degrees['U000'], ha='right')
plt.text(*(layers[0]+[0.1, 0]), '$n_s=%s$' % degrees[0], ha='left')
plt.text(*(layers[1]+[0.1, 0]), '$n_t=%s$' % degrees[1], ha='left')
plt.text(0.3, -1, '$N_v=%s$' % nvisits)
plt.plot([0.3]*2, [-1, 1], lw=160, color='white')
plt.axis('off')
Network randomization and PyMC3 simulation
When randomizing the user list, as mentioned in the question, we can get a distribution of site-site connections. For networks of moderate size this should be reasonably fast, see argument regarding shuffling complexity above and code example below.
If the network is too large, sampling may be an option and the graph formalization helps to set up the sampling scenario, see panel (C) above. For given n_u and n_s edge randomization corresponds to random draws from a multivariate hypergeometric distribution.
Unfortunately, PyMC3 does not yet support hypergeometic distributions. In case this helps, I added a small example using PyMC3 and sampling from a simple binomial distribution below. The black histograms show the distribution of site-site connections n_{s,t} from full network randomization and bipartite projection.
The gray vertical line indicates that the maximum n_{s,t} <= min(N_u, n_s, n_t).
The red dots are from the binomial approximation which assumes there are nvisits*(nvisits-1)/2 pairs of edges to be distributed and the chance of connecting nodes s and t via user u is p_s * p_u * p_t * p_u, with p_x = n_x / N_x. Here, all edges are assumed to be independent and the result obviously yields an approximation only.
The code 2
# randomize user visits and store frequencies of site-site connections
niters = 1000
matrix = np.zeros((niters, nsites, nsites))
siten = collections.Counter(sitez)
for i in range(niters):
np.random.shuffle(userz)
graph, projection = get_site_graph(users, sites, userz, sitez)
edges = projection.edges(data=True)
for u, v, d in edges:
matrix[i, u, v] = d['weight']
# define PyMC3 function for sampling from binomial distribution
def sample_pymc3(prob, number, bins, draws=1000):
with pm.Model() as model:
nst = pm.Binomial('nst', n=number, p=prob)
trace = pm.sample(draws=draws, step=pm.Metropolis())
nst = trace.get_values('nst')
freqs = [np.mean(nst == val) for val in bins]
return freqs
# define auxiliary variables
# probability to select site s by chance
probs = [np.mean(sitez == s) for s in sites]
# probability to select user u by chance
probu = [np.mean(userz == u) for u in users]
# plot connectivity statistics
nsitez = min(5, nsites)
bins = np.arange(9)
number = nvisits*(nvisits-1)/2
fig, axis = plt.subplots(nrows=nsitez,
ncols=nsitez,
figsize=(16, 9))
for s in sites[:nsitez]:
for t in sites[:nsitez]:
# prepare axis
axia = axis[s, t]
if t <= s:
axia.set_axis_off()
continue
# plot histogram
axia.hist(matrix[:, s, t], bins=bins, histtype='step', density=True,
zorder=-10, align='left', color='black', lw=2)
axia.plot([min(siten[s], siten[t], nusers)+0.5]*2, [0, 0.5], lw=4, color='gray')
# approximate probabilities using PyMC3
prob = np.sum([probs[s] * pru * probs[t] * pru for pru in probu])
freqs = sample_pymc3(prob, number, bins)
freqs = sample_pymc3(prob, number, bins)
axia.scatter(bins, freqs, color='red')
# set axes
nst = '$n_{s=%s,t=%s}$' % (s, t)
axia.set_xlabel(nst)
if t == s+1:
axia.set_ylabel('frequency')
plt.suptitle('distribution of the number $n_{s,t}$\nof connections between site $s$ and $t$')
plt.tight_layout(rect=[-0.2, -0.2, 1, 0.9])
Related
I have generated 2 groups of 1-D data points which are visually clearly separable and I want to use a Bayesian Gaussian Mixture Model (BGMM) to ideally recover 2 clusters.
Since BGMMs maximize a lower bound on the model evidence (ELBO) and given that the ELBO is supposed to combine notions of accuracy and complexity, I would expect more complex models to be penalized.
However, when running Grid Search over the number of clusters, I often get a solution with more than 2 clusters. More specifically, I often get the maximal number of clusters on my grid search. In the example below, I would expect the best model to define 2 clusters. Instead, the best models defines 4 but assigns minimal weights to 2 out of 4 clusters.
I am really surprised, since 2 out of 4 clusters are therefore adding little information and this more complex model still gets selected as the best model.
Why is the BGMM then picking 4 clusters for the best model?
If this is indeed the behavior a BGMM should show, how can I then assess how many active components I actually have in my model? Visually? By defining an arbitrary threshold on the weights?
I have added the code to reproduce my example below.
# Import statements
import itertools
import multiprocessing
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
from joblib import Parallel, delayed
from sklearn.mixture import BayesianGaussianMixture
from sklearn.utils import shuffle
def fitmodel(x, params):
'''
Instantiates and fits Bayesian GMM
Used in the parallel for loop
'''
# Gaussian mixture model
clf = BayesianGaussianMixture(**params)
# Fit
clf = clf.fit(x, y=None)
return clf
def plot_results(X, means, covariances, title):
plt.plot(X, np.random.uniform(low=0, high=1, size=len(X)),'o', alpha=0.1, color='cornflowerblue', label='data points')
for i, (mean, covar) in enumerate(zip(
means, covariances)):
# Get normal PDF
n_sd = 2.5
x = np.linspace(mean - n_sd*covar, mean + n_sd*covar, 300)
x = x.ravel()
y = stats.norm.pdf(x, mean, covar).ravel()
if i == 0:
label = 'Component PDF'
else:
label = None
plt.plot(x, y, color='darkorange', label=label)
plt.yticks(())
plt.title(title)
# Generate data
g1 = np.random.uniform(low=-1.5, high=-1, size=(1,100))
g2 = np.random.uniform(low=1.5, high=1, size=(1,100))
X = np.append(g1, g2)
# Shuffle data
X = shuffle(X)
X = X.reshape(-1, 1)
# Define parameters for grid search
parameters = {
'n_components': [1, 2, 3, 4],
'weight_concentration_prior_type':['dirichlet_distribution']
}
# Create permutations of parameter settings
keys, values = zip(*parameters.items())
param_grid = [dict(zip(keys, v)) for v in itertools.product(*values)]
# Run GridSearch using parallel for loop
list_clf = [None] * len(param_grid)
num_cores = multiprocessing.cpu_count()
list_clf = Parallel(n_jobs=num_cores)(delayed(fitmodel)(X, params) for params in param_grid)
# Print best model (based on lower bound on model evidence)
lower_bounds = [x.lower_bound_ for x in list_clf] # Extract lower bounds on model evidence
idx = int(np.where(lower_bounds == np.max(lower_bounds))[0]) # Find best model
best_estimator = list_clf[idx]
print(f'Parameter setting of best model: {param_grid[idx]}')
print(f'Components weights: {best_estimator.weights_}')
# Plot data points and gaussian components
plt.figure(figsize=(8,6))
ax = plt.subplot(2, 1, 1)
if best_estimator.weight_concentration_prior_type == 'dirichlet_process':
prior_label = 'Dirichlet process'
elif best_estimator.weight_concentration_prior_type == 'dirichlet_distribution':
prior_label = 'Dirichlet distribution'
plot_results(X, best_estimator.means_, best_estimator.covariances_,
f'Best Bayesian GMM | {prior_label} prior')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
plt.legend(fontsize='small')
# Plot histogram of weights
ax = plt.subplot(2, 1, 2)
for k, w in enumerate(best_estimator.weights_):
plt.bar(k, w,
width=0.9,
color='#56B4E9',
zorder=3,
align='center',
edgecolor='black'
)
plt.text(k, w + 0.01, "%.1f%%" % (w * 100.),
horizontalalignment='center')
ax.get_xaxis().set_tick_params(direction='out')
ax.yaxis.grid(True, alpha=0.7)
plt.xticks(range(len(best_estimator.weights_)))
plt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=None, hspace=0.4)
plt.ylabel('Component weight')
plt.ylim(0, np.max(best_estimator.weights_)+0.25*np.max(best_estimator.weights_))
plt.yticks(())
plt.savefig('bgmm_clustering.png')
plt.show()
plt.close()
I've a graph network created using Networkx and plotted using Mayavi.
After the graph is created, I 'm deleting nodes with degree < 2, using G.remove_nodes_from(). Once the nodes are deleted, the edges connected to these nodes are deleted but the nodes still appear in the final output (image below).
import matplotlib.pyplot as plt
from mayavi import mlab
import numpy as np
import pandas as pd
pos = [[0.1, 2, 0.3], [40, 0.5, -10],
[0.1, -40, 0.3], [-49, 0.1, 2],
[10.3, 0.3, 0.4], [-109, 0.3, 0.4]]
pos = pd.DataFrame(pos, columns=['x', 'y', 'z'])
ed_ls = [(x, y) for x, y in zip(range(0, 5), range(1, 6))]
G = nx.Graph()
G.add_edges_from(ed_ls)
remove = [node for node, degree in dict(G.degree()).items() if degree < 2]
G.remove_nodes_from(remove)
pos.drop(pos.index[remove], inplace=True)
print(G.edges)
nx.draw(G)
plt.show()
mlab.figure(1, bgcolor=bgcolor)
mlab.clf()
for i, e in enumerate(G.edges()):
# ----------------------------------------------------------------------------
# the x,y, and z co-ordinates are here
pts = mlab.points3d(pos['x'], pos['y'], pos['z'],
scale_mode='none',
scale_factor=1)
# ----------------------------------------------------------------------------
pts.mlab_source.dataset.lines = np.array(G.edges())
tube = mlab.pipeline.tube(pts, tube_radius=edge_size)
mlab.pipeline.surface(tube, color=edge_color)
mlab.show() # interactive window
I'd like to ask for suggestions on how to remove the deleted nodes and the corresponding positions and display the rest in the output.
Secondly, I would like to know how to delete the nodes and the edges connected to these nodes interactively. For instance, if I want to delete nodes and edges connected to nodes of degree < 2, first I would like to display an interactive graph with all nodes with degree < 2 highlighted. The user can select the nodes that have to be deleted in an interactive manner. By clicking on a highlighted node, the node and connect edge can be deleted.
EDIT:
I tried to remove the positions of the deleted nodes from the dataframe pos by including pos.drop(pos.index[remove], inplace=True) updated in the complete code posted above.
But I still don't get the correct output.
Here is a solution for interactive removal of network nodes and edges in Mayavi
(I think matplotlib might be sufficient and easier but anyways...).
The solution is inspired by this Mayavi example.
However, the example is not directly transferable because a glyph (used to visualize the nodes) consists of many points and when plotting
each glyph/node by itself, the point_id cannot be used to identify the glyph/node. Moreover, it does not include the option to
hide/delete objects. To avoid these problems, I used four ideas:
Each node/edge is plotted as a separate object, so it is easier to adjust it's (visibility) properties.
Instead of deleting nodes/edges, they are just hidden when clicked upon.
Moreover, clicking twice makes the node visible again
(this does not work for the edges with the code below but you might be able to implement that if required,
just needs keeping track of visible nodes).
The visible nodes can be collected at the end (see code below).
As in the example, the mouse position is captured using a picker callback.
But instead of using the point_id of the closest point, it's coordinates are used directly.
The node to be deleted/hidden is found by computing the minimum Euclidean distance between the mouse position and all nodes.
PS: In your original code, the for-loop is quite redundant because it plots all nodes and edges many times on top of each other.
Hope that helps!
# import modules
from mayavi import mlab
import numpy as np
import pandas as pd
import networkx as nx
# set number of nodes
number = 6
# create random node positions
np.random.seed(5)
pos = 100*np.random.rand(6, 3)
pos = pd.DataFrame(pos, columns=['x', 'y', 'z'])
# create chain graph links
links = [(x, y) for x, y in zip(range(0, number-1), range(1, number))]
# create graph (not strictly needed, link list above would be enough)
graph = nx.Graph()
graph.add_edges_from(links)
# setup mayavi figure
figure = mlab.gcf()
mlab.clf()
# add nodes as individual glyphs
# store glyphs in dictionary to allow interactive adjustments of visibility
color = (0.5, 0.0, 0.5)
nodes = dict()
texts = dict()
for ni, n in enumerate(graph.nodes()):
xyz = pos.loc[n]
n = mlab.points3d(xyz['x'], xyz['y'], xyz['z'], scale_factor=5, color=color)
label = 'node %s' % ni
t = mlab.text3d(xyz['x'], xyz['y'], xyz['z']+5, label, scale=(5, 5, 5))
# each glyph consists of many points
# arr = n.glyph.glyph_source.glyph_source.output.points.to_array()
nodes[ni] = n
texts[ni] = t
# add edges as individual tubes
edges = dict()
for ei, e in enumerate(graph.edges()):
xyz = pos.loc[np.array(e)]
edges[ei] = mlab.plot3d(xyz['x'], xyz['y'], xyz['z'], tube_radius=1, color=color)
# define picker callback for figure interaction
def picker_callback(picker):
# get coordinates of mouse click position
cen = picker.pick_position
# compute Euclidean distance btween mouse position and all nodes
dist = np.linalg.norm(pos-cen, axis=1)
# get closest node
ni = np.argmin(dist)
# hide/show node and text
n = nodes[ni]
n.visible = not n.visible
t = texts[ni]
t.visible = not t.visible
# hide/show edges
# must be adjusted if double-clicking should hide/show both nodes and edges in a reasonable way
for ei, edge in enumerate(graph.edges()):
if ni in edge:
e = edges[ei]
e.visible = not e.visible
# add picker callback
picker = figure.on_mouse_pick(picker_callback)
picker.tolerance = 0.01
# show interactive window
# mlab.show()
# collect visibility/deletion status of nodes, e.g.
# [(0, True), (1, False), (2, True), (3, True), (4, True), (5, True)]
[(key, node.visible) for key, node in nodes.items()]
I am doing a molecular dynamics simulation. It consists of numerical integration, many for loops, manipulating large NumPy arrays. I have tried to use NumPy function and arrays wherever possible. But the code is still too slow. I thought of using numba jit as a speedup. But it always throws an error message.
Here is the code.
# -*- coding: utf-8 -*-
"""
Created on Sat Mar 28 12:10:42 2020
#author: Sandipan
"""
import numpy as np
import matplotlib.pyplot as plt
from numba import jit
import os
import sys
# Setting up the simulation
NSteps =100 # Number of steps
deltat = 0.005 # Time step in reduced time units
temp = 0.851# #Reduced temperature
DumpFreq = 100 # Save the position to file every DumpFreq steps
epsilon = 1.0 # LJ parameter for the energy between particles
DIM =3
N =500
density =0.776
Rcutoff =3
#----------------------Function Definitions---------------------
#------------------Initialise Configuration--------
#jit(nopython=True)
def initialise_config(N,DIM,density):
velocity = (np.random.randn(N,DIM)-0.5)
# Set initial momentum to zero
COM_V = np.sum(velocity)/N #Center of mass velocity
velocity = velocity - COM_V # Fix any center-of-mass drift
# Calculate initial kinetic energy
k_energy=0
for i in range (N):
k_energy+=np.dot(velocity[i],velocity[i])
vscale=np.sqrt(DIM*temp/k_energy)
velocity*=vscale
#Initialize with zeroes
coords = np.zeros([N,DIM]);
# Get the cooresponding box size
L = (N/density)**(1.0/DIM)
""" Find the lowest perfect cube greater than or equal to the number of
particles"""
nCube = 2
while (nCube**3 < N):
nCube = nCube + 1
# Assign particle positions
ip=-1
x=0
y=0
z=0
for i in range(0,nCube):
for j in range(0,nCube):
for k in range(0,nCube):
if(ip<N):
x=(i+0.5)*(L/nCube)
y=(j+0.5)*(L/nCube)
z=(k+0.5)*(L/nCube)
coords[ip]=np.array([x,y,z])
ip=ip+1
else:
break
return coords,velocity,L
#jit(nopython=True)
def wrap(pos,L):
'''Apply perodic boundary conditions.'''
for i in range (len(pos)):
for k in range(DIM):
if (pos[i][k]>0.5):
pos[i][k]=pos[i][k]-1
if (pos[i][k]<-0.5):
pos[i][k]=pos[i][k]+1
return (pos)
#jit(nopython=True)
def LJ_Forces(pos,acc,epsilon,L,DIM,N):
# Compute forces on positions using the Lennard-Jones potential
# Uses double nested loop which is slow O(N^2) time unsuitable for large systems
Sij = np.zeros(DIM) # Box scaled units
Rij = np.zeros(DIM) # Real space units
#Set all variables to zero
ene_pot = np.zeros(N)
acc = acc*0
virial=0.0
# Loop over all pairs of particles
for i in range(N-1):
for j in range(i+1,N): #i+1 to N ensures we do not double count
Sij = pos[i]-pos[j] # Distance in box scaled units
for l in range(DIM): # Periodic interactions
if (np.abs(Sij[l])>0.5):
Sij[l] = Sij[l] - np.copysign(1.0,Sij[l]) # If distance is greater than 0.5 (scaled units) then subtract 0.5 to find periodic interaction distance.
Rij = L*Sij # Scale the box to the real units in this case reduced LJ units
Rsqij = np.dot(Rij,Rij) # Calculate the square of the distance
if(Rsqij < Rcutoff**2):
# Calculate LJ potential inside cutoff
# We calculate parts of the LJ potential at a time to improve the efficieny of the computation (most important for compiled code)
rm2 = 1.0/Rsqij # 1/r^2
rm6 = rm2**3
forcefact=(rm2**4)*(rm6-0.5) # 1/r^6
phi =4*(rm6**2-rm6)
ene_pot[i] = ene_pot[i]+0.5*phi # Accumulate energy
ene_pot[j] = ene_pot[j]+0.5*phi # Accumulate energy
virial = virial-forcefact*Rsqij # Virial is needed to calculate the pressure
acc[i] = acc[i]+forcefact*Sij # Accumulate forces
acc[j] = acc[j]-forcefact*Sij # (Fji=-Fij)
return 48*acc, np.sum(ene_pot)/N, -virial/DIM # return the acceleration vector, potential energy and virial coefficient
#jit(nopython=True)
def Calculate_Temperature(vel,L,DIM,N):
ene_kin = 0.0
for i in range(N):
real_vel = L*vel[i]
ene_kin = ene_kin + 0.5*np.dot(real_vel,real_vel)
ene_kin_aver = 1.0*ene_kin/N
temperature = 2.0*ene_kin_aver/DIM
return ene_kin_aver,temperature
# Main MD loop
#jit(nopython=True)
def main():
# Vectors to store parameter values at each step
ene_kin_aver = np.ones(NSteps)
ene_pot_aver = np.ones(NSteps)
temperature = np.ones(NSteps)
virial = np.ones(NSteps)
pressure = np.ones(NSteps)
pos,vel,L = initialise_config(N,DIM,density)
acc = (np.random.randn(N,DIM)-0.5)
volume=L**3
# Open file which we will save the outputs to
if os.path.exists('energy2'):
os.remove('energy2')
f = open('traj.xyz', 'w')
for k in range(NSteps):
# Refold positions according to periodic boundary conditions
pos=wrap(pos,L)
# r(t+dt) modify positions according to velocity and acceleration
pos = pos + deltat*vel + 0.5*(deltat**2.0)*acc # Step 1
# Calculate temperature
ene_kin_aver[k],temperature[k] = Calculate_Temperature(vel,L,DIM,N)
# Rescale velocities and take half step
chi = np.sqrt(temp/temperature[k])
vel = chi*vel + 0.5*deltat*acc # v(t+dt/2) Step 2
# Compute forces a(t+dt),ene_pot,virial
acc, ene_pot_aver[k], virial[k] = LJ_Forces(pos,acc,epsilon,L,DIM,N) # Step 3
# Complete the velocity step
vel = vel + 0.5*deltat*acc # v(t+dt/2) Step 4
# Calculate temperature
ene_kin_aver[k],temperature[k] = Calculate_Temperature(vel,L,DIM,N)
# Calculate pressure
pressure[k]= density*temperature[k] + virial[k]/volume
# Print output to file every DumpFreq number of steps
if(k%DumpFreq==0): # The % symbol is the modulus so if the Step is a whole multiple of DumpFreq then print the values
f.write("%s\n" %(N)) # Write the number of particles to file
# Write all of the quantities at this step to the file
f.write("Energy %s, Temperature %.5f\n" %(ene_kin_aver[k]+ene_pot_aver[k],temperature[k]))
for n in range(N): # Write the positions to file
f.write("X"+" ")
for l in range(DIM):
f.write(str(pos[n][l]*L)+" ")
f.write("\n")
if (k%5==0):
# print("\rStep: {0} KE: {1} PE: {2} Energy: {3}".format(k, ene_kin_aver[k], ene_pot_aver[k],ene_kin_aver[k]+ene_pot_aver[k]))
sys.stdout.write("\rStep: {0} KE: {1} PE: {2} Energy: {3}".format(k, ene_kin_aver[k], ene_pot_aver[k],ene_kin_aver[k]+ene_pot_aver[k]))
sys.stdout.flush()
return ene_kin_aver, ene_pot_aver, temperature, pressure, pos
#------------------------------------------------------
ene_kin_aver, ene_pot_aver, temperature, pressure, pos = main()
# Plot all of the quantities
def plot():
plt.figure(figsize=[7,12])
plt.rc('xtick', labelsize=15)
plt.rc('ytick', labelsize=15)
plt.subplot(4, 1, 1)
plt.plot(ene_kin_aver,'k-')
plt.ylabel(r"$E_{K}", fontsize=20)
plt.subplot(4, 1, 2)
plt.plot(ene_pot_aver,'k-')
plt.ylabel(r"$E_{P}$", fontsize=20)
plt.subplot(4, 1, 3)
plt.plot(temperature,'k-')
plt.ylabel(r"$T$", fontsize=20)
plt.subplot(4, 1, 4)
plt.plot(pressure,'k-')
plt.ylabel(r"$P$", fontsize=20)
plt.show()
plot()
The error I am getting is:
runfile('E:/Project/LJMD4.py', wdir='E:/Project')
Traceback (most recent call last):
File "<ipython-input-8-aeebce887079>", line 1, in <module>
runfile('E:/Project/LJMD4.py', wdir='E:/Project')
File "C:\Users\Sandipan\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 827, in runfile
execfile(filename, namespace)
File "C:\Users\Sandipan\Anaconda3\lib\site-packages\spyder_kernels\customize\spydercustomize.py", line 110, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "E:/Project/LJMD4.py", line 226, in <module>
ene_kin_aver, ene_pot_aver, temperature, pressure, pos = main()
File "C:\Users\Sandipan\Anaconda3\lib\site-packages\numba\dispatcher.py", line 351, in _compile_for_args
error_rewrite(e, 'typing')
File "C:\Users\Sandipan\Anaconda3\lib\site-packages\numba\dispatcher.py", line 318, in error_rewrite
reraise(type(e), e, None)
File "C:\Users\Sandipan\Anaconda3\lib\site-packages\numba\six.py", line 658, in reraise
raise value.with_traceback(tb)
TypingError: cannot determine Numba type of <class 'builtin_function_or_method'>
When I searched on the internet, I found numba may not support some function I am using. But I am not using any Pandas or other data frame. I am just using pure python loop or NumPy which as far numba documentation suggests are well supported. I have tried removing numba from some functions and making nopython=0 but still, it sends different error messages. I can't figure out what is wrong with it. Without numba the code will not be feasible for actual use. Any further tips on speedup will be of great help.
Thank you in advance.
A few common errors
Use of unsupported functions
file operations, many string operation. These can be in a objmode block.
In this example I commented these things out.
Wrong way of initializing arrays
Only tuples are supported, not lists (Numpy accepts both but the documentation there are only tuples mentioned)
Checking for division by zero and throwing an exception
This is the standard behavior of Python, but not Numpy. If you want to avoid this overhead (if/else on every division) turn on the Numpy default behaviour (error_model="numpy").
Use of globals
Globals are hard coded into the compiled code (as you would directly write them into the code). They cannot be changed without recompilation.
Wrong indexing of Numpy arrays
pos[i][k] instead of pos[i,k]. Numba may optimize this away, but this has a quite noticeable negative impact in Pure Python code.
Working version
# -*- coding: utf-8 -*-
"""
Created on Sat Mar 28 12:10:42 2020
#author: Sandipan
"""
import numpy as np
import matplotlib.pyplot as plt
from numba import jit
import os
import sys
# All globals are compile time constants
# recompilation needed if you change this values
# Better way: hand a tuple of all needed vars to the functions
# params=(NSteps,deltat,temp,DumpFreq,epsilon,DIM,N,density,Rcutoff)
# Setting up the simulation
NSteps =100 # Number of steps
deltat = 0.005 # Time step in reduced time units
temp = 0.851# #Reduced temperature
DumpFreq = 100 # Save the position to file every DumpFreq steps
epsilon = 1.0 # LJ parameter for the energy between particles
DIM =3
N =500
density =0.776
Rcutoff =3
params=(NSteps,deltat,temp,DumpFreq,epsilon,DIM,N,density,Rcutoff)
#----------------------Function Definitions---------------------
#------------------Initialise Configuration--------
#error_model=True
#Do you really want to search for division by zeros (additional cost)?
#jit(nopython=True,error_model="numpy")
def initialise_config(N,DIM,density):
velocity = (np.random.randn(N,DIM)-0.5)
# Set initial momentum to zero
COM_V = np.sum(velocity)/N #Center of mass velocity
velocity = velocity - COM_V # Fix any center-of-mass drift
# Calculate initial kinetic energy
k_energy=0
for i in range (N):
k_energy+=np.dot(velocity[i],velocity[i])
vscale=np.sqrt(DIM*temp/k_energy)
velocity*=vscale
#wrong array initialization (use tuple)
#Initialize with zeroes
coords = np.zeros((N,DIM))
# Get the cooresponding box size
L = (N/density)**(1.0/DIM)
""" Find the lowest perfect cube greater than or equal to the number of
particles"""
nCube = 2
while (nCube**3 < N):
nCube = nCube + 1
# Assign particle positions
ip=-1
x=0
y=0
z=0
for i in range(0,nCube):
for j in range(0,nCube):
for k in range(0,nCube):
if(ip<N):
x=(i+0.5)*(L/nCube)
y=(j+0.5)*(L/nCube)
z=(k+0.5)*(L/nCube)
coords[ip]=np.array([x,y,z])
ip=ip+1
else:
break
return coords,velocity,L
#jit(nopython=True)
def wrap(pos,L):
'''Apply perodic boundary conditions.'''
#correct array indexing
for i in range (len(pos)):
for k in range(DIM):
if (pos[i,k]>0.5):
pos[i,k]=pos[i,k]-1
if (pos[i,k]<-0.5):
pos[i,k]=pos[i,k]+1
return (pos)
#jit(nopython=True,error_model="numpy")
def LJ_Forces(pos,acc,epsilon,L,DIM,N):
# Compute forces on positions using the Lennard-Jones potential
# Uses double nested loop which is slow O(N^2) time unsuitable for large systems
Sij = np.zeros(DIM) # Box scaled units
Rij = np.zeros(DIM) # Real space units
#Set all variables to zero
ene_pot = np.zeros(N)
acc = acc*0
virial=0.0
# Loop over all pairs of particles
for i in range(N-1):
for j in range(i+1,N): #i+1 to N ensures we do not double count
Sij = pos[i]-pos[j] # Distance in box scaled units
for l in range(DIM): # Periodic interactions
if (np.abs(Sij[l])>0.5):
Sij[l] = Sij[l] - np.copysign(1.0,Sij[l]) # If distance is greater than 0.5 (scaled units) then subtract 0.5 to find periodic interaction distance.
Rij = L*Sij # Scale the box to the real units in this case reduced LJ units
Rsqij = np.dot(Rij,Rij) # Calculate the square of the distance
if(Rsqij < Rcutoff**2):
# Calculate LJ potential inside cutoff
# We calculate parts of the LJ potential at a time to improve the efficieny of the computation (most important for compiled code)
rm2 = 1.0/Rsqij # 1/r^2
rm6 = rm2**3
forcefact=(rm2**4)*(rm6-0.5) # 1/r^6
phi =4*(rm6**2-rm6)
ene_pot[i] = ene_pot[i]+0.5*phi # Accumulate energy
ene_pot[j] = ene_pot[j]+0.5*phi # Accumulate energy
virial = virial-forcefact*Rsqij # Virial is needed to calculate the pressure
acc[i] = acc[i]+forcefact*Sij # Accumulate forces
acc[j] = acc[j]-forcefact*Sij # (Fji=-Fij)
#If you want to get get the best performance, sum directly in the loop intead of
#summing at the end np.sum(ene_pot)
return 48*acc, np.sum(ene_pot)/N, -virial/DIM # return the acceleration vector, potential energy and virial coefficient
#jit(nopython=True,error_model="numpy")
def Calculate_Temperature(vel,L,DIM,N):
ene_kin = 0.0
for i in range(N):
real_vel = L*vel[i]
ene_kin = ene_kin + 0.5*np.dot(real_vel,real_vel)
ene_kin_aver = 1.0*ene_kin/N
temperature = 2.0*ene_kin_aver/DIM
return ene_kin_aver,temperature
# Main MD loop
#jit(nopython=True,error_model="numpy")
def main(params):
NSteps,deltat,temp,DumpFreq,epsilon,DIM,N,density,Rcutoff=params
# Vectors to store parameter values at each step
ene_kin_aver = np.ones(NSteps)
ene_pot_aver = np.ones(NSteps)
temperature = np.ones(NSteps)
virial = np.ones(NSteps)
pressure = np.ones(NSteps)
pos,vel,L = initialise_config(N,DIM,density)
acc = (np.random.randn(N,DIM)-0.5)
volume=L**3
# Open file which we will save the outputs to
# Unsupported operations have to be in an objectmode block
# or simply write the outputs at the end in a pure Python function
"""
if os.path.exists('energy2'):
os.remove('energy2')
f = open('traj.xyz', 'w')
"""
for k in range(NSteps):
# Refold positions according to periodic boundary conditions
pos=wrap(pos,L)
# r(t+dt) modify positions according to velocity and acceleration
pos = pos + deltat*vel + 0.5*(deltat**2.0)*acc # Step 1
# Calculate temperature
ene_kin_aver[k],temperature[k] = Calculate_Temperature(vel,L,DIM,N)
# Rescale velocities and take half step
chi = np.sqrt(temp/temperature[k])
vel = chi*vel + 0.5*deltat*acc # v(t+dt/2) Step 2
# Compute forces a(t+dt),ene_pot,virial
acc, ene_pot_aver[k], virial[k] = LJ_Forces(pos,acc,epsilon,L,DIM,N) # Step 3
# Complete the velocity step
vel = vel + 0.5*deltat*acc # v(t+dt/2) Step 4
# Calculate temperature
ene_kin_aver[k],temperature[k] = Calculate_Temperature(vel,L,DIM,N)
# Calculate pressure
pressure[k]= density*temperature[k] + virial[k]/volume
# Print output to file every DumpFreq number of steps
"""
if(k%DumpFreq==0): # The % symbol is the modulus so if the Step is a whole multiple of DumpFreq then print the values
f.write("%s\n" %(N)) # Write the number of particles to file
# Write all of the quantities at this step to the file
f.write("Energy %s, Temperature %.5f\n" %(ene_kin_aver[k]+ene_pot_aver[k],temperature[k]))
for n in range(N): # Write the positions to file
f.write("X"+" ")
for l in range(DIM):
f.write(str(pos[n][l]*L)+" ")
f.write("\n")
#Simple prints without formating are supported
if (k%5==0):
#print("\rStep: {0} KE: {1} PE: {2} Energy: {3}".format(k, ene_kin_aver[k], ene_pot_aver[k],ene_kin_aver[k]+ene_pot_aver[k]))
#sys.stdout.write("\rStep: {0} KE: {1} PE: {2} Energy: {3}".format(k, ene_kin_aver[k], ene_pot_aver[k],ene_kin_aver[k]+ene_pot_aver[k]))
#sys.stdout.flush()
"""
return ene_kin_aver, ene_pot_aver, temperature, pressure, pos
#------------------------------------------------------
ene_kin_aver, ene_pot_aver, temperature, pressure, pos = main(params)
# Plot all of the quantities
def plot():
plt.figure(figsize=[7,12])
plt.rc('xtick', labelsize=15)
plt.rc('ytick', labelsize=15)
plt.subplot(4, 1, 1)
plt.plot(ene_kin_aver,'k-')
plt.ylabel(r"$E_{K}", fontsize=20)
plt.subplot(4, 1, 2)
plt.plot(ene_pot_aver,'k-')
plt.ylabel(r"$E_{P}$", fontsize=20)
plt.subplot(4, 1, 3)
plt.plot(temperature,'k-')
plt.ylabel(r"$T$", fontsize=20)
plt.subplot(4, 1, 4)
plt.plot(pressure,'k-')
plt.ylabel(r"$P$", fontsize=20)
plt.show()
plot()
I am trying to plot the results of PCA of the dataset pima-indians-diabetes.csv. My code shows a problem only in the plotting piece:
import numpy
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import pandas as pd
# Dataset Description:
# 1. Number of times pregnant
# 2. Plasma glucose concentration a 2 hours in an oral glucose tolerance test
# 3. Diastolic blood pressure (mm Hg)
# 4. Triceps skin fold thickness (mm)
# 5. 2-Hour serum insulin (mu U/ml)
# 6. Body mass index (weight in kg/(height in m)^2)
# 7. Diabetes pedigree function
# 8. Age (years)
# 9. Class variable (0 or 1)
path = 'pima-indians-diabetes.data.csv'
dataset = numpy.loadtxt(path, delimiter=",")
X = dataset[:,0:8]
Y = dataset[:,8]
features = ['1','2','3','4','5','6','7','8','9']
df = pd.read_csv(path, names=features)
x = df.loc[:, features].values # Separating out the values
y = df.loc[:,['9']].values # Separating out the target
x = StandardScaler().fit_transform(x) # Standardizing the features
pca = PCA(n_components=2)
principalComponents = pca.fit_transform(x)
# principalDf = pd.DataFrame(data=principalComponents, columns=['pca1', 'pca2'])
# finalDf = pd.concat([principalDf, df[['9']]], axis = 1)
plt.figure()
colors = ['navy', 'turquoise', 'darkorange']
lw = 2
for color, i, target_name in zip(colors, [0, 1, 2], ['Negative', 'Positive']):
plt.scatter(principalComponents[y == i, 0], principalComponents[y == i, 1], color=color, alpha=.8, lw=lw,
label=target_name)
plt.legend(loc='best', shadow=False, scatterpoints=1)
plt.title('PCA of pima-indians-diabetes Dataset')
The error is located at the following line:
Traceback (most recent call last):
File "test.py", line 53, in <module>
plt.scatter(principalComponents[y == i, 0], principalComponents[y == i, 1], color=color, alpha=.8, lw=lw,
IndexError: too many indices for array
Kindly, how to fix this?
As the error indicates some kind of shape/dimension mismatch, a good starting point is to check the shapes of the arrays involved in the operation:
principalComponents.shape
yields
(768, 2)
while
(y==i).shape
(768, 1)
Which leads to a shape mismatch when trying to run
principalComponents[y==i, 0]
as the first array is already multidimensional, therefore the error is indicating that you used too many indices for the array.
You can fix this by forcing the shape of y==i to a 1D array ((768,)), e.g. by changing your call to scatter to
plt.scatter(principalComponents[(y == i).reshape(-1), 0],
principalComponents[(y == i).reshape(-1), 1],
color=color, alpha=.8, lw=lw, label=target_name)
which then creates the plot for me
For more information on the difference between arrays of the shape (R, 1)and (R,) this question on StackOverflow provides a nice starting point.
Referring to : https://stackoverflow.com/a/44907357/305883
I am using python-louvain implementation to detect community in complete weighted graph.
But I only get one partition, containing all nodes.
Code:
import community # this is pip install python-louvain
import networkx as nx
import matplotlib.pyplot as plt
# Replace this with your networkx graph loading depending on your format !
# using graph g as a completed graph, weights between 0 and 1
#first compute the best partition
partition = community.best_partition(g)
#drawing
size = float(len(set(partition.values())))
pos = nx.spring_layout(g)
count = 0.
for com in set(partition.values()) :
count = count + 1.
list_nodes = [nodes for nodes in partition.keys() if partition[nodes] == com]
nx.draw_networkx_nodes(g, pos, list_nodes, node_size = 20, node_color = str(count / size))
nx.draw_networkx_edges(g, pos, alpha=0.1)
plt.show()
I would like to extract communities from a complete weighted network.
I also tried girvan_newman (https://networkx.github.io/documentation/networkx-2.0/reference/algorithms/generated/networkx.algorithms.community.centrality.girvan_newman.html) but could only detect 2 communities out of a complete graph of 200 nodes (with 198 and 2 nodes).
Is Louvain working correctly to detect communities in complete graph?
Better suggestions?
It is possible that the used model selection for this case returns a single block with all nodes, which means that there is not enough statistical evidence for more blocks.
You could try Peixotos graph-tool package, which has an implementation of weighted stochastic block model.
If you have a weighted network you need to use the weight='weight' argument:
import networkx as nx
import community
import numpy as np
np.random.seed(0)
W = np.random.rand(15,15)
np.fill_diagonal(W,0.0)
G = nx.from_numpy_array(W)
louvain_partition = community.best_partition(G, weight='weight')
modularity2 = community.modularity(louvain_partition, G, weight='weight')
print("The modularity Q based on networkx is {}".format(modularity2))
The modularity Q based on networkx is 0.0849022950503318