I have build a graph from nodes like:
data = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'a': [55, 2123, -19.3, 9, -8],
'b': [21, -0.1, 0.003, 4, 2.1]
})
import networkx as nx
G = nx.Graph()
for i, attr in data.set_index('id').iterrows():
G.add_node(i, **attr.to_dict())
I have calculated similarity matrix (by excluding the id column).
from sklearn.metrics.pairwise import cosine_similarity
# Calculate the pairwise cosine similarities
S = cosine_similarity(data.drop('id', axis=1))
T = S.tolist()
df = pd.DataFrame.from_records(T)
Here is my adj matrix:
adj_mat = pd.DataFrame(df.to_numpy(), index=data['id'], columns=data['id'])
Now, how can I "attach" and connect the nodes using this adj_mat? For example I want node with id = 1 to connect to node with id = 2 with an edge with a similarity parameter equals to the similarity calculated in adj matrix.
Please advise how to do it.
Solved by firstly building the graph from adj matrix:
G = nx.Graph()
G = nx.from_pandas_adjacency(df_adj)
Then looping on my nodes data, update the nodes with their attributes (and remove the self loops):
G.remove_edges_from(nx.selfloop_edges(G))
for i, attr in data.set_index('id').iterrows():
G.add_node(i, **attr.to_dict())
Hope it will help others :)
Related
I have the following vectors in my toy example:
data = pd.DataFrame({
'id': [1, 2, 3, 4, 5],
'a': [55, 2123, -19.3, 9, -8],
'b': [21, -0.1, 0.003, 4, 2.1]
})
I have calculated similarity matrix (by excluding the id column).
from sklearn.metrics.pairwise import cosine_similarity
# Calculate the pairwise cosine similarities
S = cosine_similarity(data.drop('id', axis=1))
T = S.tolist()
df = pd.DataFrame.from_records(T)
It returns me a matrix/dataframe with all options including self similarity and duplicates.
Is there any efficient method to calculate similarity without self similarities (vector is 100% similar to itself) and duplicates (vectors 1 and 2 has 89% similarity, I don't need vectors 2 and 1 similarity as it's the same).
The best solution I found so far is to take the lower triangle under the diagonal:
[In] S[np.triu_indices_from(S, k=1)]
[Out] array([ 0.93420158, -0.93416293, 0.99856978, -0.81303909, -0.99999999,
0.91379242, -0.96724292, -0.91374841, 0.96727042, -0.78074903])
What this does is take only those values that are under the 1 diagonal, so basically excluding the ones and the repeating values. This gives you a numpy array, too.
I have a Delaunay Triangulation (DT) (scipy) as follows:
# Take first 50 rows with 3 attributes to create a DT-
d = data.loc[:50, ['aid', 'x', 'y']].values
dt = Delaunay(points = d)
# List of triangles in the DT-
dt.simplices
'''
array([[1, 3, 4, 0],
[1, 2, 3, 0],
[1, 2, 3, 4]], dtype=int32)
'''
Now, I want to create a graph using 'networkx' package and add the nodes and edges found using DT from above.
# Create an empty graph with no nodes and no edges.
G = nx.Graph()
The code I have come up with to add the unique nodes from DT simplices into 'G' is-
# Python3 list to contain nodes
nodes = []
for simplex in data_time_delaunay[1].simplices.tolist():
for nde in simplex:
if nde in nodes:
continue
else:
nodes.append(nde)
nodes
# [1, 3, 4, 0, 2]
# Add nodes to graph-
G.add_nodes_from(nodes)
How do I add edges to 'G' using 'dt.simplices'? For example, the first triangle is [1, 3, 4, 0] and is between the nodes/vertices 1, 3, 4 and 0. How do I figure out which nodes are attached to each other and then add them as edges to 'G'?
Also, is there a better way to add nodes to 'G'?
I am using Python 3.8.
Thanks!
You could add the rows in the array as paths. A path just constitutes a sequence of edges, so the path 1,2,3 translates to the edge list (1,2),(2,3).
So iterate over the rows and use nx.add_path:
simplices = np.array([[1, 3, 4, 0],
[1, 2, 3, 0],
[1, 2, 3, 4]])
G = nx.Graph()
for path in simplices:
nx.add_path(G, path)
nx.draw(G, with_labels=True, node_size=500, node_color='lightgreen')
I'm looking to better understand the covariance_ attribute returned by scikit-learn's LDA object.
I'm sure I'm missing something, but I expect it to be the covariance matrix associated with the input data. However, when I compare .covariance_ against the covariance matrix returned by numpy.cov(), I get different results.
Can anyone help me understand what I am missing? Thanks and happy to provide any additional information.
Please find a simple example illustrating the discrepancy below.
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
# Sample Data
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
y = np.array([1, 1, 1, 0, 0, 0])
# Covariance matrix via np.cov
print(np.cov(X.T))
# Covariance matrix via LDA
clf = LinearDiscriminantAnalysis(store_covariance=True).fit(X, y)
print(clf.covariance_)
In sklearn.discrimnant_analysis.LinearDiscriminantAnalysis, the covariance is computed as follow:
In [1]: import numpy as np
...: cov = np.zeros(shape=(X.shape[1], X.shape[1]))
...: for c in np.unique(y):
...: Xg = X[y == c, :]
...: cov += np.count_nonzero(y==c) / len(y) * np.cov(Xg.T, bias=1)
...: print(cov)
array([[0.66666667, 0.33333333],
[0.33333333, 0.22222222]])
So it corresponds to the sum of the covariance of each individual class multiplied by a prior which is the class frequency. Note that this prior is a parameter of LDA.
I am trying to do clustering for the data-frame given to me. It has 14 columns. How to do clustering for 8 of those?
Below is the code that I found and followed.
Elbow method:
Visualization
# K-Means Clustering
# importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# importing tha customer Expenses Invoices dataset with pandas
dataset=pd.read_csv('Expense_Invoice.csv')
X=dataset.iloc[: , [3,2]].values
# Using the elbow method to find the optimal number of clusters
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
kmeans=KMeans(n_clusters=i, init='k-means++', max_iter= 300, n_init= 10, random_state= 0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.plot(range(1, 11),wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters K')
plt.ylabel('Average Within-Cluster distance to Centroid (WCSS)')
plt.show()
# Applying k-means to the mall dataset
kmeans=KMeans(n_clusters=3, init='k-means++', max_iter= 300, n_init= 10, random_state= 0)
y_kmeans=kmeans.fit_predict(X)
# Visualizing the clusters
plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 100, c = 'red', label='Careful(c1)')
plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 100, c = 'green', label='Standard(c2)')
plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 100, c = 'blue', label='Target(c3)')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 250, c = 'yellow',
label='Centroids')
plt.title('Clusters of customer Invoices & Expenses')
plt.xlabel('Total Invoices ')
plt.ylabel('Total Expenses')
plt.legend()
plt.show()
This works perfectly but this is only for two columns(variables), i want to have it for 8 column. But I could not understand how?
With X=dataset.iloc[: , [3,2]].values you are specifically the 4th and 3rd column.
KMeans performs the clustering on all columns you selected.
Therefore you need to change X=dataset.iloc[: , [3,2]] to your needs. Eg to use the first 8 columns of your dataset: X=dataset.iloc[:, 0:8].values.
Take a look at pandas documentation for more options how to select data in dataframes: https://pandas.pydata.org/pandas-docs/stable/indexing.html
Keep in mind that you can't visualize your clusters in a 2D scatter plot as you have done before.
I will to implement kmeans in python, but i just don't know to process min distance from euclidean distance.
i have been calculate data in 3 cluster,
this is my result array :
[array([4, 5], dtype=int64), 4.1231056256176606, 0,
array([4, 8], dtype=int64), 4.4721359549995796, 0,
array([14, 23], dtype=int64), 22.022715545545239, 0,
array([4, 5], dtype=int64), 1.0, 1,
array([4, 8], dtype=int64), 2.0, 1,
array([14, 23], dtype=int64), 19.723082923316021, 1]
here its my code:
for i in range(len(centroidrandom)):
for j in range(3):
jarak_=euclidean_distances(data[j],centroidrandom[:][i])
cluster.append(data[j])
cluster.append(jarak_[0][0])
cluster.append(i)
print(cluster)
Here is some example code for kmeans clustering with three clusters, modified from the example given in the comment above:
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
# data generation for three sets of data
data = vstack((rand(150,2) + array([.5,.5]),rand(150,2), rand(150,2) + array([0,.5])))
# computing K-Means with K = 3 (3 clusters)
centroids,_ = kmeans(data,3)
# assign each sample to a cluster
idx,_ = vq(data,centroids)
print idx
# some plotting using numpy's logical indexing
plot(data[idx==0,0],data[idx==0,1],'ob',
data[idx==1,0],data[idx==1,1],'or',
data[idx==2,0],data[idx==2,1],'oy')
plot(centroids[:,0],centroids[:,1],'sg',markersize=8)
show()