Why are some values in Sklearn K-Nearest Neighbors affinity matrix for spectral clustering equal to 0.5? - scikit-learn

I am trying to understand how KNN works for spectral clustering. The affinity information I am getting below have a few values being 0.5.
I added the affinity matrix to its transpose and took the average of it but there are still a few discrepancies with the result that this code gives.
I specifically want to know how the 0.5's below come about.
from sklearn.cluster import SpectralClustering
import matplotlib.pyplot as plt
import numpy as np
X = np.array([[1, 0], [1, 1], [1, 2], [2,0], [2, 1],
[3, 5], [3, 6], [4, 7]])
sc = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', n_neighbors=3, assign_labels='discretize', random _state=0).fit(X)
(0, 1) 1.0
(0, 3) 1.0
(0, 0) 1.0
(1, 4) 0.5
(1, 0) 1.0
(1, 2) 1.0
(1, 1) 1.0
(2, 4) 0.5
(2, 1) 1.0
(2, 2) 1.0
(3, 0) 1.0
(3, 4) 1.0
(3, 3) 1.0
(4, 2) 0.5
(4, 1) 0.5
(4, 3) 1.0
(4, 4) 1.0
(5, 7) 1.0
(5, 6) 1.0
(5, 5) 1.0
(6, 7) 1.0
(6, 5) 1.0
(6, 6) 1.0
(7, 5) 1.0
(7, 6) 1.0
(7, 7) 1.0
I added the affinity matrix to its transpose and took the average of it but there are still a few discrepancies with the result that this code gives.

Related

Creating a column of edges

I need to concatenate a uid from uids column to each of the uids in the list of the friends column, as shown in the following example:
Given a pandas.DataFrame object A:
uid friends
0 1 [10, 2, 1, 5]
1 2 [1, 2]
2 3 [5, 4]
3 4 [10, 5]
4 5 [1, 2, 5]
the desired output is:
uid friends in_edges
0 1 [10, 2, 1, 5] [(1, 10), (1, 2), (1, 1), (1, 5)]
1 2 [1, 2] [(2, 1), (2, 2)]
2 3 [5, 4] [(3, 5), (3, 4)]
3 4 [10, 5] [(4, 10), (4, 5)]
4 5 [1, 2, 5] [(5, 1), (5, 2), (5, 5)]
I use the following code to achieve this outcome:
import numpy as np
import pandas as pd
A = pd.DataFrame(dict(uid=[1, 2, 3, 4, 5], friends=[[10, 2, 1, 5], [1, 2], [5, 4], [10, 5], [1, 2, 5]]))
A.loc[:, 'in_edges'] = A.loc[:, 'uid'].apply(lambda uid: [(uid, f) for f in A.loc[A.loc[:, 'uid']==uid, 'friends'].values[0]])
but it the A.loc[A.loc[:, 'uid']==uid, 'friends'] part looks kind of cumbersome to me, so I wondered if there is an easier way to accomplish this task?
Thanks in advance.
You can use .apply() with axis=1 parameter:
df["in_edges"] = df[["uid", "friends"]].apply(
lambda x: [(x["uid"], f) for f in x["friends"]], axis=1
)
print(df)
Prints:
uid friends in_edges
0 1 [10, 2, 1, 5] [(1, 10), (1, 2), (1, 1), (1, 5)]
1 2 [1, 2] [(2, 1), (2, 2)]
2 3 [5, 4] [(3, 5), (3, 4)]
3 4 [10, 5] [(4, 10), (4, 5)]
4 5 [1, 2, 5] [(5, 1), (5, 2), (5, 5)]
Why not try product
import itertools
A['in_edges'] = A.apply(lambda x : [*itertools.product([x['uid']], x['friends'])],axis=1)
A
Out[50]:
uid friends in_edges
0 1 [10, 2, 1, 5] [(1, 10), (1, 2), (1, 1), (1, 5)]
1 2 [1, 2] [(2, 1), (2, 2)]
2 3 [5, 4] [(3, 5), (3, 4)]
3 4 [10, 5] [(4, 10), (4, 5)]
4 5 [1, 2, 5] [(5, 1), (5, 2), (5, 5)]

How to avoid inplace removal of modification of a Networkx graph

I've a netwrokx graph, I'm trying to remove the edges of the graph using remove_edges.
I want to remove each edge in the original graph and post-process H to get further stats like edges connected to the edge that has been removed.
import networkx as nx
import matplotlib.pyplot as plt
# fig 1
n=10
G = nx.gnm_random_graph(n=10, m=10, seed=1)
nx.draw(G, with_labels=True)
plt.show()
for e in [[5, 0], [3, 6]]:
H = G.remove_edge(e[0], e[1])
nx.draw(G, with_labels=True)
plt.show()
In the above, the edge is removed inplace in G. So for the second iteration, the original graph is no
longer present. How can this be avoided? I want to retain the original graph for every iteration and instead store the graph that results after edge removal in another copy, H.
Any suggestions will be highly appreciated.
EDIT: Based on what's suggested below
n=10
G = nx.gnm_random_graph(n=10, m=10, seed=1)
nx.draw(G, with_labels=True)
plt.show()
G_copy = G.copy()
for e in [[5, 0], [3, 6]]:
print(G_copy.edges())
H = G_copy.remove_edge(e[0], e[1])
nx.draw(G_copy, with_labels=True)
plt.show()
print(G_copy.edges())
Obtained output:
[(0, 6), (0, 7), (0, 5), (1, 4), (1, 7), (1, 9), (2, 9), (3, 6), (3, 4), (6, 9)]
[(0, 6), (0, 7), (1, 4), (1, 7), (1, 9), (2, 9), (3, 6), (3, 4), (6, 9)]
Expected:
[(0, 6), (0, 7), (0, 5), (1, 4), (1, 7), (1, 9), (2, 9), (3, 6), (3, 4), (6, 9)]
[(0, 6), (0, 7), (0, 5), (1, 4), (1, 7), (1, 9), (2, 9), (3, 6), (3, 4), (6, 9)]
Make a copy of the original graph and modify the copy:
H = G.copy()
...
H.remove_edge(e[0], e[1])

Preserving order of edges for adding edge weights to a Networkx graph

I've the following graph,
ed_ls = [(0, 1), (0, 63), (1, 2), (1, 3), (54, 0)]
ed_w = [1, 2, 3, 4, 5]
G = nx.Graph()
G.add_edges_from(ed_ls)
edge_w = OrderedDict(zip(G.edges, ed_w))
nx.set_edge_attributes(G, edge_w, 'weight')
print(G.edges)
print(nx.get_edge_attributes(G, 'weight'))
Output obtained:
{(0, 1): 1, (0, 63): 2, (0, 54): 3, (1, 2): 4, (1, 3): 5}
The edge weights in ed_w are in the same order of edges in ed_ls. Since the order of edges
is not preserved, wrong edge weights are assigned. I could use nx.DiGraph to avoid this problem. However, I want to use nx.k_core later on and this doesn't work on directed graphs. Suggestions on
how to go ahead will be highly appreciated.
You can simplify this by using Graph.add_weighted_edges_from:
ed_ls = [(0, 1), (0, 63), (1, 2), (1, 3), (54, 0)]
ed_w = [1, 2, 3, 4, 5]
G = nx.Graph()
G.add_weighted_edges_from(((*edge, w) for edge, w in zip(ed_ls, ed_w)))
G.edges(data=True)
EdgeDataView([(0, 1, {'weight': 1}), (0, 63, {'weight': 2}),
(0, 54, {'weight': 5}), (1, 2, {'weight': 3}),
(1, 3, {'weight': 4})])
If you're using a python version above 3.7, dictionaries maintain insertion order, but the order you seen when printing G.edges(data=True) is not necessarily the order in which edges where added, it rather agrees with node adding. As you can see in this example (54, 0) is shown before
(1, 2) since node 0 was added before.
Why dont you assign the weights to the edges at the time of adding them to the graph?
ed_ls = [(0, 1), (0, 63), (1, 2), (1, 3), (54, 0)]
ed_w = [1, 2, 3, 4, 5]
G = nx.Graph()
for i in range(len(ed_ls)):
src, dst = ed_ls[i]
G.add_edge(src, dst, weight=ed_w[i])

How to iterate python windowed() to last element?

According to the more_itertools.windowed specification, you can do:
list(windowed(seq=[1, 2, 3, 4], n=2, step=1))
>>> [(1, 2), (2, 3), (3, 4)]
But what if I want to run it all to the end? Is it possible to get:
>>> [(1, 2), (2, 3), (3, 4), (4, None)]
A workaround but not the best solution is to append None with the sequence.
list(windowed(seq=[1, 2, 3, 4,None], n=2, step=1))
I believe you can do this programmatically based on the step= value which I refer to as win_step in the following code. I also removed hardcoding where possible to make it easier to test various sequence_list, win_width, and win_step data sets:
sequence_list = [1, 2, 3, 4]
win_width = 2
win_step = 1
none_list = []
for i in range(win_step):
none_list.append(None)
sequence_list.extend(none_list)
tuple_list = list(windowed(seq=sequence_list, n=win_width, step=win_step))
print('tuple_list:', tuple_list)
Here are my results based on your original question's data set, and on the current data set:
For original, where:
sequence_list = [1, 2, 3, 4, 5, 6]
win_width = 3
win_step = 2
The result is:
tuple_list: [(1, 2, 3), (3, 4, 5), (5, 6, None), (None, None, None)]
And for the present data set, where:
sequence_list = [1, 2, 3, 4]
win_width = 2
win_step = 1
The result is:
tuple_list: [(1, 2), (2, 3), (3, 4), (4, None)]

How to convert dask dataframe to scipy csr matrix?

TLDR: I am aware of this question but it breaks the mentality of lazily evaluating the dask collections. Converting lazily evaluated dask.dataframe to array with .values is problematic (memory-wise). So, how can I convert dask dataframe to scipy.sparse.csr_matrix without first converting dataframe to array?
Complete Problem:
I have 3 dask dataframes. One of them has text features, one with numerical and one with categorical features. I am vectorizing the dataframe with text features using sklearn.feature_extraction.text.TfidfVectorizer which is returning a scipy.sparse.csr_matrix. I need to concatenate 3 dataframes into one (horizontally). But they have different dtypes. I also dask_ml.feature_extraction.text.HashingVectorizer. It returns lazily evaluated dask.array but .compute() is returning scipy.sparse.csr_matrix. Without .compute() when I try to convert it to dask.dataframe as below:
import dask.array as da
import dask.dataframe as dd
.
.
.
# Here the fitting the dask.dataframe, result is lazily evaluated dask.array
X = vectorizer.fit_transform(X)
print(X.compute())
X = dd.from_dask_array(X)
print(X.compute())
X = vectorizer.fit_transform(X) returns:
dask.array<_transform, shape=(nan, 1000), dtype=float64, chunksize=(nan, 1000), chunktype=numpy.ndarray>
First print(X.compute()) returns a csr_matrix:
(0, 73) 2.0
(0, 95) 3.0
(0, 286) 1.0
(0, 340) 2.0
(0, 373) 3.0
(0, 379) 3.0
(0, 387) 1.0
(0, 407) 2.0
(0, 421) 1.0
(0, 479) 1.0
(0, 482) 3.0
(0, 515) 1.0
(0, 520) 1.0
(0, 560) 4.0
(0, 596) 1.0
(0, 620) 4.0
(0, 630) 1.0
(0, 648) 2.0
(0, 680) 1.0
(0, 721) 1.0
(0, 760) 3.0
(0, 824) 4.0
(0, 826) 12.0
(0, 880) 2.0
(0, 908) 1.0
: :
(10, 985) 1.0
(11, 95) 3.0
(11, 171) 4.0
(11, 259) 4.0
(11, 276) 1.0
(11, 352) 3.0
(11, 358) 1.0
(11, 436) 1.0
(11, 485) 1.0
(11, 507) 3.0
(11, 553) 1.0
(11, 589) 1.0
(11, 604) 1.0
(11, 619) 3.0
(11, 625) 2.0
(11, 719) 1.0
(11, 826) 6.0
(11, 858) 2.0
(11, 880) 3.0
(11, 908) 1.0
(11, 925) 2.0
(11, 930) 4.0
(11, 968) 1.0
(11, 975) 1.0
(11, 984) 4.0
the X = dd.from_dask_array(X) returns:
Dask DataFrame Structure:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Second print(X.compute()) returns the error:
ValueError: Shape of passed values is (6, 1), indices imply (6, 1000)
So, I can't also convert a csr_matrix to dask.dataframe.
UPDATE: I've just realized that using .values on a dask.dataframe is actually returning a lazily evaluated
dask.array. This is still not something I want but at least it is
not returning a solid dataframe or array on my local machine.
Best possible way is to actually convert it to csr matrix and make it dense.
from scipy.sparse import csr_matrix
dd.from_dask_array(X.map_blocks(lambda x: csr_matrix(x).todense()))

Resources