How to convert dask dataframe to scipy csr matrix? - python-3.x

TLDR: I am aware of this question but it breaks the mentality of lazily evaluating the dask collections. Converting lazily evaluated dask.dataframe to array with .values is problematic (memory-wise). So, how can I convert dask dataframe to scipy.sparse.csr_matrix without first converting dataframe to array?
Complete Problem:
I have 3 dask dataframes. One of them has text features, one with numerical and one with categorical features. I am vectorizing the dataframe with text features using sklearn.feature_extraction.text.TfidfVectorizer which is returning a scipy.sparse.csr_matrix. I need to concatenate 3 dataframes into one (horizontally). But they have different dtypes. I also dask_ml.feature_extraction.text.HashingVectorizer. It returns lazily evaluated dask.array but .compute() is returning scipy.sparse.csr_matrix. Without .compute() when I try to convert it to dask.dataframe as below:
import dask.array as da
import dask.dataframe as dd
.
.
.
# Here the fitting the dask.dataframe, result is lazily evaluated dask.array
X = vectorizer.fit_transform(X)
print(X.compute())
X = dd.from_dask_array(X)
print(X.compute())
X = vectorizer.fit_transform(X) returns:
dask.array<_transform, shape=(nan, 1000), dtype=float64, chunksize=(nan, 1000), chunktype=numpy.ndarray>
First print(X.compute()) returns a csr_matrix:
(0, 73) 2.0
(0, 95) 3.0
(0, 286) 1.0
(0, 340) 2.0
(0, 373) 3.0
(0, 379) 3.0
(0, 387) 1.0
(0, 407) 2.0
(0, 421) 1.0
(0, 479) 1.0
(0, 482) 3.0
(0, 515) 1.0
(0, 520) 1.0
(0, 560) 4.0
(0, 596) 1.0
(0, 620) 4.0
(0, 630) 1.0
(0, 648) 2.0
(0, 680) 1.0
(0, 721) 1.0
(0, 760) 3.0
(0, 824) 4.0
(0, 826) 12.0
(0, 880) 2.0
(0, 908) 1.0
: :
(10, 985) 1.0
(11, 95) 3.0
(11, 171) 4.0
(11, 259) 4.0
(11, 276) 1.0
(11, 352) 3.0
(11, 358) 1.0
(11, 436) 1.0
(11, 485) 1.0
(11, 507) 3.0
(11, 553) 1.0
(11, 589) 1.0
(11, 604) 1.0
(11, 619) 3.0
(11, 625) 2.0
(11, 719) 1.0
(11, 826) 6.0
(11, 858) 2.0
(11, 880) 3.0
(11, 908) 1.0
(11, 925) 2.0
(11, 930) 4.0
(11, 968) 1.0
(11, 975) 1.0
(11, 984) 4.0
the X = dd.from_dask_array(X) returns:
Dask DataFrame Structure:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
Second print(X.compute()) returns the error:
ValueError: Shape of passed values is (6, 1), indices imply (6, 1000)
So, I can't also convert a csr_matrix to dask.dataframe.
UPDATE: I've just realized that using .values on a dask.dataframe is actually returning a lazily evaluated
dask.array. This is still not something I want but at least it is
not returning a solid dataframe or array on my local machine.

Best possible way is to actually convert it to csr matrix and make it dense.
from scipy.sparse import csr_matrix
dd.from_dask_array(X.map_blocks(lambda x: csr_matrix(x).todense()))

Related

Why are some values in Sklearn K-Nearest Neighbors affinity matrix for spectral clustering equal to 0.5?

I am trying to understand how KNN works for spectral clustering. The affinity information I am getting below have a few values being 0.5.
I added the affinity matrix to its transpose and took the average of it but there are still a few discrepancies with the result that this code gives.
I specifically want to know how the 0.5's below come about.
from sklearn.cluster import SpectralClustering
import matplotlib.pyplot as plt
import numpy as np
X = np.array([[1, 0], [1, 1], [1, 2], [2,0], [2, 1],
[3, 5], [3, 6], [4, 7]])
sc = SpectralClustering(n_clusters=2, affinity='nearest_neighbors', n_neighbors=3, assign_labels='discretize', random _state=0).fit(X)
(0, 1) 1.0
(0, 3) 1.0
(0, 0) 1.0
(1, 4) 0.5
(1, 0) 1.0
(1, 2) 1.0
(1, 1) 1.0
(2, 4) 0.5
(2, 1) 1.0
(2, 2) 1.0
(3, 0) 1.0
(3, 4) 1.0
(3, 3) 1.0
(4, 2) 0.5
(4, 1) 0.5
(4, 3) 1.0
(4, 4) 1.0
(5, 7) 1.0
(5, 6) 1.0
(5, 5) 1.0
(6, 7) 1.0
(6, 5) 1.0
(6, 6) 1.0
(7, 5) 1.0
(7, 6) 1.0
(7, 7) 1.0
I added the affinity matrix to its transpose and took the average of it but there are still a few discrepancies with the result that this code gives.

Interpolate seconds to milliseconds in dataset?

I have a sorted dataset by timestamps in seconds. However I need to somehow convert it to millisecond accuracy.
Example
dataset = [
# UNIX timestamps with reading data
(0, 0.48499),
(2, 0.48475),
(3, 0.48475),
(3, 0.48473),
(3, 0.48433),
(3, 0.48403),
(3, 0.48403),
(3, 0.48403),
(3, 0.48403),
(3, 0.48403),
(5, 0.48396),
(12, 0.48353),
]
Expected output (roughly)
interpolated = [
# Timestamps with millisecond accuracy
(0.0, 0.48499),
(2.0, 0.48475),
(3.0, 0.48475),
(3.14, 0.48473),
(3.28, 0.48433),
(3.42, 0.48403),
(3.57, 0.48403),
(3.71, 0.48403),
(3.85, 0.48403),
(3.99, 0.48403),
(5.0, 0.48396),
(12.0, 0.48353),
]
I don't have much experience with Pandas and I've gone through interpolate and drop_duplicates but couldn't figure out how to go about this.
I would think this is a common problem so any help appreciated. Ideally I want to spread evenly the numbers.
You can use groupby and apply methods. I didn't come up with a specific method like interpolate in this case, but there might be a more pythonic way.
Code:
import numpy as np
import pandas as pd
# Create a sample dataframe
dataset = [(0, 0.48499), (2, 0.48475), (3, 0.48475), (3, 0.48473), (3, 0.48433), (3, 0.48403), (3, 0.48403), (3, 0.48403), (3, 0.48403), (3, 0.48403), (5, 0.48396), (12, 0.48353)]
df = pd.DataFrame(dataset, columns=['t', 'value'])
# Convert UNIX timestamps into the desired format
df.t = df.groupby('t', group_keys=False).apply(lambda df: df.t + np.linspace(0, 1, len(df)))
Output:
t
value
0
0.48499
2
0.48475
3
0.48475
3.14286
0.48473
3.28571
0.48433
3.42857
0.48403
3.57143
0.48403
3.71429
0.48403
3.85714
0.48403
4
0.48403
5
0.48396
12
0.48353
(Input:)
t
value
0
0.48499
2
0.48475
3
0.48475
3
0.48473
3
0.48433
3
0.48403
3
0.48403
3
0.48403
3
0.48403
3
0.48403
5
0.48396
12
0.48353

pandas get max threshold values from tuples in list

I am working with pandas dataframe. One of the columns has list of tuples in each row with some score. I am trying to get scores higher than 0.20. How do I put a threshold instead of max? I tried itemgetter and lambda if else. It didn't worked as I thought. What am I doing wrong?
from operator import itemgetter
import pandas as pd
# sample data
l1 = ['1','2','3']
l2 = ['test1','test2','test3']
l3 = [[(1,0.95),(5,0.05)],[(7,0.10),(1,0.20),(6,0.70)],[(7,0.30),(1,0.70)]]
df = pd.DataFrame({'id':l1,'text':l2,'score':l3})
print(df)
# # Preview from print statement above
id text score
1 test1 [(1, 0.95), (5, 0.05)]
2 test2 [(7, 0.1), (1, 0.2), (6, 0.7)]
3 test3 [(7, 0.3), (1, 0.7)]
# Try #1:
print(df['score'].apply(lambda x: max(x,key=itemgetter(0))))
# Preview from print statement above
(5, 0.05)
(7, 0.1)
(7, 0.3)
# Try #2: Gives `TypeError`
df['score'].apply(lambda x: ((x,itemgetter(0)) if x >= 0.20 else ''))
What I am trying to get for output:
id text probability output needed
1 test1 [(1, 0.95), (5, 0.05)] [(1, 0.95)]
2 test2 [(7, 0.1), (1, 0.2), (6, 0.7)] [(1, 0.2), (6, 0.7)]
3 test3 [(7, 0.3), (1, 0.7)] [(7, 0.3), (1, 0.7)]
You can use a pretty straightforward list comprehension to get the desired output. I'm not sure how you would use itemgetter for this:
df['score'] = df['score'].apply(lambda x: ([y for y in x if min(y) >= .2]))
df
id text score
0 1 test1 [(1, 0.95)]
1 2 test2 [(1, 0.2), (6, 0.7)]
2 3 test3 [(7, 0.3), (1, 0.7)]
If you wanted an alternative result (like an empty tuple, you can use:
df['score'] = df['score'].apply(lambda x: ([y if min(y) >= .2 else () for y in x ]))

how to line plot values column vs groups

I have dataframe x2 with two columns. i am trying to plot but didnt get xticks.
data:
bins pp
0 (0, 1] 0.155463
1 (1, 2] 1.528947
2 (2, 3] 2.436064
3 (3, 4] 3.507811
4 (4, 5] 4.377849
5 (5, 6] 5.538044
6 (6, 7] 6.577340
7 (7, 8] 7.510983
8 (8, 9] 8.520378
9 (9, 10] 9.721899
i tried this code result is fine just cant find x-axis ticks just blank. i want bins column value should be on x-axis
x2.plot(x='bins',y=['pp'])
x2.dtypes
Out[141]:
bins category
pp float64
The following is to show that this problem should not occur with pandas 0.24.1 or higher.
import numpy as np
import pandas as pd
print(pd.__version__) # prints 0.24.2
import matplotlib.pyplot as plt
df = pd.DataFrame({"Age" : np.random.rayleigh(30, size=300)})
s = pd.cut(df["Age"], bins=np.arange(0,91,10)).value_counts().to_frame().sort_index().reset_index()
s.plot(x='index',y="Age")
plt.show()
results in

Get degree of each nodes in a graph by Networkx in python

Suppose I have a data set like below that shows an undirected graph:
1 2
1 3
1 4
3 5
3 6
7 8
8 9
10 11
I have a python script like it:
for s in ActorGraph.degree():
print(s)
that is a dictionary consist of key and value that keys are node names and values are degree of nodes:
('9', 1)
('5', 1)
('11', 1)
('8', 2)
('6', 1)
('4', 1)
('10', 1)
('7', 1)
('2', 1)
('3', 3)
('1', 3)
In networkx documentation suggest to use values() for having nodes degree.
now I like to have just keys that are degree of nodes and I use this part of script but it does't work and say object has no attribute 'values':
for s in ActorGraph.degree():
print(s.values())
how can I do it?
You are using version 2.0 of networkx. Which changed from using a dict for G.degree() to using a dict-like (but not dict) DegreeView. See this guide.
To have the degrees in a list you can use a list-comprehension:
degrees = [val for (node, val) in G.degree()]
I'd like to add the following: if you're initializing the undirected graph with nx.Graph() and adding the edges afterwards, just beware that networkx doesn't guarrantee the order of nodes will be preserved -- this also applies to degree(). This means that if you use the list comprehension approach then try to access the degree by list index the indexes may not correspond to the right nodes. If you'd like them to correspond, you can instead do:
degrees = [val for (node, val) in sorted(G.degree(), key=lambda pair: pair[0])]
Here's a simple example to illustrate this:
>>> edges = [(0, 1), (0, 3), (0, 5), (1, 2), (1, 3), (1, 4), (2, 3), (2, 4), (2, 5)]
>>> g = nx.Graph()
>>> g.add_edges_from(edges)
>>> print(g.degree())
[(0, 3), (1, 4), (3, 3), (5, 2), (2, 4), (4, 2)]
>>> print([val for (node, val) in g.degree()])
[3, 4, 3, 2, 4, 2]
>>> print([val for (node, val) in sorted(g.degree(), key=lambda pair: pair[0])])
[3, 4, 4, 3, 2, 2]
You can also use a dict comprehension to get an actual dictionary:
degrees = {node:val for (node, val) in G.degree()}

Resources