Logical not on a scipy sparse matrix - python-3.x

I have a bag-of-words representation of a corpus stored in an D by W sparse matrix word_freqs. Each row is a document and each column is a word. A given element word_freqs[d,w] represents the number of occurrences of word w in document d.
I'm trying to obtain another D by W matrix not_word_occs where, for each element of word_freqs:
If word_freqs[d,w] is zero, not_word_occs[d,w] should be one.
Otherwise, not_word_occs[d,w] should be zero.
Eventually, this matrix will need to be multiplied with other matrices which might be dense or sparse.
I've tried a number of methods, including:
not_word_occs = (word_freqs == 0).astype(int)
This words for toy examples, but results in a MemoryError for my actual data (which is approx. 18,000x16,000).
I've also tried np.logical_not():
word_occs = sklearn.preprocessing.binarize(word_freqs)
not_word_occs = np.logical_not(word_freqs).astype(int)
This seemed promising, but np.logical_not() does not work on sparse matrices, giving the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
Any ideas or guidance would be appreciated.
(By the way, word_freqs is generated by sklearn's preprocessing.CountVectorizer(). If there's a solution that involves converting this to another kind of matrix, I'm certainly open to that.)

The complement of the nonzero positions of a sparse matrix is dense. So if you want to achieve your stated goals with standard numpy arrays you will require quite a bit of RAM. Here's a quick and totally unscientific hack to give you an idea, how many arrays of that sort your computer can handle:
>>> import numpy as np
>>> a = []
>>> for j in range(100):
... print(j)
... a.append(np.ones((16000, 18000), dtype=int))
My laptop chokes at j=1. So unless you have a really good computer even if you can get the complement (you can do
>>> compl = np.ones(S.shape,int)
>>> compl[S.nonzero()] = 0
) memory will be an issue.
One way out may be to not explicitly compute the complement let's call it C = B1 - A, where B1 is the same-shape matrix completely filled with ones and A the adjacency matrix of your original sparse matrix. For example the matrix product XC can be written as XB1 - XA so you have one multiplication with the sparse A and one with B1 which is actually cheap because it boils down to computing row sums. The point here is that you can compute that without computing C first.
A particularly simple example would be multiplication with a one-hot vector. Such a multiplication just selects a column (if multiplying from the right) or row (if multiplying from the left) of the other matrix. Meaning you just need to find that column or row of the sparse matrix and take the complement (for a single slice no problem) and if you do this for a one-hot matrix, as above you needn't compute the complement explicitly.

Make a small sparse matrix:
In [743]: freq = sparse.random(10,10,.1)
In [744]: freq
Out[744]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in COOrdinate format>
the repr(freq) shows the shape, elements and format.
In [745]: freq==0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:213: SparseEfficiencyWarning: Comparing a sparse matrix with 0 using == is inefficient, try using != instead.
", try using != instead.", SparseEfficiencyWarning)
Out[745]:
<10x10 sparse matrix of type '<class 'numpy.bool_'>'
with 90 stored elements in Compressed Sparse Row format>
If do your first action, I get a warning and new array with 90 (out of 100) nonzero terms. That not is no longer sparse.
In general numpy functions do not work when applied to sparse matrices. To work they have to delegate the task to sparse methods. But even if logical_not worked it wouldn't solve the memory issue.

Here is an example of using Pandas.SparseDataFrame:
In [42]: X = (sparse.rand(10, 10, .1) != 0).astype(np.int64)
In [43]: X = (sparse.rand(10, 10, .1) != 0).astype(np.int64)
In [44]: d1 = pd.SparseDataFrame(X.toarray(), default_fill_value=0, dtype=np.int64)
In [45]: d2 = pd.SparseDataFrame(np.ones((10,10)), default_fill_value=1, dtype=np.int64)
In [46]: d1.memory_usage()
Out[46]:
Index 80
0 16
1 0
2 8
3 16
4 0
5 0
6 16
7 16
8 8
9 0
dtype: int64
In [47]: d2.memory_usage()
Out[47]:
Index 80
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: int64
math:
In [48]: d2 - d1
Out[48]:
0 1 2 3 4 5 6 7 8 9
0 1 1 0 0 1 1 0 1 1 1
1 1 1 1 1 1 1 1 1 0 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 0 1 1
4 1 1 1 1 1 1 1 1 1 1
5 0 1 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1 1 1
7 0 1 1 0 1 1 1 0 1 1
8 1 1 1 1 1 1 0 1 1 1
9 1 1 1 1 1 1 1 1 1 1
source sparse matrix:
In [49]: d1
Out[49]:
0 1 2 3 4 5 6 7 8 9
0 0 0 1 1 0 0 1 0 0 0
1 0 0 0 0 0 0 0 0 1 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 1 0 0
4 0 0 0 0 0 0 0 0 0 0
5 1 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 1 0 0 1 0 0 0 1 0 0
8 0 0 0 0 0 0 1 0 0 0
9 0 0 0 0 0 0 0 0 0 0
memory usage:
In [50]: (d2 - d1).memory_usage()
Out[50]:
Index 80
0 16
1 0
2 8
3 16
4 0
5 0
6 16
7 16
8 8
9 0
dtype: int64
PS if you can't build the whole SparseDataFrame at once (because of memory constraints), you can use an approach similar to one used in this answer

Related

How to create new columns according to a label in another columns in pandas

I have a dataframe with 2 columns like:
pd.DataFrame({"action":['asdasd','fgdgddg','dfgdgdfg','dfgdgdfg','nmwaws'],"classification":['positive','negative','neutral','positive','mixed']})
df:
action classification
asdasd positive
fgdgddg negative
dfgdgdfg neutral
sdfsdff positive
nmwaws mixed
What I want to do is to create 4 new columns for each of the unique labels in the columns classification and assign 1 or 0 if the row has or not that label. Like the out put below:
And I need this as outuput:
action classification positive negative neutral mixed
asdasd positive 1 0 0 0
fgdgddg negative 0 1 0 0
dfgdgdfg neutral 0 0 1 0
sdfsdff positive 1 0 0 0
nmwaws mixed 0 0 0 1
I tried the multilabel Binarizer from sklearn but it parsed all letters of each word, not the word.
Cananyone help me?
You can use pandas.get_dummies.
pd.get_dummies(df["classification"])
Output:
mixed negative neutral positive
0 0 0 0 1
1 0 1 0 0
2 0 0 1 0
3 0 0 0 1
4 1 0 0 0
If you want to concat it to the DataFrame:
pd.concat([df, pd.get_dummies(df["classification"])], axis=1)
Output:
action classification mixed negative neutral positive
0 asdasd positive 0 0 0 1
1 fgdgddg negative 0 1 0 0
2 dfgdgdfg neutral 0 0 1 0
3 dfgdgdfg positive 0 0 0 1
4 nmwaws mixed 1 0 0 0

Find minimum operations that make the array 0

Consider an array a of positive integers 1 <= a[i] <= 10^9
We can perform an operation on this array which involves taking any integer x and subtracting or adding it to a subarray of a
The goal is to make the entire array 0
Some examples:
2 1 2 1 => 1 0 1 0 => 0 0 1 0 => 0 0 0 0
This requires 3 operations. First subtract 1 from all the elements (subarray=a[0..3]), then individually subtract 1 from single element subarrays (a[0..0], a[2..2])
Alternative way could've been:
2 1 2 1 => 2 1 1 0 => 2 0 0 0 => 0 0 0 0
=> 1 0 0 0 => 0 0 0 0
Example 2:
4 6 2 => 0 2 2 => 0 0 0
Example 3:
10 3 2 9 => 10 10 9 9 => 0 0 9 9 => 0 0 0 0
I searched a lot, not able to find the exact problem statement. The solution that I came across though (https://stackoverflow.com/a/68789827/4014182) talks about a divide and conquer solution wherein you realize that every time there is a 0 in the array, you can "split" the array. But how to get that 0 in the first place is beyond my understanding.

Writing Function on Data Frame in Pandas

I have data in excel which have two columns 'Peak Value' & 'Label'. I want to add value in 'Label' column based on 'Peak Value' column.
So, Input looks like below
Peak Value 0 0 0 88 0 0 88 0 0 88 0
Label 0 0 0 0 0 0 0 0 0 0 0
Input
Whenever the value in 'Peak Value' is greater than zero then it add 1 in 'Label' and replace all the zeros below it. For the next value greater than zero it should get incremented to 2 and replace all the zeros by 2.
So, the output will look like this:
Peak Value 0 0 0 88 0 0 88 0 0 88 0
Label 0 0 0 1 1 1 2 2 2 3 3
Output
and so on....
I tried writing function but I am only able to add 1 when the value is greater than 0 in 'Peak Value'.
def funct(row):
if row['Peak Value']>0:
val = 1
else:
val = 0
return val
df['Label']= df.apply(funct, axis=1)
May be you could try using cumsum and ffill:
import numpy as np
df['Labels'] = (df['Peak Value'] > 0).groupby(df['Peak Value']).cumsum()
df['Labels'] = df['Labels'].replace(0, np.nan).ffill().replace(np.nan, 0).astype(int)
Output:
Peak Value Labels
0 0 0
1 0 0
2 0 0
3 88 1
4 0 1
5 0 1
6 88 2
7 0 2
8 0 2
9 88 3
10 0 3

Python3.x, Pandas: creating a list of y values depending on the x values

I have a two data sets that are composed of different x values. It looks like the following.
import pandas as pd
data1=pd.csv_read('Data1.csv')
data2=pd.csv_read('Data2.csv')
print(data1)
data1_x data1_y1 data1_y2 data1_y3
-347.2498 0 2 8
-237.528509 0 3 7
-127.807218 0 0 6
-18.085927 11 5 0
print(data2)
data2_x data2_y1 data2_y2 data2_y3
-394.798507 2 0 0
-285.265994 1 0 0
-175.733482 0 0 1
-66.200969 4 0 0
I am creating new x that includes all the values by using the following code. new_x=reduce(np.union1d, (data1.iloc[:,0], data1.iloc[:,0]))
print(new_x)
array([-394.799,-347.25,-285.266,-237.529,-175.733,-127.807,-66.201,-18.0859])
Currently, I am trying to create a new y lists for each data set that keeps the same y values if the corresponding x values are present but fills with blank if there is no corresponding x value initially.
For instance, print(New_data2) would look something like this.
New_x_data2 New_y1_data2 New_y2_data2 New_y3_data2
-394.799 2 0 0
-347.25
-285.266 1 0 0
-237.529
-175.733 0 0 1
-127.807 0 0 6
-66.201 4 0 0
-18.0859 11 5 0
Especially, I am lost in figuring out how to get the new y value. Any ideas?
import pandas as pd
from re import sub
repl = lambda x : sub("data\d_(\w+)", "New_\\1_data2", x)
data1.rename(repl, axis = 'columns').append(data2.rename(repl, axis='columns')).sort_values('New_x_data2')
Out[1024]:
New_x_data2 New_y1_data2 New_y2_data2 New_y3_data2
0 -394.798507 2 0 0
0 -347.249800 0 2 8
1 -285.265994 1 0 0
1 -237.528509 0 3 7
2 -175.733482 0 0 1
2 -127.807218 0 0 6
3 -66.200969 4 0 0
3 -18.085927 11 5 0

Combining pairs in a string (Matlab)

I have a string:
sup_pairs = 'BA CE DF EF AE FC GD DA CG EA AB BG'
How can I combine pairs which have the last character of 1 pair is the first character of the follow pairs into strings? And the new strings must contain all of the character 'A','B','C','D','E','F' , 'G', those characters are appeared in the sup_pairs string.
The expected output should be:
S1 = 'BAEFCGD' % because BA will be followed by AE in sup_pairs string, so we combine BAE, and so on...we continue the rule to generate S1
S2 = 'DFCEABG'
If I have AB, BC and BD, the generated strings should be both : ABC and ABD .
If there is any repeated character in the pairs like : AB BC CA CE . We will skip the second A , and we get ABCE .
This, like all good things in life, is a graph problem. Each letter is a node, and each pair is an edge.
First we must transform your string of pairs into a numeric format so we can use the letters as subscripts. I will use A=2, B=3, ..., G=8:
sup_pairs = 'BA CE DF EF AE FC GD DA CG EA AB BG';
p=strsplit(sup_pairs,' ');
m=cell2mat(p(:));
m=m-'?';
A=sparse(m(:,1),m(:,2),1);
The sparse matrix A is now the adjacency matrix (actually, more like an adjacency list) representing our pairs. If you look at the full matrix of A, it looks like this:
>> full(A)
ans =
0 0 0 0 0 0 0 0
0 0 1 0 0 1 0 0
0 1 0 0 0 0 0 1
0 0 0 0 0 1 0 1
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 0
0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0
As you can see, the edge BA, which translates to subscript (3,2) is equal to 1.
Now you can use your favorite implementation of Depth-first Search (DFS) to perform a traversal of the graph from your starting node of choice. Each path from the root to a leaf node represents a valid string. You then transform the path back into your letter sequence:
treepath=[3,2,6,7,4,8,5];
S1=char(treepath+'?');
Output:
S1 = BAEFCGD
Here's a recursive implementation of DFS to get you going. Normally in MATLAB you have to worry about not hitting the default limitation on recursion depth, but you're finding Hamiltonian paths here, which is NP-complete. If you ever get anywhere near the recursion limit, the computation time will be so huge that increasing the depth will be the least of your worries.
function full_paths = dft_all(A, current_path)
% A - adjacency matrix of graph
% current_path - initially just the start node (root)
% full_paths - cell array containing all paths from initial root to a leaf
n = size(A, 1); % number of nodes in graph
full_paths = cell(1,0); % return cell array
unvisited_mask = ones(1, n);
unvisited_mask(current_path) = 0; % mask off already visited nodes (path)
% multiply mask by array of nodes accessible from last node in path
unvisited_nodes = find(A(current_path(end), :) .* unvisited_mask);
% add restriction on length of paths to keep (numel == n)
if isempty(unvisited_nodes) && (numel(current_path) == n)
full_paths = {current_path}; % we've found a leaf node
return;
end
% otherwise, still more nodes to search
for node = unvisited_nodes
new_path = dft_all(A, [current_path node]); % add new node and search
if ~isempty(new_path) % if this produces a new path...
full_paths = {full_paths{1,:}, new_path{1,:}}; % add it to output
end
end
end
This is a normal Depth-first traversal except for the added condition on the length of the path in line 15:
if isempty(unvisited_nodes) && (numel(current_path) == n)
The first half of the if condition, isempty(unvisited_nodes) is standard. If you only use this part of the condition you'll get all paths from the start node to a leaf, regardless of path length. (Hence the cell array output.) The second half, (numel(current_path) == n) enforces the length of the path.
I took a shortcut here because n is the number of nodes in the adjacency matrix, which in the sample case is 8 rather than 7, the number of characters in your alphabet. But there are no edges into or out of node 1 because I was apparently planning on using a trick that I never got around to telling you about. Rather than run DFS starting from each of the nodes to get all of the paths, you can make a dummy node (in this case node 1) and create an edge from it to all of the other real nodes. Then you just call DFS once on node 1 and you get all the paths. Here's the updated adjacency matrix:
A =
0 1 1 1 1 1 1 1
0 0 1 0 0 1 0 0
0 1 0 0 0 0 0 1
0 0 0 0 0 1 0 1
0 1 0 0 0 0 1 0
0 1 0 0 0 0 1 0
0 0 0 1 0 0 0 0
0 0 0 0 1 0 0 0
If you don't want to use this trick, you can change the condition to n-1, or change the adjacency matrix not to include node 1. Note that if you do leave node 1 in, you need to remove it from the resulting paths.
Here's the output of the function using the updated matrix:
>> dft_all(A, 1)
ans =
{
[1,1] =
1 2 3 8 5 7 4 6
[1,2] =
1 3 2 6 7 4 8 5
[1,3] =
1 3 8 5 2 6 7 4
[1,4] =
1 3 8 5 7 4 6 2
[1,5] =
1 4 6 2 3 8 5 7
[1,6] =
1 5 7 4 6 2 3 8
[1,7] =
1 6 2 3 8 5 7 4
[1,8] =
1 6 7 4 8 5 2 3
[1,9] =
1 7 4 6 2 3 8 5
[1,10] =
1 8 5 7 4 6 2 3
}

Resources