Set to 0 x% of non zero values in numpy 2d array - python-3.x

I tried different ways but it seems impossible for me to do it efficiently without looping through.
Input is an array y and a percentage x.
e.g. input is
y=np.random.binomial(1,1,[10,10])
x=0.5
output
[[0 0 0 0 1 1 1 1 0 1]
[1 0 1 0 0 1 0 1 0 1]
[1 0 1 1 1 1 0 0 0 1]
[0 1 0 1 1 0 1 0 1 1]
[0 1 1 0 0 1 1 1 0 0]
[0 0 1 1 1 0 1 1 0 1]
[0 1 0 0 0 0 1 0 1 1]
[0 0 0 1 1 1 1 1 0 0]
[0 1 1 1 1 0 0 1 0 0]
[1 0 1 0 1 0 0 0 0 0]]

Here's one based on masking -
def set_nonzeros_to_zeros(a, setz_ratio):
nz_mask = a!=0
nz_count = nz_mask.sum()
z_set_count = int(np.round(setz_ratio*nz_count))
idx = np.random.choice(nz_count,z_set_count,replace=False)
mask0 = np.ones(nz_count,dtype=bool)
mask0.flat[idx] = 0
nz_mask[nz_mask] = mask0
a[~nz_mask] = 0
return a
We are skipping the generation all the indices with np.argwhere/np.nonzero in favor of a masking based one to focus on performance.
Sample run -
In [154]: np.random.seed(0)
...: a = np.random.randint(0,3,(5000,5000))
# number of non-0s before using solution
In [155]: (a!=0).sum()
Out[155]: 16670017
In [156]: a_out = set_nonzeros_to_zeros(a, setz_ratio=0.2) #set 20% of non-0s to 0s
# number of non-0s after using solution
In [157]: (a_out!=0).sum()
Out[157]: 13336014
# Verify
In [158]: 16670017 - 0.2*16670017
Out[158]: 13336013.6

There are a few vectorized methods that might help you, depending on what you want to do:
# Flatten the 2D array and get the indices of the non-zero elements
c = y.flatten()
d = c.nonzero()[0]
# Shuffle the indices and set the first 100x % to zero
np.random.shuffle(d)
x = 0.5
c[d[:int(x*len(d))]] = 0
# reshape to the original 2D shape
y = c.reshape(y.shape)
No doubt there are some efficiency improvements to be made here.

Related

Does 1D or 2D array matter while fitting and prediction of a ML model?

I have developed a text classification model where my X_test and X-train are 2D array. Where as y_test and y_trainare 1D array. Though I did not encounter with any error while training, fitting and predicting my ML model. But i am dont know why I am having trouble generating ROC score. It says AxisError: axis 1 is out of bounds for array of dimension 1!!
I am unable to find a solution for this. So I am just curious to know if there's any correlation of having 1D and 2D arrays in a ML model. Or It should be one of them; either 1D or 2D array.
Can anyone explain this?
Sample code for text classification model(to generate roc score):
from sklearn.metrics import roc_curve, roc_auc_score
r_auc = roc_auc_score(y_test, r_probs, multi_class='OVO')
I had done the following before calculating auroc;
#Prediction probabilities
r_probs = [0 for _ in range(len(y_test))]
rf_probs = RFClass.predict_proba(X_test)
dt_probs = DTClass.predict_proba(X_test)
sgdc_probs = sgdc_model.predict_proba(X_test)
#Probabilities for the positive outcome is kept.
dt_probs = dt_probs[:, 1]
sgdc_probs = sgdc_probs[:, 1]
rf_probs = rf_probs[:, 1]
y_test sample output;
Covid19 - Form
Covid19 - Phone
Covid19 - Email
Covid19 - Email
Covid19 - Phone
r_probs sample output;
[0,
0,
0,
0,
0,
...]
Here is the error;
---------------------------------------------------------------------------
AxisError Traceback (most recent call last)
/tmp/ipykernel_14270/1310904144.py in <module>
4 from sklearn.metrics import roc_curve, roc_auc_score
5
----> 6 r_auc = roc_auc_score(y_test, r_probs, multi_class='OVO')
7 #rf_auc = roc_auc_score(y_test, rf_probs, multi_class='ovr')
8 #dt_auc = roc_auc_score(y_test, dt_probs, multi_class='ovr')
packages/sklearn/metrics/_ranking.py in roc_auc_score(y_true, y_score, average, sample_weight, max_fpr, multi_class, labels)
559 if multi_class == "raise":
560 raise ValueError("multi_class must be in ('ovo', 'ovr')")
--> 561 return _multiclass_roc_auc_score(
562 y_true, y_score, labels, multi_class, average, sample_weight
563 )
There seems to be a mismatch in the shapes of your y_test and r_probs. Also, you seem to have assigned the r_probs to all zeros and never have updated them. Note that you need to have some 1's in the ground truth and predictions in order for the roc_auc_score to work.
First some background:
The y_test and the predictions, both can be 1-D or 2-D depending on whether you have formulated it as binary, multi-class or a multi-label problem. Read more under the y_true and multi_class parameters here roc_auc_score
y_true:
True labels or binary label indicators. The binary and multiclass cases expect labels with shape (n_samples,) while the multilabel case expects binary label indicators with shape (n_samples, n_classes).
multi_class:
Only used for multiclass targets. Determines the type of configuration to use. The default value raises an error, so either 'ovr' or 'ovo' must be passed explicitly.
I'd print the shapes of the y_test and r_probs, just before invoking the roc_auc_score function just to be sure. Showing below samples that work for the binary (1-D labels) and multi-label (2-D labels) cases:
binary (1-D) class labels:
import numpy as np
from sklearn.metrics import roc_auc_score
np.random.seed(42)
n = 100
y_test = np.random.randint(0, 2, (n,))
r_probs = np.random.randint(0, 2, (n,))
r_auc = roc_auc_score(y_test, r_probs)
print(f'Shape of y_test: {y_test.shape}')
print(f'Shape of r_probs: {r_probs.shape}')
print(f'y_test: {y_test}')
print(f'r_probs: {r_probs}')
print(f'r_auc: {r_auc}')
Output:
Shape of y_test: (100,)
Shape of r_probs: (100,)
y_test: [0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 1 1 1 0 1 0 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 0 0 0 0 0 1 1 1 1 1 0 1 1 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 1 0 1 1 1 0 1 0 1 0 1 0 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0]
r_probs: [0 1 1 1 1 1 1 1 1 0 1 0 1 1 0 1 0 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 0 0 0 0 0 0 0 0 1 0 1 1 1 0 0 0 0 1 0 0 0 0 0 1 0 1 0 1 0 0 1 1 1 0 1 0 0 1 1 0 0 1 1 1 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 1 1 0 0]
r_auc: 0.5073051948051948
multi-label (2-D) class labels:
y_test = np.random.randint(0, 2, (n, 4))
r_probs = np.random.randint(0, 2, (n, 4))
r_auc = roc_auc_score(y_test, r_probs, multi_class='ovr')
print(f'Shape of y_test: {y_test.shape}')
print(f'Shape of r_probs: {r_probs.shape}')
print(f'y_test: {y_test}')
print(f'r_probs: {r_probs}')
print(f'r_auc: {r_auc}')
Output:
Shape of y_test: (100, 4)
Shape of r_probs: (100, 4)
y_test: [[0 1 0 0] [1 0 1 1] ... [1 0 0 0] [0 0 1 1]]
r_probs: [[0 1 1 1] [0 0 0 1] ... [1 1 0 1] [1 1 1 0]]
r_auc: 0.5270015526313198

Create a new large matrix by stacking in its diagonal K matrices

l have K (let K here be 7) distincts matrices of dimension (50,50).
I would like to create a new matrix L by filling it in diagonal with the K matrices. Hence L is of dimension (50*K,50*K).
What l have tried ?
K1=np.random.random((50,50))
N,N=K1.shape
K=7
out=np.zeros((K,N,K,N),K1.dtype)
np.einsum('ijik->ijk', out)[...] = K1
L=out.reshape(K*N, K*N) # L is of dimension (50*7,50*7)=(350,350)
Its indeed creating a new matrix L by stacking K1 seven times within its diagonal. However, l would like to stack respectively K1,K2,K3,K5,K6,K7 rather than K1 seven times.
Inputs :
K1=np.random.random((50,50))
K2=np.random.random((50,50))
K3=np.random.random((50,50))
K4=np.random.random((50,50))
K5=np.random.random((50,50))
K6=np.random.random((50,50))
K7=np.random.random((50,50))
L=np.zeros((50*7,50*7))#
Expected outputs :
L[:50,:50]=K1
L[50:100,50:100]=K2
L[100:150,100:50]=K3
L[150:200,150:200]=K4
L[200:250,200:250]=K5
L[250:300,250:300]=K6
L[300:350,300:350]=K7
You could try scipy.linalg.block_diag. If you look at the source, this function basically just loops over the given blocks the way you have written as your output. It can be used like:
K1=np.random.random((50,50))
K2=np.random.random((50,50))
K3=np.random.random((50,50))
K4=np.random.random((50,50))
K5=np.random.random((50,50))
K6=np.random.random((50,50))
K7=np.random.random((50,50))
L=sp.linalg.block_diag(K1,K2,K3,K4,K5,K6,K7)
If you have your K as a ndarray of shape (7,50,50) you can unpack it directly like:
K=np.random.random((7,50,50))
L=sp.linalg.block_diag(*K)
If you don't want to import scipy, you can always just write a simple loop to do what you have written for the expected output.
Here is a way to do that with NumPy:
import numpy as np
def put_in_diagonals(a):
n, rows, cols = a.shape
b = np.zeros((n * rows, n * cols), dtype=a.dtype)
a2 = a.reshape(-1, cols)
ii, jj = np.indices(a2.shape)
jj += (ii // rows) * cols
b[ii, jj] = a2
return b
# Test
a = np.arange(24).reshape(4, 2, 3)
print(put_in_diagonals(a))
Output:
[[ 0 1 2 0 0 0 0 0 0 0 0 0]
[ 3 4 5 0 0 0 0 0 0 0 0 0]
[ 0 0 0 6 7 8 0 0 0 0 0 0]
[ 0 0 0 9 10 11 0 0 0 0 0 0]
[ 0 0 0 0 0 0 12 13 14 0 0 0]
[ 0 0 0 0 0 0 15 16 17 0 0 0]
[ 0 0 0 0 0 0 0 0 0 18 19 20]
[ 0 0 0 0 0 0 0 0 0 21 22 23]]

how do you replace only a certain number of items in a list randomly?

board = []
for x in range(0,8):
board.append(["0"] * 8)
def print_board(board):
for row in board:
print(" ".join(row))
this code creates a grid of zeros but I wish to replace 5 of them with ones and another five with twos
does anyone know a way to do this?
If you want to randomly set some coordinates with "1" and "2", you can do it like this:
import random
board = []
for x in range(0, 8):
board.append(["0"] * 8)
def print_board(board):
for row in board:
print(" ".join(row))
def generate_coordinates(x, y, k):
coordinates = [(i, j) for i in range(x) for j in range(y)]
random.shuffle(coordinates)
return coordinates[:k]
coo = generate_coordinates(8, 8, 10)
ones = coo[:5]
twos = coo[5:]
for i, j in ones:
board[i][j] = "1"
for i, j in twos:
board[i][j] = "2"
print_board(board)
Output
0 1 0 0 0 0 0 0
0 1 0 0 0 0 0 0
0 0 0 0 0 1 0 0
0 0 0 0 0 0 0 0
0 0 2 0 0 0 0 0
1 0 0 0 2 0 0 0
0 0 0 0 2 0 0 2
2 0 0 0 0 0 0 1
Notes:
The code above generates a random sample each time so the output will be different each time (to generate the same use random.seed(42), you can change 42 for any number you want.
The function generate_coordinates receives x (number of rows), y (number of columns) and k (the number of coordinates to pick). It generates a sequence of coordinates of x*y, shuffles it and picks the k first.
In your specific case x = 8, y = 8 and k = 10 (5 for the ones and 5 for the twos)
Finally, this picks the positions for the ones and twos and changes the values:
ones = coo[:5]
twos = coo[5:]
for i, j in ones:
board[i][j] = "1"
for i, j in twos:
board[i][j] = "2"

How to understand synergy in information theory?

In information theory, multivariate mutual information (MMI) could be synergy (negative) or redundancy (positive). To simulate this two cases, assuming three variables X, Y and Z, all of them takes 0 or 1 (binary variable). And we repeat sampling them 12 times.
Case 1:
X = [ 0 0 0 0 0 0 1 1 1 1 1 1 ]
Y = [ 0 0 0 0 1 1 0 0 1 1 1 1 ]
Z = [ 0 0 1 1 1 1 0 0 0 0 1 1 ]
In this case, we assume a mechanism among XYZ taht when both Y and Z are 0 or 1, X takes 0 or 1 respectively. When Y = 0, Z = 1, then X takes 0, and Y = 1, Z = 0, then X takes 1.
The mmi(X,Y,Z) = -0.1699 in this case, indicating a synergy effect among three variable.
Case 2:
X = [ 0 0 0 0 0 0 1 1 1 1 1 1 ]
Y = [ 0 0 0 0 0 1 0 1 1 1 1 1 ]
Z = [ 0 1 1 1 1 1 0 0 0 0 0 1 ]
the machanism in this case is same as above. The difference is there are more samples of XY takes different value and less samples of both XY are 0 or 1.
The mmi(X,Y,Z) = 0.0333, indicating a redundancy.
So far, can I say in these two cases, synergy and redundancy show the similar mechanism (or relationship) among three variables? But how do we understand redundancy and particularly synergy in realistic data?

Logical not on a scipy sparse matrix

I have a bag-of-words representation of a corpus stored in an D by W sparse matrix word_freqs. Each row is a document and each column is a word. A given element word_freqs[d,w] represents the number of occurrences of word w in document d.
I'm trying to obtain another D by W matrix not_word_occs where, for each element of word_freqs:
If word_freqs[d,w] is zero, not_word_occs[d,w] should be one.
Otherwise, not_word_occs[d,w] should be zero.
Eventually, this matrix will need to be multiplied with other matrices which might be dense or sparse.
I've tried a number of methods, including:
not_word_occs = (word_freqs == 0).astype(int)
This words for toy examples, but results in a MemoryError for my actual data (which is approx. 18,000x16,000).
I've also tried np.logical_not():
word_occs = sklearn.preprocessing.binarize(word_freqs)
not_word_occs = np.logical_not(word_freqs).astype(int)
This seemed promising, but np.logical_not() does not work on sparse matrices, giving the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all().
Any ideas or guidance would be appreciated.
(By the way, word_freqs is generated by sklearn's preprocessing.CountVectorizer(). If there's a solution that involves converting this to another kind of matrix, I'm certainly open to that.)
The complement of the nonzero positions of a sparse matrix is dense. So if you want to achieve your stated goals with standard numpy arrays you will require quite a bit of RAM. Here's a quick and totally unscientific hack to give you an idea, how many arrays of that sort your computer can handle:
>>> import numpy as np
>>> a = []
>>> for j in range(100):
... print(j)
... a.append(np.ones((16000, 18000), dtype=int))
My laptop chokes at j=1. So unless you have a really good computer even if you can get the complement (you can do
>>> compl = np.ones(S.shape,int)
>>> compl[S.nonzero()] = 0
) memory will be an issue.
One way out may be to not explicitly compute the complement let's call it C = B1 - A, where B1 is the same-shape matrix completely filled with ones and A the adjacency matrix of your original sparse matrix. For example the matrix product XC can be written as XB1 - XA so you have one multiplication with the sparse A and one with B1 which is actually cheap because it boils down to computing row sums. The point here is that you can compute that without computing C first.
A particularly simple example would be multiplication with a one-hot vector. Such a multiplication just selects a column (if multiplying from the right) or row (if multiplying from the left) of the other matrix. Meaning you just need to find that column or row of the sparse matrix and take the complement (for a single slice no problem) and if you do this for a one-hot matrix, as above you needn't compute the complement explicitly.
Make a small sparse matrix:
In [743]: freq = sparse.random(10,10,.1)
In [744]: freq
Out[744]:
<10x10 sparse matrix of type '<class 'numpy.float64'>'
with 10 stored elements in COOrdinate format>
the repr(freq) shows the shape, elements and format.
In [745]: freq==0
/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py:213: SparseEfficiencyWarning: Comparing a sparse matrix with 0 using == is inefficient, try using != instead.
", try using != instead.", SparseEfficiencyWarning)
Out[745]:
<10x10 sparse matrix of type '<class 'numpy.bool_'>'
with 90 stored elements in Compressed Sparse Row format>
If do your first action, I get a warning and new array with 90 (out of 100) nonzero terms. That not is no longer sparse.
In general numpy functions do not work when applied to sparse matrices. To work they have to delegate the task to sparse methods. But even if logical_not worked it wouldn't solve the memory issue.
Here is an example of using Pandas.SparseDataFrame:
In [42]: X = (sparse.rand(10, 10, .1) != 0).astype(np.int64)
In [43]: X = (sparse.rand(10, 10, .1) != 0).astype(np.int64)
In [44]: d1 = pd.SparseDataFrame(X.toarray(), default_fill_value=0, dtype=np.int64)
In [45]: d2 = pd.SparseDataFrame(np.ones((10,10)), default_fill_value=1, dtype=np.int64)
In [46]: d1.memory_usage()
Out[46]:
Index 80
0 16
1 0
2 8
3 16
4 0
5 0
6 16
7 16
8 8
9 0
dtype: int64
In [47]: d2.memory_usage()
Out[47]:
Index 80
0 0
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 0
dtype: int64
math:
In [48]: d2 - d1
Out[48]:
0 1 2 3 4 5 6 7 8 9
0 1 1 0 0 1 1 0 1 1 1
1 1 1 1 1 1 1 1 1 0 1
2 1 1 1 1 1 1 1 1 1 1
3 1 1 1 1 1 1 1 0 1 1
4 1 1 1 1 1 1 1 1 1 1
5 0 1 1 1 1 1 1 1 1 1
6 1 1 1 1 1 1 1 1 1 1
7 0 1 1 0 1 1 1 0 1 1
8 1 1 1 1 1 1 0 1 1 1
9 1 1 1 1 1 1 1 1 1 1
source sparse matrix:
In [49]: d1
Out[49]:
0 1 2 3 4 5 6 7 8 9
0 0 0 1 1 0 0 1 0 0 0
1 0 0 0 0 0 0 0 0 1 0
2 0 0 0 0 0 0 0 0 0 0
3 0 0 0 0 0 0 0 1 0 0
4 0 0 0 0 0 0 0 0 0 0
5 1 0 0 0 0 0 0 0 0 0
6 0 0 0 0 0 0 0 0 0 0
7 1 0 0 1 0 0 0 1 0 0
8 0 0 0 0 0 0 1 0 0 0
9 0 0 0 0 0 0 0 0 0 0
memory usage:
In [50]: (d2 - d1).memory_usage()
Out[50]:
Index 80
0 16
1 0
2 8
3 16
4 0
5 0
6 16
7 16
8 8
9 0
dtype: int64
PS if you can't build the whole SparseDataFrame at once (because of memory constraints), you can use an approach similar to one used in this answer

Resources