np.where in a loop overwriting all the values

np.where in a loop overwriting all the values - python-3.x

I want to recode the values in my label array so that the labels 0,1,2 correspond to the center values
1.00162877,0.74014188,1.16120161
import numpy as np
label=np.array([0, 2, 1, 1, 2, 1, 0, 0, 1, 2])
center=np.array([[1.00162877],
[0.74014188],
[1.16120161]])
Using the np.where is not overwriting all the values in a single loop but returning 3 different arrays where only a single value is changed and not all.
for i in range(len(center)):
result=np.where(label==[i], center[i], label)
print(result)
[1.00162877 2. 1. 1. 2. 1.
1.00162877 1.00162877 1. 2. ]
[0. 2. 0.74014188 0.74014188 2. 0.74014188
0. 0. 0.74014188 2. ]
[0. 1.16120161 1. 1. 1.16120161 1.
0. 0. 1. 1.16120161]
How to modify the np.where or using any other function that the outcome will look like this.
Expected=([1.00162877,1.16120161,0.74014188,0.74014188,1.1612016,0.74014188,
1.00162877,1.00162877,0.74014188,1.16120161])

This is not a loop but I think it works:
center[label].ravel()
Output:
array([1.00162877, 1.16120161, 0.74014188, 0.74014188, 1.16120161,
0.74014188, 1.00162877, 1.00162877, 0.74014188, 1.16120161])

Related

Python numpy array: Index error, Index out of bounds

The following code is just an example of my original code which is as follows
batch_size = 10
target_q = np.ones((10, 1))
actions = np.ones((10, ), dtype=int)
batch_index = np.arange(batch_size, dtype=np.int32)
print(target_q[batch_index, actions])
print(target_q.shape)
I get the following error
IndexError: index 1 is out of bounds for axis 1 with size 1.
Can someone please explain what this means and how to rectify it.
Thanks in advance.

In numpy you can index arrays of size N up to index N-1 (along a given axis), otherwise you will get the IndexError you are seeing. In order to check how high can you go with an index, you can print target_q.shape. In your case it will tell you (10, 1), which means that if you index target_q[i, j], then i can be maximally 9 and j can be maximally 0.
What you do in your line target_q[batch_index, actions] is you insert actions as so called fancy indexing on the second position (j) and actions is full of ones. Thus, you are trying to many times index with 1, whereas the highest allowed index value is 0.
What would work would be:
import numpy as np
batch_size = 10
target_q = np.ones((10, 1))
# changed to zeros below
actions = np.zeros((10, ), dtype=int)
batch_index = np.arange(batch_size, dtype=np.int32)
print(actions)
print(target_q.shape)
print(target_q[batch_index, 0])
print(target_q[batch_index, actions])
that prints:
[0 0 0 0 0 0 0 0 0 0]
(10, 1)
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]

sklearn.preprocessing.MinMaxScaler() only returns 0 or 1 and not float

For whatever reason this only returns 0 or 1 instead of float between them.
from sklearn import preprocessing
X = [[1.3, 1.6, 1.4, 1.45, 12.3, 63.01,],
[1.9, 0.01, 4.3, 45.4, 3.01, 63.01]]
minmaxscaler = preprocessing.MinMaxScaler()
X_scale = minmaxscaler.fit_transform(X)
print(X_scale) # returns [[0. 1. 0. 0. 1. 0.] [1. 0. 1. 1. 0. 0.]]

Minmax Scaler can not work with list of lists, it needs to work with numpy array for example (or dataframes).
You can convert to numpy array. It will result 6 features with 2 samples, which I guess is not what you means so you need also reshape.
import numpy
X = numpy.array([[1.3, 1.6, 1.4, 1.45, 12.3, 63.01,],
[1.9, 0.01, 4.3, 45.4, 3.01, 63.01]]).reshape(-1,1)
Results after MinMax Scaler:
[[0.02047619]
[0.0252381 ]
[0.02206349]
[0.02285714]
[0.19507937]
[1. ]
[0.03 ]
[0. ]
[0.06809524]
[0.72047619]
[0.04761905]
[1. ]]
Not exactly sure if you want to minimax each list separatly or all together

The answer which you have got from MinMaxScaler is the expected answer.
When you have only two datapoints, you will get only 0s and 1s. See the example here for three datapoints scenario.
You need to understand that it will convert the lowest value as 0 and highest values as 1 for each column. When you have more datapoints, the remaining ones would calculation based on the range (Max-min). see the formula here.
Also, MinMaxScaler accepts 2D data, which means lists of list is acceptable. Thats the reason why you did not got any error.

Perform matrix multiplication with cosine similarity function

I have two lists:
list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'],
['scent', 'scents', 'aroma', 'smell', 'odor'],
['mental_illness', 'mental_disorders','bipolar_disorder']
['romance', 'romances', 'romantic', 'budding_romance']]
list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'],
['also', 'like', 'buy', 'perfumes'],
['suffer', 'from', 'clinical', 'depression'],
['really', 'love', 'my', 'wife']]
I would like to compute the cosine similarity between the two lists above in such a way where the cosine similarity between the first sub-list in list1 and all sublists of list 2 are measured against each other. Then the same thing but with the second sub-list in list 1 and all sub-lists in list 2, etc.
The goal is to create a len(list_2) by len(list_1) matrix, and each entry in that matrix is a cosine similarity score. Currently I've done this the following way:
import gensim
import numpy as np
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('./data/GoogleNews-vectors-negative300.bin.gz', binary=True)
similarity_mat = np.zeros([len(list_2), len(list_1)])
for i, L2 in enumerate(list_2):
for j, L1 in enumerate(list_1):
similarity_mat[i, j] = model.n_similarity(L2, L1)
However, I'd like to implement this with matrix multiplication and no for loops.
My two questions are:
Is there a way to do some sort of element-wise matrix multiplication but with gensim's n_similiarity() method to generate the required matrix?
Would it be more efficient and faster using the current method or matrix multiplication?
I hope my question was clear enough, please let me know if I can clarify even further.

Here's an approach, but it's not clear from the question whether you understand the underlying mechanics of the calculation, which might be causing the block.
I've changed the input strings to give more exact word matches, and given the two strings different dimensions to make it a bit clearer:
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
list_1 = [['flavor', 'flavors', 'fruity_flavor', 'taste'],
['scent', 'my', 'aroma', 'smell', 'odor'],
['mental_illness', 'mental_disorders','bipolar_disorder'],
['romance', 'romances', 'romantic', 'budding_romance']]
list_2 = [['love', 'eating', 'spicy', 'hand', 'pulled', 'noodles'],
['also', 'like', 'buy', 'perfumes'],
['suffer', 'from', 'clinical', 'depression'],
['really', 'love', 'my', 'wife'],
['flavor', 'taste', 'romantic', 'aroma', 'what']]
cnt = CountVectorizer()
# Combine each sublist into single str, and join everything into corpus
combined_lists = ([' '.join(item) for item in list_1] +
[' '.join(item) for item in list_2])
count_matrix = cnt.fit_transform(combined_lists).toarray()
# Split them again into list_1 and list_2 word counts
count_matrix_1 = count_matrix[:len(list_1),]
count_matrix_2 = count_matrix[len(list_1):,]
match_matrix = np.matmult(count_matrix_1, count_matrix_2.T)
Output of match_matrix:
array([[0, 0, 0, 0, 2],
[0, 0, 0, 1, 1],
[0, 0, 0, 0, 0],
[0, 0, 0, 0, 1]], dtype=int64)
You can see that the 1st string in list_1 has 2 matches with the 5th string in list_2, and so on.
So the first part of the calculation (the dot product) has been calculated. Now we need the magnitudes:
magnitudes = np.array([np.linalg.norm(count_matrix[i,:])
for i in range(len(count_matrix))])
Now we can use matrix multiplication to turn that into a matrix of divisors (we need to reshape magnitudes into n x 1 and 1 x n matrices for this to produce an n x n matrix:
divisor_matrix = np.matmul(magnitudes.reshape(len(magnitudes),1),
magnitudes.reshape(1,len(magnitudes)))
Now since we didn't compare every single sublist, but only the list_1 with the list_2 sublists, we need to take a subsection of this divisor matrix to get the right magnitudes:
divisor_matrix = divisor_matrix[:len(list_1), len(list_1):]
Output:
array([[4.89897949, 4. , 4. , 4. , 4.47213595],
[5.47722558, 4.47213595, 4.47213595, 4.47213595, 5. ],
[4.24264069, 3.46410162, 3.46410162, 3.46410162, 3.87298335],
[4.89897949, 4. , 4. , 4. , 4.47213595]])
Now we can calculate the final matrix of cosine similarity scores:
cos_sim = match_matrix / divisor_matrix
Output:
array([[0. , 0. , 0. , 0. , 0.4472136],
[0. , 0. , 0. , 0.2236068, 0.2 ],
[0. , 0. , 0. , 0. , 0. ],
[0. , 0. , 0. , 0. , 0.2236068]])
Note these scores differ from the example given, since in the example every cosine similarity score would be 0.

There are two problems in code, the second last and last line.
import gensim
import numpy as np
from gensim.models import KeyedVectors
model = KeyedVectors.load_word2vec_format('/root/input/GoogleNews-vectors-negative300.bin.gz', binary=True)
similarity_mat = np.zeros([len(list_2), len(list_1)])
for i, L2 in enumerate(list_2):
for j, L1 in enumerate(list_1):
similarity_mat[i, j] = model.n_similarity(L2, L1)
Answers to you questions:
1. You are already using a direct function to calculate the similarity between two sentences(L1 and L2) which are first converted to two vectors and then cosine similarity is calculated of those two vectors. Everything is already done inside the n_similarity() so you can't do any kind of matrix multiplication.
If you want to do your own matrix multiplication then instead of directly using n_similarity() calculates the vectors of the sentences and then apply matrix multiplication while calculating cosine similarity.
2. As I said in (1) that everything is done in n_similarity() and creators of gensim takes care of the efficiency when writing the libraries so any other multiplication method will most likely not make a difference.

How to solve this issue in Python (creating weights for Infogain Loss layer)?

I am working on semantic segmentation using CNNs. I have an imbalance number of pixels for each class.
Based on this link, I am trying to create weight matrix H in order to define Infogain loss layer for my imbalance class members.
My data has five classes. I wrote the following code in python:
Reads a sample image:
im=imread(sample_img_path)
Counts the number of pixels of each class
cl0=np.count_nonzero(im == 0) #0=background class
.
.
cl4=np.count_nonzero(im == 4) #4=class 4
output:
39817 13751 1091 10460 417
#Inverse class weights
#FORMULA=(total number of sample)/((number of classes)*(number of sample in class i))
w0=round(sum_/(no_classes*cl0),3)
w1=round(sum_/(no_classes*cl1),3)
w2=round(sum_/(no_classes*cl2),3)
w3=round(sum_/(no_classes*cl3),3)
w4=round(sum_/(no_classes*cl4),3)
print w0,w1,w2,w3,w4
L_1=[w0,w1,w2,w3,w4]
#weighting based on the number of pixel
print L_1
L=[round(i/sum(L_1),2) for i in L_1] #normalizing the weights
print L
print sum(L)
#creating the H matrix
H=np.eye(5)
print H
#H = np.eye( L, dtype = 'f4' )
d=np.diag_indices_from(H)
H[d]=L
print H
blob = caffe.io.array_to_blobproto(H.reshape((1,1,L,L)))
with open( 'infogainH.binaryproto', 'wb' ) as f :
f.write( blob.SerializeToString() )
print f
The output, after removing some unimportant lines, is as follows:
(256, 256)
39817 13751 1091 10460 417
0.329 0.953 12.014 1.253 31.432
<type 'list'>
[0.329, 0.953, 12.014, 1.253, 31.432]
[0.01, 0.02, 0.26, 0.03, 0.68]
1.0
[[ 1. 0. 0. 0. 0.]
[ 0. 1. 0. 0. 0.]
[ 0. 0. 1. 0. 0.]
[ 0. 0. 0. 1. 0.]
[ 0. 0. 0. 0. 1.]]
[[ 0.01 0. 0. 0. 0. ]
[ 0. 0.02 0. 0. 0. ]
[ 0. 0. 0.26 0. 0. ]
[ 0. 0. 0. 0.03 0. ]
[ 0. 0. 0. 0. 0.68]]
Traceback (most recent call last):
File "create_class_prob.py", line 59, in <module>
blob = caffe.io.array_to_blobproto(H.reshape((1,1,L,L)))
TypeError: an integer is required
As it can be seen, it is giving an error. My question can be folded into two parts:
How to solve this error?
I replaced L with 5 as follows:
blob = caffe.io.array_to_blobproto(H.reshape((1,1,5,5)))
Now, it is not giving error and last line shows this:
<closed file 'infogainH.binaryproto', mode 'wb' at 0x7f94b5775b70>
It created the file infogainH.binaryproto, Is this correct?
Is this matrix H should be constant for the all images in database?
I really appreciate any help.
Thanks

You have a simple "copy-paste" bug. You copied your code from this answer where L was an integer representing the number of classes. In your code, on the other hand, L is a list with the class weights. replacing L with 5 in your code does indeed solves the problem.
Should H be constant? This is really up to you to decide.
BTW, AFAIK, current caffe version does not support pixel-wise infogain loss, you might need to use the code in PR #3855.

How to vectorize this function in py3

I seem to be unable to find out how to vectorize this py3 loop
import numpy as np
a = np.array([-72, -10, -70, 37, 68, 9, 1, -3, 2, 3, -6, -4, ], np.int16)
result = np.array([-72, -10, -111, -23, 1, -2, 1, -3, 1, 2, -5, -5, ], np.int16)
b = np.copy(a)
for i in range(2, len(b)):
b[i] += int( (b[i-1] + b[i-2]) / 2)
assert (b == result).all()
I tried playing with np.convolve and pandas.rolling_apply but couldn't get it working. Maybe this is the time to learn about c-extensions?
It would be great to get the time for this down to something like 50..100ms for input arrays of ~500k elements.
#hpaulj asked in his answer for a closed expression of b[k] in terms of a[:k]. I didn't think it existed, but I worked a bit on it and indeed found that the closed form contains a bunch of Jacobsthal numbers as #Divakar pointed out.
Here is one closed form:
J_n here is the Jacobsthal number, when expanding it like this:
J_n = (2^n - (-1)^n) / 3
one ends up with an expression which I can imagine to use a vectorized implementation ...

Most numpy code operates on the whole array at once. Ok it iterates in C code but buffered in a way that it doesn't matter which element is used first.
Here changes to b[2] affect the value calculated for b[3] and on down the line.
add.at and other such ufunc do unbuffered calculations. This allows you to add some value repeatedly to one element. I played a bit with it in that case, but no luck so far.
cumsum and cumprod are also handy for problems were values depend earlier ones.
Is it possible to generalize the calculation, so as to as define b[i] in terms of all the a[:i]. We know b[2] as a function of a[:2], but what of b[3]?
Even if we go this working for floats, it might be off when doing integer divisions.

I think you already have the sane solution. Any other vectorization would rely on floating point calculations and it would be really difficult to keep track of the error accumulation. For example say you want to have a matrix vector multiplication: for the first seven terms the matrix would look like
array([[ 1. , 0. , 0. , 0. , 0. , 0. , 0. ],
[ 0. , 1. , 0. , 0. , 0. , 0. , 0. ],
[ 0.5 , 0.5 , 1. , 0. , 0. , 0. , 0. ],
[ 0.25 , 0.75 , 0.5 , 1. , 0. , 0. , 0. ],
[ 0.375 , 0.625 , 0.75 , 0.5 , 1. , 0. , 0. ],
[ 0.3125 , 0.6875 , 0.625 , 0.75 , 0.5 , 1. , 0. ],
[ 0.34375, 0.65625, 0.6875 , 0.625 , 0.75 , 0.5 , 1. ]])
The relationship can be described as the iterative formula
[ a[i-2] ]
b[i] = [0.5 , 0.5 , 1] [ a[i-1] ]
[ a[i] ]
That defines a series of elementary matrices of the form of an identity matrix with
[0 ... 0.5 0.5 1 0 ... 0]
on the ith row. And successive multiplication gives the matrix above for the first seven terms. There is indeed a subdiagonal structure but the terms are getting too small very quickly. As you have shown 2 to the power 500k is not fun.
In order to keep track of floating point noise, an iterative solution is required which is what you have anyways.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

np.where in a loop overwriting all the values - python-3.x

This is not a loop but I think it works: center[label].ravel() Output: array([1.00162877, 1.16120161, 0.74014188, 0.74014188, 1.16120161, 0.74014188, 1.00162877, 1.00162877, 0.74014188, 1.16120161])

Related

Python numpy array: Index error, Index out of bounds

sklearn.preprocessing.MinMaxScaler() only returns 0 or 1 and not float

Perform matrix multiplication with cosine similarity function

How to solve this issue in Python (creating weights for Infogain Loss layer)?

How to vectorize this function in py3

Categories

Resources