checking whether tuple in the list in python - python-3.x

Suppose I have a list of variables as follows:
v = [('d',0),('i',0),('g',0)]
What I want is to obtain a vector of values, that gives the truth value of the presence of the variable inside the list.
So, if have another list say
g = [('g',0)]
The output of that should be
op(v,g) = [False, False, True]
P.S.
I have tried using np.in1d but it gives the following:
array([False, True, False, True, True, True], dtype=bool)

In python you can use a list comprehension like following :
>>> v=[('d', 0), ('i', 0), ('g', 0)]
>>> g=[('t', 0), ('g', 0),('d',0)]
>>> [i in g for i in v]
[True, False, True]

You can convert those lists to numpy arrays and then use np.in1d like so -
import numpy as np
# Convert to numpy arrays
v_arr = np.array(v)
g_arr = np.array(g)
# Slice the first & second columns to get string & numeric parts.
# Use in1d to get matches between first columns of those two arrays;
# repeat for the second columns.
string_part = np.in1d(v_arr[:,0],g_arr[:,0])
numeric_part = np.in1d(v_arr[:,1],g_arr[:,1])
# Perform boolean AND to get the final boolean output
out = string_part & numeric_part
Sample run -
In [157]: v_arr
Out[157]:
array([['d', '0'],
['i', '0'],
['g', '0']],
dtype='<U1')
In [158]: g_arr
Out[158]:
array([['g', '1']],
dtype='<U1')
In [159]: string_part = np.in1d(v_arr[:,0],g_arr[:,0])
In [160]: string_part
Out[160]: array([False, False, True], dtype=bool)
In [161]: numeric_part = np.in1d(v_arr[:,1],g_arr[:,1])
In [162]: numeric_part
Out[162]: array([False, False, False], dtype=bool)
In [163]: string_part & numeric_part
Out[163]: array([False, False, False], dtype=bool)

Related

Getting indices using conditional list comprehension

I have the following np.array:
my_array=np.array([False, False, False, True, True, True, False, True, False, False, False, False, True])
How can I make a list using list comprehension of the indices corresponding to the True elements. In this case the output I'm looking for would be [3,4,5,7,12]
I've tried the following:
cols = [index if feature_condition==True for index, feature_condition in enumerate(my_array)]
But is not working
why specifically a list comprehension?
>>> np.where(my_array==True)
(array([ 3, 4, 5, 7, 12]),)
this does the job and is quicker. The list solution would be:
>>> [index for index, feature_condition in enumerate(my_array) if feature_condition == True]
[3, 4, 5, 7, 12]
Accepted answer of this explains the confusion of the ordering: if/else in a list comprehension
I was curious of the differences:
def np_time(array):
np.where(my_array==True)
def list_time(array):
[index for index, feature_condition in enumerate(my_array) if feature_condition == True]
timeit.timeit(lambda: list_time(my_array),number = 1000)
0.007574789000500459
timeit.timeit(lambda: np_time(my_array),number = 1000)
0.0010812399996211752
The order of if is not correct, should be in last -
$more numpy_1.py
import numpy as np
my_array=np.array([False, False, False, True, True, True, False, True, False, False, False, False, True])
print (my_array)
cols = [index for index, feature_condition in enumerate(my_array) if feature_condition]
print (cols)
$python numpy_1.py
[False False False True True True False True False False False False
True]
[3, 4, 5, 7, 12]

slice tensor of tensors using boolean tensor

Having two tensors :inputs_tokens is a batch of 20x300 of token ids
and seq_A is my model output with size of [20, 300, 512] (512 vector for each of the tokens in the batch)
seq_A.size()
Out[1]: torch.Size([20, 300, 512])
inputs_tokens.size()
torch.Size([20, 300])
I would like to get only the vectors of the token 101 (CLS) as follow:
cls_tokens = (inputs_tokens == 101)
cls_tokens
Out[4]:
tensor([[ True, False, False, ..., False, False, False],
[ True, False, False, ..., False, False, False],
[ True, False, False, ..., False, False, False], ...
How do I slice seq_A to get only the vectors which are true in cls_tokens for each batch?
when I do
seq_A[cls_tokens].size()
Out[7]: torch.Size([278, 512])
but I still need it to bee in the size of [20 x N x 512 ] (otherwise I don't know to which sample it belongs)
TLDR; You can't, all sequences must have the same size along a given axis.
Take this simplified example:
>>> inputs_tokens = torch.tensor([[ 1, 101, 18, 101, 9],
[ 1, 2, 101, 101, 101]])
>>> inputs_tokens.shape
torch.Size([2, 5])
>>> cls_tokens = inputs_tokens == 101
tensor([[False, True, False, True, False],
[False, False, True, True, True]])
Indexing inputs_tokens with the cls_tokens mask comes down to reducing inputs_tokens to cls_tokens's true values. In a general case where there is a different number of true values per batch, keeping the shape is impossible.
Following the above example, here is seq_A:
>>> seq_A = torch.rand(2, 5, 1)
tensor([[[0.4644],
[0.7656],
[0.3951],
[0.6384],
[0.1090]],
[[0.6754],
[0.0144],
[0.7154],
[0.5805],
[0.5274]]])
According to your example, you would expect to have an output shape of (2, N, 1). What would N be? 3? What about the first batch which only as 2 true values? The resulting tensor can't have different sizes (2 and 3 on axis=1). Hence: "all sequences on axis=1 must have the same size".
If however, you are expecting each batch to have the same number of tokens 101, then you could get away with a broadcast of your indexed tensor:
>>> inputs_tokens = torch.tensor([[ 1, 101, 101, 101, 9],
[ 1, 2, 101, 101, 101]])
>>> inputs_tokens.shape
>>> N = cls_tokens[0].sum()
3
Here remember, I'm assuming you have:
>>> assert all(cls_tokens.sum(axis=1) == N)
Therefore the desired output (with shape (2, 3, 1)) is:
>>> seq_A[cls_tokens].reshape(seq_A.size(0), N, -1)
tensor([[[0.7656],
[0.3951],
[0.6384]],
[[0.7154],
[0.5805],
[0.5274]]])
Edit - if you really want to do this though you would require the use of a list comprehension:
>>> [seq_A[i, cls_tokens[i]] for i in range(cls_tokens.size(0))]
[ tensor([[0.7656],
[0.6384]]),
tensor([[0.7154],
[0.5805],
[0.5274]]) ]

Python3 Boolean assignment to multidimension list is wrong?

Trying to do below
visited = [[False] * 3]* 3
print(visited)
visited[0][0] = True
print(visited)
why it prints like this:
[[False, False, False], [False, False, False], [False, False, False]]
[[True, False, False], [True, False, False], [True, False, False]]
shouldn't it be:
[[False, False, False], [False, False, False], [False, False, False]]
[[True, False, False], [False, False, False], [False, False, False]]
When you create a 2D array using
arr = [something * m]*n, all subarrays point to the same memory location. If you modify even one subarray, all other subarrays get modified.
The correct way to initialise the 2D matrix is
arr = [[something for i in range(m)] for j in range(n)]
to create a n x m matrix.
I am going to give an example how the array works. when you create a contiguous array(below is an example to check what is the size of the integer in python)
from sys import getsizeof
getsizeof(bool())
24
In my case boolean is 3 bytes (24 / 8 bits). when you define an array like above you are going to get the start location (which is 1000 as per below example) and when you access an array with index you will be given access to that particular location by the calculation start_location + size of boolean * index
1st Index of array will give 1000 + 3 * 1 -> 1003 for example
In your example visited = [[False] * 3]* 3 we are actually multiplying the reference point.[1000, 1000, 1000] in the example given above. so when you do visited[0][0] = True you are ideally changing the array with reference 1000 and modifying the 0th element to True. since the all array elements point to same array the value becomes like [[True, False, False], [True, False, False], [True, False, False]]
you should be using like below to initialize 2D matrix
array = [[False for i in range(3)] for j in range(3)]
array[0][0] = True
array
[[True, False, False], [False, False, False], [False, False, False]]
You would have noticed that by above example we are creating separate arrays on the go (with different references ofcourse).

Replace all Trues in a boolean dataframe with index valule

How can I replace all cells in a boolean dataframe (True/False) with the index name of that cell, when "True"? For example:
df = pd.DataFrame(
[
[False, True],
[True, False],
],
index=["abc", "def"],
columns=list("ab")
)
comes out as:
df = pd.DataFrame(
[
[False, abc],
[def, False],
],
index=["abc", "def"],
columns=list("ab")
)
Use df.mask:
Replace values where the condition is True.
df.mask(df,df.index)
a b
abc False abc
def def False

print mismatch items in two array

I want to compare two array(4 floating point)and print mismatched items.
I used this code:
>>> from numpy.testing import assert_allclose as np_assert_allclose
>>> x=np.array([1,2,3])
>>> y=np.array([1,0,3])
>>> np_assert_allclose(x,y, rtol=1e-4)
AssertionError:
Not equal to tolerance rtol=0.0001, atol=0
(mismatch 33.33333333333333%)
x: array([1, 2, 3])
y: array([1, 0, 3])
the problem by this code is with big array:
(mismatch 0.0015104228617559556%)
x: array([ 0.440088, 0.35994 , 0.308225, ..., 0.199546, 0.226758, 0.2312 ])
y: array([ 0.44009, 0.35994, 0.30822, ..., 0.19955, 0.22676, 0.2312 ])
I can not find what values are mismatched. how can see them ?
Just use
~np.isclose(x, y, rtol=1e-4) # array([False, True, False], dtype=bool)
e.g.
d = ~np.isclose(x, y, rtol=1e-4)
print(x[d]) # [2]
print(y[d]) # [0]
or, to get the indices
np.where(d) # (array([1]),)

Resources