Copying data from one tensor to another using bit masking - pytorch

import numpy as np
import torch
a = torch.zeros(5)
b = torch.tensor(tuple((0,1,0,1,0)),dtype=torch.uint8)
c= torch.tensor([7.,9.])
print(a[b].size())
a[b]=c
print(a)
torch.Size([2])tensor([0., 7., 0., 9., 0.])
I am struggling to understand how this works. I initially thought the above code was using Fancy indexing but I realised that values from c tensors are getting copied corresponding to the indices marked 1. Also, if I don't specify dtype of b as uint8 then the above code does not work. Can someone please explain me the mechanism of the above code.

Indexing with arrays works the same as in numpy and most other vectorized math packages I am aware of. There are two cases:
When b is of type uint8 (think boolean, pytorch doesn't distinguish bool from uint8), a[b] is a 1-d array containing the subset of values of a (a[i]) for which the corresponding in b (b[i]) was nonzero. These values are aliased to the original a so if you modify them, their corresponding locations will change as well.
The alternative type you can use for indexing is an array of int64, in which case a[b] creates an array of shape (*b.shape, *a.shape[1:]). Its structure is as if each element of b (b[i]) was replaced by a[i]. In other words, you create a new array by specifying from which indexes of a should the data be fetched. Again, the values are aliased to the original a, so if you modify a[b] the values of a[b[i]], for each i, will change. An example usecase is shown in this question.
These two modes are explained for numpy in integer array indexing and boolean array indexing, where for the latter you have to keep in mind that pytorch uses uint8 in place of bool.
Also, if your goal is to copy data from one tensor to another you have to keep in mind that an operation like a[ixs] = b[ixs] is an in-place operation (a is modified in place), which my not play well with autograd. If you want to do out of place masking, use torch.where. An example usecase is shown in this answer.

Related

NumbaNotImplementedError: only one advanced index supported -> how to rewrite a 3D [x,y,x] -> 2D array replacement of values numba can handle

I'm converting some odd code over to be Numba compatible using parallel=True. It has a problematic array assignment that I can't quite figure out how to rewrite in a way numba can handle. I try to decode what the error means, but I get quite lost. The only thing clear is it does not like the line: Averaging_price_3D[leg, :, expired_loc] = last_non_expired_values.T The error is pretty long, included for reference here:
TypingError: No implementation of function Function(<built-in function setitem>) found for signature:
setitem(array(float64, 3d, C), Tuple(int64, slice<a:b>, array(int64, 1d, C)), array(float64, 2d, F))
There are 16 candidate implementations:
- Of which 14 did not match due to:
Overload of function 'setitem': File: <numerous>: Line N/A.
With argument(s): '(array(float64, 3d, C), Tuple(int64, slice<a:b>, array(int64, 1d, C)), array(float64, 2d, F))':
No match.
- Of which 2 did not match due to:
Overload in function 'SetItemBuffer.generic': File: numba\core\typing\arraydecl.py: Line 176.
With argument(s): '(array(float64, 3d, C), Tuple(int64, slice<a:b>, array(int64, 1d, C)), array(float64, 2d, F))':
Rejected as the implementation raised a specific error:
NumbaNotImplementedError: only one advanced index supported
And here is a short code segment to reproduce the error:
import numpy as np
import numba as nb
#nb.jit(nopython=True, parallel=True, nogil=True)
def main(Averaging_price_3D, expired_loc, last_non_expired_values):
for leg in range(Averaging_price_3D.shape[0]):
# line below causes the numba error:
Averaging_price_3D[leg, :, expired_loc] = last_non_expired_values.T
return Averaging_price_3D
if __name__ == "__main__":
Averaging_price_3D=np.random.rand(2,8192,11)*100 # shape (2,8192,11) 3D array float64
expired_loc=np.arange(4,10).astype(np.int64) # shape (6,) 1D array int64
last_non_expired_values = Averaging_price_3D[1,:,0:expired_loc.shape[0]].copy() # shape (8192,6) 2D array float64
result = main(Averaging_price_3D, expired_loc, last_non_expired_values)
Now the best I can interpret this error is that "numba doesn't know how to set values in a 3D matrix using array indexing with values from a 2D array." But I searched online quite a bit and can't find another way to accomplish the same thing, without numba crashing on it.
In other cases like this I resorted to flattening the arrays with a .reshape(-1) before indexing, but I'm having issues with figuring out how to do that in this specific case (that was easy with a 3D array indexed with another 3D array, as they both would flatten in the same order)... Any help is appreciated!
Well interesting enough, I looked at the indexes passed to the 3D array (since the error said "only one advanced index supported," I chose to examine my indexing):
3Darray[int, :, 1Darray]
Seeing numba is quite picky, I tried rewriting it a little bit, so that a 1D array wasn't used as an index (apparently, this is an "advanced index", so use an int index instead). Reading numba errors and solutions, they tend to add loops, so I tried that here. So instead of passing a 1D array as an index, I looped over the elements of the 1D array:
import numpy as np
import numba as nb
#nb.jit(cache=True, nopython=True, parallel=True, nogil=True)
def main(Averaging_price_3D, expired_loc, last_non_expired_values):
for leg in nb.prange(Averaging_price_3D.shape[0]):
# change the indexing for numba to get rid of a 1D array index
for i in nb.prange(expired_loc.shape[0]):
# now we assign values 3Darray[int,:,int] = 1Darray
Averaging_price_3D[leg, :, expired_loc[i]] = last_non_expired_values[:,i].T
return Averaging_price_3D
if __name__ == "__main__":
Averaging_price_3D=np.random.rand(2,8192,11)*100 # shape (2,8192,11) 3D array float64
expired_loc=np.arange(4,10).astype(np.int64) # shape (6,) 1D array int64
last_non_expired_values = Averaging_price_3D[1,:,0:expired_loc.shape[0]] # shape (8192,6) 2D array float64
result = main(Averaging_price_3D, expired_loc, last_non_expired_values)
Now it works no problem at all. So it appears to me if you want to access elements from a 3D array with numba, you should do it with either ints or :. It appears to not like 1D array indexing, so replace it with a loop and it should run in parallel.

Torch mv behavior not understandable

The following screenshots show that torch.mv is unusable in a situation that obviously seem to be correct... how is this possible, any idea what can be the problem?
this first image shows the correct situation, where the vector has 10 rows for a matrix of 10 columns, but I showed the other also just in case. Also swapping w.mv(x) for x.mv(w) does not make a difference.
However, the # operator works... the thing is that for my own reasons I want to use mv, so I would like to know what the problem is.
According to documentation:
torch.mv(input, vec, *, out=None) → Tensor
If input is a (n×m) tensor, vec is a 1-D tensor of size m, out will be 1-D of size n.
The x here should be 1-D, but in your case it's 10x1 (2D). You can remove extra dimension (or create a single dimension x)
>>> w.mv(x.squeeze())
tensor([ 0.1432, -2.0639, -2.1871, -1.8837, 0.7333, -0.4000, 0.4023, -1.1318,
0.0423, -1.2136])
>>> w # x
tensor([[ 0.1432],
[-2.0639],
[-2.1871],
[-1.8837],
[ 0.7333],
[-0.4000],
[ 0.4023],
[-1.1318],
[ 0.0423],
[-1.2136]])

Split a sparse square matrix into blocks

I have a banded sparse square matrix , A, of type <class 'scipy.sparse.csr.csr_matrix'> and size = 400 x 400. I'd like to split this into block square matrices of size 200 x 200 each. For instance, the first block
block1 = A[0:200, 0:200]
block2 = A[100:300, 100:300]
block3 = A[200:400, 200:400]
The same information about the slices is stored in a list of tuples.
[(0,200), (100, 300), (200, 400)]
Suggestions on how to split the spare square matrix will be really helpful.
You can convert to a regular array and then split it:
from scipy.sparse import csr_matrix
import numpy as np
row = np.arange(400)[::2]
col = np.arange(400)[1::2]
data = np.random.randint(1, 10, (200))
compressed_matrix = csr_matrix((data, (row, col)), shape=(400, 400))
# Convert to a regular array
m = compressed_matrix.toarray()
# Split the matrix
sl = [(0,200), (100, 300), (200, 400)]
blocks = [m[i, i] for i in map(lambda x: slice(*x), sl)]
And if you want you can convert back each block to a compressed matrix:
blocks_csr = list(map(csr_matrix, blocks))
CODE EXPLANATION
The creation of the blocks is based on a list comprehension and basic slicing.
Each input tuple is converted to a slice object, only to create a series of row and column indexes, corresponding to that of the elements to be selected; in this answer, this is sufficient to select the requested block squared matrix. Slice objects are generated when extended indexing syntax is used: To be clear, a[start:stop:step] will create a slice object equivalent to slice(start, stop, step). In our case, they are used to dynamically change the indexes to be selected, according to the matrix we want to extract. So, if you consider the first block, m[i, i] is equivalent to m[0:200, 0:200].
Slice objects are a form of basic indexing, so a view of the original array is created, rather than a copy (this means that if you modify the view, also the original array will be modified: you can easily create a copy of the original data using the copy method of the numpy array).
The map object is used to generate slice objects from the input tuples; map applies the function provided as its first argument to all the elements of its second argument.
lambda is used to create an anonymous function, i.e., a function defined without a name. Anonymous functions are useful to accomplish specific tasks that you do not want to code in a standard function, because you are not going to reuse them or you need only for a short period of time, like in the example of this code. They make code more compact rather than defining the correspondent functions.
*x is called unpacking, i.e you extract, unpack elements from the tuple. Suppose you have a function f and a tuple a = (1, 2, 3), then f(*a) is equivalent to f(1, 2, 3) (as you can see, you can think of unpacking as removing a level of parentheses).
So, looking back at the code:
blocks = [ # this is a list comprehension
m[i, i] # basic slicing of the input array
for i in map( # map apply a function to all the item of a list
lambda x: slice(*x), sl # creating a slice object out of the provided index ranges
)
]

TypeError: append() missing 1 required positional argument: 'values'

I have variable 'x_data' sized 360x190, I am trying to select particular rows of data.
x_data_train = []
x_data_train = np.append([x_data_train,
x_data[0:20,:],
x_data[46:65,:],
x_data[91:110,:],
x_data[136:155,:],
x_data[181:200,:],
x_data[226:245,:],
x_data[271:290,:],
x_data[316:335,:]],axis = 0)
I get the following error :
TypeError: append() missing 1 required positional argument: 'values'
where did I go wrong ?
If I am using
x_data_train = []
x_data_train.append(x_data[0:20,:])
x_data_train.append(x_data[46:65,:])
x_data_train.append(x_data[91:110,:])
x_data_train.append(x_data[136:155,:])
x_data_train.append(x_data[181:200,:])
x_data_train.append(x_data[226:245,:])
x_data_train.append(x_data[271:290,:])
x_data_train.append(x_data[316:335,:])
the size of the output is 8 instead of 160 rows.
Update:
In matlab, I will load the text file and x_data will be variable having 360 rows and 190 columns.
If I want to select 1 to 20 , 46 to 65, ... rows of data , I simply give
x_data_train = xdata([1:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :);
the resulting x_data_train will be the array of my desired.
How can do that in python because it results array of 8 subsets of array for 20*192 each, but I want it to be one array 160*192
Short version: the most idiomatic and fastest way to do what you want in python is this (assuming x_data is a numpy array):
x_data_train = np.vstack([x_data[0:20,:],
x_data[46:65,:],
x_data[91:110,:],
x_data[136:155,:],
x_data[181:200,:],
x_data[226:245,:],
x_data[271:290,:],
x_data[316:335,:]])
This can be shortened (but made very slightly slower) by doing:
xdata[np.r_[0:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :]
For your case where you have a lot of indices I think it helps readability, but in cases where there are fewer indices I would use the first approach.
Long version:
There are several different issues at play here.
First, in python, [] makes a list, not an array like in MATLAB. Lists are more like 1D cell arrays. They can hold any data type, including other lists, but they cannot have multiple dimensions. The equivalent of MATLAB matrices in Python are numpy arrays, which are created using np.array.
Second, [x, y] in Python always creates a list where the first element is x and the second element is y. In MATLAB [x, y] can do one of several completely different things depending on what x and y are. In your case, you want to concatenate. In Python, you need to explicitly concatenate. For two lists, there are several ways to do that. The simplest is using x += y, which modifies x in-place by putting the contents of y at the end. You can combine multiple lists by doing something like x += y + z + w. If you want to keep x, unchanged, you can assign to a new variable using something like z = x + y. Finally, you can use x.extend(y), which is roughly equivalent to x += y but works with some data types besides lists.
For numpy arrays, you need to use a slightly different approach. While Python lists can be modified in-place, strictly speaking neither MATLAB matrices nor numpy arrays can be. MATLAB pretends to allow this, but it is really creating a new matrix behind-the-scenes (which is why you get a warning if you try to resize a matrix in a loop). Numpy requires you to be more explicit about creating a new array. The simplest approach is to use np.hstack, which concatenates two arrays horizontally (or np.vstack or np.dstack for vertical and depth concatenation, respectively). So you could do z = np.hstack([v, w, x, y]). There is an append method and function in numpy, but it almost never works in practice so don't use it (it requires careful memory management that is more trouble than it is worth).
Third, what append does is to create one new element in the target list, and put whatever variable append is called with in that element. So if you do x.append([1,2,3]), it adds one new element to the end of list x containing the list [1,2,3]. It would be more like x = [x, {{1,2,3}}}, where x is a cell array.
Fourth, Python makes heavy use of "methods", which are basically functions attached to data (it is a bit more complicated than that in practice, but those complexities aren't really relevant here). Recent versions of MATLAB has added them as well, but they aren't really integrated into MATLAB data types like they are in Python. So where in MATLAB you would usually use sum(x), for numpy arrays you would use x.sum(). In this case, assuming you were doing appending (which you aren't) you wouldn't use the np.append(x, y), you would use x.append(y).
Finally, in MATLAB x:y creates a matrix of values from x to y. In Python, however, it creates a "slice", which doesn't actually contain all the values and so can be processed much more quickly by lists and numpy arrays. However, you can't really work with multiple slices like you do in your example (nor does it make sense to because slices in numpy don't make copies like they do in MATLAB, while using multiple indexes does make a copy). You can get something close to what you have in MATLAB using np.r_, which creates a numpy array based on indexes and slices. So to reproduce your example in numpy, where xdata is a numpy array, you can do xdata[np.r_[1:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :]
More information on x_data and np might be needed to solve this but...
First: You're creating 2 copies of the same list: np and x_data_train
Second: Your indexes on x_data are strange
Third: You're passing 3 objects to append() when it only accepts 2.
I'm pretty sure revisiting your indexes on x_data will be where you solve the current error, but it will result in another error related to passing 2 values to append.
And I'm also sure you want
x_data_train.append(object)
not
x_data_train = np.append(object)
and you may actually want
x_data_train.extend([objects])
More on append vs extend here: append vs. extend

Vectors and Cells/DataFrame

In MATLAB, I could combine a column vector of numbers and string to a 3x1 Cell to produce a single cell (C) of 3x3 dimension as shown below
C =
8 'CNDA-ESP_SIMRAD_MDS_DGPS_2000_2001.xlsx' 415x3 double
2 'CNDA-ESP_SIMRAD_MDS_DGPS_2006_2007.xlsx' 986x3 double
2 'CNDA-ESP_SIMRAD_MDS_DGPS_2010_2011.xlsx' 704x3 double
Is it possible to do this in Python?
Long before MATLAB had cells, Python had lists. They are 1d, but can contain other lists. In fact without numpy nested lists are used to create 'matrices'.
A numpy arrays with object dtype contain pointers to objects, so they are similar to lists. You can't append to them as you can with lists. But they are multidimensional like regular arrays. So depending on your perspective they are enhanced lists or degraded ones.
Constructing a list is trival
alist = [array1, array2, array3, ...]
Constructing an object array can be trickier. If the subarrays or objects differ in size and type, it is easy - just wrap the list version in np.array(alist, dtype=object).
But if the subarrays all have the same shape, np.array creates a higher dimensional array from them. The best way around that is to create 'blank' array of the right shape, and assign values.
I've discussed these issues at:
Can I construct a numpy object zero-d array from its value in a single expression?
There I mention scipy.io.loadmat, which is capable of loading MATLAB files. You might try writing your cells to a .mat, and load them in Python. That will give you ideas of how that function tries to create equivalents.
More on object arrays v lists at:
Irregular Numpy matrix

Resources