Vectors and Cells/DataFrame - python-3.x

In MATLAB, I could combine a column vector of numbers and string to a 3x1 Cell to produce a single cell (C) of 3x3 dimension as shown below
C =
8 'CNDA-ESP_SIMRAD_MDS_DGPS_2000_2001.xlsx' 415x3 double
2 'CNDA-ESP_SIMRAD_MDS_DGPS_2006_2007.xlsx' 986x3 double
2 'CNDA-ESP_SIMRAD_MDS_DGPS_2010_2011.xlsx' 704x3 double
Is it possible to do this in Python?

Long before MATLAB had cells, Python had lists. They are 1d, but can contain other lists. In fact without numpy nested lists are used to create 'matrices'.
A numpy arrays with object dtype contain pointers to objects, so they are similar to lists. You can't append to them as you can with lists. But they are multidimensional like regular arrays. So depending on your perspective they are enhanced lists or degraded ones.
Constructing a list is trival
alist = [array1, array2, array3, ...]
Constructing an object array can be trickier. If the subarrays or objects differ in size and type, it is easy - just wrap the list version in np.array(alist, dtype=object).
But if the subarrays all have the same shape, np.array creates a higher dimensional array from them. The best way around that is to create 'blank' array of the right shape, and assign values.
I've discussed these issues at:
Can I construct a numpy object zero-d array from its value in a single expression?
There I mention scipy.io.loadmat, which is capable of loading MATLAB files. You might try writing your cells to a .mat, and load them in Python. That will give you ideas of how that function tries to create equivalents.
More on object arrays v lists at:
Irregular Numpy matrix

Related

Octave: Differences between struct and cell array

Sparked by this question and posted comments/answers, I came up with another question:
What features are available in Cell arrays that are not in Structures, and viceversa, in Octave?
I could gather the following:
In Cell arrays:
Can operate on full "columns" (fields in the structure lingo) at once.
In Structures:
Have named fields.
I think the best way to answer this, rather than address how they are similar, is to point out how they differ.
Also, since you seem to be drawing equivalents to (and perhaps confusing) concepts from other languages, it may be instructive to point out similarities to constructs from other popular languages, namely R and python.
In all the above languages, there exists the concept of
an "array": a rectangular collection of elements of the same type, which can be one or more dimensions, and typically guaranteed to occupy a contiguous area in memory
a "list": a collection of elements which can be of different types, does not have to be rectangular (i.e. can be 'jagged'), typically only 1D (but can be multidimensional, or contain nested lists), and its elements are not guaranteed to occupy a contiguous area in memory
a "dict": a collection of elements which are like a list, but augmented by the fact that they can be accessed by a 'key', rather than just by index.
a "table" (?): a horizontal concatenation of equal-sized columns, each identified by a 'column header'
Octave
In octave, the closest to the "array" concept is the 'object' array, (where 'object' could be anything, but is typically numerical) e,g, [1,2:3,4].
The closest to a "list" concept is the cell array, e.g. { [1,2], true; 'John' }. To index a cell array and obtain the contents of a cell at a particular index, you use {}. However, octave cell-arrays are slightly different, in that they can also be thought of as 'object arrays' where the object elements are 'cells', which you could think of as references to their contained objects. This means you can construct a cell-array and index it with () as a normal array, returning a sub-array of cells (i.e. another cell-array). Also, a cell can contain another cell-array as its contents (i.e. cell-arrays can be nested).
The closest to a "dict" concept is the struct. This allows you to create an object which can have 'fields', such that for each field you can assign value.
Python
By contrast, in python you don't have arrays. You only have lists and dicts. In order to get array functionality you need to rely on external modules (such as numpy) which take a list as an argument to convert to an array type. Python lists are also always 1D (but can be nested).
Python dicts effectively behave the same way as octave structs. There are some tiny conceptual differences, but they're effectively equivalent constructs.
R
R is probably the bit that's causing the most confusion, because R allows you to allocate names to elements of both arrays and lists, and allows you to access both using either an index or the allocated name.
But, R still has a vector type, e.g. c(1,2,3), which despite the fact that it can also be given names, e.g. c( a=1, b=2, c=3 ), it still requires that elements need to be of the same type, otherwise R will convert to the least common denominator. (e.g. c(1, '2') will convert both elements to strings).
Then, you have lists, which are basically something like 'lists' and 'dicts' combined. If you have list(1, 2, 3), you have 'list' functionality, and if you have list(a=1, b=2, c=3) you have 'dict' functionality. If you access a list element using the [] operator, the output is expressed as another list (in a similar way to how cellarrays in octave can be indexed with () ), whereas if you index a list using the [[]] operator, you get the 'contents' only (similar to if you index a cell-array in octave with {} ).
"Tables": dataframes vs dicts vs structs
Now, in R, you also have dataframes. This is effectively a list with names (i.e. 'dict') where all elements are vectors of the same length (but can be different types), e.g. data.frame( list( a=1:3, b=c('one', 'two', 'three') ) ); note that expressing this as a data.frame rather than plain list simply results in different behaviour (e.g. printing), but otherwise the underlying object is the same (which you can confirm by typing unclass(df).
In python, we can note that a pandas dataframe behaves the same way (i.e. a pandas dataframe is initialized via a dict whose keys contain values that are equally sized vectors).
Therefore since a dataframe is basically a list of equal vectors, the easiest way to have dataframe functionality in octave is to create a struct whose fields are equal sized vectors. Or, if you don't care about fieldnames and are happy to access your contained arrays by "column index", then you can create a cell array and store in each cell your equally-sized numerical 'data' arrays.
Do cells have "columns" in the way implied in the question?
No. If you want to do vectorised operations, you cannot do it across cell-array columns. You need to performed vectorised operations on arrays.
So actually, if what you're looking for is the equivalent of a dataframe, where each "column" represents a numerical vector, the equivalent of that is a struct, where you assign a numerical vector to a field.
In other words the equivalent of dataframes in the various languages are:
Python: pandas.DataFrame( { 'col1': [1,2], 'col2': [3,4] })
R: data.frame( list( col1=c(1,2), col2=c(3,4) ) )
octave: struct( 'col1', [1,2], 'col2', [3,4] )
Having said that, you may prefer a more 'tabular' output. You can either write your own function for this, or try the dataframe package from octave forge, which provides a class for just that.
As an example here's one snippet you could easily convert to a function, and improve on to add all sorts of bells and whistles like colour etc.
fprintf( '%4s %5s %5s\n', '', fieldnames(S){:} ), for i = 1 : length(S.col1), fprintf( '%4d %5.3f %5.3f\n', i, num2cell( structfun(#(x) x(i), S) ){:} ), end
col1 col2
1 1.000 3.000
2 2.000 4.000

ValuerError regarding dimensions when declaring PyTorch tensor

I'm currently trying to convert a list of values into a PyTorch tensor and am facing some difficulties.
The exact code that's causing the error is:
input_tensor = torch.cuda.FloatTensor(data)
Here, data is a list with two elements: The first element is another list of NumPy arrays and the second element is a list of tuples. The sizes of both lists differ, and I believe this is causing the following error:
*** ValueError: expected sequence of length x at dim 2 (got y)
Usually y is larger than x. I've tried playing around with an IPython terminal to see what's wrong, and it appears that trying to convert data of this format directly into PyTorch tensors doesn't work. Taking each individual element of the data list and converting those into tensors works, though.
Does anybody know why this doesn't work and perhaps also be able to provide some feedback on how to achieve my original goal? Thanks in advance.
Let's say that the first sublist of data contains n 1D arrays, each of size m, and the second sublist contains k tuples, each of size p.
When calling torch.FloatTensor(data) each sublist is converted to a 2D tensor, of shape (n, m) and of shape (k, p) respectively; then they are stack together to form a 3D tensor. This is possible only if n=k and m=p -- think of a 3D tensor as a cuboid.
This is quite obvious I think, so I guess you have m = p and want to create a 2D tensor of shape (n+k, m) by simply concatenating the two sublists:
torch.FloatTensor(np.concatenate(data))

Efficient way to find unique numpy arrays from a large set having the same shape and dtype

I have a large set (~ 10000) of numpy arrays, (a1, a2, a3,...,a10000). Each array has the same shape (10, 12) and all are of dtype = int. In any row of any array, the 12 values are unique.
Now, there are many doubles, triples, etc. I suspect only about a tenth of the arrays are actually unique (ie: having the same values in the same positions).
Could I get some advice on how I might isolate the unique arrays? I suspect numpy.array_equal will be involved, but I'm new enough to the language that I'm struggling with how to implement it.
numpy.unique can be used to find the unique elements of an array. Supposing your data is contained in a list; first, stack data to generate a 3D array. Then perform np.unique to find unique 2D arrays:
import numpy as np
# dummy list of numpy array to simulate your data
list_of_arrays = [np.stack([np.random.permutation(12) for i in range(10)]) for i in range(10000)]
# stack arrays to form a 3D array
arr = np.stack(list_of_arrays)
# find unique arrays
unq = np.unique(arr, axis = 0)

Copying data from one tensor to another using bit masking

import numpy as np
import torch
a = torch.zeros(5)
b = torch.tensor(tuple((0,1,0,1,0)),dtype=torch.uint8)
c= torch.tensor([7.,9.])
print(a[b].size())
a[b]=c
print(a)
torch.Size([2])tensor([0., 7., 0., 9., 0.])
I am struggling to understand how this works. I initially thought the above code was using Fancy indexing but I realised that values from c tensors are getting copied corresponding to the indices marked 1. Also, if I don't specify dtype of b as uint8 then the above code does not work. Can someone please explain me the mechanism of the above code.
Indexing with arrays works the same as in numpy and most other vectorized math packages I am aware of. There are two cases:
When b is of type uint8 (think boolean, pytorch doesn't distinguish bool from uint8), a[b] is a 1-d array containing the subset of values of a (a[i]) for which the corresponding in b (b[i]) was nonzero. These values are aliased to the original a so if you modify them, their corresponding locations will change as well.
The alternative type you can use for indexing is an array of int64, in which case a[b] creates an array of shape (*b.shape, *a.shape[1:]). Its structure is as if each element of b (b[i]) was replaced by a[i]. In other words, you create a new array by specifying from which indexes of a should the data be fetched. Again, the values are aliased to the original a, so if you modify a[b] the values of a[b[i]], for each i, will change. An example usecase is shown in this question.
These two modes are explained for numpy in integer array indexing and boolean array indexing, where for the latter you have to keep in mind that pytorch uses uint8 in place of bool.
Also, if your goal is to copy data from one tensor to another you have to keep in mind that an operation like a[ixs] = b[ixs] is an in-place operation (a is modified in place), which my not play well with autograd. If you want to do out of place masking, use torch.where. An example usecase is shown in this answer.

TypeError: append() missing 1 required positional argument: 'values'

I have variable 'x_data' sized 360x190, I am trying to select particular rows of data.
x_data_train = []
x_data_train = np.append([x_data_train,
x_data[0:20,:],
x_data[46:65,:],
x_data[91:110,:],
x_data[136:155,:],
x_data[181:200,:],
x_data[226:245,:],
x_data[271:290,:],
x_data[316:335,:]],axis = 0)
I get the following error :
TypeError: append() missing 1 required positional argument: 'values'
where did I go wrong ?
If I am using
x_data_train = []
x_data_train.append(x_data[0:20,:])
x_data_train.append(x_data[46:65,:])
x_data_train.append(x_data[91:110,:])
x_data_train.append(x_data[136:155,:])
x_data_train.append(x_data[181:200,:])
x_data_train.append(x_data[226:245,:])
x_data_train.append(x_data[271:290,:])
x_data_train.append(x_data[316:335,:])
the size of the output is 8 instead of 160 rows.
Update:
In matlab, I will load the text file and x_data will be variable having 360 rows and 190 columns.
If I want to select 1 to 20 , 46 to 65, ... rows of data , I simply give
x_data_train = xdata([1:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :);
the resulting x_data_train will be the array of my desired.
How can do that in python because it results array of 8 subsets of array for 20*192 each, but I want it to be one array 160*192
Short version: the most idiomatic and fastest way to do what you want in python is this (assuming x_data is a numpy array):
x_data_train = np.vstack([x_data[0:20,:],
x_data[46:65,:],
x_data[91:110,:],
x_data[136:155,:],
x_data[181:200,:],
x_data[226:245,:],
x_data[271:290,:],
x_data[316:335,:]])
This can be shortened (but made very slightly slower) by doing:
xdata[np.r_[0:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :]
For your case where you have a lot of indices I think it helps readability, but in cases where there are fewer indices I would use the first approach.
Long version:
There are several different issues at play here.
First, in python, [] makes a list, not an array like in MATLAB. Lists are more like 1D cell arrays. They can hold any data type, including other lists, but they cannot have multiple dimensions. The equivalent of MATLAB matrices in Python are numpy arrays, which are created using np.array.
Second, [x, y] in Python always creates a list where the first element is x and the second element is y. In MATLAB [x, y] can do one of several completely different things depending on what x and y are. In your case, you want to concatenate. In Python, you need to explicitly concatenate. For two lists, there are several ways to do that. The simplest is using x += y, which modifies x in-place by putting the contents of y at the end. You can combine multiple lists by doing something like x += y + z + w. If you want to keep x, unchanged, you can assign to a new variable using something like z = x + y. Finally, you can use x.extend(y), which is roughly equivalent to x += y but works with some data types besides lists.
For numpy arrays, you need to use a slightly different approach. While Python lists can be modified in-place, strictly speaking neither MATLAB matrices nor numpy arrays can be. MATLAB pretends to allow this, but it is really creating a new matrix behind-the-scenes (which is why you get a warning if you try to resize a matrix in a loop). Numpy requires you to be more explicit about creating a new array. The simplest approach is to use np.hstack, which concatenates two arrays horizontally (or np.vstack or np.dstack for vertical and depth concatenation, respectively). So you could do z = np.hstack([v, w, x, y]). There is an append method and function in numpy, but it almost never works in practice so don't use it (it requires careful memory management that is more trouble than it is worth).
Third, what append does is to create one new element in the target list, and put whatever variable append is called with in that element. So if you do x.append([1,2,3]), it adds one new element to the end of list x containing the list [1,2,3]. It would be more like x = [x, {{1,2,3}}}, where x is a cell array.
Fourth, Python makes heavy use of "methods", which are basically functions attached to data (it is a bit more complicated than that in practice, but those complexities aren't really relevant here). Recent versions of MATLAB has added them as well, but they aren't really integrated into MATLAB data types like they are in Python. So where in MATLAB you would usually use sum(x), for numpy arrays you would use x.sum(). In this case, assuming you were doing appending (which you aren't) you wouldn't use the np.append(x, y), you would use x.append(y).
Finally, in MATLAB x:y creates a matrix of values from x to y. In Python, however, it creates a "slice", which doesn't actually contain all the values and so can be processed much more quickly by lists and numpy arrays. However, you can't really work with multiple slices like you do in your example (nor does it make sense to because slices in numpy don't make copies like they do in MATLAB, while using multiple indexes does make a copy). You can get something close to what you have in MATLAB using np.r_, which creates a numpy array based on indexes and slices. So to reproduce your example in numpy, where xdata is a numpy array, you can do xdata[np.r_[1:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :]
More information on x_data and np might be needed to solve this but...
First: You're creating 2 copies of the same list: np and x_data_train
Second: Your indexes on x_data are strange
Third: You're passing 3 objects to append() when it only accepts 2.
I'm pretty sure revisiting your indexes on x_data will be where you solve the current error, but it will result in another error related to passing 2 values to append.
And I'm also sure you want
x_data_train.append(object)
not
x_data_train = np.append(object)
and you may actually want
x_data_train.extend([objects])
More on append vs extend here: append vs. extend

Resources