Octave: Differences between struct and cell array - struct

Sparked by this question and posted comments/answers, I came up with another question:
What features are available in Cell arrays that are not in Structures, and viceversa, in Octave?
I could gather the following:
In Cell arrays:
Can operate on full "columns" (fields in the structure lingo) at once.
In Structures:
Have named fields.

I think the best way to answer this, rather than address how they are similar, is to point out how they differ.
Also, since you seem to be drawing equivalents to (and perhaps confusing) concepts from other languages, it may be instructive to point out similarities to constructs from other popular languages, namely R and python.
In all the above languages, there exists the concept of
an "array": a rectangular collection of elements of the same type, which can be one or more dimensions, and typically guaranteed to occupy a contiguous area in memory
a "list": a collection of elements which can be of different types, does not have to be rectangular (i.e. can be 'jagged'), typically only 1D (but can be multidimensional, or contain nested lists), and its elements are not guaranteed to occupy a contiguous area in memory
a "dict": a collection of elements which are like a list, but augmented by the fact that they can be accessed by a 'key', rather than just by index.
a "table" (?): a horizontal concatenation of equal-sized columns, each identified by a 'column header'
Octave
In octave, the closest to the "array" concept is the 'object' array, (where 'object' could be anything, but is typically numerical) e,g, [1,2:3,4].
The closest to a "list" concept is the cell array, e.g. { [1,2], true; 'John' }. To index a cell array and obtain the contents of a cell at a particular index, you use {}. However, octave cell-arrays are slightly different, in that they can also be thought of as 'object arrays' where the object elements are 'cells', which you could think of as references to their contained objects. This means you can construct a cell-array and index it with () as a normal array, returning a sub-array of cells (i.e. another cell-array). Also, a cell can contain another cell-array as its contents (i.e. cell-arrays can be nested).
The closest to a "dict" concept is the struct. This allows you to create an object which can have 'fields', such that for each field you can assign value.
Python
By contrast, in python you don't have arrays. You only have lists and dicts. In order to get array functionality you need to rely on external modules (such as numpy) which take a list as an argument to convert to an array type. Python lists are also always 1D (but can be nested).
Python dicts effectively behave the same way as octave structs. There are some tiny conceptual differences, but they're effectively equivalent constructs.
R
R is probably the bit that's causing the most confusion, because R allows you to allocate names to elements of both arrays and lists, and allows you to access both using either an index or the allocated name.
But, R still has a vector type, e.g. c(1,2,3), which despite the fact that it can also be given names, e.g. c( a=1, b=2, c=3 ), it still requires that elements need to be of the same type, otherwise R will convert to the least common denominator. (e.g. c(1, '2') will convert both elements to strings).
Then, you have lists, which are basically something like 'lists' and 'dicts' combined. If you have list(1, 2, 3), you have 'list' functionality, and if you have list(a=1, b=2, c=3) you have 'dict' functionality. If you access a list element using the [] operator, the output is expressed as another list (in a similar way to how cellarrays in octave can be indexed with () ), whereas if you index a list using the [[]] operator, you get the 'contents' only (similar to if you index a cell-array in octave with {} ).
"Tables": dataframes vs dicts vs structs
Now, in R, you also have dataframes. This is effectively a list with names (i.e. 'dict') where all elements are vectors of the same length (but can be different types), e.g. data.frame( list( a=1:3, b=c('one', 'two', 'three') ) ); note that expressing this as a data.frame rather than plain list simply results in different behaviour (e.g. printing), but otherwise the underlying object is the same (which you can confirm by typing unclass(df).
In python, we can note that a pandas dataframe behaves the same way (i.e. a pandas dataframe is initialized via a dict whose keys contain values that are equally sized vectors).
Therefore since a dataframe is basically a list of equal vectors, the easiest way to have dataframe functionality in octave is to create a struct whose fields are equal sized vectors. Or, if you don't care about fieldnames and are happy to access your contained arrays by "column index", then you can create a cell array and store in each cell your equally-sized numerical 'data' arrays.
Do cells have "columns" in the way implied in the question?
No. If you want to do vectorised operations, you cannot do it across cell-array columns. You need to performed vectorised operations on arrays.
So actually, if what you're looking for is the equivalent of a dataframe, where each "column" represents a numerical vector, the equivalent of that is a struct, where you assign a numerical vector to a field.
In other words the equivalent of dataframes in the various languages are:
Python: pandas.DataFrame( { 'col1': [1,2], 'col2': [3,4] })
R: data.frame( list( col1=c(1,2), col2=c(3,4) ) )
octave: struct( 'col1', [1,2], 'col2', [3,4] )
Having said that, you may prefer a more 'tabular' output. You can either write your own function for this, or try the dataframe package from octave forge, which provides a class for just that.
As an example here's one snippet you could easily convert to a function, and improve on to add all sorts of bells and whistles like colour etc.
fprintf( '%4s %5s %5s\n', '', fieldnames(S){:} ), for i = 1 : length(S.col1), fprintf( '%4d %5.3f %5.3f\n', i, num2cell( structfun(#(x) x(i), S) ){:} ), end
col1 col2
1 1.000 3.000
2 2.000 4.000

Related

Choose smallest value in dictionary

I have a dictionary given in the form {(i1,r1,m1):w, (i2,r2,m1):w, (i1,r1,m2):w ...} where i is the activity, r the type of resource, m the mode, and w is the resources of type r needed of activity i in mode m.
Now I would like to choose for every activity the mode, that requires the least resources (w). If possible, at the end in a list in the form [(i,m),...] for every i.
My tutor suggested to work with np.argmin(), but for this I have to convert the dictionary into an array. So I tried to convert the dictionary into an array:
w_list = list(w.items())
w_array = np.array(w_list)
print(w_array)
array([[(0, 1, 1), 0],
[(0, 2, 1), 0],
[(1, 1, 1), 9],
[(1, 2, 1), 0], ...
However, this array arrangement cannot be used for np.argmin.
Does anyone have any other idea how I can get the desired list mentioned above?
Here's one trivial non-numpy solution - simply create a new dictionary, and fill it with the mode and lowest cost per activity by iterating over the original dict:
w = {(i1,r1,m1): w1, (i2,r2,m1): w2, (i1,r1,m2): w3} #your original dict
result = {}
for (activity, _, mode), requiredResources in w.items():
if activity not in result or result[activity][1] > requiredResources:
result[activity] = mode, requiredResources
Now result holds a mapping from i to a tuple of m and w for the lowest w. In case of ambiguous entries for some i, the first entry in the iteration order will win (and as dicts are unordered, the iteration order is an implementation detail and dependend on things such as the specific keys and the dict size).
If you want to turn this into a list of i and m tuples, simply use a list comprehension:
resultList = [(k, v[0]) for k, v in result.items()]
An observation on the side: when confronted with any python problem, some people instantly recommend using numpy or similar libraries. IMO this is simply an expression of their own inexperience or ignorance - in many cases numpy is not just unneccessary, but actively detrimental if you don't know what you're doing.
If you're intending to seriously work with python, you would do well to first master the basics of the language (functions, classes, lists, dictionaries, loops, comprehensions, basic variable scoping rules), and get a rough overview of the vast python standard library - at least enough to know how to look up if something you need is readily available in some built-in module. Then next time when you need some functionality, you will be better equipped for deciding if this is something you can easily implement yourself (potentially with help from the standard lib), or if it makes sense to use functionality from external libraries such as numpy.

Split a sparse square matrix into blocks

I have a banded sparse square matrix , A, of type <class 'scipy.sparse.csr.csr_matrix'> and size = 400 x 400. I'd like to split this into block square matrices of size 200 x 200 each. For instance, the first block
block1 = A[0:200, 0:200]
block2 = A[100:300, 100:300]
block3 = A[200:400, 200:400]
The same information about the slices is stored in a list of tuples.
[(0,200), (100, 300), (200, 400)]
Suggestions on how to split the spare square matrix will be really helpful.
You can convert to a regular array and then split it:
from scipy.sparse import csr_matrix
import numpy as np
row = np.arange(400)[::2]
col = np.arange(400)[1::2]
data = np.random.randint(1, 10, (200))
compressed_matrix = csr_matrix((data, (row, col)), shape=(400, 400))
# Convert to a regular array
m = compressed_matrix.toarray()
# Split the matrix
sl = [(0,200), (100, 300), (200, 400)]
blocks = [m[i, i] for i in map(lambda x: slice(*x), sl)]
And if you want you can convert back each block to a compressed matrix:
blocks_csr = list(map(csr_matrix, blocks))
CODE EXPLANATION
The creation of the blocks is based on a list comprehension and basic slicing.
Each input tuple is converted to a slice object, only to create a series of row and column indexes, corresponding to that of the elements to be selected; in this answer, this is sufficient to select the requested block squared matrix. Slice objects are generated when extended indexing syntax is used: To be clear, a[start:stop:step] will create a slice object equivalent to slice(start, stop, step). In our case, they are used to dynamically change the indexes to be selected, according to the matrix we want to extract. So, if you consider the first block, m[i, i] is equivalent to m[0:200, 0:200].
Slice objects are a form of basic indexing, so a view of the original array is created, rather than a copy (this means that if you modify the view, also the original array will be modified: you can easily create a copy of the original data using the copy method of the numpy array).
The map object is used to generate slice objects from the input tuples; map applies the function provided as its first argument to all the elements of its second argument.
lambda is used to create an anonymous function, i.e., a function defined without a name. Anonymous functions are useful to accomplish specific tasks that you do not want to code in a standard function, because you are not going to reuse them or you need only for a short period of time, like in the example of this code. They make code more compact rather than defining the correspondent functions.
*x is called unpacking, i.e you extract, unpack elements from the tuple. Suppose you have a function f and a tuple a = (1, 2, 3), then f(*a) is equivalent to f(1, 2, 3) (as you can see, you can think of unpacking as removing a level of parentheses).
So, looking back at the code:
blocks = [ # this is a list comprehension
m[i, i] # basic slicing of the input array
for i in map( # map apply a function to all the item of a list
lambda x: slice(*x), sl # creating a slice object out of the provided index ranges
)
]

Copying data from one tensor to another using bit masking

import numpy as np
import torch
a = torch.zeros(5)
b = torch.tensor(tuple((0,1,0,1,0)),dtype=torch.uint8)
c= torch.tensor([7.,9.])
print(a[b].size())
a[b]=c
print(a)
torch.Size([2])tensor([0., 7., 0., 9., 0.])
I am struggling to understand how this works. I initially thought the above code was using Fancy indexing but I realised that values from c tensors are getting copied corresponding to the indices marked 1. Also, if I don't specify dtype of b as uint8 then the above code does not work. Can someone please explain me the mechanism of the above code.
Indexing with arrays works the same as in numpy and most other vectorized math packages I am aware of. There are two cases:
When b is of type uint8 (think boolean, pytorch doesn't distinguish bool from uint8), a[b] is a 1-d array containing the subset of values of a (a[i]) for which the corresponding in b (b[i]) was nonzero. These values are aliased to the original a so if you modify them, their corresponding locations will change as well.
The alternative type you can use for indexing is an array of int64, in which case a[b] creates an array of shape (*b.shape, *a.shape[1:]). Its structure is as if each element of b (b[i]) was replaced by a[i]. In other words, you create a new array by specifying from which indexes of a should the data be fetched. Again, the values are aliased to the original a, so if you modify a[b] the values of a[b[i]], for each i, will change. An example usecase is shown in this question.
These two modes are explained for numpy in integer array indexing and boolean array indexing, where for the latter you have to keep in mind that pytorch uses uint8 in place of bool.
Also, if your goal is to copy data from one tensor to another you have to keep in mind that an operation like a[ixs] = b[ixs] is an in-place operation (a is modified in place), which my not play well with autograd. If you want to do out of place masking, use torch.where. An example usecase is shown in this answer.

TypeError: append() missing 1 required positional argument: 'values'

I have variable 'x_data' sized 360x190, I am trying to select particular rows of data.
x_data_train = []
x_data_train = np.append([x_data_train,
x_data[0:20,:],
x_data[46:65,:],
x_data[91:110,:],
x_data[136:155,:],
x_data[181:200,:],
x_data[226:245,:],
x_data[271:290,:],
x_data[316:335,:]],axis = 0)
I get the following error :
TypeError: append() missing 1 required positional argument: 'values'
where did I go wrong ?
If I am using
x_data_train = []
x_data_train.append(x_data[0:20,:])
x_data_train.append(x_data[46:65,:])
x_data_train.append(x_data[91:110,:])
x_data_train.append(x_data[136:155,:])
x_data_train.append(x_data[181:200,:])
x_data_train.append(x_data[226:245,:])
x_data_train.append(x_data[271:290,:])
x_data_train.append(x_data[316:335,:])
the size of the output is 8 instead of 160 rows.
Update:
In matlab, I will load the text file and x_data will be variable having 360 rows and 190 columns.
If I want to select 1 to 20 , 46 to 65, ... rows of data , I simply give
x_data_train = xdata([1:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :);
the resulting x_data_train will be the array of my desired.
How can do that in python because it results array of 8 subsets of array for 20*192 each, but I want it to be one array 160*192
Short version: the most idiomatic and fastest way to do what you want in python is this (assuming x_data is a numpy array):
x_data_train = np.vstack([x_data[0:20,:],
x_data[46:65,:],
x_data[91:110,:],
x_data[136:155,:],
x_data[181:200,:],
x_data[226:245,:],
x_data[271:290,:],
x_data[316:335,:]])
This can be shortened (but made very slightly slower) by doing:
xdata[np.r_[0:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :]
For your case where you have a lot of indices I think it helps readability, but in cases where there are fewer indices I would use the first approach.
Long version:
There are several different issues at play here.
First, in python, [] makes a list, not an array like in MATLAB. Lists are more like 1D cell arrays. They can hold any data type, including other lists, but they cannot have multiple dimensions. The equivalent of MATLAB matrices in Python are numpy arrays, which are created using np.array.
Second, [x, y] in Python always creates a list where the first element is x and the second element is y. In MATLAB [x, y] can do one of several completely different things depending on what x and y are. In your case, you want to concatenate. In Python, you need to explicitly concatenate. For two lists, there are several ways to do that. The simplest is using x += y, which modifies x in-place by putting the contents of y at the end. You can combine multiple lists by doing something like x += y + z + w. If you want to keep x, unchanged, you can assign to a new variable using something like z = x + y. Finally, you can use x.extend(y), which is roughly equivalent to x += y but works with some data types besides lists.
For numpy arrays, you need to use a slightly different approach. While Python lists can be modified in-place, strictly speaking neither MATLAB matrices nor numpy arrays can be. MATLAB pretends to allow this, but it is really creating a new matrix behind-the-scenes (which is why you get a warning if you try to resize a matrix in a loop). Numpy requires you to be more explicit about creating a new array. The simplest approach is to use np.hstack, which concatenates two arrays horizontally (or np.vstack or np.dstack for vertical and depth concatenation, respectively). So you could do z = np.hstack([v, w, x, y]). There is an append method and function in numpy, but it almost never works in practice so don't use it (it requires careful memory management that is more trouble than it is worth).
Third, what append does is to create one new element in the target list, and put whatever variable append is called with in that element. So if you do x.append([1,2,3]), it adds one new element to the end of list x containing the list [1,2,3]. It would be more like x = [x, {{1,2,3}}}, where x is a cell array.
Fourth, Python makes heavy use of "methods", which are basically functions attached to data (it is a bit more complicated than that in practice, but those complexities aren't really relevant here). Recent versions of MATLAB has added them as well, but they aren't really integrated into MATLAB data types like they are in Python. So where in MATLAB you would usually use sum(x), for numpy arrays you would use x.sum(). In this case, assuming you were doing appending (which you aren't) you wouldn't use the np.append(x, y), you would use x.append(y).
Finally, in MATLAB x:y creates a matrix of values from x to y. In Python, however, it creates a "slice", which doesn't actually contain all the values and so can be processed much more quickly by lists and numpy arrays. However, you can't really work with multiple slices like you do in your example (nor does it make sense to because slices in numpy don't make copies like they do in MATLAB, while using multiple indexes does make a copy). You can get something close to what you have in MATLAB using np.r_, which creates a numpy array based on indexes and slices. So to reproduce your example in numpy, where xdata is a numpy array, you can do xdata[np.r_[1:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :]
More information on x_data and np might be needed to solve this but...
First: You're creating 2 copies of the same list: np and x_data_train
Second: Your indexes on x_data are strange
Third: You're passing 3 objects to append() when it only accepts 2.
I'm pretty sure revisiting your indexes on x_data will be where you solve the current error, but it will result in another error related to passing 2 values to append.
And I'm also sure you want
x_data_train.append(object)
not
x_data_train = np.append(object)
and you may actually want
x_data_train.extend([objects])
More on append vs extend here: append vs. extend

Vectors and Cells/DataFrame

In MATLAB, I could combine a column vector of numbers and string to a 3x1 Cell to produce a single cell (C) of 3x3 dimension as shown below
C =
8 'CNDA-ESP_SIMRAD_MDS_DGPS_2000_2001.xlsx' 415x3 double
2 'CNDA-ESP_SIMRAD_MDS_DGPS_2006_2007.xlsx' 986x3 double
2 'CNDA-ESP_SIMRAD_MDS_DGPS_2010_2011.xlsx' 704x3 double
Is it possible to do this in Python?
Long before MATLAB had cells, Python had lists. They are 1d, but can contain other lists. In fact without numpy nested lists are used to create 'matrices'.
A numpy arrays with object dtype contain pointers to objects, so they are similar to lists. You can't append to them as you can with lists. But they are multidimensional like regular arrays. So depending on your perspective they are enhanced lists or degraded ones.
Constructing a list is trival
alist = [array1, array2, array3, ...]
Constructing an object array can be trickier. If the subarrays or objects differ in size and type, it is easy - just wrap the list version in np.array(alist, dtype=object).
But if the subarrays all have the same shape, np.array creates a higher dimensional array from them. The best way around that is to create 'blank' array of the right shape, and assign values.
I've discussed these issues at:
Can I construct a numpy object zero-d array from its value in a single expression?
There I mention scipy.io.loadmat, which is capable of loading MATLAB files. You might try writing your cells to a .mat, and load them in Python. That will give you ideas of how that function tries to create equivalents.
More on object arrays v lists at:
Irregular Numpy matrix

Resources