Choose smallest value in dictionary - python-3.x

I have a dictionary given in the form {(i1,r1,m1):w, (i2,r2,m1):w, (i1,r1,m2):w ...} where i is the activity, r the type of resource, m the mode, and w is the resources of type r needed of activity i in mode m.
Now I would like to choose for every activity the mode, that requires the least resources (w). If possible, at the end in a list in the form [(i,m),...] for every i.
My tutor suggested to work with np.argmin(), but for this I have to convert the dictionary into an array. So I tried to convert the dictionary into an array:
w_list = list(w.items())
w_array = np.array(w_list)
print(w_array)
array([[(0, 1, 1), 0],
[(0, 2, 1), 0],
[(1, 1, 1), 9],
[(1, 2, 1), 0], ...
However, this array arrangement cannot be used for np.argmin.
Does anyone have any other idea how I can get the desired list mentioned above?

Here's one trivial non-numpy solution - simply create a new dictionary, and fill it with the mode and lowest cost per activity by iterating over the original dict:
w = {(i1,r1,m1): w1, (i2,r2,m1): w2, (i1,r1,m2): w3} #your original dict
result = {}
for (activity, _, mode), requiredResources in w.items():
if activity not in result or result[activity][1] > requiredResources:
result[activity] = mode, requiredResources
Now result holds a mapping from i to a tuple of m and w for the lowest w. In case of ambiguous entries for some i, the first entry in the iteration order will win (and as dicts are unordered, the iteration order is an implementation detail and dependend on things such as the specific keys and the dict size).
If you want to turn this into a list of i and m tuples, simply use a list comprehension:
resultList = [(k, v[0]) for k, v in result.items()]
An observation on the side: when confronted with any python problem, some people instantly recommend using numpy or similar libraries. IMO this is simply an expression of their own inexperience or ignorance - in many cases numpy is not just unneccessary, but actively detrimental if you don't know what you're doing.
If you're intending to seriously work with python, you would do well to first master the basics of the language (functions, classes, lists, dictionaries, loops, comprehensions, basic variable scoping rules), and get a rough overview of the vast python standard library - at least enough to know how to look up if something you need is readily available in some built-in module. Then next time when you need some functionality, you will be better equipped for deciding if this is something you can easily implement yourself (potentially with help from the standard lib), or if it makes sense to use functionality from external libraries such as numpy.

Related

np.where issue above a certain value (#Numpy)

I'm facing to 2 issues in the following snippet using np.where (looking for indexes where A[:,0] is identical to B)
Numpy error when n is above a certain value (see error)
quite slow
DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
So I'm wondering what I'm missing and/or misunderstanding, how to fix it, and how to speed-up the code. This is a basic example I've made to mimic my code, but in fact I'm dealing with arrays having (dozens of) millions of rows.
Thanks for your support
Paul
import numpy as np
import time
n=100_000 # with n=10 000 ok but quit slow
m=2_000_000
#matrix A
# A=np.random.random ((n, 4))
A = np.arange(1, 4*n+1, dtype=np.uint64).reshape((n, 4), order='F')
#Matrix B
B=np.random.randint(1, m+1, size=(m), dtype=np.uint64)
B=np.unique(B) # duplicate values are generally generated, so the real size remains lower than n
# use of np.where
t0=time.time()
ind=np.where(A[:, 0].reshape(-1, 1) == B)
# ind2=np.where(B == A[:, 0].reshape(-1, 1))
t1=time.time()
print(f"duration={t1-t0}")
In your current implementation, A[:, 0] is just
np.arange(n/4, dtype=np.uint64)
And if you are interested only in row indexes where A[:, 0] is in B, then you can get them like this:
row_indices = np.where(np.isin(first_col_of_A, B))[0]
If you then want to select the rows of A with these indices, you don't even have to convert the boolean mask to index locations. You can just select the rows with the boolean mask: A[np.isin(first_col_of_A, B)]
There are better ways to select random elements from an array. For example, you could use numpy.random.Generator.choice with replace=False. Also, Numpy: Get random set of rows from 2D array.
I feel there is almost certainly a better way to do the whole thing that you are trying to do with these index locations.
I recommend you study the Numpy User Guide and the Pandas User Guide to see what cool things are available there.
Honestly, with your current implementation you don't even need the first column of A at all, because row indicies simply equal the elements of A[:, 0]. Here:
row_indices = B[B < n]
row_indices.sort()
print(row_indices)

Octave: Differences between struct and cell array

Sparked by this question and posted comments/answers, I came up with another question:
What features are available in Cell arrays that are not in Structures, and viceversa, in Octave?
I could gather the following:
In Cell arrays:
Can operate on full "columns" (fields in the structure lingo) at once.
In Structures:
Have named fields.
I think the best way to answer this, rather than address how they are similar, is to point out how they differ.
Also, since you seem to be drawing equivalents to (and perhaps confusing) concepts from other languages, it may be instructive to point out similarities to constructs from other popular languages, namely R and python.
In all the above languages, there exists the concept of
an "array": a rectangular collection of elements of the same type, which can be one or more dimensions, and typically guaranteed to occupy a contiguous area in memory
a "list": a collection of elements which can be of different types, does not have to be rectangular (i.e. can be 'jagged'), typically only 1D (but can be multidimensional, or contain nested lists), and its elements are not guaranteed to occupy a contiguous area in memory
a "dict": a collection of elements which are like a list, but augmented by the fact that they can be accessed by a 'key', rather than just by index.
a "table" (?): a horizontal concatenation of equal-sized columns, each identified by a 'column header'
Octave
In octave, the closest to the "array" concept is the 'object' array, (where 'object' could be anything, but is typically numerical) e,g, [1,2:3,4].
The closest to a "list" concept is the cell array, e.g. { [1,2], true; 'John' }. To index a cell array and obtain the contents of a cell at a particular index, you use {}. However, octave cell-arrays are slightly different, in that they can also be thought of as 'object arrays' where the object elements are 'cells', which you could think of as references to their contained objects. This means you can construct a cell-array and index it with () as a normal array, returning a sub-array of cells (i.e. another cell-array). Also, a cell can contain another cell-array as its contents (i.e. cell-arrays can be nested).
The closest to a "dict" concept is the struct. This allows you to create an object which can have 'fields', such that for each field you can assign value.
Python
By contrast, in python you don't have arrays. You only have lists and dicts. In order to get array functionality you need to rely on external modules (such as numpy) which take a list as an argument to convert to an array type. Python lists are also always 1D (but can be nested).
Python dicts effectively behave the same way as octave structs. There are some tiny conceptual differences, but they're effectively equivalent constructs.
R
R is probably the bit that's causing the most confusion, because R allows you to allocate names to elements of both arrays and lists, and allows you to access both using either an index or the allocated name.
But, R still has a vector type, e.g. c(1,2,3), which despite the fact that it can also be given names, e.g. c( a=1, b=2, c=3 ), it still requires that elements need to be of the same type, otherwise R will convert to the least common denominator. (e.g. c(1, '2') will convert both elements to strings).
Then, you have lists, which are basically something like 'lists' and 'dicts' combined. If you have list(1, 2, 3), you have 'list' functionality, and if you have list(a=1, b=2, c=3) you have 'dict' functionality. If you access a list element using the [] operator, the output is expressed as another list (in a similar way to how cellarrays in octave can be indexed with () ), whereas if you index a list using the [[]] operator, you get the 'contents' only (similar to if you index a cell-array in octave with {} ).
"Tables": dataframes vs dicts vs structs
Now, in R, you also have dataframes. This is effectively a list with names (i.e. 'dict') where all elements are vectors of the same length (but can be different types), e.g. data.frame( list( a=1:3, b=c('one', 'two', 'three') ) ); note that expressing this as a data.frame rather than plain list simply results in different behaviour (e.g. printing), but otherwise the underlying object is the same (which you can confirm by typing unclass(df).
In python, we can note that a pandas dataframe behaves the same way (i.e. a pandas dataframe is initialized via a dict whose keys contain values that are equally sized vectors).
Therefore since a dataframe is basically a list of equal vectors, the easiest way to have dataframe functionality in octave is to create a struct whose fields are equal sized vectors. Or, if you don't care about fieldnames and are happy to access your contained arrays by "column index", then you can create a cell array and store in each cell your equally-sized numerical 'data' arrays.
Do cells have "columns" in the way implied in the question?
No. If you want to do vectorised operations, you cannot do it across cell-array columns. You need to performed vectorised operations on arrays.
So actually, if what you're looking for is the equivalent of a dataframe, where each "column" represents a numerical vector, the equivalent of that is a struct, where you assign a numerical vector to a field.
In other words the equivalent of dataframes in the various languages are:
Python: pandas.DataFrame( { 'col1': [1,2], 'col2': [3,4] })
R: data.frame( list( col1=c(1,2), col2=c(3,4) ) )
octave: struct( 'col1', [1,2], 'col2', [3,4] )
Having said that, you may prefer a more 'tabular' output. You can either write your own function for this, or try the dataframe package from octave forge, which provides a class for just that.
As an example here's one snippet you could easily convert to a function, and improve on to add all sorts of bells and whistles like colour etc.
fprintf( '%4s %5s %5s\n', '', fieldnames(S){:} ), for i = 1 : length(S.col1), fprintf( '%4d %5.3f %5.3f\n', i, num2cell( structfun(#(x) x(i), S) ){:} ), end
col1 col2
1 1.000 3.000
2 2.000 4.000

TypeError: append() missing 1 required positional argument: 'values'

I have variable 'x_data' sized 360x190, I am trying to select particular rows of data.
x_data_train = []
x_data_train = np.append([x_data_train,
x_data[0:20,:],
x_data[46:65,:],
x_data[91:110,:],
x_data[136:155,:],
x_data[181:200,:],
x_data[226:245,:],
x_data[271:290,:],
x_data[316:335,:]],axis = 0)
I get the following error :
TypeError: append() missing 1 required positional argument: 'values'
where did I go wrong ?
If I am using
x_data_train = []
x_data_train.append(x_data[0:20,:])
x_data_train.append(x_data[46:65,:])
x_data_train.append(x_data[91:110,:])
x_data_train.append(x_data[136:155,:])
x_data_train.append(x_data[181:200,:])
x_data_train.append(x_data[226:245,:])
x_data_train.append(x_data[271:290,:])
x_data_train.append(x_data[316:335,:])
the size of the output is 8 instead of 160 rows.
Update:
In matlab, I will load the text file and x_data will be variable having 360 rows and 190 columns.
If I want to select 1 to 20 , 46 to 65, ... rows of data , I simply give
x_data_train = xdata([1:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :);
the resulting x_data_train will be the array of my desired.
How can do that in python because it results array of 8 subsets of array for 20*192 each, but I want it to be one array 160*192
Short version: the most idiomatic and fastest way to do what you want in python is this (assuming x_data is a numpy array):
x_data_train = np.vstack([x_data[0:20,:],
x_data[46:65,:],
x_data[91:110,:],
x_data[136:155,:],
x_data[181:200,:],
x_data[226:245,:],
x_data[271:290,:],
x_data[316:335,:]])
This can be shortened (but made very slightly slower) by doing:
xdata[np.r_[0:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :]
For your case where you have a lot of indices I think it helps readability, but in cases where there are fewer indices I would use the first approach.
Long version:
There are several different issues at play here.
First, in python, [] makes a list, not an array like in MATLAB. Lists are more like 1D cell arrays. They can hold any data type, including other lists, but they cannot have multiple dimensions. The equivalent of MATLAB matrices in Python are numpy arrays, which are created using np.array.
Second, [x, y] in Python always creates a list where the first element is x and the second element is y. In MATLAB [x, y] can do one of several completely different things depending on what x and y are. In your case, you want to concatenate. In Python, you need to explicitly concatenate. For two lists, there are several ways to do that. The simplest is using x += y, which modifies x in-place by putting the contents of y at the end. You can combine multiple lists by doing something like x += y + z + w. If you want to keep x, unchanged, you can assign to a new variable using something like z = x + y. Finally, you can use x.extend(y), which is roughly equivalent to x += y but works with some data types besides lists.
For numpy arrays, you need to use a slightly different approach. While Python lists can be modified in-place, strictly speaking neither MATLAB matrices nor numpy arrays can be. MATLAB pretends to allow this, but it is really creating a new matrix behind-the-scenes (which is why you get a warning if you try to resize a matrix in a loop). Numpy requires you to be more explicit about creating a new array. The simplest approach is to use np.hstack, which concatenates two arrays horizontally (or np.vstack or np.dstack for vertical and depth concatenation, respectively). So you could do z = np.hstack([v, w, x, y]). There is an append method and function in numpy, but it almost never works in practice so don't use it (it requires careful memory management that is more trouble than it is worth).
Third, what append does is to create one new element in the target list, and put whatever variable append is called with in that element. So if you do x.append([1,2,3]), it adds one new element to the end of list x containing the list [1,2,3]. It would be more like x = [x, {{1,2,3}}}, where x is a cell array.
Fourth, Python makes heavy use of "methods", which are basically functions attached to data (it is a bit more complicated than that in practice, but those complexities aren't really relevant here). Recent versions of MATLAB has added them as well, but they aren't really integrated into MATLAB data types like they are in Python. So where in MATLAB you would usually use sum(x), for numpy arrays you would use x.sum(). In this case, assuming you were doing appending (which you aren't) you wouldn't use the np.append(x, y), you would use x.append(y).
Finally, in MATLAB x:y creates a matrix of values from x to y. In Python, however, it creates a "slice", which doesn't actually contain all the values and so can be processed much more quickly by lists and numpy arrays. However, you can't really work with multiple slices like you do in your example (nor does it make sense to because slices in numpy don't make copies like they do in MATLAB, while using multiple indexes does make a copy). You can get something close to what you have in MATLAB using np.r_, which creates a numpy array based on indexes and slices. So to reproduce your example in numpy, where xdata is a numpy array, you can do xdata[np.r_[1:20,46:65,91:110,136:155,181:200,226:245,271:290,316:335], :]
More information on x_data and np might be needed to solve this but...
First: You're creating 2 copies of the same list: np and x_data_train
Second: Your indexes on x_data are strange
Third: You're passing 3 objects to append() when it only accepts 2.
I'm pretty sure revisiting your indexes on x_data will be where you solve the current error, but it will result in another error related to passing 2 values to append.
And I'm also sure you want
x_data_train.append(object)
not
x_data_train = np.append(object)
and you may actually want
x_data_train.extend([objects])
More on append vs extend here: append vs. extend

(Incremental)PCA's Eigenvectors are not transposed but should be?

When we posted a homework assignment about PCA we told the course participants to pick any way of calculating the eigenvectors they found. They found multiple ways: eig, eigh (our favorite was svd). In a later task we told them to use the PCAs from scikit-learn - and were surprised that the results differed a lot more than we expected.
I toyed around a bit and we posted an explanation to the participants that either solution was correct and probably just suffered from numerical instabilities in the algorithms. However, recently I picked that file up again during a discussion with a co-worker and we quickly figured out that there's an interesting subtle change to make to get all results to be almost equivalent: Transpose the eigenvectors obtained from the SVD (and thus from the PCAs).
A bit of code to show this:
def pca_eig(data):
"""Uses numpy.linalg.eig to calculate the PCA."""
data = data.T # data
val, vec = np.linalg.eig(data)
return val, vec
versus
def pca_svd(data):
"""Uses numpy.linalg.svd to calculate the PCA."""
u, s, v = np.linalg.svd(data)
return s ** 2, v
Does not yield the same result. Changing the return of pca_svd to s ** 2, v.T, however, works! It makes perfect sense following the definition by wikipedia: The SVD of X follows X=UΣWT where
the right singular vectors W of X are equivalent to the eigenvectors of XTX
So to get the eigenvectors we need to transposed the output v of np.linalg.eig(...).
Unless there is something else going on? Anyway, the PCA and IncrementalPCA both show wrong results (or eig is wrong? I mean, transposing that yields the same equality), and looking at the code for PCA reveals that they are doing it as I did it initially:
U, S, V = linalg.svd(X, full_matrices=False)
# flip eigenvectors' sign to enforce deterministic output
U, V = svd_flip(U, V)
components_ = V
I created a little gist demonstrating the differences (nbviewer), the first with PCA and IncPCA as they are (also no transposition of the SVD), the second with transposed eigenvectors:
Comparison without transposition of SVD/PCAs (normalized data)
Comparison with transposition of SVD/PCAs (normalized data)
As one can clearly see, in the upper image the results are not really great, while the lower image only differs in some signs, thus mirroring the results here and there.
Is this really wrong and a bug in scikit-learn? More likely I am using the math wrong – but what is right? Can you please help me?
If you look at the documentation, it's pretty clear from the shape that the eigenvectors are in the rows, not the columns.
The point of the sklearn PCA is that you can use the transform method to do the correct transformation.

Typed Lists in Theano

Consider the following machine translation problem. Let s be a source sentence and t be a target sentence. Both sentences are conceptually represented as lists of indices, where the indices correspond to the position of the words in the associated dictionaries. Example:
s = [34, 68, 91, 20]
t = [29, 0, 43]
Note that s and t don't necessarily have the same length. Now let S and T be sets of such instances. In other words, they are a parallel corpus. Example:
S = [[34, 68, 91, 20], [4, 7, 1]]
T = [[29, 0, 43], [190, 37, 25, 60]]
Note that not all s's in S have the same length. That is, sentences have variable numbers of words.
I am implementing a machine translation system in Theano, and the first design decision is what kind of data structures to use for S and T. From one of the answers posted on Matrices with different row lengths in numpy , I learnt that typed lists are a good solution for storing variable length tensors.
However, I realise that they complicate my code a lot. Let me give you one example. Say that we have two typed lists y and p_y_given_x and aim to calculate the negative loss likelihood. If they were regular tensors, a simple statement like this would suffice:
loss = t.mean(t.nnet.categorical_crossentropy(p_y_given_x, y))
But categorical_crossentropy can only be applied to tensors, so in case of typed lists I have to iterate over them and apply the function separately to each element:
_loss, _ = theano.scan(fn=lambda i, p, y: t.nnet.categorical_crossentropy(p[i], y[i]),
non_sequences=[p_y_given_x, y],
sequences=[t.arange(y.__len__(), dtype='int64')])
loss = t.mean(_loss)
On top of making my code more and more messy, these problems propagate. For instance, if I want to calculate the gradient of the loss, the following doesn't work anymore:
grad_params = t.grad(loss, params)
I don't know exactly why it doesn't work. I'm sure it has to do with the type of loss, but I am not interested in investigating any further how I could make it work. The mess is growing exponentially, and what I would like is to know whether I am using typed lists in the wrong way, or if it is time to give up on them because they are not well enough supported yet.
Typed list isn't used by anybody yet. But the idea for having them is that you iterate on them with scan for each sentence. Then you do everything you need in 1 scan. You don't do 1 scan for each operation.
So the scan is only used to do the iteration on each example in the minibatch, and the inside of scan is all what is done on one example.
We haven't tested typed list with grad yet. It is possible that it is missing some implementations.

Resources