Error when using an awkward array with an index array - python-3.x

I currently have a list of values and an awkward array of integer values. I want the same dimension awkward array, but where the values are the indices of the "values" arrays corresponding with the integer values of the awkward array. For instance:
values = ak.Array(np.random.rand(100))
arr = ak.Array((np.random.randint(0, 100, 33), np.random.randint(0, 100, 125)))
I want something like values[arr], but that gives the following error:
>>> values[arr]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\awkward\highlevel.py", line 943, in __getitem__
return ak._util.wrap(self._layout[where], self._behavior)
ValueError: cannot fit jagged slice with length 2 into RegularArray of size 100
If I run it with a loop, I get back what I want:
>>> values = ([values[i] for i in arr])
>>> values
[<Array [0.842, 0.578, 0.159, ... 0.726, 0.702] type='33 * float64'>, <Array [0.509, 0.45, 0.202, ... 0.906, 0.367] type='125 * float64'>]
Is there another way to do this, or is this it? I'm afraid it'll be too slow for my application.
Thanks!

If you're trying to avoid Python for loops for performance, note that the first line casts a NumPy array as Awkward with ak.from_numpy (no loop, very fast):
>>> values = ak.Array(np.random.rand(100))
but the second line iterates over data in Python (has a slow loop):
>>> arr = ak.Array((np.random.randint(0, 100, 33), np.random.randint(0, 100, 125)))
because a tuple of two NumPy arrays is not a NumPy array. It's a generic iterable, and the constructor falls back to ak.from_iter.
On your main question, the reason that arr doesn't slice values is because arr is a jagged array and values is not:
>>> values
<Array [0.272, 0.121, 0.167, ... 0.152, 0.514] type='100 * float64'>
>>> arr
<Array [[15, 24, 9, 42, ... 35, 75, 20, 10]] type='2 * var * int64'>
Note the types: values has type 100 * float64 and arr has type 2 * var * int64. There's no rule for values[arr].
Since it looks like you want to slice values with arr[0] and then arr[1] (from your list comprehension), it could be done in a vectorized way by duplicating values for each element of arr, then slicing.
>>> # The np.newaxis is to give values a length-1 dimension before concatenating.
>>> duplicated = ak.concatenate([values[np.newaxis]] * 2)
>>> duplicated
<Array [[0.272, 0.121, ... 0.152, 0.514]] type='2 * 100 * float64'>
Now duplicated has length 2 and one level of nesting, just like arr, so arr can slice it. The resulting array also has length 2, but the length of each sublist is the length of each sublist in arr, rather than 100.
>>> duplicated[arr]
<Array [[0.225, 0.812, ... 0.779, 0.665]] type='2 * var * float64'>
>>> ak.num(duplicated[arr])
<Array [33, 125] type='2 * int64'>
If you're scaling up from 2 such lists to a large number, then this would eat up a lot of memory. Then again, the size of the output of this operation would also scale as "length of values" × "length of arr". If this "2" is not going to scale up (if it will be at most thousands, not millions or more), then I wouldn't worry about the speed of the Python for loop. Python scales well for thousands, but not billions (depending, of course, on the size of the things being scaled!).

Related

Stacking up dataframes in a 3-dimenional numpy array

I have several pandas dataframe that I would like to stack them up using numpy as a three-dimensional numpy array. I could manually do the job using the following code:
arr = np.array([df1.values, df2.values], dtype="object")
However, since I have many dataframes, I can neither write this line for all the dataframes nor automate it.
I tried to use append function (np.append(df1.values, df2['1002'].values)) but it flattens dataframes and ignores their structure. What I want is a three-dimensional numpy array where the first dimension is the number of dataframes (that I have), the second one is the number of rows in each dataframe, and the third one is the number of columns. In the first example that I mentioned earlier, I get a three-dimensional numpy array. In fact when I run arr.shape the result is (2,) and when I run arr[0].shape and arr[1].shape, I get (26, 7) and (24, 7), respectively which are the structure of their corresponding dataframe.
I even ran np.append(df1.values, df2['1002'].values, axis=0) but I received the error of ValueError: all the input array dimensions for the concatenation axis must match exactly. Is there any way that I can fix this problem and stack up all my dataframes in a 3-dimensional numpy array?
Looks like you start with 2 frames with 7 columns, but different numbers of rows. The equivalent of:
In [1]: arr1 = np.ones((26,7)); arr2 = np.zeros((24,7))
...:
In [2]: arr = np.array([arr1, arr2], object)
In [3]: arr.shape
Out[3]: (2,)
In [4]: arr[0].shape
Out[4]: (26, 7)
You probably tried this without the object and got a 'ragged array' warning. In any case, this is not a 3d array. It is 1d (2,), with two arrays. It's roughly the same as the list
[arr1, arr2]
The np.append docs should make it clear that it flattens the arguments, when you don't specify an axis.
In [6]: np.append(arr1,arr2).shape
Out[6]: (350,)
You could specify an axis, and get a 2d array, where the 50 is the sum of 26 and 24.
In [7]: np.append(arr1,arr2,axis=0).shape
Out[7]: (50, 7)
This is the same as:
In [8]: np.concatenate((arr1,arr2), axis=0).shape
Out[8]: (50, 7)
np.append is poorly name cover for np.concatenate. It is not a list append clone. Learn to use concatenate and its stack derivatives. In
With different dataframe shapes, you cannot make a 3d array. Arrays cannot be 'ragged'.
As for working with more than 2 dataframes, if you can make a list of all the frames, you can use the initial syntax.
alist = []
for a in frame_list:
alist.append(a.values)
arr = np.array(alist, object)
But make such array doesn't do much for you.
If the frames are all the same size, then you can make a 3d array
In [10]: np.array([arr1[:10,:],arr2[:10,:]]).shape
Out[10]: (2, 10, 7)
In [11]: np.stack([arr1[:10,:],arr2[:10,:]]).shape
Out[11]: (2, 10, 7)
But if they differ, stack will complain about that:
In [12]: np.stack([arr1, arr2])
Traceback (most recent call last):
File "<ipython-input-12-23d05d0422dc>", line 1, in <module>
np.stack([arr1, arr2])
File "<__array_function__ internals>", line 180, in stack
File "/usr/local/lib/python3.8/dist-packages/numpy/core/shape_base.py", line 426, in stack
raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape

Numpy append 3D vectors without flattening [duplicate]

This question already has answers here:
How do I add an extra column to a NumPy array?
(17 answers)
Closed 5 years ago.
l have the following vector
video_132.shape
Out[64]: (64, 3)
that l would to add to it a new 3D vector of three values
video_146[1][146][45]
such that
video_146[1][146][45].shape
Out[68]: (3,)
and
video_146[1][146][45]
Out[69]: array([217, 207, 198], dtype=uint8)
when l do the following
np.append(video_132,video_146[1][146][45])
l'm supposed to get
video_132.shape
Out[64]: (65, 3) # originally (64,3)
However l get :
Out[67]: (195,) # 64*3+3=195
It seems that it flattens the vector
How can l do the append by preserving the 3D structure ?
For visual simplicity let's rename video_132 --> a, and video_146[1][146][45] --> b. The particular values aren't important so let's say
In [82]: a = np.zeros((64, 3))
In [83]: b = np.ones((3,))
Then we can append b to a using:
In [84]: np.concatenate([a, b[None, :]]).shape
Out[84]: (65, 3)
Since np.concatenate returns a new array, reassign its return value to a to "append" b to a:
a = np.concatenate([a, b[None, :]])
Code for append:
def append(arr, values, axis=None):
arr = asanyarray(arr)
if axis is None:
if arr.ndim != 1:
arr = arr.ravel()
values = ravel(values)
axis = arr.ndim-1
return concatenate((arr, values), axis=axis)
Note how arr is raveled if no axis is provided
In [57]: np.append(np.ones((2,3)),2)
Out[57]: array([1., 1., 1., 1., 1., 1., 2.])
append is really aimed as simple cases like adding a scalar to a 1d array:
In [58]: np.append(np.arange(3),6)
Out[58]: array([0, 1, 2, 6])
Otherwise the behavior is hard to predict.
concatenate is the base operation (builtin) and takes a list, not just two. So we can collect many arrays (or lists) in one list and do one concatenate at the end of a loop. And since it doesn't tweak the dimensions before hand, it forces us to do that ourselves.
So to add a shape (3,) to a (64,3) we have transform that (3,) into (1,3). append requires the same dimension adjustment as concatenate if we specify the axis.
In [68]: np.append(arr,b[None,:], axis=0).shape
Out[68]: (65, 3)
In [69]: np.concatenate([arr,b[None,:]], axis=0).shape
Out[69]: (65, 3)

python 3 round on one numpy float unexpectedly yielding float

I have an np.array startIdx originating from a list of tuples consisting of integer and float fields:
>>> startIdx, someInt, someFloat = np.array(resultList).T
>>> startIdx
array([0.0, 111.0, 333.0]) # 10 to a few 100 positive values of the order of 100 to 10000
>>> round(startIdx[2])
333.0 # oops
>>> help(round)
Round [...] returns an int when called with one argument, otherwise the same type as the number.
>>> round(np.pi)
3
>>> round(np.pi, 2) # the optional argument is the number of decimal digits
3.14
round([0.0, 111.0, 333.0][2]) # to test whether it's specific for numpy arrays.
333
The float currently works (as index into numpy arrays) but yields a warning:
VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
I could avoid the conversion from tuples to arrays (and int to float) by collecting my results in a grossly oversized record array (with an int field ''startIdx'').
I could use something like int(. + 0.1), which is also ugly. Would int(round(.)) or even int(.) safely yield correct results?
In [70]: startIdx=np.array([0.0, 111.0, 333.0])
In [71]: startIdx
Out[71]: array([ 0., 111., 333.])
If you need an integer array, use astype:
In [72]: startIdx.astype(int)
Out[72]: array([ 0, 111, 333])
not round:
In [73]: np.round(startIdx)
Out[73]: array([ 0., 111., 333.])
np.array(resultList) produces a float dtype array because some values are float. arr=np.array(resultList, dtype='i,i,f') should produce a structured array with integer and float fields, assuming resultList is indeed a list of tuples.
startIdx = arr['f0']
should then be an integer dtype array.
I expect the memory use of the structured array to be the same as for the float one.

slicing error in numpy array

I am trying to run the following code
fs = 1000
data = np.loadtxt("trainingdataset.txt", delimiter=",")
data1 = data[:,2]
data2 = data1.astype(int)
X,Y = data2['521']
but it gets me the following error
Traceback (most recent call last):
File "C:\Users\hadeer.elziaat\Desktop\testspec.py", line 58, in <module>
X,Y = data2['521']
IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices
my dataset
1,4,6,10
2,100,125,10
3,100,7216,254
4,100,527,263
5,100,954,13
6,100,954,23
You're using the string '521' rather than the number 521 for indexing. Try X,Y = data2[521] instead.
If you are only given the string, you could cast it to an int first: X,Y = data2[int('521')], but this might result in some errors and/or unexpected behaviour.
Next problem, you are requiring two variable, one for X and one for Y, yet the data2[521] selection only provides you with a single variable (the number in the 3rd column, 522nd row).
You say you want all the data in the 3rd column.
I assume you also want some kind of x-axis, since you are attempting to do X, Y = .... How about using the first column for that? Then your code would be:
import numpy as np
data = np.loadtxt("trainingdataset.txt", delimiter=',', dtype='int')
x = data[:, 0]
y = data[:, 2]
What remains unclear from your question is why you tried to index your data with 521 - which failed because you cannot use strings as indices on plain arrays.

How to make a tuple including a numpy array hashable?

One way to make a numpy array hashable is setting it to read-only. This has worked for me in the past. But when I use such a numpy array in a tuple, the whole tuple is no longer hashable, which I do not understand. Here is the sample code I put together to illustrate the problem:
import numpy as np
npArray = np.ones((1,1))
npArray.flags.writeable = False
print(npArray.flags.writeable)
keySet = (0, npArray)
print(keySet[1].flags.writeable)
myDict = {keySet : 1}
First I create a simple numpy array and set it to read-only. Then I add it to a tuple and check if it is still read-only (which it is).
When I want to use the tuple as key in a dictionary, I get the error TypeError: unhashable type: 'numpy.ndarray'.
Here is the output of my sample code:
False
False
Traceback (most recent call last):
File "test.py", line 10, in <module>
myDict = {keySet : 1}
TypeError: unhashable type: 'numpy.ndarray'
What can I do to make my tuple hashable and why does Python show this behavior in the first place?
You claim that
One way to make a numpy array hashable is setting it to read-only
but that's not actually true. Setting an array to read-only just makes it read-only. It doesn't make the array hashable, for multiple reasons.
The first reason is that an array with the writeable flag set to False is still mutable. First, you can always set writeable=True again and resume writing to it, or do more exotic things like reassign its shape even while writeable is False. Second, even without touching the array itself, you could mutate its data through another view that has writeable=True.
>>> x = numpy.arange(5)
>>> y = x[:]
>>> x.flags.writeable = False
>>> x
array([0, 1, 2, 3, 4])
>>> y[0] = 5
>>> x
array([5, 1, 2, 3, 4])
Second, for hashability to be meaningful, objects must first be equatable - == must return a boolean, and must be an equivalence relation. NumPy arrays don't do that. The purpose of hash values is to quickly locate equal objects, but when your objects don't even have a built-in notion of equality, there's not much point to providing hashes.
You're not going to get hashable tuples with arrays inside. You're not even going to get hashable arrays. The closest you can get is to put some other representation of the array's data in the tuple.
The fastest way to hash a numpy array is likely tostring.
In [11]: %timeit hash(y.tostring())
What you could do is rather than use a tuple define a class:
class KeySet(object):
def __init__(self, i, arr):
self.i = i
self.arr = arr
def __hash__(self):
return hash((self.i, hash(self.arr.tostring())))
Now you can use it in a dict:
In [21]: ks = KeySet(0, npArray)
In [22]: myDict = {ks: 1}
In [23]: myDict[ks]
Out[23]: 1

Resources