I have an np.array startIdx originating from a list of tuples consisting of integer and float fields:
>>> startIdx, someInt, someFloat = np.array(resultList).T
>>> startIdx
array([0.0, 111.0, 333.0]) # 10 to a few 100 positive values of the order of 100 to 10000
>>> round(startIdx[2])
333.0 # oops
>>> help(round)
Round [...] returns an int when called with one argument, otherwise the same type as the number.
>>> round(np.pi)
3
>>> round(np.pi, 2) # the optional argument is the number of decimal digits
3.14
round([0.0, 111.0, 333.0][2]) # to test whether it's specific for numpy arrays.
333
The float currently works (as index into numpy arrays) but yields a warning:
VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future
I could avoid the conversion from tuples to arrays (and int to float) by collecting my results in a grossly oversized record array (with an int field ''startIdx'').
I could use something like int(. + 0.1), which is also ugly. Would int(round(.)) or even int(.) safely yield correct results?
In [70]: startIdx=np.array([0.0, 111.0, 333.0])
In [71]: startIdx
Out[71]: array([ 0., 111., 333.])
If you need an integer array, use astype:
In [72]: startIdx.astype(int)
Out[72]: array([ 0, 111, 333])
not round:
In [73]: np.round(startIdx)
Out[73]: array([ 0., 111., 333.])
np.array(resultList) produces a float dtype array because some values are float. arr=np.array(resultList, dtype='i,i,f') should produce a structured array with integer and float fields, assuming resultList is indeed a list of tuples.
startIdx = arr['f0']
should then be an integer dtype array.
I expect the memory use of the structured array to be the same as for the float one.
Related
I currently have a list of values and an awkward array of integer values. I want the same dimension awkward array, but where the values are the indices of the "values" arrays corresponding with the integer values of the awkward array. For instance:
values = ak.Array(np.random.rand(100))
arr = ak.Array((np.random.randint(0, 100, 33), np.random.randint(0, 100, 125)))
I want something like values[arr], but that gives the following error:
>>> values[arr]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\awkward\highlevel.py", line 943, in __getitem__
return ak._util.wrap(self._layout[where], self._behavior)
ValueError: cannot fit jagged slice with length 2 into RegularArray of size 100
If I run it with a loop, I get back what I want:
>>> values = ([values[i] for i in arr])
>>> values
[<Array [0.842, 0.578, 0.159, ... 0.726, 0.702] type='33 * float64'>, <Array [0.509, 0.45, 0.202, ... 0.906, 0.367] type='125 * float64'>]
Is there another way to do this, or is this it? I'm afraid it'll be too slow for my application.
Thanks!
If you're trying to avoid Python for loops for performance, note that the first line casts a NumPy array as Awkward with ak.from_numpy (no loop, very fast):
>>> values = ak.Array(np.random.rand(100))
but the second line iterates over data in Python (has a slow loop):
>>> arr = ak.Array((np.random.randint(0, 100, 33), np.random.randint(0, 100, 125)))
because a tuple of two NumPy arrays is not a NumPy array. It's a generic iterable, and the constructor falls back to ak.from_iter.
On your main question, the reason that arr doesn't slice values is because arr is a jagged array and values is not:
>>> values
<Array [0.272, 0.121, 0.167, ... 0.152, 0.514] type='100 * float64'>
>>> arr
<Array [[15, 24, 9, 42, ... 35, 75, 20, 10]] type='2 * var * int64'>
Note the types: values has type 100 * float64 and arr has type 2 * var * int64. There's no rule for values[arr].
Since it looks like you want to slice values with arr[0] and then arr[1] (from your list comprehension), it could be done in a vectorized way by duplicating values for each element of arr, then slicing.
>>> # The np.newaxis is to give values a length-1 dimension before concatenating.
>>> duplicated = ak.concatenate([values[np.newaxis]] * 2)
>>> duplicated
<Array [[0.272, 0.121, ... 0.152, 0.514]] type='2 * 100 * float64'>
Now duplicated has length 2 and one level of nesting, just like arr, so arr can slice it. The resulting array also has length 2, but the length of each sublist is the length of each sublist in arr, rather than 100.
>>> duplicated[arr]
<Array [[0.225, 0.812, ... 0.779, 0.665]] type='2 * var * float64'>
>>> ak.num(duplicated[arr])
<Array [33, 125] type='2 * int64'>
If you're scaling up from 2 such lists to a large number, then this would eat up a lot of memory. Then again, the size of the output of this operation would also scale as "length of values" × "length of arr". If this "2" is not going to scale up (if it will be at most thousands, not millions or more), then I wouldn't worry about the speed of the Python for loop. Python scales well for thousands, but not billions (depending, of course, on the size of the things being scaled!).
I made a simple function that produces a weighted average of several time series using supplied weights. It is designed to handle missing values (NaNs), which is why I am not using numpy's supplied average function.
However, when I feed it my array containing missing values, the array has its nan values replaced by 0s! I would have assumed that since I am changing the name of the array and it is not a global variable this should not happen. I want my X array to retain its original form including the nan value
I am a relative novice using python (obviously).
Example:
X = np.array([[1, 2, 3], [1, 2, 3], [1, 2, np.nan]]) # 3 time series to be weighted together
weights = np.array([[1,1,1]]) # simple example with weights for each series as 1
def WeightedMeanNaN(Tseries, weights):
## calculates weighted mean
N_Tseries = Tseries
Weights = np.repeat(weights, len(N_Tseries), axis=0) # make a vector of weights matching size of time series
loc = np.where(np.isnan(N_Tseries)) # get location of nans
Weights[loc] = 0
N_Tseries[loc] = 0
Weights = Weights/Weights.sum(axis=1)[:,None] # normalize each row so that weights sum to 1
WeightedAve = np.multiply(N_Tseries,Weights)
WeightedAve = WeightedAve.sum(axis=1)
return WeightedAve
WeightedMeanNaN(Tseries = X, weights = weights)
Out[161]: array([2. , 2. , 1.5])
In:X
Out:
array([[1., 2., 3.],
[1., 2., 3.],
[1., 2., 0.]]) # no longer nan!! ```
Where you call
loc = np.where(np.isnan(N_Tseries)) # get location of nans
Weights[loc] = 0
N_Tseries[loc] = 0
You remove all NaNs and set them to zeros.
To reverse this you could iterate over the array and replace zeros with NaNs.
However, this would also set regular zeros to Nans.
So it turns out this is a mistake caused by me being used to working in Matlab. Python treats arguments supplied to the function as pointers to the original object. In contrast, Matlab creates copies that are discarded when the function ends.
I solved my problem by adding ".copy()" when assigning variables in the function, so that the first line in the function above becomes:
N_Tseries = Tseries.copy().
However, one thing that puzzles me is that some people have suggested that using Tseries[:] should also create a copy of Tseries rather than a pointer to the original variable. This did not work for me though.
I found this answer useful:
Python function not supposed to change a global variable
enter image description here
I wrote a code for 2D list.
row_num = int(input())
col_ = int(input())
arr2=[]
for i in range(row_num):
arr2.append([])
a=input()
a=a.split(" ")
for j in range(col_):
arr2[i].append(a[j])
for j in range(2):
arr2[j][-2]=float(arr2[j][-2])-float(arr2[j][-1])
print(arr2)
first I didn't convert list into np array so my output was
2
2
2 9
2 9
[[-7.0, '9'], [-7.0, '9']]
but when I convert list into np array and do same operation
row_num = int(input())
col_ = int(input())
arr2=[]
for i in range(row_num):
arr2.append([])
a=input()
a=a.split(" ")
for j in range(col_):
arr2[i].append(a[j])
arr2=np.array(arr2) #here I am converting list into np array
for j in range(2):
arr2[j][-2]=float(arr2[j][-2])-float(arr2[j][-1])
print(arr2)
I got different output
2
2
2 9
2 9
[['-' '9']
['-' '9']]
I don't know, why I am getting different answers?
The documentation for numpy.array() says, for its parameter dtype:
The desired data-type for the array. If not given, then the type will be determined as the minimum type required to hold the objects in the sequence. ...
This is exactly what happened here. You were expecting the array to be able to hold both str and float values, like ordinary Python arrays do; for this the dtype should be object. Since you didn't specify the type, it chose single characters instead:
>>> np.array(['2', '9'])
array(['2', '9'],
dtype='<U1')
And then when you tried to put -7.0 into one slot of an array of single characters, it must have turned it into a string '-7.0' and only used the first character of that.
So specify the dtype you want for your array when you create it. If you're looking to gain some of the performance advantages of Numpy, you probably want to use a floating-point dtype and convert your strings into floats before you put them into the Numpy array. Or you could do the conversion with astype():
>>> np.array(['2', '9']).astype(float)
array([ 1., 2.])
I am sure many of you use numpy log function. How do you deal with Nan and -Inf? Is there any pythonic way to remove it from the array?
a = np.array([[0,1],
[0,0],
[1,1]])
b = np.log(a[:,0]/a[:,1])
print(b)
Simply index the array where the values are finite using np.isfinite() (docs here).
>>> a = np.array([[0,1],
[0,0],
[1,1]])
>>> b = np.log(a[:,0]/a[:,1])
>>> b[np.isfinite(b)]
array([ 0.])
The np.isfinite() function will give you a boolean array the same size as the input array that is True wherever the value is finite, i.e. non-NaN and non-inf, and False otherwise:
>>> np.isfinite(b)
array([False, False, True], dtype=bool)
which then can be used as a boolean index, so it will only grab the values out of b where this result is True (in this case, it's the final index, which has a value of 0).
Why is the shape of a single row numpy structured array not defined ( '()') and whats the common "workaround"?
import io
fileWrapper = io.StringIO("-0.09469 0.032987 0.061009 0.0588")
a =np.loadtxt(fileWrapper,dtype=np.dtype([('min', (float,2) ), ('max',(float,2) )]), delimiter= " ", comments="#");
print(np.shape(a), a)
Output: () ([-0.09469, 0.032987], [0.061009, 0.0588])
Short answer: Add the argument ndmin=1 to the loadtxt call.
Long answer:
The shape is () for the same reason that reading a single floating point value with loadtxt returns an array with shape ():
In [43]: a = np.loadtxt(['1.0'])
In [44]: a.shape
Out[44]: ()
In [45]: a
Out[45]: array(1.0)
By default, loadtxt uses the squeeze function to eliminate trivial (i.e. length 1) dimensions in the array that it returns. In my example above, it means the result is a "scalar array"--an array with shape ().
When you give loadtxt a structured dtype, the structure defines the fields of a single element of the array. It is common to think of these fields as "columns", but structured arrays will make more sense if you consistently think of them as what they are: arrays of structures with fields. If your data file had two lines, the array returned by loadtxt would be an array with shape (2,). That is, it is a one-dimensional array with length 2. Each element of the array is a structure whose fields are defined by the given dtype. When the input file has only a single line, the array would have shape (1,), but loadtxt squeezes that to be a scalar array with shape ().
To force loadtxt to always return a one-dimensional array, even when there is a single line of data, use the argument ndmin=1.
For example, here's a dtype for a structured array:
In [58]: dt = np.dtype([('x', np.float64), ('y', np.float64)])
Read one line using that dtype. The result has shape ():
In [59]: a = np.loadtxt(['1.0 2.0'], dtype=dt)
In [60]: a.shape
Out[60]: ()
Use ndmin=1 to ensure that even an input with a single line results in a one-dimensional array:
In [61]: a = np.loadtxt(['1.0 2.0'], dtype=dt, ndmin=1)
In [62]: a.shape
Out[62]: (1,)
In [63]: a
Out[63]:
array([(1.0, 2.0)],
dtype=[('x', '<f8'), ('y', '<f8')])