Stacking up dataframes in a 3-dimenional numpy array - python-3.x

I have several pandas dataframe that I would like to stack them up using numpy as a three-dimensional numpy array. I could manually do the job using the following code:
arr = np.array([df1.values, df2.values], dtype="object")
However, since I have many dataframes, I can neither write this line for all the dataframes nor automate it.
I tried to use append function (np.append(df1.values, df2['1002'].values)) but it flattens dataframes and ignores their structure. What I want is a three-dimensional numpy array where the first dimension is the number of dataframes (that I have), the second one is the number of rows in each dataframe, and the third one is the number of columns. In the first example that I mentioned earlier, I get a three-dimensional numpy array. In fact when I run arr.shape the result is (2,) and when I run arr[0].shape and arr[1].shape, I get (26, 7) and (24, 7), respectively which are the structure of their corresponding dataframe.
I even ran np.append(df1.values, df2['1002'].values, axis=0) but I received the error of ValueError: all the input array dimensions for the concatenation axis must match exactly. Is there any way that I can fix this problem and stack up all my dataframes in a 3-dimensional numpy array?

Looks like you start with 2 frames with 7 columns, but different numbers of rows. The equivalent of:
In [1]: arr1 = np.ones((26,7)); arr2 = np.zeros((24,7))
...:
In [2]: arr = np.array([arr1, arr2], object)
In [3]: arr.shape
Out[3]: (2,)
In [4]: arr[0].shape
Out[4]: (26, 7)
You probably tried this without the object and got a 'ragged array' warning. In any case, this is not a 3d array. It is 1d (2,), with two arrays. It's roughly the same as the list
[arr1, arr2]
The np.append docs should make it clear that it flattens the arguments, when you don't specify an axis.
In [6]: np.append(arr1,arr2).shape
Out[6]: (350,)
You could specify an axis, and get a 2d array, where the 50 is the sum of 26 and 24.
In [7]: np.append(arr1,arr2,axis=0).shape
Out[7]: (50, 7)
This is the same as:
In [8]: np.concatenate((arr1,arr2), axis=0).shape
Out[8]: (50, 7)
np.append is poorly name cover for np.concatenate. It is not a list append clone. Learn to use concatenate and its stack derivatives. In
With different dataframe shapes, you cannot make a 3d array. Arrays cannot be 'ragged'.
As for working with more than 2 dataframes, if you can make a list of all the frames, you can use the initial syntax.
alist = []
for a in frame_list:
alist.append(a.values)
arr = np.array(alist, object)
But make such array doesn't do much for you.
If the frames are all the same size, then you can make a 3d array
In [10]: np.array([arr1[:10,:],arr2[:10,:]]).shape
Out[10]: (2, 10, 7)
In [11]: np.stack([arr1[:10,:],arr2[:10,:]]).shape
Out[11]: (2, 10, 7)
But if they differ, stack will complain about that:
In [12]: np.stack([arr1, arr2])
Traceback (most recent call last):
File "<ipython-input-12-23d05d0422dc>", line 1, in <module>
np.stack([arr1, arr2])
File "<__array_function__ internals>", line 180, in stack
File "/usr/local/lib/python3.8/dist-packages/numpy/core/shape_base.py", line 426, in stack
raise ValueError('all input arrays must have the same shape')
ValueError: all input arrays must have the same shape

Related

Error when using an awkward array with an index array

I currently have a list of values and an awkward array of integer values. I want the same dimension awkward array, but where the values are the indices of the "values" arrays corresponding with the integer values of the awkward array. For instance:
values = ak.Array(np.random.rand(100))
arr = ak.Array((np.random.randint(0, 100, 33), np.random.randint(0, 100, 125)))
I want something like values[arr], but that gives the following error:
>>> values[arr]
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Anaconda3\lib\site-packages\awkward\highlevel.py", line 943, in __getitem__
return ak._util.wrap(self._layout[where], self._behavior)
ValueError: cannot fit jagged slice with length 2 into RegularArray of size 100
If I run it with a loop, I get back what I want:
>>> values = ([values[i] for i in arr])
>>> values
[<Array [0.842, 0.578, 0.159, ... 0.726, 0.702] type='33 * float64'>, <Array [0.509, 0.45, 0.202, ... 0.906, 0.367] type='125 * float64'>]
Is there another way to do this, or is this it? I'm afraid it'll be too slow for my application.
Thanks!
If you're trying to avoid Python for loops for performance, note that the first line casts a NumPy array as Awkward with ak.from_numpy (no loop, very fast):
>>> values = ak.Array(np.random.rand(100))
but the second line iterates over data in Python (has a slow loop):
>>> arr = ak.Array((np.random.randint(0, 100, 33), np.random.randint(0, 100, 125)))
because a tuple of two NumPy arrays is not a NumPy array. It's a generic iterable, and the constructor falls back to ak.from_iter.
On your main question, the reason that arr doesn't slice values is because arr is a jagged array and values is not:
>>> values
<Array [0.272, 0.121, 0.167, ... 0.152, 0.514] type='100 * float64'>
>>> arr
<Array [[15, 24, 9, 42, ... 35, 75, 20, 10]] type='2 * var * int64'>
Note the types: values has type 100 * float64 and arr has type 2 * var * int64. There's no rule for values[arr].
Since it looks like you want to slice values with arr[0] and then arr[1] (from your list comprehension), it could be done in a vectorized way by duplicating values for each element of arr, then slicing.
>>> # The np.newaxis is to give values a length-1 dimension before concatenating.
>>> duplicated = ak.concatenate([values[np.newaxis]] * 2)
>>> duplicated
<Array [[0.272, 0.121, ... 0.152, 0.514]] type='2 * 100 * float64'>
Now duplicated has length 2 and one level of nesting, just like arr, so arr can slice it. The resulting array also has length 2, but the length of each sublist is the length of each sublist in arr, rather than 100.
>>> duplicated[arr]
<Array [[0.225, 0.812, ... 0.779, 0.665]] type='2 * var * float64'>
>>> ak.num(duplicated[arr])
<Array [33, 125] type='2 * int64'>
If you're scaling up from 2 such lists to a large number, then this would eat up a lot of memory. Then again, the size of the output of this operation would also scale as "length of values" × "length of arr". If this "2" is not going to scale up (if it will be at most thousands, not millions or more), then I wouldn't worry about the speed of the Python for loop. Python scales well for thousands, but not billions (depending, of course, on the size of the things being scaled!).

Can we initialise a numpy array of numpy arrays with different shapes using some constructor?

I want an array that looks like this,
array([array([[1, 1], [2, 2]]), array([3, 3])], dtype=object)
I can make an empty array and then assign elements one by one like this,
z = [np.array([[1,1],[2,2]]), np.array([3,3])]
x = np.empty(shape=2, dtype=object)
x[0], x[1] = z
I thought if this possible then so should be this: x = np.array(z, dtype=object), but that gets me the error: ValueError: could not broadcast input array from shape (2,2) into shape (2).
So is the way given above the only way to make a ragged numpy array? Or, is there a nice one line constructor/function we can can call to make the array x from above.

Apply MinMaxScaler() on a pandas column

I am trying to use the sklearn MinMaxScaler to rescale a python column like below:
scaler = MinMaxScaler()
y = scaler.fit(df['total_amount'])
But got the following errors:
Traceback (most recent call last):
File "/Users/edamame/workspace/git/my-analysis/experiments/my_seq.py", line 54, in <module>
y = scaler.fit(df['total_amount'])
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 308, in fit
return self.partial_fit(X, y)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/preprocessing/data.py", line 334, in partial_fit
estimator=self, dtype=FLOAT_DTYPES)
File "/Users/edamame/workspace/git/my-analysis/venv/lib/python3.4/site-packages/sklearn/utils/validation.py", line 441, in check_array
"if it contains a single sample.".format(array))
ValueError: Expected 2D array, got 1D array instead:
array=[3.180000e+00 2.937450e+03 6.023850e+03 2.216292e+04 1.074589e+04
:
0.000000e+00 0.000000e+00 9.000000e+01 1.260000e+03].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
Any idea what was wrong?
The input to MinMaxScaler needs to be array-like, with shape [n_samples, n_features]. So you can apply it on the column as a dataframe rather than a series (using double square brackets instead of single):
y = scaler.fit(df[['total_amount']])
Though from your description, it sounds like you want fit_transform rather than just fit (but I could be wrong):
y = scaler.fit_transform(df[['total_amount']])
A little more explanation:
If your dataframe had 100 rows, consider the difference in shape when you transform a column to an array:
>>> np.array(df[['total_amount']]).shape
(100, 1)
>>> np.array(df['total_amount']).shape
(100,)
The first returns a shape that matches [n_samples, n_features] (as required by MinMaxScaler), whereas the second does not.
Try to do with this way:
import pandas as pd
from sklearn import preprocessing
x = df.values #returns a numpy array
min_max_scaler = preprocessing.MinMaxScaler()
x_scaled = min_max_scaler.fit_transform(x)
df = pd.DataFrame(x_scaled)

Value error while generating indexes using PCA in scikit-learn

Using the following function i am trying to generate index from the data:
Function:
import numpy as np
from sklearn.decomposition import PCA
def pca_index(data,components=1,indx=1):
corrs = np.asarray(data.cov())
pca = PCA(n_components = components).fit(corrs)
trns = pca.transform(data)
index=np.dot(trns[0:indx],pca.explained_variance_ratio_[0:indx])
return index
Index: generation from principal components
index = pca_index(data=mydata,components=3,indx=2)
Following error is being generated when i am calling the function:
Traceback (most recent call last):
File "<ipython-input-411-35115ef28e61>", line 1, in <module>
index = pca_index(data=mydata,components=3,indx=2)
File "<ipython-input-410-49c0174a047a>", line 15, in pca_index
index=np.dot(trns[0:indx],pca.explained_variance_ratio_[0:indx])
ValueError: shapes (2,3) and (2,) not aligned: 3 (dim 1) != 2 (dim 0)
Can anyone help with the error.
According to my understanding there is some error at the following point when i am passing the subscript indices as variable (indx):
trns[0:indx],pca.explained_variance_ratio_[0:**indx**]
In np.dot you are trying to multiply a matrix having dimensions (2,3) with a matrix having dimensions (2,), i.e. a vector.
However, you can only multiply NxM to MxP, e.g. (3,2) to (2,1) or (2,3) to (3,1).
In your example the second matrix have dimensions of (2,) which, in numpy terms, is similar but not the same as (2,1). You can reshape a vector into a matrix with vector.reshape([2,1])
You might also transpose you first matrix, thus converting its dimensions from (2,3) to (3,2).
However, make sure that you multiply appropriate matrices as the result will differ from you might expect.

numpy structured array no shape information?

Why is the shape of a single row numpy structured array not defined ( '()') and whats the common "workaround"?
import io
fileWrapper = io.StringIO("-0.09469 0.032987 0.061009 0.0588")
a =np.loadtxt(fileWrapper,dtype=np.dtype([('min', (float,2) ), ('max',(float,2) )]), delimiter= " ", comments="#");
print(np.shape(a), a)
Output: () ([-0.09469, 0.032987], [0.061009, 0.0588])
Short answer: Add the argument ndmin=1 to the loadtxt call.
Long answer:
The shape is () for the same reason that reading a single floating point value with loadtxt returns an array with shape ():
In [43]: a = np.loadtxt(['1.0'])
In [44]: a.shape
Out[44]: ()
In [45]: a
Out[45]: array(1.0)
By default, loadtxt uses the squeeze function to eliminate trivial (i.e. length 1) dimensions in the array that it returns. In my example above, it means the result is a "scalar array"--an array with shape ().
When you give loadtxt a structured dtype, the structure defines the fields of a single element of the array. It is common to think of these fields as "columns", but structured arrays will make more sense if you consistently think of them as what they are: arrays of structures with fields. If your data file had two lines, the array returned by loadtxt would be an array with shape (2,). That is, it is a one-dimensional array with length 2. Each element of the array is a structure whose fields are defined by the given dtype. When the input file has only a single line, the array would have shape (1,), but loadtxt squeezes that to be a scalar array with shape ().
To force loadtxt to always return a one-dimensional array, even when there is a single line of data, use the argument ndmin=1.
For example, here's a dtype for a structured array:
In [58]: dt = np.dtype([('x', np.float64), ('y', np.float64)])
Read one line using that dtype. The result has shape ():
In [59]: a = np.loadtxt(['1.0 2.0'], dtype=dt)
In [60]: a.shape
Out[60]: ()
Use ndmin=1 to ensure that even an input with a single line results in a one-dimensional array:
In [61]: a = np.loadtxt(['1.0 2.0'], dtype=dt, ndmin=1)
In [62]: a.shape
Out[62]: (1,)
In [63]: a
Out[63]:
array([(1.0, 2.0)],
dtype=[('x', '<f8'), ('y', '<f8')])

Resources