groupby when column contains multi-dimensional array that needs to be added - python-3.x

I have a pandas dataframe that contains 2 dimensional vector as a column. I would like to groupby one of the columns and add the vectors up.
I have tried groupby then sum as shown in the code below, but the output column is adding dimensions to the vector rather than adding the vectors (similarly to when using np.add).
import pandas as pd
data = pd.DataFrame({'label': ['A', 'B', 'A'], 'label2' : ['X', 'Y', 'Z'],
'output' : [[[1,2,3,4],[5,6,7,8]] ,[[9,10,11,12],[13,14,15,16]],[[17,18,19,20],[21,22,23,24]]] })
data_grouped = data.groupby('label')['output'].sum()
I would like to groupby 'label' and have the outputs aggregated. Given that the output is two dimensional vector, i would like the vectors to be added and not combined. Therfore, my expectation is to have:
label A: output is [[18,20,22,24],[26,28,30,32]]
label B: output is [[9,10,11,12],[13,14,15,16]]
but I am getting:
label A: [[1, 2, 3, 4], [5, 6, 7, 8], [17, 18, 19, 20],[21,22,23,24]]
label B: [[9, 10, 11, 12], [13, 14, 15, 16]]

The solution
import pandas as pd
import numpy as np
data = pd.DataFrame({'label': ['A', 'B', 'A'], 'label2' : ['X', 'Y', 'Z'],
'output' : [[[1,2,3,4],[5,6,7,8]] ,[[9,10,11,12],[13,14,15,16]],[[17,18,19,20],[21,22,23,24]]] })
data['output'] = data['output'].map(np.array)
data_grouped = data[['label', 'output']].groupby('label').sum()
print(data_group)
>>> output
>>> label
>>> A [[18, 20, 22, 24], [26, 28, 30, 32]]
>>> B [[9, 10, 11, 12], [13, 14, 15, 16]]
The explanation
Your output contains python lists. Operation + on 2 lists concatenates the lists together:
print([1, 2] + [3, 4])
>>> [1, 2, 3, 4]
print([[1], [2]] + [[3], [4]])
>>> [[1], [2], [3], [4]]
data['output'].map(np.array) turns your 2D lists into 2D numpy arrays. Numpy arrays + operation (which is used by sum()) sums the values that are on "the same place" in both arrays.

Related

Extract lower off diagonal elements from numpy array

I have below array
import numpy as np
a = np.array([[7412, 33, 2],
[2, 7304, 83],
[3, 101, 7237]])
I would like to extract only lower off-diagonal elements from above array and put them in a vector.
I tried with np.extract(~a, a), but is extracting all elements.
Desired output will be [2, 3, 101] for above example.
Any insight would be helpful
You can use np.tril_indices or np.tri:
import numpy as np
a = np.array([[7412, 33, 2],
[2, 7304, 83],
[3, 101, 7237]])
n, m = a.shape
# Option 1
out = a[ np.tril_indices(n=n, k=-1, m=m) ]
# Option 2 (should have equivalent output)
out = a[ np.tri(N=n, M=m, k=-1, dtype=bool) ]
out:
array([ 2, 3, 101])

Print multiple columns from a matrix

I have a list of column vectors and I want to print only those column vectors from a matrix.
Note: the list can be of random length, and the indices can also be random.
For instance, the following does what I want:
import numpy as np
column_list = [2,3]
a = np.array([[1,2,6,1],[4,5,8,2],[8,3,5,3],[6,5,4,4],[5,2,8,8]])
new_matrix = []
for i in column_list:
new_matrix.append(a[:,i])
new_matrix = np.array(new_matrix)
new_matrix = new_matrix.transpose()
print(new_matrix)
However, I was wondering if there is a shorter method?
Yes, there's a shorter way. You can pass a list (or numpy array) to an array's indexer. Therefore, you can pass column_list to the columns indexer of a:
>>> a[:, column_list]
array([[6, 1],
[8, 2],
[5, 3],
[4, 4],
[8, 8]])
# This is your new_matrix produced by your original code:
>>> new_matrix
array([[6, 1],
[8, 2],
[5, 3],
[4, 4],
[8, 8]])
>>> np.all(a[:, column_list] == new_matrix)
True

Wilcoxon rank sum test between two data frames in python

I am trying to perform a Wilcoxon rank-sum test between two data frames. I would like to perform the test only between the rows. for example, the test should only be done between row 1 in df1 (A, 1, 2, 3) and df2 (A ,10, 12 ,13), row 2 in df1 (B ,4, 5, 6) and df2 (B ,14, 15, 16), and so on.
df1=pd.DataFrame(np.array([['A',1, 2, 3], ['B',4, 5, 6], ['C',7, 8, 9]]),
columns=['Details','a', 'b', 'c'])
df2=pd.DataFrame(np.array([['A',10, 12, 13], ['B',14, 15, 16], ['C',17, 18, 19]]),
columns=['Details','a', 'b', 'c'])
This should lead me to a column of p values for the test between the rows of the data frames.
out = pd.DataFrame(np.array([['A',0.05], ['B',0.0002], ['C',1]]),
columns=['details','P'])
One way is to apply a for loop but unfortunately, I have 28000 rows in my original dataset and this experiment has to be repeated at least 1000 times. I am wondering if anyone has a better strategy to approach this. Thank you very much for your help in advance.
One way to calculate this is using ranksums of scipy
from scipy.stats import ranksums
import pandas as pd
df1=pd.DataFrame(np.array([['A',1, 2, 3], ['B',4, 5, 6], ['C',7, 8, 9]]),
columns=['Details','a', 'b', 'c'])
df2=pd.DataFrame(np.array([['A',10, 12, 13], ['B',14, 15, 16], ['C',17, 18, 19]]),
columns=['Details','a', 'b', 'c'])
a = df1.loc[0,'a':].values.astype(int) #Select the first row
b = df2.loc[0,'a':].values.astype(int) #Select the second row
ranksums(a, b)

Find the First Instances of all Values in a Column of a Numpy Array

I'm trying to find the first occurrence of any row in an array in which either column has a number that has changed since the last time it appeared. Given the array below:
import numpy as np
arr = np.array([[1, 11], [2, 21], [3, 31], [4, 41], [1, 11], [2, 21], [3, 31], [4, 42]])
The output I'm looking for would look like:
subArr = [[1, 11]
[2, 21]
[3, 31]
[4, 41]
[4, 42]]
In the actual problem, the numbers are not as sequential as they appear here and cannot be predicted in advance. I've tried finding the first instance in an array, using multiple conditions, trying to get the first element in a 2-D array, and accessing the ith column. Although some of these were helpful but I can't get it do all the things I want. I tried:
subArr = arr[np.unique(np.logical_and(arr[:,0][0], arr[:,1][0]))]
which didn't work. I also tried:
subArr = arr[(arr[:,0][0]) & (arr[:,1][0])]
I'm sure it's just a matter of getting the syntax right but I can't figure out what I'm missing. Any help would be greatly appreciated.
Using:
Python 3.6
Numpy 1.18.1
Use the axis parameter of numpy.unique:
In [16]: arr
Out[16]:
array([[ 1, 11],
[ 2, 21],
[ 3, 31],
[ 4, 41],
[ 1, 11],
[ 2, 21],
[ 3, 31],
[ 4, 42]])
In [17]: np.unique(arr, axis=0)
Out[17]:
array([[ 1, 11],
[ 2, 21],
[ 3, 31],
[ 4, 41],
[ 4, 42]])
The returned values are copies of the rows from the original array, so it doesn't really make sense to ask if a row in the output corresponds to the first occurrence of the same values in the input.
If you need to know the indices of the first occurrence of each unique row in the input, you can add the argument return_index. When you do this, unique ensures that the index will be that of the first occurrence of the corresponding unique value:
In [51]: values, indices = np.unique(arr, return_index=True, axis=0)
In [52]: values
Out[52]:
array([[ 1, 11],
[ 2, 21],
[ 3, 31],
[ 4, 41],
[ 4, 42]])
In [53]: indices
Out[53]: array([0, 1, 2, 3, 7]

How to convert a grouped pandas dataframe into a numpy 3d array and apply right-padding?

In order to feed data into a LSTM network to predict remaining-useful-life (RUL) I need to create a 3D numpy array (No of machines, No of sequences, No of variables).
I already tried to combine solutions from stackoverflow and managed to create a prototype (which you can see below).
import numpy as np
import tensorflow as tf
import pandas as pd
df = pd.DataFrame({'ID': [1, 1, 2, 3, 3, 3, 3],
'V1': [1, 2, 2, 3, 3, 4, 2],
'V2': [4, 2, 3, 2, 1, 5, 1],
})
df_desired_result = np.array([[[1, 4], [2, 2], [-99, -99]],
[[2, 3], [-99, -99], [-99, -99]],
[[3, 2], [3, 1], [4, 5]]])
max_len = df['ID'].value_counts().max()
def pad_df(df, cols, max_seq, group_col= 'ID'):
array_for_pad = np.array(list(df[cols].groupby(df[group_col]).apply(pd.DataFrame.as_matrix)))
padded_array = tf.keras.preprocessing.sequence.pad_sequences(array_for_pad,
padding='post',
maxlen=max_seq,
value=-99
)
return padded_array
#testing prototype
pad_df(df, ['V1', 'V2'], max_len)
But when I apply the code above to my data, it applies the right-padding correctly but all values are set to 0.0.
I can't fully figure out this behaviour, I noticed that in the first line of my function, I get returned an array with nested arrays for 'array_for_pad'.
Here is a screenshot of the result:
result padding

Resources