Using groupby() with appending additional rows - python-3.x

With the following csv input file
ID,Name,Metric,Value
0,K1,M1,200
0,K1,M2,5
1,K2,M1,1
1,K2,M2,10
2,K2,M1,500
2,K2,M2,8
This code, groups the rows by the name column, e.g. two groups. Then it appends the values as columns for the same Name.
df = pd.read_csv('test.csv', usecols=['ID','Name','Metric','Value'])
print(df)
my_array = []
for name, df_group in df.groupby('Name'):
my_array.append( pd.concat(
[g.reset_index(drop=True) for _, g in df_group.groupby('ID')['Value']],
axis=1) )
print(my_array)
The output looks like
ID Name Metric Value
0 0 K1 M1 200
1 0 K1 M2 5
2 1 K2 M1 1
3 1 K2 M2 10
4 2 K2 M1 500
5 2 K2 M2 8
[ Value
0 200
1 5, Value Value
0 1 500
1 10 8]
For example, my_array[1] which is K2 has two rows corresponding to M1 and M2. I would like to keep the IDs as well in the final data frames in my_array. So I want to add a third row and save it (M1, M2 and ID). Therefore, the final my_array should be
[ Value
0 200
1 5
2 0, Value Value
0 1 500 <-- For K2, there are two M1 (1 and 500)
1 10 8 <-- For K2, there are two M2 (10 and 8)
2 1 2] <-- For K2, there are two ID (1 and 2)
How can I modify the code for that purpose?

You can use DataFrame.pivot for DataFrames pe groups and then append df1.columns in np.vstack:
my_array = []
for name, df_group in df.groupby('Name'):
df1 = df_group.pivot('Metric','ID','Value')
my_array.append(pd.DataFrame(np.vstack([df1, df1.columns])))
print (my_array)
[ 0
0 200
1 5
2 0, 0 1
0 1 500
1 10 8
2 1 2]

Related

Getting Dummy Back to Categorical

I have a df called X like this:
Index Class Family
1 Mid 12
2 Low 6
3 High 5
4 Low 2
Created this to dummy variables using below code:
from sklearn.preprocessing import OneHotEncoder
import pandas as pd
ohe = OneHotEncoder()
X_object = X.select_dtypes('object')
ohe.fit(X_object)
codes = ohe.transform(X_object).toarray()
feature_names = ohe.get_feature_names(['V1', 'V2'])
X = pd.concat([df.select_dtypes(exclude='object'),
pd.DataFrame(codes,columns=feature_names).astype(int)], axis=1)
Resultant df is like:
V1_Mid V1_Low V1_High V2_12 V2_6 V2_5 V2_2
1 0 0 1 0 0 0
..and so on
Question: How to do I convert my resultant df back to original df ?
I have seen this but it gives me NameError: name 'Series' is not defined.
First we can regroup each original column from your resultant df into the original column names as the first level of a column multi-index:
>>> df.columns = pd.MultiIndex.from_tuples(df.columns.str.split('_', 1).map(tuple))
>>> df = df.rename(columns={'V1': 'Class', 'V2': 'Family'}, level=0)
>>> df
Class Family
Mid Low High 12 6 5 2
0 1 0 0 1 0 0 0
Now we see the second-level of columns are the values. Thus, within each top-level we want to get the column name that has a 1, knowing all the other entries are 0. This can be done with idxmax():
>>> orig_df = pd.concat({col: df[col].idxmax(axis='columns') for col in df.columns.levels[0]}, axis='columns')
>>> orig_df
Class Family
0 Mid 12
An even more simple way is to just stick to pandas.
df = pd.DataFrame({"Index":[1,2,3,4],"Class":["Mid","Low","High","Low"],"Family":[12,6,5,2]})
# Combine features in new column
df["combined"] = list(zip(df["Class"], df["Family"]))
print(df)
Out:
Index Class Family combined
0 1 Mid 12 (Mid, 12)
1 2 Low 6 (Low, 6)
2 3 High 5 (High, 5)
3 4 Low 2 (Low, 2)
You can get the one hot encoding using pandas directly.
one_hot = pd.get_dummies(df["combined"])
print(one_hot)
Out:
(High, 5) (Low, 2) (Low, 6) (Mid, 12)
0 0 0 0 1
1 0 0 1 0
2 1 0 0 0
3 0 1 0 0
Then to get back you just can check the name of the column and select the row in the original dataframe with same tuple.
print(df[df["combined"]==one_hot.columns[0]])
Out:
Index Class Family combined
2 3 High 5 (High, 5)

Get all columns per id where the column is equal to a value

Say I have a pandas dataframe:
id A B C D
id1 0 1 0 1
id2 1 0 0 1
id3 0 0 0 1
id4 0 0 0 0
I want to select all the columns per id where the column name is equal to 1, this list will then be a new column in the dataframe.
Expected output:
id A B C D Result
id1 0 1 0 1 [B,D]
id2 1 0 0 1 [A,D]
id3 0 0 0 1 [D]
id4 0 0 0 0 []
I tried df.apply(lambda row: row[row == 1].index, axis=1) but the output of the 'Result' was not in the form in specified above
You can do what you are trying to do adding .tolist():
df['Result'] = df.apply(lambda row: row[row == 1].index.tolist(), axis=1)
Saying that, your approach of using lists as values inside a single column seems contradictory with the Pandas approach of keeping data tabular (only one value per cell). It will probably be better to use nested lists instead of pandas to do what you are trying to do.
Setup
I used a different set of ones and zeros to highlight skipping an entire row.
df = pd.DataFrame(
[[0, 1, 0, 1], [1, 0, 0, 1], [0, 0, 0, 0], [0, 0, 1, 0]],
['id1', 'id2', 'id3', 'id4'],
['A', 'B', 'C', 'D']
)
df
A B C D
id1 0 1 0 1
id2 1 0 0 1
id3 0 0 0 0
id4 0 0 1 0
Not Your Grampa's Reverse Binerizer
n = len(df)
i, j = np.nonzero(df.to_numpy())
col_names = df.columns[j]
positions = np.bincount(i).cumsum()[:-1]
result = np.split(col_names, positions)
df.assign(Result=[a.tolist() for a in result])
A B C D Result
id
id1 0 1 0 1 [B, D]
id2 1 0 0 1 [A, D]
id3 0 0 0 0 []
id4 0 0 1 0 [C]
Explanations
Ohh, the details!
np.nonzero on a 2-D array will return two arrays of equal length. The first array will have the 1st dimensional position of each element that is not zero. The second array will have the 2nd dimensional position of each element that is not zero. I'll call the first array i and the second array j.
In the figure below, I label the columns with what j they represent and correspondingly, I label the rows with what i they represent.
For each non-zero element of the dataframe, I place above the value a tuple with the (ith, jth) dimensional positions and in brackets the [kth] non-zero element in the dataframe.
# j → 0 1 2 3
# A B C D
# i
# ↓ (0,1)[0] (0,3)[1]
# 0 id1 0 1 0 1
#
# (1,0)[2] (1,3)[3]
# 1 id2 1 0 0 1
#
#
# 2 id3 0 0 0 0
#
# (3,2)[4]
# 3 id4 0 0 1 0
In the figure below, I show what i and j look like. I label each row with the same k in the brackets in the figure above
# i j
# ----
# 0 1 k=0
# 0 3 k=1
# 1 0 k=2
# 1 3 k=3
# 3 2 k=4
Now it's easy to slice the df.columns with the j array to get all the column labels in one big array.
# df.columns[j]
# B
# D
# A
# D
# C
The plan is to use np.split to chop up df.columns[j] into the sub arrays for each row. Turns out that the information is embedded in the array i. I'll use np.bincount to count how many non-zero elements are in each row. I'll need to tell np.bincount the minimum number of bins we are assuming to have. That minimum is the number of rows in the dataframe. We assign it to n with n = len(df)
# np.bincount(i, minlength=n)
# 2 ← Two non-zero elements in the first row
# 2 ← Two more in the second row
# 0 ← None in this row
# 1 ← And one more in the fourth row
However, if we take the cumulative sum of this array, we get the positions we need to split at.
# np.bincount(i, minlength=n).cumsum()
# 2
# 4
# 4 ← This repeated value results in an empty array for the 3rd row
# 5
Let's look at how this matches up with df.columns[j]. We see below that the column slice gets split exactly where we need.
# B D A D D ← df.columns[j]
# 2 44 5 ← np.bincount(i, minlength=n).cumsum()
One issue is that the 4 values in this array will result in splitting the df.columns[j] array into 5 sub-arrays. This isn't horrible because the last array will always be empty. so we slice it to appropriate size np.bincount(i, minlength=n).cumsum()[:-1]
col_names = df.columns[j]
positions = np.bincount(i, minlength=n).cumsum()[:-1]
result = np.split(col_names, positions)
# result
# [B, D]
# [A, D]
# []
# [D]
The only thing left to do is assign it to a new columns and make the individual sub-arrays lists instead.
df.assign(Result=[a.tolist() for a in result])
# A B C D Result
# id
# id1 0 1 0 1 [B, D]
# id2 1 0 0 1 [A, D]
# id3 0 0 0 0 []
# id4 0 0 1 0 [C]

Populate cells based on x by y cell value

I'm trying to populate cells based on values from two different cells.
Values in the cell needs to be (n-1) where n is the input and then repeated based on the amount of the other cell.
For example, I have input:
x y
2 5
Output should be:
x should have 0 and 1; each repeated five times
y should have 0, 1, 2, 3, 4; each repeated twice
x1 y1
0 0
0 1
0 2
0 3
0 4
1 0
1 1
1 2
1 3
1 4
I used:
=IF(ROW()<=C2+1,K2-1,"")
and
=IF(ROW()<=d2+1,K2-1,"")
but it is not repeating and I only see:
x y
0 0
1 1
__ 2
__ 3
__ 4
(C2 and D2 are where values for x and y are, K is the number of items.)
Are there any suggestions on how I can do this?
In Row2 and copied down to suit:
=IF(ROW()<=1+C$2*D$2,INT((ROW()-2)/D$2),"")
and
=IF(ROW()<=1+C$2*D$2,MOD(ROW()-2,D$2),"")

how to sort a pandas dataframe according to elements of list [duplicate]

I have the following example of dataframe.
c1 c2
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
Given a template c1 = [3, 2, 5, 4, 1], I want to change the order of the rows based on the new order of column c1, so it will look like:
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
I found the following thread, but the shuffle is random. Cmmiw.
Shuffle DataFrame rows
If values are unique in list and also in c1 column use reindex:
df = df.set_index('c1').reindex(c1).reset_index()
print (df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
General solution working with duplicates in list and also in column:
c1 = [3, 2, 5, 4, 1, 3, 2, 3]
#create df from list
list_df = pd.DataFrame({'c1':c1})
print (list_df)
c1
0 3
1 2
2 5
3 4
4 1
5 3
6 2
7 3
#helper column for count duplicates values
df['g'] = df.groupby('c1').cumcount()
list_df['g'] = list_df.groupby('c1').cumcount()
#merge together, create index from column and remove g column
df = list_df.merge(df).drop('g', axis=1)
print (df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
5 3 c
merge
You can create a dataframe with the column specified in the wanted order then merge.
One advantage of this approach is that it gracefully handles duplicates in either df.c1 or the list c1. If duplicates not wanted then care must be taken to handle them prior to reordering.
d1 = pd.DataFrame({'c1': c1})
d1.merge(df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
searchsorted
This is less robust but will work if df.c1 is:
already sorted
one-to-one mapping
df.iloc[df.c1.searchsorted(c1)]
c1 c2
2 3 c
1 2 b
4 5 e
3 4 d
0 1 a

Pandas Flag Rows with Complementary Zeros

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':[0,4,4,4],
'B':[0,4,4,0],
'C':[0,4,4,4],
'D':[4,0,0,4],
'E':[4,0,0,0],
'Name':['a','a','b','c']})
df
A B C D E Name
0 0 0 0 4 4 a
1 4 4 4 0 0 a
2 4 4 4 0 0 b
3 4 0 4 4 0 c
I'd like to add a new field called "Match_Flag" which labels unique combinations of rows if they have complementary zero patterns (as with rows 0, 1, and 2) AND have the same name (just for rows 0 and 1). It uses the name of the rows that match.
The desired result is as follows:
A B C D E Name Match_Flag
0 0 0 0 4 4 a a
1 4 4 4 0 0 a a
2 4 4 4 0 0 b NaN
3 4 0 4 4 0 c NaN
Caveat:
The patterns may vary, but should still be complementary.
Thanks in advance!
UPDATE
Sorry for the confusion.
Here is some clarification:
The reason why rows 0 and 1 are "complementary" is that they have opposite patterns of zeros in their columns; 0,0,0,4,4 vs, 4,4,4,0,0.
The number 4 is arbitrary; it could just as easily be 0,0,0,4,2 and 65,770,23,0,0. So if 2 such rows are indeed complementary and they have the same name, I'd like for them to be flagged with that same name under the "Match_Flag" column.
You can identify a compliment if it's dot product is zero and it's element wise sum is nowhere zero.
def complements(df):
v = df.drop('Name', axis=1).values
n = v.shape[0]
row, col = np.triu_indices(n, 1)
# ensure two rows are complete
# their sum contains no zeros
c = ((v[row] + v[col]) != 0).all(1)
complete = set(row[c]).union(col[c])
# ensure two rows do not overlap
# their product is zero everywhere
o = (v[row] * v[col] == 0).all(1)
non_overlap = set(row[o]).union(col[o])
# we are a compliment iff we do
# not overlap and we are complete
complement = list(non_overlap.intersection(complete))
# return slice
return df.Name.iloc[complement]
Then groupby('Name') and apply our function
df['Match_Flag'] = df.groupby('Name', group_keys=False).apply(complements)

Resources