how to sort a pandas dataframe according to elements of list [duplicate] - python-3.x

I have the following example of dataframe.
c1 c2
0 1 a
1 2 b
2 3 c
3 4 d
4 5 e
Given a template c1 = [3, 2, 5, 4, 1], I want to change the order of the rows based on the new order of column c1, so it will look like:
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
I found the following thread, but the shuffle is random. Cmmiw.
Shuffle DataFrame rows

If values are unique in list and also in c1 column use reindex:
df = df.set_index('c1').reindex(c1).reset_index()
print (df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
General solution working with duplicates in list and also in column:
c1 = [3, 2, 5, 4, 1, 3, 2, 3]
#create df from list
list_df = pd.DataFrame({'c1':c1})
print (list_df)
c1
0 3
1 2
2 5
3 4
4 1
5 3
6 2
7 3
#helper column for count duplicates values
df['g'] = df.groupby('c1').cumcount()
list_df['g'] = list_df.groupby('c1').cumcount()
#merge together, create index from column and remove g column
df = list_df.merge(df).drop('g', axis=1)
print (df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
5 3 c

merge
You can create a dataframe with the column specified in the wanted order then merge.
One advantage of this approach is that it gracefully handles duplicates in either df.c1 or the list c1. If duplicates not wanted then care must be taken to handle them prior to reordering.
d1 = pd.DataFrame({'c1': c1})
d1.merge(df)
c1 c2
0 3 c
1 2 b
2 5 e
3 4 d
4 1 a
searchsorted
This is less robust but will work if df.c1 is:
already sorted
one-to-one mapping
df.iloc[df.c1.searchsorted(c1)]
c1 c2
2 3 c
1 2 b
4 5 e
3 4 d
0 1 a

Related

Getting rows with minimum col2 given same col1 [duplicate]

I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4

How to replenish a data frame based on another one?

Given two data frames. One contains a column of repeated values (a, in this case). The other contains what this value corresponds to (in this example, it corresponds to some "d" values). How do I efficiently replenish the first data frame with a new column, values in which correspond to some existent column, according to a rule recorded in the other data frame. Here is an example code that works really slow:
import pandas as pd
import numpy as np
d1 = pd.DataFrame(np.asarray([[1,2,3], [2,4,5], [3,4,5], [2,1,4], [3,4,5]]), columns = ['a', 'b', 'c'])
d2 = pd.DataFrame(np.asarray([[1,7], [2,8], [3,11]]), columns = ['a', 'd'])
d = np.empty((d1.shape[0],))
for i in range(d1.shape[0]):
temp = d2.loc[d2['a'] == d1.at[i,'a']]
d[i] = temp['d'].array[0]
d1['d'] = d
This is d1 original:
a b c
0 1 2 3
1 2 4 5
2 3 4 5
3 2 1 4
4 3 4 5
This is d2:
a d
0 1 7
1 2 8
2 3 11
This is a resultant d1:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11
You're probably looking for pd.merge.
In your case, d1 = d1.merge(d2, on=['a'], how='left') should do the trick.
Another way is to use map and make only the values you need.
d1['d'] = d1['a'].map(d2.set_index('a')['d'])
d1
Output:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11

Collapse values from multiple rows of a column into an array when all other columns values are same

I have a table with 7 columns where for every few rows, 6 columns remain same and only the 7th changes. I would like to merge all these rows into one row, and combine the value of the 7th column into a list.
So if I have this dataframe:
A B C
0 a 1 2
1 b 3 4
2 c 5 6
3 c 7 6
I would like to convert it to this:
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6
Since the values of column A and C were same in row 2 and 3, they would get collapsed into a single row and the values of B will be combined into a list.
Melt, explode, and pivot don't seem to have such functionality. How can achieve this using Pandas?
Use GroupBy.agg with custom lambda function, last add DataFrame.reindex for same order of columns by original:
f = lambda x: x.tolist() if len(x) > 1 else x
df = df.groupby(['A','C'])['B'].agg(f).reset_index().reindex(df.columns, axis=1)
You can also create columns names dynamic like:
changes = ['B']
cols = df.columns.difference(changes).tolist()
f = lambda x: x.tolist() if len(x) > 1 else x
df = df.groupby(cols)[changes].agg(f).reset_index().reindex(df.columns, axis=1)
print (df)
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6
For all lists in column solution is simplier:
changes = ['B']
cols = df.columns.difference(changes).tolist()
df = df.groupby(cols)[changes].agg(list).reset_index().reindex(df.columns, axis=1)
print (df)
A B C
0 a [1] 2
1 b [3] 4
2 c [5, 7] 6
Here is another approach using pivot_table and applymap:
(df.pivot_table(index='A',aggfunc=list).applymap(lambda x: x[0] if len(set(x))==1 else x)
.reset_index())
A B C
0 a 1 2
1 b 3 4
2 c [5, 7] 6

Create a new column with the minimum of other columns on same row

I have the following DataFrame
Input:
A B C D E
2 3 4 5 6
1 1 2 3 2
2 3 4 5 6
I want to add a new column that has the minimum of A, B and C for that row
Output:
A B C D E Goal
2 3 4 5 6 2
1 1 2 3 2 1
2 3 4 5 6 2
I have tried to use
df = df[['A','B','C]].min()
but I get errors about hashing lists and also I think this will be the min of the whole column I only want the min of the row for those specific columns.
How can I best accomplish this?
Use min along the columns with axis=1
Inline solution that produces copy that doesn't alter the original
df.assign(Goal=lambda d: d[['A', 'B', 'C']].min(1))
A B C D E Goal
0 2 3 4 5 6 2
1 1 1 2 3 2 1
2 2 3 4 5 6 2
Same answer put different
Add column to existing dataframe
new = df[['A', 'B', 'C']].min(axis=1)
df['Goal'] = new
df
A B C D E Goal
0 2 3 4 5 6 2
1 1 1 2 3 2 1
2 2 3 4 5 6 2
Add axis = 1 to your min
df['Goal'] = df[['A','B','C']].min(axis = 1)
you have to define an axis across which you are applying the min function, which would be 1 (columns).
df['ABC_row_min'] = df[['A', 'B', 'C']].min(axis = 1)

Python Pandas: copy several columns at specific row from one dataframe to another with different names

I have dataframe1 with columns a,b,c,d with 5 rows.
I also have another dataframe2 with columns e,f,g,h
Let's say I want to copy columns a,b in row 3 from dataframe1 to columns f,g in row 3 at dataframe2.
I tried to use this code:
dataframe2.loc[3,['f','g']] = dataframe1.loc[3,['a','b']].
The results was NaN in dataframe2.
Any ideas how can I solve it?
One idea is convert to numpy array for avoid alignment data by columns names:
dataframe2.loc[3,['f','g']] = dataframe1.loc[3,['a','b']].values
Sample:
dataframe1 = pd.DataFrame({'a':list('abcdef'),
'b':[4,5,4,5,5,4],
'c':[7,8,9,4,2,3]})
print (dataframe1)
a b c
0 a 4 7
1 b 5 8
2 c 4 9
3 d 5 4
4 e 5 2
5 f 4 3
dataframe2 = pd.DataFrame({'f':list('HIJK'),
'g':[0,0,7,1],
'h':[0,1,0,1]})
print (dataframe2)
f g h
0 H 0 0
1 I 0 1
2 J 7 0
3 K 1 1
dataframe2.loc[3,['f','g']] = dataframe1.loc[3,['a','b']].values
print (dataframe2)
f g h
0 H 0 0
1 I 0 1
2 J 7 0
3 d 5 1

Resources