I have a DataFrame with columns A, B, and C. For each value of A, I would like to select the row with the minimum value in column B.
That is, from this:
df = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2],
'B': [4, 5, 2, 7, 4, 6],
'C': [3, 4, 10, 2, 4, 6]})
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
I would like to get:
A B C
0 1 2 10
1 2 4 4
For the moment I am grouping by column A, then creating a value that indicates to me the rows I will keep:
a = data.groupby('A').min()
a['A'] = a.index
to_keep = [str(x[0]) + str(x[1]) for x in a[['A', 'B']].values]
data['id'] = data['A'].astype(str) + data['B'].astype('str')
data[data['id'].isin(to_keep)]
I am sure that there is a much more straightforward way to do this.
I have seen many answers here that use MultiIndex, which I would prefer to avoid.
Thank you for your help.
I feel like you're overthinking this. Just use groupby and idxmin:
df.loc[df.groupby('A').B.idxmin()]
A B C
2 1 2 10
4 2 4 4
df.loc[df.groupby('A').B.idxmin()].reset_index(drop=True)
A B C
0 1 2 10
1 2 4 4
Had a similar situation but with a more complex column heading (e.g. "B val") in which case this is needed:
df.loc[df.groupby('A')['B val'].idxmin()]
The accepted answer (suggesting idxmin) cannot be used with the pipe pattern. A pipe-friendly alternative is to first sort values and then use groupby with DataFrame.head:
data.sort_values('B').groupby('A').apply(DataFrame.head, n=1)
This is possible because by default groupby preserves the order of rows within each group, which is stable and documented behaviour (see pandas.DataFrame.groupby).
This approach has additional benefits:
it can be easily expanded to select n rows with smallest values in specific column
it can break ties by providing another column (as a list) to .sort_values(), e.g.:
data.sort_values(['final_score', 'midterm_score']).groupby('year').apply(DataFrame.head, n=1)
As with other answers, to exactly match the result desired in the question .reset_index(drop=True) is needed, making the final snippet:
df.sort_values('B').groupby('A').apply(DataFrame.head, n=1).reset_index(drop=True)
I found an answer a little bit more wordy, but a lot more efficient:
This is the example dataset:
data = pd.DataFrame({'A': [1,1,1,2,2,2], 'B':[4,5,2,7,4,6], 'C':[3,4,10,2,4,6]})
data
Out:
A B C
0 1 4 3
1 1 5 4
2 1 2 10
3 2 7 2
4 2 4 4
5 2 6 6
First we will get the min values on a Series from a groupby operation:
min_value = data.groupby('A').B.min()
min_value
Out:
A
1 2
2 4
Name: B, dtype: int64
Then, we merge this series result on the original data frame
data = data.merge(min_value, on='A',suffixes=('', '_min'))
data
Out:
A B C B_min
0 1 4 3 2
1 1 5 4 2
2 1 2 10 2
3 2 7 2 4
4 2 4 4 4
5 2 6 6 4
Finally, we get only the lines where B is equal to B_min and drop B_min since we don't need it anymore.
data = data[data.B==data.B_min].drop('B_min', axis=1)
data
Out:
A B C
2 1 2 10
4 2 4 4
I have tested it on very large datasets and this was the only way I could make it work in a reasonable time.
You can sort_values and drop_duplicates:
df.sort_values('B').drop_duplicates('A')
Output:
A B C
2 1 2 10
4 2 4 4
The solution is, as written before ;
df.loc[df.groupby('A')['B'].idxmin()]
If the solution but then if you get an error;
"Passing list-likes to .loc or [] with any missing labels is no longer supported.
The following labels were missing: Float64Index([nan], dtype='float64').
See https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#deprecate-loc-reindex-listlike"
In my case, there were 'NaN' values at column B. So, I used 'dropna()' then it worked.
df.loc[df.groupby('A')['B'].idxmin().dropna()]
You can also boolean indexing the rows where B column is minimal value
out = df[df['B'] == df.groupby('A')['B'].transform('min')]
print(out)
A B C
2 1 2 10
4 2 4 4
How come this be right?
X=np.array([range(1,12)])
A=X>4
B=X<10
C=(X>4) | (X<10)
print (X[A])
print (X[B])
print (X[C])
[ 5 6 7 8 9 10 11]
[1 2 3 4 5 6 7 8 9]
[ 1 2 3 4 5 6 7 8 9 10 11]
I'm going to guess that your concern is because you have every element in the final expression, simply because the first two are obvious (5..11 are all greater than four and 1..9 are all less than ten).
But the third one is also right since every element is either greater than four or less than ten. The numbers 1..9 are all less than ten so they're in. Similarly, 5..11 are all greater than four so they're in as well. The union of those two ranges is the entire set of values.
If you wanted the items that were between four and ten (exclusive at both ends), you should probably have used "and" instead of "or" (& instead of |):
import numpy as np
X=np.array([range(1,12)])
A=X>4
B=X<10
C=(X>4) | (X<10)
D=(X>4) & (X<10)
E=(X<=4) | (X>=10)
print (X[A])
print (X[B])
print (X[C])
print (X[D])
print (X[E])
The output of that is:
[ 5 6 7 8 9 10 11]
[1 2 3 4 5 6 7 8 9]
[ 1 2 3 4 5 6 7 8 9 10 11]
[5 6 7 8 9]
[ 1 2 3 4 10 11]
Because you didn't specify what you wanted in the original question), I've also added the opposite operation (to get values not in that range). That's indicated by the E code.
The rather verbose fork I came up with is
({. , (>:#[ }. ]))
E.g.,
3 ({. , (>:#[ }. ])) 0 1 2 3 4 5
0 1 2 4 5
Works great, but is there a more idiomatic way? What is the usual way to do this in J?
Yes, the J-way is to use a 3-level boxing:
(<<<5) { i.10
0 1 2 3 4 6 7 8 9
(<<<1 3) { i.10
0 2 4 5 6 7 8 9
It's a small note in the dictionary for {:
Note that the result in the very last dyadic example, that is, (<<<_1){m , is all except the last item.
and a bit more in Learning J: Chapter 6 - Indexing: 6.2.5 Excluding Things.
Another approach is to use the monadic and dyadic forms of # (Tally and Copy). This idiom of using Copy to remove an item is something that I use frequently.
The hook (i. i.##) uses Tally (monadic #) and monadic and dyadic i. (Integers and Index of) to generate the filter string:
2 (i. i.##) 'abcde'
1 1 0 1 1
which Copy (dyadic #) uses to omit the appropriate item.
2 ((i. i.##) # ]) 0 1 2 3 4 5
0 1 3 4 5
2 ((i. i.##) # ]) 'abcde'
abde
The i. primitive produces a list of integers:
i. 10
0 1 2 3 4 5 6 7 8 9
If I want to produce several short lists in a row, I do this:
;i."0 each [ 2 3 4
0 1 0 1 2 0 1 2 3
(the result I want)
Boxing (that each) is a crutch here, because without it, i."0 produces a matrix.
i."0 [ 2 3 4
0 1 0 0
0 1 2 0
0 1 2 3
(the result I don't want)
Is there a better way to not have i."0 format the output to a matrix, but an array?
No, I believe you can't do any better than your current solution. There is no way for i."0 to return a vector.
The "0 adverb forces i. to accept scalars, and i. returns vectors. i. has no way of knowing that your input was a vector rather than a scalar. According to The J primer the result shape is the concatenation of the frame of the argument and the result.
The shortest "box-less" solution I've found so far is
(*#$"0~#&,i."0) 2 3 4
which is still longer than just using ;i. each 2 3 4
Imagine that I want to take the numbers from 1 to 3 and form a matrix such that each possible pairing is represented, e.g.,
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
Here is the monadic verb I formulated in J to do this:
($~ (-:## , 2:)) , ,"0/~ 1+i.y
Originally I had thought that ,"0/~ 1+i.y would be sufficient, but unfortunately that produces the following output:
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
In other words, its shape is 3 3 2 and I want something whose shape is 9 2. The only way I could think of to fix it is to pour all of the data into a new shape. I'm convinced there must be a more concise way to do this. Anyone know?
Reshaping your intermediate result can be simplified. Removing the topmost axis is commonly done with ,/ so in your case the completed phrase could be ,/ ,"0/~ 1+i.y
One way (which uses { as a monad in its capacity for permutation cataloguing):
>,{ 2#<1+i.y
EDIT:
Some fun to be had with this scheme:
All possible permutations:
>,{ y#<1+i.y
Configurable number in sequence:
>,{ x#<1+i.y
I realize this question is old, but there is a simpler way to do it: count to 9 in trinary, and add 1.
1 + 3 3 #: i.9
1 1
1 2
1 3
2 1
2 2
2 3
3 1
3 2
3 3
The 3 3 & #: gives you two digits. The general 'base 3' verb is 3 & #.^:_1.