sort pandas value_counts() primarily by descending counts and secondarily by ascending values - python-3.x

When applying value_counts() to a series in pandas, by default the counts are sorted in descending order, however the values are not sorted within each count.
How can i have the values within each identical count sorted in ascending order?
apples 5
peaches 5
bananas 3
carrots 3
apricots 1

The output of value_counts is a series itself (just like the input), so you have available all of the standard sorting options as with any series. For example:
df = pd.DataFrame({ 'fruit':['apples']*5 + ['peaches']*5 + ['bananas']*3 +
['carrots']*3 + ['apricots'] })
df.fruit.value_counts().reset_index().sort([0,'index'],ascending=[False,True])
index 0
0 apples 5
1 peaches 5
2 bananas 3
3 carrots 3
4 apricots 1
I'm actually getting the same results by default so here's a test with ascending=[False,False] to demonstrate that this is actually working as suggested.
df.fruit.value_counts().reset_index().sort([0,'index'],ascending=[False,False])
index 0
1 peaches 5
0 apples 5
3 carrots 3
2 bananas 3
4 apricots 1
I'm actually a bit confused about exactly what desired output here in terms of ascending vs descending, but regardless, there are 4 possible combos here and you can get it however you like by altering the ascending keyword argument.

Related

Combining the respective columns from 2 separate DataFrames using pandas

I have 2 large DataFrames with the same set of columns but different values. I need to combine the values in respective columns (A and B here, maybe be more in actual data) into single values in the same columns (see required output below). I have a quick way of implementing this using np.vectorize and df.to_numpy() but I am looking for a way to implement this strictly with pandas. Criteria here is first readability of code then time complexity.
df1 = pd.DataFrame({'A':[1,2,3,4,5], 'B':[5,4,3,2,1]})
print(df1)
A B
0 1 5
1 2 4
2 3 3
3 4 2
4 5 1
and,
df2 = pd.DataFrame({'A':[10,20,30,40,50], 'B':[50,40,30,20,10]})
print(df2)
A B
0 10 50
1 20 40
2 30 30
3 40 20
4 50 10
I have one way of doing it which is quite fast -
#This function might change into something more complex
def conc(a,b):
return str(a)+'_'+str(b)
conc_v = np.vectorize(conc)
required = pd.DataFrame(conc_v(df1.to_numpy(), df2.to_numpy()), columns=df1.columns)
print(required)
#Required Output
A B
0 1_10 5_50
1 2_20 4_40
2 3_30 3_30
3 4_40 2_20
4 5_50 1_10
Looking for an alternate way (strictly pandas) of solving this.
Criteria here is first readability of code
Another simple way is using add and radd
df1.astype(str).add(df2.astype(str).radd('-'))
A B
0 1-10 5-50
1 2-20 4-40
2 3-30 3-30
3 4-40 2-20
4 5-50 1-10

Highest frequency in a dataframe

I am looking for a way to get the highest frequency in the entire pandas, not in a particular column. I have looked at value count, but it seems that works in a column specific way. Any way to do that?
Use DataFrame.stack with Series.mode for top values, for first select by position:
df = pd.DataFrame({
'B':[4,5,4,5,4,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
})
a = df.stack().mode().iat[0]
print (a)
4
Or if need also frequency is possible use Series.value_counts:
s = df.stack().value_counts()
print (s)
4 6
5 4
3 3
9 2
7 2
2 2
1 2
8 1
6 1
0 1
dtype: int64
print (s.index[0])
4
print (s.iat[0])
6

Is there a way with DataFrame objects to grab rows based on single-row conditions AND the preceding rows as well? Like 'grep -B1 ...' does on Linux

I have a pd.DataFrame object df, and i can select some rows, say on a single-column condition, and i can grab all the rows matching the condition, but i wish to also grab the preceding row before each of the rows matching the condition. The result should be a pd.DataFrame with these rows.
I can write code to do that, i am not asking for it (but feel free to illustrate if you think you have a neat + short way of doing it), but i was wondering if pandas doesn't have a built-in tool to do it i am not aware of.
An example showing what i'm looking for:
import pandas as pd
df = pd.DataFrame([{'a':1, 'b':'apples'}, {'a':5, 'b':'pears'}, {'a':2, 'b':'4 plums'},
{'a':9, 'b':'bananas'}, {'a':5, 'b':'cherries'}, {'a':2, 'b':'100 grapes'},
{'a':3, 'b':'oranges'}, {'a':8, 'b':'cherries'}])
print(df)
# prints: | my markings here, not part of printout, showing
# a b | with a '+' the rows i wish to select and why
# 0 1 apples |
# 1 5 pears | + - because it's a preceding row
# 2 2 4 plums | + - because it has a number
# 3 9 bananas |
# 4 5 cherries | + - because it's a preceding row
# 5 2 100 grapes | + - because it has a number
# 6 3 oranges |
# 7 8 cherries |
# condition would be all the rows where 'b' column has the number of items too:
df[[not x.isalpha() for x in df.b]]
# but this returns only the condition rows, of index 2 and 5, not rows
# 1, 2, 4, 5 as i want it.
IIUC, you are looking for shift(-1):
c=~df.b.str.isalpha()
df[c|c.shift(-1)]
a b
1 5 pears
2 2 4 plums
4 5 cherries
5 2 100 grapes

Filter rows based on the count of unique values

I need to count the unique values of column A and filter out the column with values greater than say 2
A C
Apple 4
Orange 5
Apple 3
Mango 5
Orange 1
I have calculated the unique values but not able to figure out how to filer them df.value_count()
I want to filter column A that have greater than 2, expected Dataframe
A B
Apple 4
Orange 5
Apple 3
Orange 1
value_counts should be called on a Series (single column) rather than a DataFrame:
counts = df['A'].value_counts()
Giving:
A
Apple 2
Mango 1
Orange 2
dtype: int64
You can then filter this to only keep those >= 2 and use isin to filter your DataFrame:
filtered = counts[counts >= 2]
df[df['A'].isin(filtered.index)]
Giving:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1
Use duplicated with parameter keep=False:
df[df.duplicated(['A'], keep=False)]
Output:
A C
0 Apple 4
1 Orange 5
2 Apple 3
4 Orange 1

How to select and subset rows based on sting in pandas dataframe?

My dataset looks like following. I am trying to subset my pandas dataframe such that only the responses by all 3 people will get selected. For example, in below data frame the responses that were answered by all 3 people were "I like to eat" and "You have nice day" . Thus only those should be subsetted. I am not sure how to achieve this in Pandas dataframe.
Note: I am new to Python ,please provide explanation with your code.
DataFrame example
import pandas as pd
data = {'Person':['1', '1','1','2','2','2','2','3','3'],'Response':['I like to eat','You have nice day','My name is ','I like to eat','You have nice day','My name is','This is it','I like to eat','You have nice day'],
}
df = pd.DataFrame(data)
print (df)
Output:
Person Response
0 1 I like to eat
1 1 You have nice day
2 1 My name is
3 2 I like to eat
4 2 You have nice day
5 2 My name is
6 2 This is it
7 3 I like to eat
8 3 You have nice day
IIUC I am using transform with nunique
yourdf=df[df.groupby('Response').Person.transform('nunique')==df.Person.nunique()]
yourdf
Out[463]:
Person Response
0 1 I like to eat
1 1 You have nice day
3 2 I like to eat
4 2 You have nice day
7 3 I like to eat
8 3 You have nice day
Method 2
df.groupby('Response').filter(lambda x : pd.Series(df['Person'].unique()).isin(x['Person']).all())
Out[467]:
Person Response
0 1 I like to eat
1 1 You have nice day
3 2 I like to eat
4 2 You have nice day
7 3 I like to eat
8 3 You have nice day

Resources