select the first n largest groups from grouped data frames - python-3.x

Data frame(df) structure
col1 col2
x 3131
y 9647
y 9648
z 9217
y 9652
x 23
grouping:
grouped = df.groupby(col1)
I want to select first 2 largest groups i.e.,
y 9647
y 9648
y 9652
and
x 3131
x 23
How can I do that using pandas. I've achieved it using list but that makes it clumsy again as it becomes a list of tuples and I've to convert them back to data frame types

Use value_counts with indexing index and filter rows by isin in boolean indexing:
df1 = df[df['col1'].isin(df['col1'].value_counts().index[:2])]
print (df1)
col1 col2
0 x 3131
1 y 9647
2 y 9648
4 y 9652
5 x 23
If need DataFrames by top groups use dictionary comprehension with enumerate:
dfs = {i: df[df['col1'].eq(x)] for i, x in enumerate(df['col1'].value_counts().index[:2], 1)}
print (dfs)
{1: col1 col2
1 y 9647
2 y 9648
4 y 9652, 2: col1 col2
0 x 3131
5 x 23}
print (dfs[1])
col1 col2
1 y 9647
2 y 9648
4 y 9652
print (dfs[2])
col1 col2
0 x 3131
5 x 23

Related

For each row, add column name to list in new column if row value matches a condition

I have a series of columns, each containing either Y or N.
I would like to create a new column that contains a list of columns (for that particular row) that contain Y.
Old DataFrame
>>> df
col1 col2 col3 col4 col5
a Y N N N Y
b Y N Y Y Y
c N N Y N N
New Dataframe
>>> df_new
col1 col2 col3 col4 col5 col6
a Y N N N Y [col1, col5]
b Y N Y Y Y [col1, col3, col4, col5]
c N N Y N N [col3]
So far I can get it working for a single column with:
df["col6"] = ["col1" if val == "Y" else "" for val in df["col1"]]
But ideally I want to do the same for all columns, so I somehow end up with the result above. I could imagine doing some kind of loop, but then how I go about appending the result to the list value in col6 I'm unsure on. Can someone steer me in the right direction please?
Compare values by Y first, then use DataFrame.dot with Series.str.split:
df["col6"] = df.eq('Y').dot(df.columns + ',').str[:-1].str.split(',')
print (df)
col1 col2 col3 col4 col5 col6
a Y N N N Y [col1, col5]
b Y N Y Y Y [col1, col3, col4, col5]
c N N Y N N [col3]
Or if need better performance use list comprehension with numpy arrays:
cols = df.columns.to_numpy()
df["col6"] = [cols[x].tolist() for x in df.eq('Y').to_numpy()]

Applying "percentage of group total" to a column in a grouped dataframe

I have a dataframe from which I generate another dataframe using following code as under:
df.groupby(['Cat','Ans']).agg({'col1':'count','col2':'sum'})
This gives me following result:
Cat Ans col1 col2
A Y 100 10000.00
N 40 15000.00
B Y 80 50000.00
N 40 10000.00
Now, I need percentage of group totals for each group (level=0, i.e. "Cat") instead of count or sum.
For getting count percentage instead of count value, I could do this:
df['Cat'].value_counts(normalize=True)
But here I have sub-group "Ans" under the "Cat" group. And I need the percentage to be on each Cat group level and not the whole total.
So, expectation is:
Cat Ans col1 .. col3
A Y 100 .. 71.43 #(100/(100+40))*100
N 40 .. 28.57
B Y 80 .. 66.67
N 40 .. 33.33
Similarly, col4 will be percentage of group-total for col2.
Is there a function or method available for this?
How do we do this in an efficient way for large data?
You can use the level argument of DataFrame.sum (to perform a groupby) and have pandas take care of the index alignment for the division.
df['col3'] = df['col1']/df['col1'].sum(level='Cat')*100
col1 col2 col3
Cat Ans
A Y 100 10000.0 71.428571
N 40 15000.0 28.571429
B Y 80 50000.0 66.666667
N 40 10000.0 33.333333
For multiple columns you can loop the above, or have pandas align those too. I add a suffix to distinguish the new columns from the original columns when joining back with concat.
df = pd.concat([df, (df/df.sum(level='Cat')*100).add_suffix('_pct')], axis=1)
col1 col2 col1_pct col2_pct
Cat Ans
A Y 100 10000.0 71.428571 40.000000
N 40 15000.0 28.571429 60.000000
B Y 80 50000.0 66.666667 83.333333
N 40 10000.0 33.333333 16.666667

How to append a column to a dataframe with values from a specified column index [duplicate]

This question already has an answer here:
Vectorized look-up of values in Pandas dataframe
(1 answer)
Closed 2 years ago.
I have a dataframe in which one of the columns is used to designate which of the other columns has the specific value I'm looking for.
df = pd.DataFrame({'COL1':['X','Y','Z'],'COL2':['A','B','C'],'X_SORT':['COL1','COL2','COL1']})
I'm trying to add a new column called 'X_SORT_VALUE' and assigning the value of the column identified in the X_SORT column.
df = df.assign(X_SORT_VALUE=lambda x: (x['X_SORT']))
But I'm getting the value of the X_SORT column:
COL1 COL2 X_SORT X_SORT_VALUE
0 X A COL1 COL1
1 Y B COL2 COL2
2 Z C COL1 COL1
Rather than getting the value of that column index, like I want:
COL1 COL2 X_SORT X_SORT_VALUE
0 X A COL1 X
1 Y B COL2 B
2 Z C COL1 Z
You need df.lookup here:
df['X_SORT_VALUE'] = df.lookup(df.index,df['X_SORT'])
COL1 COL2 X_SORT X_SORT_VALUE
0 X A COL1 X
1 Y B COL2 B
2 Z C COL1 Z

iterate through rows and columns in excel using pandas-Python 3

I have an excel spreadsheet that I read with this code:
df=pd.ExcelFile('/Users/xxx/Documents/Python/table.xlsx')
ccg=df.parse("CCG")
With the sheet that I want inside the spreadsheet being CCG
The sheet looks like this:
col1 col2 col3
x a 1 2
x b 3 4
x c 5 6
x d 7 8
x a 9 10
x b 11 12
x c 13 14
y a 15 16
y b 17 18
y c 19 20
y d 21 22
y a 23 24
How would I write code that gets values of col 2 and col3 for rows that contain both a and x. So the proposed output for this table would be: col1=[1,9], col2=[2,10]
Try this:
df = pd.read_excel('/Users/xxx/Documents/Python/table.xlsx', 'CCG', index_col=0, usecols=['col1','col2']) \
.query("index == 'x' and col1 == 'a'")
Demo:
Excel file:
In [243]: fn = r'C:\Temp\.data\41718085.xlsx'
In [244]: pd.read_excel(fn, 'CCG', index_col=0, usecols=['col1','col2']) \
.query("index == 'x' and col1 == 'a'")
Out[244]:
col1 col2
x a 1
x a 9
You can do:
df = pd.read_excel('/Users/xxx/Documents/Python/table.xlsx'),sheetname='CCG', index_col=0)
filter = df[(df.index == 'x') & (df.col1 == 'a')]
Then from here, you can return all the values as a numpy array with:
filter['col2']
filter['col3']
Managed to create a count that iterates until it finds a adds +1 to the count and only appends to the list index if it is between the ranges that x is in, once i have the indices i search through col 2 and 3 and pull the values out for the indices

Pandas add empty rows as filler

Given the following data frame:
import pandas as pd
DF = pd.DataFrame({'COL1': ['A', 'A','B'],
'COL2' : [1,2,1],
'COL3' : ['X','Y','X']})
DF
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
I would like to have an additional row for COL1 = 'B' so that both values (COL1 A and B) are represented by the COL3 values X and Y, with a 0 for COL2 in the generated row.
The desired result is as follows:
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
3 B 0 Y
This is just a simplified example, but I need a calculation that could handle many such instances (and not just inserting the row in interest manually).
Thanks in advance!
UPDATE:
For a generalized scenario where there are many different combinations of values under 'COL1' and 'COL3', this works but is probably not nearly as efficient as it can be:
#Get unique set of COL3
COL3SET = set(DF['COL3'])
#Get unique set of COL1
COL1SET = set(DF['COL1'])
#Get all possible combinations of unique sets
import itertools
COMB=[]
for combination in itertools.product(COL1SET, COL3SET):
COMB.append(combination)
#Create dataframe from new set:
UNQ = pd.DataFrame({'COMB':COMB})
#Split tuples into columns
new_col_list = ['COL1unq','COL3unq']
for n,col in enumerate(new_col_list):
UNQ[col] = UNQ['COMB'].apply(lambda COMB: COMB[n])
UNQ = UNQ.drop('COMB',axis=1)
#Merge original data frame with unique set data frame
DF = pd.merge(DF,UNQ,left_on=['COL1','COL3'],right_on=['COL1unq','COL3unq'],how='outer')
#Fill in empty values of COL1 and COL3 where they did not have records
DF['COL1'] = DF['COL1unq']
DF['COL3'] = DF['COL3unq']
#Replace 'NaN's in column 2 with zeros
DF['COL2'].fillna(0, inplace=True)
#Get rid of COL1unq and COL3unq
DF.drop(['COL1unq','COL3unq'],axis=1, inplace=True)
DF
Something like this?
col1_b_vals = set(DF.loc[DF.COL1 == 'B', 'COL3'])
col1_not_b_col3_vals = set(DF.loc[DF.COL1 != 'B', 'COL3'])
missing_vals = col1_not_b_col3_vals.difference(col1_b_vals)
missing_rows = DF.loc[(DF.COL1 != 'B') & (DF.COL3.isin(missing_vals)), :]
missing_rows['COL1'] = 'B'
missing_rows['COL2'] = 0
>>> pd.concat([DF, missing_rows], ignore_index=True)
COL1 COL2 COL3
0 A 1 X
1 A 2 Y
2 B 1 X
3 B 0 Y

Resources