Pandas group by two columns and get top n rows of each value of one of the columns sorted in descending order

Pandas group by two columns and get top n rows of each value of one of the columns sorted in descending order - python-3.x

I have a pandas dataframe with many columns (Two column names of interest are a and b)
I want to group by a and b
compute the occurences of each group
sort each group in descending order of occurrences
For each value of b I want to take top n values of a, which have most occurences.
I could do upto step 3, using the following code:
a_b_count = df.groupby(['a', 'b']).size().reset_index().rename({0:'count'},axis='columns').sort_values('count', ascending = False)
But, for each value of b, how to get top-n values of a for which occurrences are the highest?
Example
df =
a b ...
a1 b1 ...
a2 b1 ...
a1 b1 ...
a1 b2 ...
a2 b2 ...
a2 b2 ...
Expected Output (for n = 1):
a b count
b1 a1 2
b2 a2 2

You can use nlargest rather than a sort. Will be faster for a smaller n relative to Series size.
df.groupby(['a', 'b']).size().groupby(
level=1).nlargest(n).reset_index(-1, drop=True)
b a
b1 a1 2
b2 a2 2
dtype: int64

Here's one way to do it, using crosstab to get a frequency of columns a and b :
pd.crosstab(df.a, df.b).stack().nlargest(1, keep="all").reset_index(name="count")

Related

How to fill na in pandas by the mode of a group

I have a Pandas Dataframe like this:
df =
a b
a1 b1
a1 b2
a1 b1
a1 Nan
a2 b1
a2 b2
a2 b2
a2 Nan
a2 b2
a3 Nan
For every value of a, b can have multiple values of b corresponding to it. I want to fill up all the nan values of b with the mode of b value grouped by the corresponding value of a.
The resulting dataframe should look like the following:
df =
a b
a1 b1
a1 b2
a1 b1
a1 ***b1***
a2 b1
a2 b2
a2 b2
a2 **b2**
a2 b2
a3 b2
Above b1 was the mode of b corresponding to a1. Similarly, b2 was the mode corresponding to a2. Finally, a3 had no data, so it fills it by global mode b2.
For every nan value of column b, I want to fill it with the mode of the value of b column, but, for that particular value of a, whatever is the mode.
EDIT:
If there is a group a for which there is no data on b, then fill it by global mode.

Try:
# lazy grouping
groups = df.groupby('a')
# where all the rows within a group is NaN
all_na = groups['b'].transform(lambda x: x.isna().all())
# fill global mode
df.loc[all_na, 'b'] = df['b'].mode()[0]
# fill with local mode
mode_by_group = groups['b'].transform(lambda x: x.mode()[0])
df['b'] = df['b'].fillna(mod_by_group)

You are getting the IndexError: index out of bounds because last a column value a3 does not have corresponding b column value. Hence there is no group to fill. Solution would be have try catch block while fillna and then apply ffill and bfill . Here is the code solution.
data_stack = [['a1','b1'],['a1','b2'],['a1','b1'],['a1',np.nan],['a2','b1'],
['a2','b2'],['a2','b2'],['a2',np.nan],['a2','b2'],['a3',np.nan]]
df_try_stack = pd.DataFrame(data_stack, columns=["a","b"])
# This function will fill na values of group to the mode value
def fillna_group(grp):
try:
return grp.fillna(grp.mode()[0])
except BaseException as e:
print('Error as no correspindg group: ' + str(e))
df_try_stack["b"] = df_try_stack["b"].fillna(df_try_stack.groupby(["a"])
['b'].transform(lambda grp : fillna_group(grp)))
df_try_stack = df_try_stack.ffill(axis = 0)
df_try_stack = df_try_stack.bfill(axis =0)

scan Excel column based on another column value

I want to check one entire column with value in another column and then assign a value in another column value to matching row cell.
Eg-
A B C D
1 10 X
2 3 Y
3 2 Z
4 11 K
What I want to do is take one value at a time from column A eg 1 and then scan through Column B if matches the Column A (value 1) then assign x to that row under D. eg if we check A3 ( value 2) with column B and found 2 is on B4 then D4 = Z. Like this I want to check all values in column in A against column B assign relevant vale from column C to Column D
How can I do this, can someone please help me.
Thanks.

Try:
= IFERROR(INDEX($C$2:$C$5,MATCH(A3,$B$2:$B$5,0)),"no match")
See below.

Try:
=IFERROR(VLOOKUP(A1,$B$1:$C$5,2,0),"")

Average based on criteria then output

I have an Excel table with the first six columns having a value of 1 or 2. The next six columns are associated with the first six columns and have values that will need to be averaged - two averages will be computed based on whether there is a 1 or 2 value in the first six columns. Then depending on the two averages, the last six columns will need to be assigned a value equal to H (high average) or L (low average). This is difficult to explain, so here is an example:
A B C D E F G H I J K L M N O P Q R
1 2 2 1 2 2 1 8 8 9 8 6 8 L L H L L H
Columns C and F have values equal to 1, so columns I and L need to be averaged. Then because columns A, B, D and E have values equal to 2, columns G, H, J and K need to be averaged. The average of the columns associated with a value of 1 (I and L) is 8.5, and the average of the columns associated with a value of 2 (G, H, J and K) is 7.5. Columns M-R now must be labeled with an H or L depending on whether the corresponding values from columns G-L were part of the high (H) or low (L) average. In this case, since columns I and L had the larger average, then columns O and R need to be assigned an H. The other columns (M, N, P and Q) will be assigned an L because their associated columns (G, H, J, K) had the lower average.

Please consider the following formula placed on the first row of Column M and then copied across to Column R:
=IF(AVERAGEIF($A$1:$F$1,A1,$G$1:$L$1)=MAX(AVERAGEIF($A$1:$F$1,1,$G$1:$L$1),AVERAGEIF($A$1:$F$1,2,$G$1:$L$1)),"H","L")
Logic is if the average of the values that correspond to either the 1 or 2 on A1 is equal to the MAX of calculations between both, then this corresponds to the High number. If not it corresponds to the Low value. Note that this does not consider for when the averages are equal in which case all entries are noted as High. You can extend this by adding to the formula to check if the value is equal to the MIN of the same. Hope this helps. Regards,

If you don't want to use Visual Basic, you could use this method, but it might require more columns.
xx A B C D E F G H I J K L
1 2 2 1 2 2 1 8 8 9 8 6 8
For cell M1 type in: =if(A1=1, G1, "")
Note that this is two regular quotes (")s in a row after the G1 term.
Copy this over to cells M1-R1.
Now cells M1-R1 should only contain data for columns marked with a 1.
Next for cell S1 type in: =average(M1:R1)
This shouldn't factor in blank cells, So you should just have the average of "1" cells.
Now copy the process for the "2" cells:
For cell T1 type in: =if(A1=2, G1, "")
Copy this to cells T1-Y1.
For cell Z1 type in: =average(T1:Y1)
Now for cell AA1 type in: =if(S1 > Z1, 1, 2)
Now AA1 will have the number that has the higher average. So if the "1" cells had a higher average, cell AA1 will be a 1, otherwise it will be a 2.
Now for cell AB1 type in =if(A1=$AA1, "H", "L")
Copy AB1 to cells AB1 through AG1 and you're done.
Cells AB1-AG1 will have your H's and L's. Note that there is one drawback to this method, apart from it being a little complex, that is that if the averages are equal, it will still print "2"s as having the higher average.
Anyways hopefully you can find a simpler method, but this one should work if you can't.

I need a count of the number of rows that have a value in each of two columns

Lets say I have two columns B2:B21 and T2:T21
I need the total number of rows that have a value in both column B and T
If B2 has a value of 15, and T2 is blank, don't count that row
If B2 is blank and T2 has a value of 5, don't count that row
If B2 is blank and T2 is blank, don't count that row
If B2 is 45 and T2 is 50, then count this row (+1)
id this clear?

Next time try Super Users when you do not need programming assistance.
=COUNTIFS(B2:B21,">0",T2:T21,">0")

If you want to count a row if for example, B2 has 0 and T1 has 15 (+1), then use this instead:
=COUNTIFS(B:B,"<>"&"",T:T,"<>"&"")
This also works for cells containing text.

Compare 2 Columns values in 2 ExcelSheet

I want to compare two columns in excel 2 sheets
In sheet A Column is Gender with Numeric Value
And in sheet B Column is Gender with Alpha Value
A Gender B Gender
----------------------
1 M
2 F
2 F
1 M
2 M
1 F
Here 1 = M and 2 = F.
If A Gender is 1 and B Gender is M then correct.
If A Gender is 1 and B Gender is F then Wrong.
How to Compare the Value in Excel?

Assuming Gender data starts from A2 and goes down, type the following formula in C2:
=IF(OR(AND(A2=1,B2="M"),AND(A2=2,B2="F")),"OK","WRONG")
and autofill down as required.

Another alternative formula which might be easier to understand (and extend values if needs be)...
=CHOOSE(A2,"M","F")=B2
This formula basically reevaluates column A to be "M" when 1 and "F" when 2. Then asks if that equals Column B, and you get a TRUE/FALSE return ;)
If you have a third value for 3, just add another comma separated value onto the end of the CHOOSE function.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pandas group by two columns and get top n rows of each value of one of the columns sorted in descending order - python-3.x

You can use nlargest rather than a sort. Will be faster for a smaller n relative to Series size. df.groupby(['a', 'b']).size().groupby( level=1).nlargest(n).reset_index(-1, drop=True) b a b1 a1 2 b2 a2 2 dtype: int64

Here's one way to do it, using crosstab to get a frequency of columns a and b : pd.crosstab(df.a, df.b).stack().nlargest(1, keep="all").reset_index(name="count")

Related

How to fill na in pandas by the mode of a group

scan Excel column based on another column value

Average based on criteria then output

I need a count of the number of rows that have a value in each of two columns

Compare 2 Columns values in 2 ExcelSheet

Categories

Resources