How to fill na in pandas by the mode of a group - python-3.x

I have a Pandas Dataframe like this:
df =
a b
a1 b1
a1 b2
a1 b1
a1 Nan
a2 b1
a2 b2
a2 b2
a2 Nan
a2 b2
a3 Nan
For every value of a, b can have multiple values of b corresponding to it. I want to fill up all the nan values of b with the mode of b value grouped by the corresponding value of a.
The resulting dataframe should look like the following:
df =
a b
a1 b1
a1 b2
a1 b1
a1 ***b1***
a2 b1
a2 b2
a2 b2
a2 **b2**
a2 b2
a3 b2
Above b1 was the mode of b corresponding to a1. Similarly, b2 was the mode corresponding to a2. Finally, a3 had no data, so it fills it by global mode b2.
For every nan value of column b, I want to fill it with the mode of the value of b column, but, for that particular value of a, whatever is the mode.
EDIT:
If there is a group a for which there is no data on b, then fill it by global mode.

Try:
# lazy grouping
groups = df.groupby('a')
# where all the rows within a group is NaN
all_na = groups['b'].transform(lambda x: x.isna().all())
# fill global mode
df.loc[all_na, 'b'] = df['b'].mode()[0]
# fill with local mode
mode_by_group = groups['b'].transform(lambda x: x.mode()[0])
df['b'] = df['b'].fillna(mod_by_group)

You are getting the IndexError: index out of bounds because last a column value a3 does not have corresponding b column value. Hence there is no group to fill. Solution would be have try catch block while fillna and then apply ffill and bfill . Here is the code solution.
data_stack = [['a1','b1'],['a1','b2'],['a1','b1'],['a1',np.nan],['a2','b1'],
['a2','b2'],['a2','b2'],['a2',np.nan],['a2','b2'],['a3',np.nan]]
df_try_stack = pd.DataFrame(data_stack, columns=["a","b"])
# This function will fill na values of group to the mode value
def fillna_group(grp):
try:
return grp.fillna(grp.mode()[0])
except BaseException as e:
print('Error as no correspindg group: ' + str(e))
df_try_stack["b"] = df_try_stack["b"].fillna(df_try_stack.groupby(["a"])
['b'].transform(lambda grp : fillna_group(grp)))
df_try_stack = df_try_stack.ffill(axis = 0)
df_try_stack = df_try_stack.bfill(axis =0)

Related

Pandas rename one value as other value in a column and add corresponding values in the other column

So, I have a pandas data frame:
df =
a b c
a1 b1 c1
a2 b2 c1
a2 b3 c2
a2 b4 c2
I want to rename a2 into a1 and then group by a and c and add the corresponding values of b
df =
a b c
a1 b1+b2 c1
a1 b3+b4 c2
So, something like this
df =
a value c
a1 10 c1
a2 20 c1
a2 50 c2
a2 60 c2
df =
a value c
a1 30 c1
a1 110 c2
How to do this?
What about
>>> res = df.replace({"a": {"a2": "a1"}}).groupby(["a", "c"], as_index=False).sum()
>>> res
a c value
0 a1 c1 30
1 a1 c2 110
which first replaces "a2"s with "a1" in only a column and then groups by and sums.
To get the original column order back, we can reindex:
>>> res.reindex(df.columns, axis=1)
a value c
0 a1 30 c1
1 a1 110 c2
Try this:
df.groupby([df['a'].replace({'a2':'a1'}),'c']).sum().reset_index()

Pandas group by two columns and get top n rows of each value of one of the columns sorted in descending order

I have a pandas dataframe with many columns (Two column names of interest are a and b)
I want to group by a and b
compute the occurences of each group
sort each group in descending order of occurrences
For each value of b I want to take top n values of a, which have most occurences.
I could do upto step 3, using the following code:
a_b_count = df.groupby(['a', 'b']).size().reset_index().rename({0:'count'},axis='columns').sort_values('count', ascending = False)
But, for each value of b, how to get top-n values of a for which occurrences are the highest?
Example
df =
a b ...
a1 b1 ...
a2 b1 ...
a1 b1 ...
a1 b2 ...
a2 b2 ...
a2 b2 ...
Expected Output (for n = 1):
a b count
b1 a1 2
b2 a2 2
You can use nlargest rather than a sort. Will be faster for a smaller n relative to Series size.
df.groupby(['a', 'b']).size().groupby(
level=1).nlargest(n).reset_index(-1, drop=True)
b a
b1 a1 2
b2 a2 2
dtype: int64
Here's one way to do it, using crosstab to get a frequency of columns a and b :
pd.crosstab(df.a, df.b).stack().nlargest(1, keep="all").reset_index(name="count")

Sum only cells with prefix letter in a row

I have data in Row 1 as follow:
A1 = 8
A2 = 9
A3 = CN2.75
A4 = CN3
I would like the result in cell B2 = sum range A1 to A4 and only sum cells with prefix letter "CN". The sum result in cell B2 should be = 2.75 + 3 = 5.75
=SUMPRODUCT(+IF(LEFT(A1:A4,2)="CN",1,0),IFERROR(VALUE(RIGHT(A1:A4,LEN(A1:A4)-2)),0))
Prefer even JvDV's Version from the comments, even more efficient
Enter as an array formula Ctrl+Shift+Enter

Excel Formula || How to count occurrences of a value in column

Need some help in figuring out an formula to count the number of times a value is listed in a column. I will try and explain the requirement below.
The below image show sample of data set.
The requirement is to list out issues and actions per customer.
As you can see, even from values clustered in cell, we need to find out individual unique values and then map it against the adjacent column or columns.
It just need an extra sheet/table to execute..
try :
A1 = a,b,c
A2 = b,c
A3 = c,b,a
A4 = c,a
A5 = b
B1 = ss
B2 = ss
B3 = dd
B4 = dd
B5 = ss
D1 = a
E1 = b
F1 = c
C7 = ss
C8 = dd
D2 =IF(FIND(D$1,$A2,1)>0,1,"") drag until F6
D7 =COUNTIFS($B$2:$B$6,$C7,D$2:D$6,1) drag until F8
D7:F8 will be your desired results. Happy trying.

Add A1 to C1 if B1 = Specific Number

I will try to be as clear and concise as possible. I am working on a spreadsheet in which I have item prices listed in a range of A1:A40. B1:B40 lists a numerical digit (either 1, 2, 3, etc.) that corresponds with a purchase category type (groceries, gas, etc.). Now I want one cell, such as C1, to add all instances in the A range that equal a specific number in B.
For example:
A1 = $5.00 | B1 = 1 | C1 = The sum in range A1:A3 if it's corresponding B value is equal to 1 (In this case B1 and B3, so C1=A1+A3)
A2 = $2.50 | B2 = 2 | C2 = The sum in range A1:A3 if it's corresponding B value is equal to 2 (In this case B2, so C2= B2)
A3 = $4.00 | B3 = 1 | C3 =
Use SUMIF Function
SUMIF(range, criteria, [sum_range])
In Cell C1 enter the formula = SUMIF(B:B,1,A:A)

Resources