Pandas rename one value as other value in a column and add corresponding values in the other column - python-3.x

So, I have a pandas data frame:
df =
a b c
a1 b1 c1
a2 b2 c1
a2 b3 c2
a2 b4 c2
I want to rename a2 into a1 and then group by a and c and add the corresponding values of b
df =
a b c
a1 b1+b2 c1
a1 b3+b4 c2
So, something like this
df =
a value c
a1 10 c1
a2 20 c1
a2 50 c2
a2 60 c2
df =
a value c
a1 30 c1
a1 110 c2
How to do this?

What about
>>> res = df.replace({"a": {"a2": "a1"}}).groupby(["a", "c"], as_index=False).sum()
>>> res
a c value
0 a1 c1 30
1 a1 c2 110
which first replaces "a2"s with "a1" in only a column and then groups by and sums.
To get the original column order back, we can reindex:
>>> res.reindex(df.columns, axis=1)
a value c
0 a1 30 c1
1 a1 110 c2

Try this:
df.groupby([df['a'].replace({'a2':'a1'}),'c']).sum().reset_index()

Related

How to split every row in dataframe into two with some features? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have this dataframe:
A C1 C2
a1 c1 c3
a2 c2 c4
And columns C1 and C2 has the same type.
And I want get this:
A C
a1 c1
a1 c3
a2 c2
a2 c4
How I can do this?
UPD:
In answers I get this info:
df_final = df.set_index('A').stack().droplevel(1).rename('C').reset_index()
Out[604]:
A C
0 a1 c1
1 a1 c3
2 a2 c2
3 a2 c4
But what I should if I want split in this way?
A B C1 C2 C3 C4
a1 b1 c1 c2 c3 c4
a2 b2 c5 c6 c7 c8
and get this:
A B C1 C2
a1 b1 c1 c2
a1 b1 c3 c4
a2 b2 c5 c6
a2 b2 c7 c8
Edit 2: If you have even number of columns Cx, you may use numpy to make it simple
import numpy as np
cols = ['C1','C2','C3','C4']
df1 = df.loc[df.index.repeat(len(cols) / 2), ['A','B']].reset_index(drop=True)
df_final = df1.join(pd.DataFrame(df[cols].to_numpy().reshape(-1,2), columns=['C1','C2']))
Out[698]:
A B C1 C2
0 a1 b1 c1 c2
1 a1 b1 c3 c4
2 a2 b2 c5 c6
3 a2 b2 c7 c8
Edit for updated sample:
On multiple columns Cx splitting by 2, you need wide_to_long. However, beforing doing it, you need pre-processing columns names to appropriate format to use with wide_to_long
df1 = df.set_index(['A','B'])
stub_cols = (np.arange(df1.columns.size) % 2).astype(str)
suff_cols = (np.arange(df1.columns.size) // 2).astype(str)
d = dict(zip(stub_cols, ['C1', 'C2']))
df1.columns = pd.Series(stub_cols) + '_' + suff_cols
df_final = pd.wide_to_long(df1.reset_index(),
i=['A','B'],
j='num',
stubnames=['0','1'],
sep='_').droplevel(-1).rename(d, axis=1).reset_index()
Out[680]:
A B C1 C2
0 a1 b1 c1 c2
1 a1 b1 c3 c4
2 a2 b2 c5 c6
3 a2 b2 c7 c8
Give this a try
df_final = df.set_index('A').stack().droplevel(1).rename('C').reset_index()
Out[604]:
A C
0 a1 c1
1 a1 c3
2 a2 c2
3 a2 c4
print(
pd.concat([df.A, df[['C1', 'C2']].apply(list, axis=1)], axis=1).explode(0).rename(columns={0:'C'})
)
Prints:
A C
0 a1 c1
0 a1 c3
1 a2 c2
1 a2 c4

How to fill na in pandas by the mode of a group

I have a Pandas Dataframe like this:
df =
a b
a1 b1
a1 b2
a1 b1
a1 Nan
a2 b1
a2 b2
a2 b2
a2 Nan
a2 b2
a3 Nan
For every value of a, b can have multiple values of b corresponding to it. I want to fill up all the nan values of b with the mode of b value grouped by the corresponding value of a.
The resulting dataframe should look like the following:
df =
a b
a1 b1
a1 b2
a1 b1
a1 ***b1***
a2 b1
a2 b2
a2 b2
a2 **b2**
a2 b2
a3 b2
Above b1 was the mode of b corresponding to a1. Similarly, b2 was the mode corresponding to a2. Finally, a3 had no data, so it fills it by global mode b2.
For every nan value of column b, I want to fill it with the mode of the value of b column, but, for that particular value of a, whatever is the mode.
EDIT:
If there is a group a for which there is no data on b, then fill it by global mode.
Try:
# lazy grouping
groups = df.groupby('a')
# where all the rows within a group is NaN
all_na = groups['b'].transform(lambda x: x.isna().all())
# fill global mode
df.loc[all_na, 'b'] = df['b'].mode()[0]
# fill with local mode
mode_by_group = groups['b'].transform(lambda x: x.mode()[0])
df['b'] = df['b'].fillna(mod_by_group)
You are getting the IndexError: index out of bounds because last a column value a3 does not have corresponding b column value. Hence there is no group to fill. Solution would be have try catch block while fillna and then apply ffill and bfill . Here is the code solution.
data_stack = [['a1','b1'],['a1','b2'],['a1','b1'],['a1',np.nan],['a2','b1'],
['a2','b2'],['a2','b2'],['a2',np.nan],['a2','b2'],['a3',np.nan]]
df_try_stack = pd.DataFrame(data_stack, columns=["a","b"])
# This function will fill na values of group to the mode value
def fillna_group(grp):
try:
return grp.fillna(grp.mode()[0])
except BaseException as e:
print('Error as no correspindg group: ' + str(e))
df_try_stack["b"] = df_try_stack["b"].fillna(df_try_stack.groupby(["a"])
['b'].transform(lambda grp : fillna_group(grp)))
df_try_stack = df_try_stack.ffill(axis = 0)
df_try_stack = df_try_stack.bfill(axis =0)

Modify Dataframe based on Priority

I have a df like this:
ID A1 A2 A3 A4 A5
1 1 2 3
2 1 2 3
3 2 1
4 3 1 2
5
For every ID, I have 5 columns A1 to A5 (In real I have many more) and the values are top priority for a particular ID.
For example: ID 1 has A1, A3 and A5 as priorites, , ID 3 has only 2 A2 and A1 and ID 5 has no
Priorities
Resultant DF
ID Priority_1 Priority_2 Priority_3
1 A1 A3 A5
2 A1 A2 A4
3 A2 A1
4 A3 A5 A1
5
I am trying to same using melt and pivot using this and this_1 and many more, but exactly not able to get the same resultant df.
Any help on this or clarity from my side!!
Use DataFrame.melt with sorting by DataFrame.sort_values and removing missing rows by DataFrame.dropna, then add new column used for filtering by boolean indexing and Series.le for less or equal and last use DataFrame.pivot with DataFrame.add_prefix, last add DataFrame.reindex for added only mising rows ID:
N = 3
df1 = df.melt('ID').sort_values(['ID','value']).dropna(subset=['value'])
df1['new'] = df1.groupby('ID').cumcount().add(1)
df1 = df1[df1['new'].le(N)]
df2 = df1.pivot('ID','new','variable').add_prefix('Priority_').reindex(df['ID'])
print (df2)
new Priority_1 Priority_2 Priority_3
ID
1 A1 A3 A5
2 A1 A2 A4
3 A2 A1 NaN
4 A3 A5 A1
5 NaN NaN NaN

Match two column values from 2 data sets, then find associated values

The issue I'm having was hard to title, and hard to search as well.
Here's some example data.
A B C D E F
B1 04/14/16 746 B1 04/25/16 2
B1 04/15/16 180 B1 04/30/16 4
B1 04/16/16 494 B1 05/01/16 5
B1 04/17/16 726 B2 04/01/16 1
B1 04/18/16 206 B2 04/03/16 1
B1 04/19/16 22 B2 04/04/16 2
B1 04/20/16 193 B2 04/05/16 2
B1 04/21/16 739 B2 04/12/16 8
B1 04/22/16 926 B2 04/13/16 1
B1 04/23/16 748 B2 04/14/16 2
B1 04/24/16 830 B2 04/15/16 1
B1 04/25/16 272 B2 04/18/16 9
B1 04/26/16 0 B2 04/19/16 1
B1 04/27/16 0 B2 04/26/16 9
B1 04/28/16 0 B2 04/27/16 3
B1 04/29/16 0 B2 04/30/16 1
B1 04/30/16 685 B2 05/02/16 5
B1 05/01/16 770 B2 05/03/16 2
B1 05/02/16 701 B3 04/03/16 3
B1 05/03/16 181 B3 04/04/16 1
B2 04/01/16 77 B3 04/06/16 2
B2 04/02/16 182 B3 04/07/16 1
B2 04/03/16 53 B3 04/09/16 1
B2 04/04/16 32 B3 04/16/16 7
What I'm trying to do is check for matching A and D columns, as well as matching B and E columns. If the columns match I would like to take column F and divide by column C.
Also if there is no match for both A and B column values, then have return those values with a zero.
So for a match:
B1 04/25/16 =2/272
For a non-match:
B1 04/14/16 0
Thank you.
Two INDEX/MATCH Function will do it:
=IFERROR(INDEX($F$1:$F$24,MATCH(1,INDEX(($E$1:$E$24=J2)*($D$1:$D$24=I2),),0))/INDEX($C$1:$C$24,MATCH(1,INDEX(($B$1:$B$24=J2)*($A$1:$A$24=I2),),0)),0)
This is an array formula, Full column references should be avoided as the calculation are exponential and will increase the calculation times.
If a more dynamic range is wanted then use this formula:
=IFERROR(INDEX($F$1:INDEX(F:F,MATCH(1E+99,F:F)),MATCH(1,INDEX(($E$1:INDEX(E:E,MATCH(1E+99,F:F))=J2)*($D$1:INDEX(D:D,MATCH(1E+99,F:F))=I2),),0))/INDEX($C$1:INDEX(C:C,MATCH(1E+99,C:C)),MATCH(1,INDEX(($B$1:INDEX(B:B,MATCH(1E+99,C:C))=J2)*($A$1:INDEX(A:A,MATCH(1E+99,C:C))=I2),),0)),0)
This will find the last cell with data and use that to set the extents of the range. So now as the data grows or shrinks it will only look at the data and not iterate through any more or any less than what is needed to cover the entire data set.

find last same value in column x and add or deduct column y

similar to a post back in May, I also need to create a sheet that includes (3) different sources of information to create a running total based on the last same value
this example is what I need to happen
Column A = D (to add) or W (to deduct)
Column B = Source1 or Source2 or Source3
Column C = the value that needs to be added or deducted
Column D = the running total based on the source (Column B)
A1 = D
B1 = Source1
C1 = 100
D1 = 100 (0 + C1)
A2 = W
B2 = Source1
C2 = 25
D2 = 75 (D1 - C2)
A3 = D
B3 = Source2
C3 = 50
D3 = 50 (0 + C3)
A4 = D
B4 = Source1
C4 = 100
D4 = 175 (D2 + C4)
A5 = W
B5 = Source2
C5 = 10
D5 = 40 (D3 - C5)
A6 = D
B6 = Source3
C6 = 20
D6 = 20 (0 + C6)
Any help would be greatly appreciated
Have tried inserting a picture, however as I am a new to the site I am unable to...sorry about that
I am using this which adds correctly
=SUMIFS($C$1:$C1, $A$1:$A1, A1, $B$1:$B1, B1)
however, I also need to make it deduct if Column A = W
Put this in cell D1 and copy it down: =SUMIFS($C$1:$C1,$B$1:$B1,B1,$A$1:$A1,"D")-SUMIFS($C$1:$C1,$B$1:$B1,B1,$A$1:$A1,"W").
I was able to get started using the formula that you provided. I modified it so that it subtracted a running total of W's for each source from a running total of D's for each source.

Resources