Python create a column based on the values of each row of another column - python-3.x

I have a pandas dataframe as below:
import pandas as pd
df = pd.DataFrame({'ORDER':["A", "A", "A", "B", "B","B"], 'GROUP': ["A_2018_1B1", "A_2018_1B1", "A_2018_1M1", "B_2018_I000_1C1", "B_2018_I000_1B1", "B_2018_I000_1C1H"], 'VAL':[1,3,8,5,8,10]})
df
ORDER GROUP VAL
0 A A_2018_1B1 1
1 A A_2018_1B1H 3
2 A A_2018_1M1 8
3 B B_2018_I000_1C1 5
4 B B_2018_I000_1B1 8
5 B B_2018_I000_1C1H 10
I want to create a column "CAL" as sum of 'VAL' where GROUP name is same for all the rows expect H character in the end. So, for example, 'VAL' column for 1st two rows will be added because the only difference between the 'GROUP' is 2nd row has H in the last. Row 3 will remain as it is, Row 4 and 6 will get added and Row 5 will remain same.
My expected output
ORDER GROUP VAL CAL
0 A A_2018_1B1 1 4
1 A A_2018_1B1H 3 4
2 A A_2018_1M1 8 8
3 B B_2018_I000_1C1 5 15
4 B B_2018_I000_1B1 8 8
5 B B_2018_I000_1C1H 10 15

Try with replace then transform
df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')
0 4
1 4
2 8
3 15
4 8
5 15
Name: VAL, dtype: int64
df['CAL'] = df.groupby(df.GROUP.str.replace('H','')).VAL.transform('sum')

Related

How to replenish a data frame based on another one?

Given two data frames. One contains a column of repeated values (a, in this case). The other contains what this value corresponds to (in this example, it corresponds to some "d" values). How do I efficiently replenish the first data frame with a new column, values in which correspond to some existent column, according to a rule recorded in the other data frame. Here is an example code that works really slow:
import pandas as pd
import numpy as np
d1 = pd.DataFrame(np.asarray([[1,2,3], [2,4,5], [3,4,5], [2,1,4], [3,4,5]]), columns = ['a', 'b', 'c'])
d2 = pd.DataFrame(np.asarray([[1,7], [2,8], [3,11]]), columns = ['a', 'd'])
d = np.empty((d1.shape[0],))
for i in range(d1.shape[0]):
temp = d2.loc[d2['a'] == d1.at[i,'a']]
d[i] = temp['d'].array[0]
d1['d'] = d
This is d1 original:
a b c
0 1 2 3
1 2 4 5
2 3 4 5
3 2 1 4
4 3 4 5
This is d2:
a d
0 1 7
1 2 8
2 3 11
This is a resultant d1:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11
You're probably looking for pd.merge.
In your case, d1 = d1.merge(d2, on=['a'], how='left') should do the trick.
Another way is to use map and make only the values you need.
d1['d'] = d1['a'].map(d2.set_index('a')['d'])
d1
Output:
a b c d
0 1 2 3 7
1 2 4 5 8
2 3 4 5 11
3 2 1 4 8
4 3 4 5 11

The way `Drop column by id ` result in all same columns removed in dataframe

import pandas as pd
df1 = pd.DataFrame({"A":[14, 4, 5, 4],"B":[1,2,3,4]})
df2 = pd.DataFrame({"A":[14, 4, 5, 4],"C":[5,6,7,8]})
df = pd.concat([df1,df2],axis=1)
Let's see the concated df,the first column and third column shares the same column name A.
df
A B A C
0 14 1 14 5
1 4 2 4 6
2 5 3 5 7
3 4 4 4 8
I want to get the following format.
df
A B C
0 14 1 5
1 4 2 6
2 5 3 7
3 4 4 8
Drop column by id.
result = df.drop(df.columns[2],axis=1)
result
B C
0 1 5
1 2 6
2 3 7
3 4 8
I can get what i expect this way:
import pandas as pd
df1 = pd.DataFrame({"A":[14, 4, 5, 4],"B":[1,2,3,4]})
df2 = pd.DataFrame({"A":[14, 4, 5, 4],"C":[5,6,7,8]})
df2 = df2.drop(df2.columns[0],axis=1)
df = pd.concat([df1,df2],axis=1)
It is so strange that both the first and third column removed when to drop specified column by id.
1.Please tell me the reason of dataframe's this action.
2.How can i remove the third column at the same time keep the first column undeleted?
Here's a way using indexes:
index_to_drop = 2
# get indexes to keep
col_idxs = [en for en, _ in enumerate(df.columns) if en != index_to_drop]
# subset the df
df = df.iloc[:,col_idxs]
A B C
0 14 1 5
1 4 2 6
2 5 3 7
3 4 4 8

How to remove the repeated row spaning two dataframe index in python

I have a dataframe as follow:
import pandas as pd
d = {'location1': [1, 2,3,8,6], 'location2':
[2,1,4,6,8]}
df = pd.DataFrame(data=d)
The dataframe df means there is a road between two locations. look like:
location1 location2
0 1 2
1 2 1
2 3 4
3 8 6
4 6 8
The first row means there is a road between locationID1 and locationID2, however, the second row also encodes this information. The forth and fifth rows also have repeated information. I am trying the remove those repeated by keeping only one row. Any of row is okay.
For example, my expected output is
location1 location2
0 1 2
2 3 4
4 6 8
Any efficient way to do that because I have a large dataframe with lots of repeated rows.
Thanks a lot,
It looks like you want every other row in your dataframe. This should work.
import pandas as pd
d = {'location1': [1, 2,3,8,6], 'location2':
[2,1,4,6,8]}
df = pd.DataFrame(data=d)
print(df)
location1 location2
0 1 2
1 2 1
2 3 4
3 8 6
4 6 8
def Every_other_row(a):
return a[::2]
Every_other_row(df)
location1 location2
0 1 2
2 3 4
4 6 8

Create a new column with the minimum of other columns on same row

I have the following DataFrame
Input:
A B C D E
2 3 4 5 6
1 1 2 3 2
2 3 4 5 6
I want to add a new column that has the minimum of A, B and C for that row
Output:
A B C D E Goal
2 3 4 5 6 2
1 1 2 3 2 1
2 3 4 5 6 2
I have tried to use
df = df[['A','B','C]].min()
but I get errors about hashing lists and also I think this will be the min of the whole column I only want the min of the row for those specific columns.
How can I best accomplish this?
Use min along the columns with axis=1
Inline solution that produces copy that doesn't alter the original
df.assign(Goal=lambda d: d[['A', 'B', 'C']].min(1))
A B C D E Goal
0 2 3 4 5 6 2
1 1 1 2 3 2 1
2 2 3 4 5 6 2
Same answer put different
Add column to existing dataframe
new = df[['A', 'B', 'C']].min(axis=1)
df['Goal'] = new
df
A B C D E Goal
0 2 3 4 5 6 2
1 1 1 2 3 2 1
2 2 3 4 5 6 2
Add axis = 1 to your min
df['Goal'] = df[['A','B','C']].min(axis = 1)
you have to define an axis across which you are applying the min function, which would be 1 (columns).
df['ABC_row_min'] = df[['A', 'B', 'C']].min(axis = 1)

Pandas Conditionally Combine (and sum) Rows

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'A':['A','A','A','B','B','B'],
'B':[1,1,2,1,1,1],
'C':[2,4,6,3,5,7]})
df
A B C
0 A 1 2
1 A 1 4
2 A 2 6
3 B 1 3
4 B 1 5
5 B 1 7
Wherever there are duplicate rows per columns 'A' and 'B', I'd like to combine those rows and sum the value under column 'C' like this:
A B C
0 A 1 6
2 A 2 6
3 B 1 15
So far, I can at least identify the duplicates like this:
df['Dup']=df.duplicated(['A','B'],keep=False)
Thanks in advance!
use groupby() and sum():
In [94]: df.groupby(['A','B']).sum().reset_index()
Out[94]:
A B C
0 A 1 6
1 A 2 6
2 B 1 15

Resources