Creating columns in a pandas dataframe based on a column value in other dataframe - python-3.x

I have two pandas dataframe
import pandas as pd
import numpy as np
import datetime
data = {'group' :["A","A","B","B"],
'val': ["AA","AB","B1","B2"],
'cal1' :[4,5,7,6],
'cal2' :[10,100,100,10]
}
df1 = pd.DataFrame(data)
df1
group val cal1 cal2
0 A AA 4 10
1 A AB 5 100
2 B B1 7 100
3 B B2 6 10
data = {'group' :["A","A","A","B","B","B","B", "B", "B", "B"],
'flag' : [1,0,0,1,0,0,0, 1, 0, 0],
'var1': [1,2,3,7,8,9,10, 15, 20, 30]
}
# Create DataFrame
df2 = pd.DataFrame(data)
df2
group flag var1
0 A 1 1
1 A 0 2
2 A 0 3
3 B 1 7
4 B 0 8
5 B 0 9
6 B 0 10
7 B 1 15
8 B 0 20
9 B 0 30
Step 1: CReate columns in df2(with suffix "_new") based on unique "val" in df1 like below:
unique_val = df1['val'].unique().tolist()
new_cols = [t + '_new' for t in unique_val]
for i in new_cols:
df2[i] = 0
df2
group flag var1 AA_new AB_new B1_new B2_new
0 A 1 1 0 0 0 0
1 A 0 2 0 0 0 0
2 A 0 3 0 0 0 0
3 B 1 7 0 0 0 0
4 B 0 8 0 0 0 0
5 B 0 9 0 0 0 0
6 B 0 10 0 0 0 0
7 B 1 15 0 0 0 0
8 B 0 20 0 0 0 0
9 B 0 30 0 0 0 0
Step 2: for row where flag = 1, AA_new will be calculated as var1(from df2)* value of 'cal1' from df1 for group "A" and val "AA" * value of 'cal2' from df1 for group "A" and val "AA", similarly AB_new will be calculated as var1(from df2) * value of 'cal1' from df1 for group "A" and val "AB" * value of 'cal2' from df1 for group "A" and val "AB"
My expected output should look like below:
group flag var1 AA_new AB_new B1_new B2_new
0 A 1 1 40.0 500.0 0.0 0.0
1 A 0 2 0.0 0.0 0.0 0.0
2 A 0 3 0.0 0.0 0.0 0.0
3 B 1 7 0.0 0.0 4900.0 420.0
4 B 0 8 0.0 0.0 0.0 0.0
5 B 0 9 0.0 0.0 0.0 0.0
6 B 0 10 0.0 0.0 0.0 0.0
7 B 1 15 0.0 0.0 10500.0 900.0
8 B 0 20 0.0 0.0 0.0 0.0
9 B 0 30 0.0 0.0 0.0 0.0
Below solution based on the other stackflow question works partially:
df2.assign(**df1.assign(mul_cal = df1['cal1'].mul(df1['cal2']))
.pivot_table(columns='val',
values='mul_cal',
index = ['group', df2.index])
.add_suffix('_new')
.groupby(level=0)
.apply(lambda x: x.bfill().ffill())
.reset_index(level='group',drop='group')
.fillna(0)
.mul(df2['var1'], axis=0)
.where(df2['flag'].eq(1), 0)
)

Flexible Columns
If you want this works when we add several rows more in df1, you can do this.
combinations = df1.groupby(['group','val'])['cal3'].sum().reset_index()
for index_, row_ in combinations.iterrows():
for index, row in df2.iterrows():
if row['flag'] == 1:
if row['group'] == row_['group']:
df2.loc[index, row_['val'] + '_new'] = row['var1'] * df1[(df1['group'] == row_['group']) & (df1['val'] == row_['val'])]['cal3'].values[0]
Hard Code
You can use iteration to dataframe and change its specific column in each iteration, you can do something like this (but you need to add new column into your df1 first).
df1['cal3'] = df1['cal1'] * df1['cal2']
for index, row in df2.iterrows():
if row['flag'] == 1:
if row['group'] == 'A':
df2.loc[index, 'AA_new'] = row['var1'] * df1[(df1['group'] == 'A') & (df1['val'] == 'AA')]['cal3'].values[0]
df2.loc[index, 'AB_new'] = row['var1'] * df1[(df1['group'] == 'A') & (df1['val'] == 'AB')]['cal3'].values[0]
elif row['group'] == 'B':
df2.loc[index, 'B1_new'] = row['var1'] * df1[(df1['group'] == 'B') & (df1['val'] == 'B1')]['cal3'].values[0]
df2.loc[index, 'B2_new'] = row['var1'] * df1[(df1['group'] == 'B') & (df1['val'] == 'B2')]['cal3'].values[0]
This is the result I got.

Related

Creating python data frame from list of dictionary

I have the following data:
sentences = [{'mary':'N', 'jane':'N', 'can':'M', 'see':'V','will':'N'},
{'spot':'N','will':'M','see':'V','mary':'N'},
{'will':'M','jane':'N','spot':'V','mary':'N'},
{'mary':'N','will':'M','pat':'V','spot':'N'}]
I want to create a data frame where each key (from the pairs above) will be the column name and each value (from above) will be the index of the row. The values in the data frame will be counting of each matching point between the key and the value.
The expected result should be:
df = pd.DataFrame([(4,0,0),
(2,0,0),
(0,1,0),
(0,0,2),
(1,3,0),
(2,0,1),
(0,0,1)],
index=['mary', 'jane', 'can', 'see', 'will', 'spot', 'pat'],
columns=('N','M','V'))
Use value_counts per columns in DataFrame.apply, replace missing values, convert to integers and last transpose by DataFrame.T:
df = df.apply(pd.value_counts).fillna(0).astype(int).T
print (df)
M N V
mary 0 3 1
jane 0 2 0
can 1 0 0
see 0 0 2
will 3 1 0
spot 0 2 1
pat 0 0 1
Or use DataFrame.stack with SeriesGroupBy.value_counts and Series.unstack:
df = df.stack().groupby(level=1).value_counts().unstack(fill_value=0)
print (df)
M N V
can 1 0 0
jane 0 2 0
mary 0 3 1
pat 0 0 1
see 0 0 2
spot 0 2 1
will 3 1 0
pd.DataFrame(sentences).T.stack().groupby(level=0).value_counts().unstack().fillna(0)
M N V
can 1.0 0.0 0.0
jane 0.0 2.0 0.0
mary 0.0 3.0 1.0
pat 0.0 0.0 1.0
see 0.0 0.0 2.0
spot 0.0 2.0 1.0
will 3.0 1.0 0.0
Cast as int if needed to.
pd.DataFrame(sentences).T.stack().groupby(level=0).value_counts().unstack().fillna(0).cast("int")

How to split string with the values in their specific columns indexed on their label?

I have the following data
Index Data
0 100CO
1 50CO-50PET
2 98CV-2EL
3 50CV-50CO
. .
. .
. .
I have to create split the data format into different columns each with their own header and their values, the result should be as below:
Index Data CO PET CV EL
0 100CO 100 0 0 0
1 50CO-50PET 50 50 0 0
2 98CV-2EL 0 0 98 2
3 50CV-50CO 50 0 50 0
. .
. .
. .
The data is not limited to CO/PET/CV/EL, will need as many columns needed each displaying its corresponding value.
The .str.split('-', expand=True) function will only delimit the data and keep all first values in same column and does not rename each column.
Is there a way to implement this in python?
You could do:
df.Data.str.split('-').explode().str.split(r'(?<=\d)(?=\D)',expand = True). \
reset_index().pivot('index',1,0).fillna(0).reset_index()
1 Index CO CV EL PET
0 0 100 0 0 0
1 1 50 0 0 50
2 2 0 98 2 0
3 3 50 50 0 0
Idea is first split values by -, then extract numbers and no numbers values to tuples, append to list and convert to dictionaries. It is passed in list comprehension to DataFrame cosntructor, replaced misisng values and converted to numeric:
import re
def f(x):
L = []
for val in x.split('-'):
k, v = re.findall('(\d+)(\D+)', val)[0]
L.append((v, k))
return dict(L)
df = df.join(pd.DataFrame([f(x) for x in df['Data']], index=df.index).fillna(0).astype(int))
print (df)
Data CO PET CV EL
0 100CO 100 0 0 0
1 50CO-50PET 50 50 0 0
2 98CV-2EL 0 0 98 2
3 50CV-50CO 50 0 50 0
If in data exist some values without number or number only solution should be changed for more general like:
print (df)
Data
0 100CO
1 50CO-50PET
2 98CV-2EL
3 50CV-50CO
4 AAA
5 20
def f(x):
L = []
for val in x.split('-'):
extracted = re.findall('(\d+)(\D+)', val)
if len(extracted) > 0:
k, v = extracted[0]
L.append((v, k))
else:
if val.isdigit():
L.append(('No match digit', val))
else:
L.append((val, 0))
return dict(L)
df = df.join(pd.DataFrame([f(x) for x in df['Data']], index=df.index).fillna(0).astype(int))
print (df)
Data CO PET CV EL AAA No match digit
0 100CO 100 0 0 0 0 0
1 50CO-50PET 50 50 0 0 0 0
2 98CV-2EL 0 0 98 2 0 0
3 50CV-50CO 50 0 50 0 0 0
4 AAA 0 0 0 0 0 0
5 20 0 0 0 0 0 20
Try this:
import pandas as pd
import re
df = pd.DataFrame({'Data':['100CO', '50CO-50PET', '98CV-2EL', '50CV-50CO']})
split_df = pd.DataFrame(df.Data.apply(lambda x: {re.findall('[A-Z]+', el)[0] : re.findall('[0-9]+', el)[0] \
for el in x.split('-')}).tolist())
split_df = split_df.fillna(0)
df = pd.concat([df, split_df], axis = 1)

Add extra columns with default values from a list in a dataframe

I have a dataframe like
df = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
Now i want the data frame to have additional columns from a list=['a','b','c'] with default values as 0.
so the output will be
Name Age a b c
Tome 20 0 0 0
nick 21 0 0 0
krish 19 0 0 0
Jack 18 0 0 0
Dont use variable list, because builtin (python code word).
For new columns is possible create dictionary from list and pass to DataFrame.assign:
d = {'Name':['Tom', 'nick', 'krish', 'jack'], 'Age':[20, 21, 19, 18]}
df = pd.DataFrame(d)
L = ['a','b','c']
df1 = df.assign(**dict.fromkeys(L, 0))
Or create new DataFrame and use DataFrame.join:
df1 = df.join(pd.DataFrame(0, columns=L, index=df.index))
print (df1)
Name Age a b c
0 Tom 20 0 0 0
1 nick 21 0 0 0
2 krish 19 0 0 0
3 jack 18 0 0 0
>>> df.join(df.reindex(columns=list('abc'), fill_value=0))
Name Age a b c
0 Tom 20 0 0 0
1 nick 21 0 0 0
2 krish 19 0 0 0
3 jack 18 0 0 0
You can also use reindex to create new df with fill_value zero. and than combine columns by using join.

Add Column based on information from other dataframe pandas

I am looking for an answer to a question which I would have solved with for loops.
I have two pandas Dataframes:
ind_1 ind_2 ind_3
prod_id
A = a 1 0 0
a 0 1 0
b 0 1 0
c 0 0 1
a 0 0 1
a b c
B = ind_1 0.1 0.2 0.3
ind_2 0.4 0.5 0.6
ind_3 0.7 0.8 0.9
I am looking for a way to solve the following problem with pandas:
I want to map the entries of the dataframe B with a the index and columnnames and create a new column within dataframe A, so the result will look like this:
ind_1 ind_2 ind_3 y
prod_id
A = a 1 0 0 0.1
a 0 1 0 0.4
b 0 1 0 0.5
c 0 0 1 0.9
a 0 0 1 0.7
Is there a way to not use for loop to solve this problem?
Thank you in advance!
Use DataFrame.stack for MultiIndex Series in both DataFrames, then filter only 1 values by callable, filter b values by Index.isin, remove first level of MultiIndex and last add new column - it is align by index values of A:
a = A.T.stack().loc[lambda x: x == 1]
b = B.stack()
b = b[b.index.isin(a.index)].reset_index(level=0, drop=True)
A['y'] = b
print (A)
ind_1 ind_2 ind_3 y
prod_id
a 1 0 0 0.1
b 0 1 0 0.5
c 0 0 1 0.9
Or use DataFrame.join with DataFrame.query for filtering, but processing is a bit complicated:
a = A.stack()
b = B.stack()
s = (a.to_frame('a')
.rename_axis((None, None))
.join(b.swaplevel(1,0)
.rename('b'))
.query("a == 1")
.reset_index(level=1, drop=True))
A['y'] = s['b']
print (A)
ind_1 ind_2 ind_3 y
prod_id
a 1 0 0 0.1
b 0 1 0 0.5
c 0 0 1 0.9

pandas if else only on specific rows

I have a pandas dataframe as below. I want to apply below condition
Only for row where A =2, update the column 'C', 'D' TO -99.
I have a function like below which updates the value of C and D to -99.
def func(df):
for col in df.columns:
if ("C" in col) or ("D" in col):
df.loc[:,col] = -99
Now i just want to call that function, if A =2. I tried the below code but it updates all the rows of C and D to -99
import pandas as pd
import math
import sys
import re
data=[[0,1,0,0, 0],
[1,2,0,0,0],
[2,0,0,0,0],
[2,4,0,0,0],
[1,8,0,0,0],
[3,2, 0,0,0]]
df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
df
def func(df):
for col in df.columns:
if ("C" in col) or ("D" in col):
df.loc[:,col] = -99
if (df['A'] == 2).any():
func(df)
print(df)
My expected output:
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 -99 -99 0
3 2 4 -99 -99 0
4 1 8 0 0 0
5 3 2 0 0 0
You can do that by filtering:
df.loc[df['A'] == 2, ['C', 'D']] = -99
Here the first item of the filtering filters the rows, and we filter these such that we only select rows where the value for the column of 'A' is 2. We filter the columns by a list of names (C and D). We then assign -99 to these items.
For the given sample data, we obtain:
>>> df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
>>> df
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 0 0 0
3 2 4 0 0 0
4 1 8 0 0 0
5 3 2 0 0 0
>>> df.loc[df['A'] == 2, ['C', 'D']] = -99
>>> df
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 -99 -99 0
3 2 4 -99 -99 0
4 1 8 0 0 0
5 3 2 0 0 0

Resources