Add Column based on information from other dataframe pandas - python-3.x

I am looking for an answer to a question which I would have solved with for loops.
I have two pandas Dataframes:
ind_1 ind_2 ind_3
prod_id
A = a 1 0 0
a 0 1 0
b 0 1 0
c 0 0 1
a 0 0 1
a b c
B = ind_1 0.1 0.2 0.3
ind_2 0.4 0.5 0.6
ind_3 0.7 0.8 0.9
I am looking for a way to solve the following problem with pandas:
I want to map the entries of the dataframe B with a the index and columnnames and create a new column within dataframe A, so the result will look like this:
ind_1 ind_2 ind_3 y
prod_id
A = a 1 0 0 0.1
a 0 1 0 0.4
b 0 1 0 0.5
c 0 0 1 0.9
a 0 0 1 0.7
Is there a way to not use for loop to solve this problem?
Thank you in advance!

Use DataFrame.stack for MultiIndex Series in both DataFrames, then filter only 1 values by callable, filter b values by Index.isin, remove first level of MultiIndex and last add new column - it is align by index values of A:
a = A.T.stack().loc[lambda x: x == 1]
b = B.stack()
b = b[b.index.isin(a.index)].reset_index(level=0, drop=True)
A['y'] = b
print (A)
ind_1 ind_2 ind_3 y
prod_id
a 1 0 0 0.1
b 0 1 0 0.5
c 0 0 1 0.9
Or use DataFrame.join with DataFrame.query for filtering, but processing is a bit complicated:
a = A.stack()
b = B.stack()
s = (a.to_frame('a')
.rename_axis((None, None))
.join(b.swaplevel(1,0)
.rename('b'))
.query("a == 1")
.reset_index(level=1, drop=True))
A['y'] = s['b']
print (A)
ind_1 ind_2 ind_3 y
prod_id
a 1 0 0 0.1
b 0 1 0 0.5
c 0 0 1 0.9

Related

Creating python data frame from list of dictionary

I have the following data:
sentences = [{'mary':'N', 'jane':'N', 'can':'M', 'see':'V','will':'N'},
{'spot':'N','will':'M','see':'V','mary':'N'},
{'will':'M','jane':'N','spot':'V','mary':'N'},
{'mary':'N','will':'M','pat':'V','spot':'N'}]
I want to create a data frame where each key (from the pairs above) will be the column name and each value (from above) will be the index of the row. The values in the data frame will be counting of each matching point between the key and the value.
The expected result should be:
df = pd.DataFrame([(4,0,0),
(2,0,0),
(0,1,0),
(0,0,2),
(1,3,0),
(2,0,1),
(0,0,1)],
index=['mary', 'jane', 'can', 'see', 'will', 'spot', 'pat'],
columns=('N','M','V'))
Use value_counts per columns in DataFrame.apply, replace missing values, convert to integers and last transpose by DataFrame.T:
df = df.apply(pd.value_counts).fillna(0).astype(int).T
print (df)
M N V
mary 0 3 1
jane 0 2 0
can 1 0 0
see 0 0 2
will 3 1 0
spot 0 2 1
pat 0 0 1
Or use DataFrame.stack with SeriesGroupBy.value_counts and Series.unstack:
df = df.stack().groupby(level=1).value_counts().unstack(fill_value=0)
print (df)
M N V
can 1 0 0
jane 0 2 0
mary 0 3 1
pat 0 0 1
see 0 0 2
spot 0 2 1
will 3 1 0
pd.DataFrame(sentences).T.stack().groupby(level=0).value_counts().unstack().fillna(0)
M N V
can 1.0 0.0 0.0
jane 0.0 2.0 0.0
mary 0.0 3.0 1.0
pat 0.0 0.0 1.0
see 0.0 0.0 2.0
spot 0.0 2.0 1.0
will 3.0 1.0 0.0
Cast as int if needed to.
pd.DataFrame(sentences).T.stack().groupby(level=0).value_counts().unstack().fillna(0).cast("int")

Dataframe conversion

I am trying to convert a data frame into a 1,0 matrix format
data = pd.DataFrame({'Val1':['A','B','B'],
'Val2':['C','A','D'],
'Val3':['E','F','C'],
'Comb':['Comb1','Comb2','Comb3']})
data:
Val1 Val2 Val3 Comb
0 A C E Comb1
1 B A F Comb2
2 B D C Comb3
What I need is to convert to below data frame
Comb A C D E B F
0 Comb1 1 1 0 1 0 0
1 Comb2 1 0 0 0 1 1
2 Comb3 0 1 1 0 1 0
I was able to do it with a FOR loop but as my dataframe increases, the processing time increases. Is there a better way to do it?
header = set(data[['Val1','Val2','Val3']].values.ravel())
matrix = pd.DataFrame(columns=header)
for i in range(data.shape[0]):
temp_dict = {data["Val1"].iloc[i]:1, data["Val2"].iloc[i]:1, data["Val3"].iloc[i]:1}
matrix = matrix.append(temp_dict, ignore_index=True)
matrix = matrix.loc[:, matrix.columns.notnull()]
matrix = matrix.fillna(0)
matrix = pd.merge(data[["Comb"]],matrix, left_index=True, right_index=True, how= 'outer')
Thanks!
There may be a better solution, but this is what came to my mind: convert each raw to a dictionary of "present" letters, build a Series from the dictionary, and combine the Series into a dataframe.
data.loc[:, 'Val1':'Val3'].apply(lambda row:
pd.Series({letter: 1 for letter in row}), axis=1)\
.fillna(0).astype(int).join(data.Comb)
# A B C D E F Comb
#0 1 0 1 0 1 0 Comb1
#1 1 1 0 0 0 1 Comb2
#2 0 1 1 1 0 0 Comb3
There are propably multiple ways to solve this, I used pd.crosstab for it:
import pandas as pd
data = pd.DataFrame({'Val1':['A','B','B'],
'Val2':['C','A','D'],
'Val3':['E','F','C'],
'Comb':['Comb1','Comb2','Comb3']})
data["lst"] = data[['Val1', 'Val2', 'Val3']].values.tolist()
data = data.explode("lst")
print(pd.crosstab(data["Comb"], data["lst"]))
Out[20]:
lst A B C D E F
Comb
Comb1 1 0 1 0 1 0
Comb2 1 1 0 0 0 1
Comb3 0 1 1 1 0 0
I guess this will work. Please let me know if it works
pd.get_dummies(data, columns=['Val1','Val2','Val3'],prefix="",prefix_sep="").groupby(axis=1,level=0).sum()
Here's another way:
data.melt('Comb').set_index('Comb')['value'].str.get_dummies().sum(level=0).reset_index()
Output:
Comb A B C D E F
0 Comb1 1 0 1 0 1 0
1 Comb2 1 1 0 0 0 1
2 Comb3 0 1 1 1 0 0

Merge multiple binary encoded rows into one in pandas dataframe

I have a pandas.DataFrame that looks like this:
A B C D E F
0 0 1 0 0 0
1 1 0 0 0 0
2 0 1 0 0 0
3 0 0 0 1 0
4 0 0 1 0 0
There are several rows that share a 1 in their columns and in each row there is only one 1 present. I want to merge the rows with each other so the resulting dataFrame would onyl consist of one row, that combines all the 1s of the dataframe, like this:
A B C D E F
0 1 1 1 1 0
Is there a smart, easy way to do this with pandas?
Use DataFrame.sum, then compare for greater or equal by Series.ge and last convert to 0,1 by Series.view:
s = df.sum().ge(1).view('i1')
Another idea if 0,1 values only is use DataFrame.any with convert mask to 0,1:
s = df.any().view('i1')
print (s)
A 1
B 1
C 1
D 1
E 1
F 0
dtype: int8
We can do
df.sum().ge(1).astype(int)
Out[316]:
A 1
B 1
C 1
D 1
E 1
F 0
dtype: int32

Creating columns in a pandas dataframe based on a column value in other dataframe

I have two pandas dataframe
import pandas as pd
import numpy as np
import datetime
data = {'group' :["A","A","B","B"],
'val': ["AA","AB","B1","B2"],
'cal1' :[4,5,7,6],
'cal2' :[10,100,100,10]
}
df1 = pd.DataFrame(data)
df1
group val cal1 cal2
0 A AA 4 10
1 A AB 5 100
2 B B1 7 100
3 B B2 6 10
data = {'group' :["A","A","A","B","B","B","B", "B", "B", "B"],
'flag' : [1,0,0,1,0,0,0, 1, 0, 0],
'var1': [1,2,3,7,8,9,10, 15, 20, 30]
}
# Create DataFrame
df2 = pd.DataFrame(data)
df2
group flag var1
0 A 1 1
1 A 0 2
2 A 0 3
3 B 1 7
4 B 0 8
5 B 0 9
6 B 0 10
7 B 1 15
8 B 0 20
9 B 0 30
Step 1: CReate columns in df2(with suffix "_new") based on unique "val" in df1 like below:
unique_val = df1['val'].unique().tolist()
new_cols = [t + '_new' for t in unique_val]
for i in new_cols:
df2[i] = 0
df2
group flag var1 AA_new AB_new B1_new B2_new
0 A 1 1 0 0 0 0
1 A 0 2 0 0 0 0
2 A 0 3 0 0 0 0
3 B 1 7 0 0 0 0
4 B 0 8 0 0 0 0
5 B 0 9 0 0 0 0
6 B 0 10 0 0 0 0
7 B 1 15 0 0 0 0
8 B 0 20 0 0 0 0
9 B 0 30 0 0 0 0
Step 2: for row where flag = 1, AA_new will be calculated as var1(from df2)* value of 'cal1' from df1 for group "A" and val "AA" * value of 'cal2' from df1 for group "A" and val "AA", similarly AB_new will be calculated as var1(from df2) * value of 'cal1' from df1 for group "A" and val "AB" * value of 'cal2' from df1 for group "A" and val "AB"
My expected output should look like below:
group flag var1 AA_new AB_new B1_new B2_new
0 A 1 1 40.0 500.0 0.0 0.0
1 A 0 2 0.0 0.0 0.0 0.0
2 A 0 3 0.0 0.0 0.0 0.0
3 B 1 7 0.0 0.0 4900.0 420.0
4 B 0 8 0.0 0.0 0.0 0.0
5 B 0 9 0.0 0.0 0.0 0.0
6 B 0 10 0.0 0.0 0.0 0.0
7 B 1 15 0.0 0.0 10500.0 900.0
8 B 0 20 0.0 0.0 0.0 0.0
9 B 0 30 0.0 0.0 0.0 0.0
Below solution based on the other stackflow question works partially:
df2.assign(**df1.assign(mul_cal = df1['cal1'].mul(df1['cal2']))
.pivot_table(columns='val',
values='mul_cal',
index = ['group', df2.index])
.add_suffix('_new')
.groupby(level=0)
.apply(lambda x: x.bfill().ffill())
.reset_index(level='group',drop='group')
.fillna(0)
.mul(df2['var1'], axis=0)
.where(df2['flag'].eq(1), 0)
)
Flexible Columns
If you want this works when we add several rows more in df1, you can do this.
combinations = df1.groupby(['group','val'])['cal3'].sum().reset_index()
for index_, row_ in combinations.iterrows():
for index, row in df2.iterrows():
if row['flag'] == 1:
if row['group'] == row_['group']:
df2.loc[index, row_['val'] + '_new'] = row['var1'] * df1[(df1['group'] == row_['group']) & (df1['val'] == row_['val'])]['cal3'].values[0]
Hard Code
You can use iteration to dataframe and change its specific column in each iteration, you can do something like this (but you need to add new column into your df1 first).
df1['cal3'] = df1['cal1'] * df1['cal2']
for index, row in df2.iterrows():
if row['flag'] == 1:
if row['group'] == 'A':
df2.loc[index, 'AA_new'] = row['var1'] * df1[(df1['group'] == 'A') & (df1['val'] == 'AA')]['cal3'].values[0]
df2.loc[index, 'AB_new'] = row['var1'] * df1[(df1['group'] == 'A') & (df1['val'] == 'AB')]['cal3'].values[0]
elif row['group'] == 'B':
df2.loc[index, 'B1_new'] = row['var1'] * df1[(df1['group'] == 'B') & (df1['val'] == 'B1')]['cal3'].values[0]
df2.loc[index, 'B2_new'] = row['var1'] * df1[(df1['group'] == 'B') & (df1['val'] == 'B2')]['cal3'].values[0]
This is the result I got.

Is there anyway to make more than one dummies variable at a time? [duplicate]

How can one idiomatically run a function like get_dummies, which expects a single column and returns several, on multiple DataFrame columns?
With pandas 0.19, you can do that in a single line :
pd.get_dummies(data=df, columns=['A', 'B'])
Columns specifies where to do the One Hot Encoding.
>>> df
A B C
0 a c 1
1 b c 2
2 a b 3
>>> pd.get_dummies(data=df, columns=['A', 'B'])
C A_a A_b B_b B_c
0 1 1.0 0.0 0.0 1.0
1 2 0.0 1.0 0.0 1.0
2 3 1.0 0.0 1.0 0.0
Since pandas version 0.15.0, pd.get_dummies can handle a DataFrame directly (before that, it could only handle a single Series, and see below for the workaround):
In [1]: df = DataFrame({'A': ['a', 'b', 'a'], 'B': ['c', 'c', 'b'],
...: 'C': [1, 2, 3]})
In [2]: df
Out[2]:
A B C
0 a c 1
1 b c 2
2 a b 3
In [3]: pd.get_dummies(df)
Out[3]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Workaround for pandas < 0.15.0
You can do it for each column seperate and then concat the results:
In [111]: df
Out[111]:
A B
0 a x
1 a y
2 b z
3 b x
4 c x
5 a y
6 b y
7 c z
In [112]: pd.concat([pd.get_dummies(df[col]) for col in df], axis=1, keys=df.columns)
Out[112]:
A B
a b c x y z
0 1 0 0 1 0 0
1 1 0 0 0 1 0
2 0 1 0 0 0 1
3 0 1 0 1 0 0
4 0 0 1 1 0 0
5 1 0 0 0 1 0
6 0 1 0 0 1 0
7 0 0 1 0 0 1
If you don't want the multi-index column, then remove the keys=.. from the concat function call.
Somebody may have something more clever, but here are two approaches. Assuming you have a dataframe named df with columns 'Name' and 'Year' you want dummies for.
First, simply iterating over the columns isn't too bad:
In [93]: for column in ['Name', 'Year']:
...: dummies = pd.get_dummies(df[column])
...: df[dummies.columns] = dummies
Another idea would be to use the patsy package, which is designed to construct data matrices from R-type formulas.
In [94]: patsy.dmatrix(' ~ C(Name) + C(Year)', df, return_type="dataframe")
Unless I don't understand the question, it is supported natively in get_dummies by passing the columns argument.
The simple trick I am currently using is a for-loop.
First separate categorical data from Data Frame by using select_dtypes(include="object"),
then by using for loop apply get_dummies to each column iteratively
as I have shown in code below:
train_cate=train_data.select_dtypes(include="object")
test_cate=test_data.select_dtypes(include="object")
# vectorize catagorical data
for col in train_cate:
cate1=pd.get_dummies(train_cate[col])
train_cate[cate1.columns]=cate1
cate2=pd.get_dummies(test_cate[col])
test_cate[cate2.columns]=cate2

Resources