Pandas dataframe merge by function on column names - python-3.x

I say to dataframes.
df_A has columns A__a, B__b, C. (shape 5,3)
df_B has columns A_a, B_b, D. (shape 4,3)
How can I unify them (without having to iterate over all columns) to get one df with columns A,B ? (shape 9,2) - meaning A__a and A_a should be unified to the same column.
I need to use merge with applying the function lambda x: x.replace("_",""). Is it possible?

import pandas as pd
df = pd.DataFrame(np.random.randint(0,5,size=(5, 3)), columns=['A__a', 'B__b', 'C'])
df:
A__a B__b C
0 3 0 2
1 0 3 4
2 0 4 4
3 4 2 1
4 3 4 3
df2:
df2 = pd.DataFrame(np.random.randint(0,4,size=(4, 3)), columns=['A__a', 'B__b', 'D'])
A__a B__b D
0 3 2 0
1 3 1 1
2 0 2 0
3 3 2 0
df3 = pd.concat([df, df2], join='inner', ignore_index=True)
df_final = df3.rename(lambda x: str(x).split("__")[0],axis='columns')
df_final
df_final:
A B
0 3 0
1 0 3
2 0 4
3 4 2
4 3 4
5 3 2
6 3 1
7 0 2
8 3 2

A simple concatenation will do
pd.concat([df_A, df_B], join='outer')[['A', 'B']].copy().
or
'pd.concat([df_A, df_B], join='inner')

You have to merge Dataframe using 'outer'
import pandas as pd
import numpy as np
df_A = pd.DataFrame(np.random.randint(10,size=(5,3)), columns=['A','B','C'])
df_B = pd.DataFrame(np.random.randint(10,size=(4,3)), columns=['A','B','D'])
print(df_A.shape,df_B.shape)
#(5, 3) (4, 3)
new_df = df_A.merge(df_B , how= 'outer', on = ['A','B'])[['A','B']]
print(new_df.shape)
#(9,2)

If you cant change the name of the columns in advance and you want to use lambda x: x.replace("_",""), this is a way:
df = pd.concat([df1.rename_axis(lambda x: str(x).replace("_",""),axis='columns'), df2.rename_axis(lambda x: str(x).replace("_",""),axis='columns')], join='inner', ignore_index=True)
Example:
d1 = {'A__a' : ('A', 'B', 'C', 'D', 'E') , 'B__b' : ('a', 'b', 'c', 'd', 'e') ,'C': (1,2,3,4,5)}
df1 = pd.DataFrame(d1)
A__a B__b C
0 A a 1
1 B b 2
2 C c 3
3 D d 4
4 E e 5
d2 = {'A_a' : ('B', 'C', 'D','G') , 'B_b' : ('l','m','n','o') ,'D': (6,7,8,9)}
df2=pd.DataFrame(d2)
A_a B_b D
0 B l 6
1 C m 7
2 D n 8
3 G o 9
Output:
Aa Bb
0 A a
1 B b
2 C c
3 D d
4 E e
5 B l
6 C m
7 D n
8 G o
Alternative with:
df = pd.concat([df1.rename(columns={'A__a':'A', 'B__b':'B'}), df2.rename(columns={'A_a':'A', 'B_b':'B'})], join='inner', ignore_index=True)

Related

How to create a weighted edgelist directed from a pandas dataframe with weights in two columns?

I have the following pandas DataFrame (df):
>>> import pandas as pd
>>> df = pd.DataFrame([
['A', 'B', '1'],
['A', 'B', '2'],
['B', 'A', '41'],
['A', 'C', '11'],
['C', 'B', '3'],
['B', 'D', '4'],
['D', 'B','51']
], columns=('station_i', 'station_j','UID'))
I used
>>> df2=df.groupby(by=['station_i', 'station_j']).size().to_frame(name = 'counts_ij').reset_index()
to obtain the dataframe df2:
>>> print(df2)
station_i station_j counts_ij
0 A B 2
1 A C 1
2 B A 1
3 B D 1
4 C B 1
5 D B 1
Now, I would obtain the dataframe df3, build as shown below, where couples with same values, but reversed, are dropped and counted in an extra column as showed below:
>>>print(df3)
station_i station_j counts_ij counts_ji
0 A B 2 1
1 A C 1 0
2 C B 1 0
3 B D 1 1
Would really appreciate some suggestions
import pandas as pd
import numpy as np
# find out indices that are reverse duplicated
dupe = pd.concat([
np.maximum(df2.station_i, df2.station_j),
np.minimum(df2.station_i, df2.station_j)
], axis=1).duplicated()
df2[dupe]
station_i station_j counts_ij
2 B A 1
5 D B 1
df2[~dupe]
station_i station_j counts_ij
0 A B 2
1 A C 1
3 B D 1
4 C B 1
# split by dupe, reverse the dupe station and merge with the non dupe
df2[~dupe].merge(
df2[dupe].rename(
columns={'station_i': 'station_j', 'station_j': 'station_i', 'counts_ij': 'counts_ji'}
), how='left'
).fillna(0).astype({'counts_ji': int})
station_i station_j counts_ij counts_ji
0 A B 2 1
1 A C 1 0
2 B D 1 1
3 C B 1 0

How to compare a string of one column of pandas with rest of the columns and if value is found in any column of the row append a new row?

I want to compare the Category column with all the predicted_site and if value matches with anyone column, append a column named rank and insert 1 if value is found or else insert 0
Use DataFrame.filter for predicted columns compared by DataFrame.eq with Category column, convert to integers, change columns names by DataFrame.add_prefix and last add new columns by DataFrame.join:
df = pd.DataFrame({
'category':list('abcabc'),
'B':[4,5,4,5,5,4],
'predicted1':list('adadbd'),
'predicted2':list('cbarac')
})
df1 = df.filter(like='predicted').eq(df['category'], axis=0).astype(int).add_prefix('new_')
df = df.join(df1)
print (df)
category B predicted1 predicted2 new_predicted1 new_predicted2
0 a 4 a c 1 0
1 b 5 d b 0 1
2 c 4 a a 0 0
3 a 5 d r 0 0
4 b 5 b a 1 0
5 c 4 d c 0 1
This solution is much less elegant than that proposed by #jezrael, however you can try it.
#sample dataframe
d = {'cat': ['comp-el', 'el', 'comp', 'comp-el', 'el', 'comp'], 'predicted1': ['com', 'al', 'p', 'col', 'el', 'comp'], 'predicted2': ['a', 'el', 'p', 'n', 's', 't']}
df = pd.DataFrame(data=d)
#iterating through rows
for i, row in df.iterrows():
#assigning values
cat = df.loc[i,'cat']
predicted1 = df.loc[i,'predicted1']
predicted2 = df.loc[i,'predicted2']
#condition
if (cat == predicted1 or cat == predicted2):
df.loc[i,'rank'] = 1
else:
df.loc[i,'rank'] = 0
output:
cat predicted1 predicted2 rank
0 comp-el com a 0.0
1 el al el 1.0
2 comp p p 0.0
3 comp-el col n 0.0
4 el el s 1.0
5 comp comp t 1.0

Easily generate edge list from specific structure using pandas

This is a question about how to make things properly with pandas (I use version 1.0).
Let say I have a DataFrame with missions which contains an origin and one or more destinations:
mid from to
0 0 A [C]
1 1 A [B, C]
2 2 B [B]
3 3 C [D, E, F]
Eg.: For the mission (mid=1) people will travel from A to B, then from B to C and finally from C to A. Notice, that I have no control on the datamodel of the input DataFrame.
I would like to compute metrics on each travel of the mission. The expected output would be exactly:
tid mid from to
0 0 0 A C
1 1 0 C A
2 2 1 A B
3 3 1 B C
4 4 1 C A
5 5 2 B B
6 6 2 B B
7 7 3 C D
8 8 3 D E
9 9 3 E F
10 10 3 F C
I have found a way to achieve my goal. Please, find bellow the MCVE:
import pandas as pd
# Input:
df = pd.DataFrame(
[["A", ["C"]],
["A", ["B", "C"]],
["B", ["B"]],
["C", ["D", "E", "F"]]],
columns = ["from", "to"]
).reset_index().rename(columns={'index': 'mid'})
# Create chain:
df['chain'] = df.apply(lambda x: list(x['from']) + x['to'] + list(x['from']), axis=1)
# Explode chain:
df = df.explode('chain')
# Shift to create travel:
df['end'] = df.groupby("mid")["chain"].shift(-1)
# Remove extra row, clean, reindex and rename:
df = df.dropna(subset=['end']).reset_index(drop=True).reset_index().rename(columns={'index': 'tid'})
df = df.drop(['from', 'to'], axis=1).rename(columns={'chain': 'from', 'end': 'to'})
My question is: Is there a better/easier way to make it with Pandas? By saying better I mean, not necessary more performant (it can be off course), but more readable and intuitive.
Your operation is basically explode and concat:
# turn series of lists in to single series
tmp = df[['mid','to']].explode('to')
# new `from` is concatenation of `from` and the list
df1 = pd.concat((df[['mid','from']],
tmp.rename(columns={'to':'from'})
)
).sort_index()
# new `to` is concatenation of list and `to``
df2 = pd.concat((tmp,
df[['mid','from']].rename(columns={'from':'to'})
)
).sort_index()
df1['to'] = df2['to']
Output:
mid from to
0 0 A C
0 0 C A
1 1 A B
1 1 B C
1 1 C A
2 2 B B
2 2 B B
3 3 C D
3 3 D E
3 3 E F
3 3 F C
If you don't mind re-constructing the entire DataFrame then you can clean it up a bit with np.roll to get the pairs of destinations and then assign the value of mid based on the number of trips (length of each sublist in l)
import pandas as pd
import numpy as np
from itertools import chain
l = [[fr]+to for fr,to in zip(df['from'], df['to'])]
df1 = (pd.DataFrame(data=chain.from_iterable([zip(sl, np.roll(sl, -1)) for sl in l]),
columns=['from', 'to'])
.assign(mid=np.repeat(df['mid'].to_numpy(), [*map(len, l)])))
from to mid
0 A C 0
1 C A 0
2 A B 1
3 B C 1
4 C A 1
5 B B 2
6 B B 2
7 C D 3
8 D E 3
9 E F 3
10 F C 3

iterating over a list of columns in pandas dataframe

I have a dataframe like below. I want to update the value of column C,D, E based on column A and B.
If column A < B, then C, D, E = A, else B. I tried the below code but I'm getting ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). error
import pandas as pd
import math
import sys
import re
data=[[0,1,0,0, 0],
[1,2,0,0,0],
[2,0,0,0,0],
[2,4,0,0,0],
[1,8,0,0,0],
[3,2, 0,0,0]]
df
Out[59]:
A B C D E
0 0 1 0 0 0
1 1 2 0 0 0
2 2 0 0 0 0
3 2 4 0 0 0
4 1 8 0 0 0
5 3 2 0 0 0
df = pd.DataFrame(data,columns=['A','B','C', 'D','E'])
list_1 = ['C', 'D', 'E']
for i in df[list_1]:
if df['A'] < df['B']:
df[i] = df['A']
else:
df['i'] = df['B']
I'm expecting below output:
df
Out[59]:
A B C D E
0 0 1 0 0 0
1 1 2 1 1 1
2 2 0 0 0 0
3 2 4 2 2 2
4 1 8 1 1 1
5 3 2 2 2 2
np.where
Return elements are chosen from A or B depending on condition.
df.assign
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten.
nums = np.where(df.A < df.B, df.A, df.B)
df = df.assign(C=nums, D=nums, E=nums)
Use DataFrame.mask:
df.loc[:,df.columns != 'B']=df.loc[:,df.columns != 'B'].mask(df['B']>df['A'],df['A'],axis=0)
print(df)
A B C D E
0 0 1 0 0 0
1 1 2 1 1 1
2 2 0 0 0 0
3 2 4 2 2 2
4 1 8 1 1 1
5 3 2 0 0 0
personally i always use .apply to modify columns based on other columns
list_1 = ['C', 'D', 'E']
for i in list_1:
df[i]=df.apply(lambda x: x.a if x.a<x.b else x.b, axis=1)
I don't know what you are trying to achieve here. Because condition df['A'] < df['B'] will always return same output in your loop. Just for sake of understanding:
When you do if df['A'] < df['B']:
The if condition expects a Boolean, but df['A'] < df['B'] gives a Series of Boolean values. So, it says either use something like
if (df['A'] < df['B']).all():
OR
if (df['A'] < df['B']).any():
What I would do is I would only create a DataFrame with columns 'A' and 'B', and then create column 'C' in the following way:
df['C'] = df.min(axis=1)
Columns 'D' and 'E' seem to be redundant.
If you have to start with all the columns and need to have all of them as output then you can do the following:
df['C'] = df[['A', 'B']].min(axis=1)
df['D'] = df['C']
df['E'] = df['C']
You can use the function where in numpy:
df.loc[:,'C':'E'] = np.where(df['A'] < df['B'], df['A'], df['B']).reshape(-1, 1)

Placing n rows of pandas a dataframe into their own dataframe

I have a large dataframe with many rows and columuns.
An example of the structure is:
a = np.random.rand(6,3)
df = pd.DataFrame(a)
I'd like to split the DataFrame into seperate data frames each consisting of 3 rows.
you can use groupby
g = df.groupby(np.arange(len(df)) // 3)
for n, grp in g:
print(grp)
0 1 2
0 0.278735 0.609862 0.085823
1 0.836997 0.739635 0.866059
2 0.691271 0.377185 0.225146
0 1 2
3 0.435280 0.700900 0.700946
4 0.796487 0.018688 0.700566
5 0.900749 0.764869 0.253200
to get it into a handy dictionary
mydict = {k: v for k, v in g}
You can use numpy.split() method:
In [8]: df = pd.DataFrame(np.random.rand(9, 3))
In [9]: df
Out[9]:
0 1 2
0 0.899366 0.991035 0.775607
1 0.487495 0.250279 0.975094
2 0.819031 0.568612 0.903836
3 0.178399 0.555627 0.776856
4 0.498039 0.733224 0.151091
5 0.997894 0.018736 0.999259
6 0.345804 0.780016 0.363990
7 0.794417 0.518919 0.410270
8 0.649792 0.560184 0.054238
In [10]: for x in np.split(df, len(df)//3):
...: print(x)
...:
0 1 2
0 0.899366 0.991035 0.775607
1 0.487495 0.250279 0.975094
2 0.819031 0.568612 0.903836
0 1 2
3 0.178399 0.555627 0.776856
4 0.498039 0.733224 0.151091
5 0.997894 0.018736 0.999259
0 1 2
6 0.345804 0.780016 0.363990
7 0.794417 0.518919 0.410270
8 0.649792 0.560184 0.054238

Resources