split based on comma and create new data frame in Python - python-3.x

Let's say I have the following data frame.
df
Nodes Weight
A,B 10
A,C,F 8
B,F,D 6
B,E 4
I would like to split based on comma and keeping their weight as well. For example, Nodes (A,C,F) A has connection with C and C has connection F. So, I would like to see A >>C, and C>>F. No need to see A>>F. AND their weight should be 8 as well as shown below.
The final data frame that am looking for looks like below.
Node_1 Node_2 Weight
A B 10
A C 8
C F 8
B F 6
F D 6
B E 4
The goal of creating this data frame is creating a network graph out of it.
There are similar solutions but I couldn't get the result what I want.
I tried with the following:
df = (df['Nodes'].str.split(',') .groupby(df['Weight'])
Can anyone help on this?

Here is one way to do this:
# From https://docs.python.org/3/library/itertools.html#itertools-recipes
from itertools import tee
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return zip(a, b)
df['Node_pairs'] = df['Nodes'].str.split(',').apply(lambda x: list(pairwise(x)))
df = df.explode('Node_pairs')
df['Node1'] = df['Node_pairs'].str[0]
df['Node2'] = df['Node_pairs'].str[1]
df
Output:
Nodes Weight Node_pairs Node1 Node2
0 A,B 10 (A, B) A B
1 A,C,F 8 (A, C) A C
1 A,C,F 8 (C, F) C F
2 B,F,D 6 (B, F) B F
2 B,F,D 6 (F, D) F D
3 B,E 4 (B, E) B E
Details:
Use pairwise recipe from itertools documentation to create
'Node_pairs'
Explode the dataframe on list of 'Node_pairs'
Assign 'Node1' and 'Node2' using .str get shortcut.

The logic is same as the solution provided by the Scott.
def grouper(input_list, n = 2):
for i in range(len(input_list) - (n - 1)):
yield input_list[i:i+n]
(df.set_index('Weight')['Nodes']
.str.split(',')
.map(grouper)
.map(list)
.explode()
.apply(pd.Series).add_prefix('Node_')
.reset_index())
Weight Node_0 Node_1
0 10 A B
1 8 A C
2 8 C F
3 6 B F
4 6 F D
5 4 B E

Related

Computing relationship gaps using pandas

Suppose I have a company whose reporting relationships are given by:
ref_pd = pd.DataFrame({'employee':['a','b','c','d','e','f'],'supervisor':['c','c','d','f','f',None]})
e.g.
a and b report to c, who reports to d, who reports to f
e also reports to f
f is the head and reports to no one
Is there a (clever) way to use
ref_pd
I know a priori that there are at most 4 "layers" to the company (f is layer 1, d and e are layer 2, c is layer 3, a and b are layer 4)
to compute the last column of:
gap_pd = pd.DataFrame({'supervisor':['d','d','d','d','d','d'],'employee':['a','b','c','d','e','f'],'gap':[2,2,1,0,None,None]})
The logic of this last column is like this: with the supervisor = d for each row
if the employee = d, then that is also the supervisor, so there is no gap, so gap = 0
c reports directly to d, so gap = 1
a and b report to c, who reports to d, so gap = 2
neither e nor f (eventually) reports to d, so gap is Null
The only thing I can think of is feeding ref_pd to a directed graph then computing path lengths but I struggle for a graph-less (and hopefully pure pandas) solution.
Graph of the hierarchy:
If is possible using iterative merges, but quite impractical:
target = 'd'
# set up containers
counts = pd.Series(index=ref_pd2['employee'], dtype=int)
valid = pd.Series(False, index=ref_pd['employee'])
# set up initial df2 for iterative merging
df2 = ref_pd.rename(columns={'employee': 'origin', 'supervisor': 'employee'})
# identify origin == target
valid[target] = True
counts[target] = 0
while df2['employee'].notna().any(): # or: for _ in range(4):
# increment count for non valid items
idx = valid[~valid].index
counts[idx] = counts[idx].add(1, fill_value=0)
# if we reached supervisor == target, set valid
m2 = df2['employee'].eq(target)
valid[df2.loc[m2, 'origin']] = True
# update df2
df2 = (df2.merge(ref_pd2, how='left')
.drop(columns='employee')
.rename(columns={'supervisor': 'employee'})
)
# set invalid items to NA
counts[~valid] = pd.NA
# craft output
gap_pd = pd.DataFrame({'supervisor': target,
'employee': ref_pd['employee'],
'gap': ref_pd['employee'].map(counts),
})
output:
supervisor employee gap
0 d a 2.0
1 d b 2.0
2 d c 1.0
3 d d 0.0
4 d e NaN
5 d f NaN
In comparison, a networkx solution is much more explicit:
import networkx as nx
G = nx.from_pandas_edgelist(ref_pd.dropna(),
source='supervisor', target='employee',
create_using=nx.DiGraph)
target = 'd'
gap_pd = ref_pd.assign(supervisor=target)
gap_pd['gap'] = [nx.shortest_path_length(G, target, n)
if target in nx.ancestors(G, n) or n==target else pd.NA
for n in ref_pd['employee']]
output:
employee supervisor gap
0 a d 2
1 b d 2
2 c d 1
3 d d 0
4 e d <NA>
5 f d <NA>

Do I use a loop, df.melt or df.explode to achieve a flattened dataframe?

Can anyone help with some code that will achieve the following transformation? I have tried variations of df.melt, df.explode, and also a looping statement but only get error statements. I think it might need nesting but don't have the experience to do so.
index A B C D
0 X d 4 2
1 Y b 5 2
Where column D represents frequency of column C.
desired output is:
index A B C
0 X d 4
1 X d 4
2 Y b 5
3 Y b 5
If you want to repeat rows, why not use index.repeat?
import pandas as pd
#recreate the sample dataframe
df = pd.DataFrame({"A":["X","Y"],"B":["d","b"],"C":[4,5],"D":[3,2]}, columns=list("ABCD"))
df = df.reindex(df.index.repeat(df["D"])).drop("D", 1).reset_index(drop=True)
print(df)
Sample output
A B C
0 X d 4
1 X d 4
2 X d 4
3 Y b 5
4 Y b 5

Easily generate edge list from specific structure using pandas

This is a question about how to make things properly with pandas (I use version 1.0).
Let say I have a DataFrame with missions which contains an origin and one or more destinations:
mid from to
0 0 A [C]
1 1 A [B, C]
2 2 B [B]
3 3 C [D, E, F]
Eg.: For the mission (mid=1) people will travel from A to B, then from B to C and finally from C to A. Notice, that I have no control on the datamodel of the input DataFrame.
I would like to compute metrics on each travel of the mission. The expected output would be exactly:
tid mid from to
0 0 0 A C
1 1 0 C A
2 2 1 A B
3 3 1 B C
4 4 1 C A
5 5 2 B B
6 6 2 B B
7 7 3 C D
8 8 3 D E
9 9 3 E F
10 10 3 F C
I have found a way to achieve my goal. Please, find bellow the MCVE:
import pandas as pd
# Input:
df = pd.DataFrame(
[["A", ["C"]],
["A", ["B", "C"]],
["B", ["B"]],
["C", ["D", "E", "F"]]],
columns = ["from", "to"]
).reset_index().rename(columns={'index': 'mid'})
# Create chain:
df['chain'] = df.apply(lambda x: list(x['from']) + x['to'] + list(x['from']), axis=1)
# Explode chain:
df = df.explode('chain')
# Shift to create travel:
df['end'] = df.groupby("mid")["chain"].shift(-1)
# Remove extra row, clean, reindex and rename:
df = df.dropna(subset=['end']).reset_index(drop=True).reset_index().rename(columns={'index': 'tid'})
df = df.drop(['from', 'to'], axis=1).rename(columns={'chain': 'from', 'end': 'to'})
My question is: Is there a better/easier way to make it with Pandas? By saying better I mean, not necessary more performant (it can be off course), but more readable and intuitive.
Your operation is basically explode and concat:
# turn series of lists in to single series
tmp = df[['mid','to']].explode('to')
# new `from` is concatenation of `from` and the list
df1 = pd.concat((df[['mid','from']],
tmp.rename(columns={'to':'from'})
)
).sort_index()
# new `to` is concatenation of list and `to``
df2 = pd.concat((tmp,
df[['mid','from']].rename(columns={'from':'to'})
)
).sort_index()
df1['to'] = df2['to']
Output:
mid from to
0 0 A C
0 0 C A
1 1 A B
1 1 B C
1 1 C A
2 2 B B
2 2 B B
3 3 C D
3 3 D E
3 3 E F
3 3 F C
If you don't mind re-constructing the entire DataFrame then you can clean it up a bit with np.roll to get the pairs of destinations and then assign the value of mid based on the number of trips (length of each sublist in l)
import pandas as pd
import numpy as np
from itertools import chain
l = [[fr]+to for fr,to in zip(df['from'], df['to'])]
df1 = (pd.DataFrame(data=chain.from_iterable([zip(sl, np.roll(sl, -1)) for sl in l]),
columns=['from', 'to'])
.assign(mid=np.repeat(df['mid'].to_numpy(), [*map(len, l)])))
from to mid
0 A C 0
1 C A 0
2 A B 1
3 B C 1
4 C A 1
5 B B 2
6 B B 2
7 C D 3
8 D E 3
9 E F 3
10 F C 3

How to create pandas matrix from one column

I'm trying to create a matrix from one column into two columns, I think this i the right terminology. It's really a 2d matrix I think? I haven't found a lot on this topic which is why I am coming here.
This is what my starting dataframe looks like:
df:
[1]
A
B
C
This is what I am trying to end up with:
df2:
[1] [2]
A B
A C
B C
B A
C A
C B
You can try using permutations
from itertools import permutations
df = pd.DataFrame({1:['A','B','C']})
df_out = pd.DataFrame().from_records(permutations(df[1], 2), columns=[1, 2])
print(df_out)
OUtput:
1 2
0 A B
1 A C
2 B A
3 B C
4 C A
5 C B

Pandas Dynamic Stack

Given the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'foo':['a','b','c','d'],
'bar':['e','f','g','h'],
0:['i','j','k',np.nan],
1:['m',np.nan,'o','p']})
df=df[['foo','bar',0,1]]
df
foo bar 0 1
0 a e i m
1 b f j NaN
2 c g k o
3 d h NaN p
...which resulted from a previous procedure that produced columns 0 and 1 (and may have produced more or fewer columns than 0 and 1 depending on the data):
I want to somehow stack (if that's the correct term) the data so that each value of 0 and 1 (ignoring NaNs) produces a new row like this:
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p
You probably noticed that the common field is foo.
It will likely occur that there are more common fields in my actual data set.
Also, I'm not sure how important it is that the index values repeat in the end result across values of foo. As long as the data is correct, that's my main concern.
Update:
What if I have 2+ common fields like this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'foo':['a','a','b','b'],
'foo2':['a2','b2','c2','d2'],
'bar':['e','f','g','h'],
0:['i','j','k',np.nan],
1:['m',np.nan,'o','p']})
df=df[['foo','foo2','bar',0,1]]
df
foo foo2 bar 0 1
0 a a2 e i m
1 a b2 f j NaN
2 b c2 g k o
3 b d2 h NaN p
You can use set_index, stack and reset_index:
print df.set_index('foo').stack().reset_index(level=1, drop=True).reset_index(name='bar')
foo bar
0 a e
1 a i
2 a m
3 b f
4 b j
5 c g
6 c k
7 c o
8 d h
9 d p
If you need index, use melt:
print pd.melt(df.reset_index(),
id_vars=['index', 'foo'],
value_vars=['bar', 0, 1],
value_name='bar')
.sort_values('index')
.set_index('index', drop=True)
.dropna()
.drop('variable', axis=1)
.rename_axis(None)
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p
Or use not well known lreshape:
print pd.lreshape(df.reset_index(), {'bar': ['bar', 0, 1]})
.sort_values('index')
.set_index('index', drop=True)
.rename_axis(None)
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p

Resources