Pandas Dynamic Stack - python-3.x

Given the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'foo':['a','b','c','d'],
'bar':['e','f','g','h'],
0:['i','j','k',np.nan],
1:['m',np.nan,'o','p']})
df=df[['foo','bar',0,1]]
df
foo bar 0 1
0 a e i m
1 b f j NaN
2 c g k o
3 d h NaN p
...which resulted from a previous procedure that produced columns 0 and 1 (and may have produced more or fewer columns than 0 and 1 depending on the data):
I want to somehow stack (if that's the correct term) the data so that each value of 0 and 1 (ignoring NaNs) produces a new row like this:
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p
You probably noticed that the common field is foo.
It will likely occur that there are more common fields in my actual data set.
Also, I'm not sure how important it is that the index values repeat in the end result across values of foo. As long as the data is correct, that's my main concern.
Update:
What if I have 2+ common fields like this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'foo':['a','a','b','b'],
'foo2':['a2','b2','c2','d2'],
'bar':['e','f','g','h'],
0:['i','j','k',np.nan],
1:['m',np.nan,'o','p']})
df=df[['foo','foo2','bar',0,1]]
df
foo foo2 bar 0 1
0 a a2 e i m
1 a b2 f j NaN
2 b c2 g k o
3 b d2 h NaN p

You can use set_index, stack and reset_index:
print df.set_index('foo').stack().reset_index(level=1, drop=True).reset_index(name='bar')
foo bar
0 a e
1 a i
2 a m
3 b f
4 b j
5 c g
6 c k
7 c o
8 d h
9 d p
If you need index, use melt:
print pd.melt(df.reset_index(),
id_vars=['index', 'foo'],
value_vars=['bar', 0, 1],
value_name='bar')
.sort_values('index')
.set_index('index', drop=True)
.dropna()
.drop('variable', axis=1)
.rename_axis(None)
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p
Or use not well known lreshape:
print pd.lreshape(df.reset_index(), {'bar': ['bar', 0, 1]})
.sort_values('index')
.set_index('index', drop=True)
.rename_axis(None)
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p

Related

Complete DataFrame with missing steps python

I have a pandas data frame which misses some rows. It actually has the following format:
id step var1 var2
1 1 a h
2 1 b i
3 1 c g
1 3 d k
2 2 e l
5 2 f m
6 1 g n
...
An observation should pass through every steps. I mean id ==1 has step 1 and 3 but misses step 2 (which I don't want). id==2 has step 1 and 2 and there is no step 3 and this is fine because there is no gap. id ==5 has step 2 but doesn't have step 1 so I am missing a line there.
I need to add some rows to complete the steps, I would keep var1 var2 and id as the same.
I would like to obtain this df :
id step var1 var2
1 1 a h
2 1 b i
3 1 c g
1 3 d k
2 2 e l
5 2 f m
6 1 g n
1 2 a h
5 1 f m
...
It would be awesome if anyone could help with a smooth solution
You can try pivot the table then ffill and bfill:
(df.pivot(index='id', columns='step')
.groupby(level=0, axis=1)
.apply(lambda x: x.ffill().bfill())
.stack()
.reset_index()
)
Output:
id step var1 var2
0 1 1 a h
1 1 2 e l
2 1 3 d k
3 2 1 b i
4 2 2 e l
5 2 3 d k
6 3 1 c g
7 3 2 e l
8 3 3 d k
9 5 1 c g
10 5 2 f m
11 5 3 d k
12 6 1 g n
13 6 2 f m
14 6 3 d k

Do I use a loop, df.melt or df.explode to achieve a flattened dataframe?

Can anyone help with some code that will achieve the following transformation? I have tried variations of df.melt, df.explode, and also a looping statement but only get error statements. I think it might need nesting but don't have the experience to do so.
index A B C D
0 X d 4 2
1 Y b 5 2
Where column D represents frequency of column C.
desired output is:
index A B C
0 X d 4
1 X d 4
2 Y b 5
3 Y b 5
If you want to repeat rows, why not use index.repeat?
import pandas as pd
#recreate the sample dataframe
df = pd.DataFrame({"A":["X","Y"],"B":["d","b"],"C":[4,5],"D":[3,2]}, columns=list("ABCD"))
df = df.reindex(df.index.repeat(df["D"])).drop("D", 1).reset_index(drop=True)
print(df)
Sample output
A B C
0 X d 4
1 X d 4
2 X d 4
3 Y b 5
4 Y b 5

Easily generate edge list from specific structure using pandas

This is a question about how to make things properly with pandas (I use version 1.0).
Let say I have a DataFrame with missions which contains an origin and one or more destinations:
mid from to
0 0 A [C]
1 1 A [B, C]
2 2 B [B]
3 3 C [D, E, F]
Eg.: For the mission (mid=1) people will travel from A to B, then from B to C and finally from C to A. Notice, that I have no control on the datamodel of the input DataFrame.
I would like to compute metrics on each travel of the mission. The expected output would be exactly:
tid mid from to
0 0 0 A C
1 1 0 C A
2 2 1 A B
3 3 1 B C
4 4 1 C A
5 5 2 B B
6 6 2 B B
7 7 3 C D
8 8 3 D E
9 9 3 E F
10 10 3 F C
I have found a way to achieve my goal. Please, find bellow the MCVE:
import pandas as pd
# Input:
df = pd.DataFrame(
[["A", ["C"]],
["A", ["B", "C"]],
["B", ["B"]],
["C", ["D", "E", "F"]]],
columns = ["from", "to"]
).reset_index().rename(columns={'index': 'mid'})
# Create chain:
df['chain'] = df.apply(lambda x: list(x['from']) + x['to'] + list(x['from']), axis=1)
# Explode chain:
df = df.explode('chain')
# Shift to create travel:
df['end'] = df.groupby("mid")["chain"].shift(-1)
# Remove extra row, clean, reindex and rename:
df = df.dropna(subset=['end']).reset_index(drop=True).reset_index().rename(columns={'index': 'tid'})
df = df.drop(['from', 'to'], axis=1).rename(columns={'chain': 'from', 'end': 'to'})
My question is: Is there a better/easier way to make it with Pandas? By saying better I mean, not necessary more performant (it can be off course), but more readable and intuitive.
Your operation is basically explode and concat:
# turn series of lists in to single series
tmp = df[['mid','to']].explode('to')
# new `from` is concatenation of `from` and the list
df1 = pd.concat((df[['mid','from']],
tmp.rename(columns={'to':'from'})
)
).sort_index()
# new `to` is concatenation of list and `to``
df2 = pd.concat((tmp,
df[['mid','from']].rename(columns={'from':'to'})
)
).sort_index()
df1['to'] = df2['to']
Output:
mid from to
0 0 A C
0 0 C A
1 1 A B
1 1 B C
1 1 C A
2 2 B B
2 2 B B
3 3 C D
3 3 D E
3 3 E F
3 3 F C
If you don't mind re-constructing the entire DataFrame then you can clean it up a bit with np.roll to get the pairs of destinations and then assign the value of mid based on the number of trips (length of each sublist in l)
import pandas as pd
import numpy as np
from itertools import chain
l = [[fr]+to for fr,to in zip(df['from'], df['to'])]
df1 = (pd.DataFrame(data=chain.from_iterable([zip(sl, np.roll(sl, -1)) for sl in l]),
columns=['from', 'to'])
.assign(mid=np.repeat(df['mid'].to_numpy(), [*map(len, l)])))
from to mid
0 A C 0
1 C A 0
2 A B 1
3 B C 1
4 C A 1
5 B B 2
6 B B 2
7 C D 3
8 D E 3
9 E F 3
10 F C 3

Column label of max in pandas

I am trying to extract maximum value in row and contributing column label from pandas dataframe. For example,
A B C D
index
x 0 1 2 3
y 3 2 1 0
I expect the following output,
A B C D Maxv Con
index
x 0 1 2 3 3 D
y 3 2 1 0 3 A
I tried the following,
df['Maxv'] = df.apply(max,axis=1)
df['Con'] = df.idxmax(axis='rows')
It returned only the max column and 'NaN' for Con column. What is the error here?
Thanks in Advance.
AP
Need axis='columns' or axis=1 in DataFrame.idxmax:
df['Con'] = df.idxmax(axis='columns')
print (df)
A B C D Maxv Con
index
x 0 1 2 3 3 D
y 3 2 1 0 3 A
Or:
df['Con'] = df.idxmax(axis=1)
print (df)
A B C D Maxv Con
index
x 0 1 2 3 3 D
y 3 2 1 0 3 A
You get NaNs, because data are not align to index:
print (df.idxmax(axis='rows'))
A y
B y
C x
D x
dtype: object

Pandas Pivot Table Slice Off Level 0 of Index

Given the following data frame and pivot table:
df=pd.DataFrame({'A':['a','a','a','a','a','b','b','b','b'],
'B':['x','y','z','x','y','z','x','y','z'],
'C':['a','b','a','b','a','b','a','b','a'],
'D':[7,5,3,4,1,6,5,3,1]})
table = pd.pivot_table(df, index=['A', 'B','C'],aggfunc='sum')
table
D
A B C
a x a 7
b 4
y a 1
b 5
z a 3
b x a 5
y b 3
z a 1
b 6
I want the pivot table exactly how it is, minus index level 0, like this:
D
B C
x a 7
b 4
y a 1
b 5
z a 3
x a 5
y b 3
z a 1
b 6
Thanks in advance!
You can selectively drop an index level using reset_index with param drop=True:
In [95]:
table.reset_index('A', drop=True)
Out[95]:
D
B C
x a 7
b 4
y a 1
b 5
z a 3
x a 5
y b 3
z a 1
b 6
You can use droplevel on index:
table.index = table.index.droplevel(0).

Resources