How to split record from a DataFrame cross pairs in pandas? - python-3.x

I hava a dataframe like this :
a b c
0 A B 1
4 B A 1
1 C D -1
3 D C 3
2 E F 3
The '0' row and '4'row are a pair, I will remove one row by the value of 'c' columns. According to 'c' columns, I decide to remove which one or remove all of them. If mirror pair have same value in c column, I will remove one row, or I will remove all of them.
a b c
0 A B 1
2 E F 3
I use while, but my data set is huge. Have any good ideas ?

IIUC using np.sort with duplicated
df1=df.loc[~pd.DataFrame(np.sort(df[['a','b']].values,axis=1)).duplicated().values]
a b c
0 A B 1
1 C D -1
2 E F 3

You may use agg with frozenset and duplicated and slicing
s = df[['a', 'b']].agg(frozenset, axis=1)
m = ~s.duplicated(keep=False) | (s.duplicated(keep=False) & df.c.duplicated())
df.loc[m]
Out[165]:
a b c
4 B A 1
2 E F 3

first select the non-duplicated rows using np.sort and Series.duplicated (see m1 detail)
Then you can use DataFrame.groupby
and group according to columns a, b (see detail g). Then perform a Boolean indexing using Groupby.transform to eliminate duplicates when c does not match.:
df2=df.reset_index(drop=True)
m1=~pd.DataFrame(np.sort(df2[['a','b']])).duplicated()
g=m1.cumsum()
m2=~df2.groupby(g,sort=False)['c'].transform(lambda x: (x.nunique()==len(x))&(len(x)>1))
mask=m1&m2
print(mask)
0 True
1 False
2 False
3 False
4 True
dtype: bool
df_filtered=df2[mask]
print(df_filtered)
a b c
0 A B 1
4 E F 3
Details:
m1
0 True
1 False
2 True
3 False
4 True
dtype: bool
m2
0 True
1 True
2 False
3 False
4 True
dtype: bool
g
0 1
1 1
2 2
3 2
4 3
dtype: int64

Related

Easily generate edge list from specific structure using pandas

This is a question about how to make things properly with pandas (I use version 1.0).
Let say I have a DataFrame with missions which contains an origin and one or more destinations:
mid from to
0 0 A [C]
1 1 A [B, C]
2 2 B [B]
3 3 C [D, E, F]
Eg.: For the mission (mid=1) people will travel from A to B, then from B to C and finally from C to A. Notice, that I have no control on the datamodel of the input DataFrame.
I would like to compute metrics on each travel of the mission. The expected output would be exactly:
tid mid from to
0 0 0 A C
1 1 0 C A
2 2 1 A B
3 3 1 B C
4 4 1 C A
5 5 2 B B
6 6 2 B B
7 7 3 C D
8 8 3 D E
9 9 3 E F
10 10 3 F C
I have found a way to achieve my goal. Please, find bellow the MCVE:
import pandas as pd
# Input:
df = pd.DataFrame(
[["A", ["C"]],
["A", ["B", "C"]],
["B", ["B"]],
["C", ["D", "E", "F"]]],
columns = ["from", "to"]
).reset_index().rename(columns={'index': 'mid'})
# Create chain:
df['chain'] = df.apply(lambda x: list(x['from']) + x['to'] + list(x['from']), axis=1)
# Explode chain:
df = df.explode('chain')
# Shift to create travel:
df['end'] = df.groupby("mid")["chain"].shift(-1)
# Remove extra row, clean, reindex and rename:
df = df.dropna(subset=['end']).reset_index(drop=True).reset_index().rename(columns={'index': 'tid'})
df = df.drop(['from', 'to'], axis=1).rename(columns={'chain': 'from', 'end': 'to'})
My question is: Is there a better/easier way to make it with Pandas? By saying better I mean, not necessary more performant (it can be off course), but more readable and intuitive.
Your operation is basically explode and concat:
# turn series of lists in to single series
tmp = df[['mid','to']].explode('to')
# new `from` is concatenation of `from` and the list
df1 = pd.concat((df[['mid','from']],
tmp.rename(columns={'to':'from'})
)
).sort_index()
# new `to` is concatenation of list and `to``
df2 = pd.concat((tmp,
df[['mid','from']].rename(columns={'from':'to'})
)
).sort_index()
df1['to'] = df2['to']
Output:
mid from to
0 0 A C
0 0 C A
1 1 A B
1 1 B C
1 1 C A
2 2 B B
2 2 B B
3 3 C D
3 3 D E
3 3 E F
3 3 F C
If you don't mind re-constructing the entire DataFrame then you can clean it up a bit with np.roll to get the pairs of destinations and then assign the value of mid based on the number of trips (length of each sublist in l)
import pandas as pd
import numpy as np
from itertools import chain
l = [[fr]+to for fr,to in zip(df['from'], df['to'])]
df1 = (pd.DataFrame(data=chain.from_iterable([zip(sl, np.roll(sl, -1)) for sl in l]),
columns=['from', 'to'])
.assign(mid=np.repeat(df['mid'].to_numpy(), [*map(len, l)])))
from to mid
0 A C 0
1 C A 0
2 A B 1
3 B C 1
4 C A 1
5 B B 2
6 B B 2
7 C D 3
8 D E 3
9 E F 3
10 F C 3

pandas transform one row into multiple rows

I have a dataframe as below.
My dataframe as below.
ID list
1 a, b, c
2 a, s
3 NA
5 f, j, l
I need to break each items in the list column(String) into independent row as below:
ID item
1 a
1 b
1 c
2 a
2 s
3 NA
5 f
5 j
5 l
Thanks.
Use str.split to separate your items then explode:
print (df.assign(list=df["list"].str.split(", ")).explode("list"))
ID list
0 1 a
0 1 b
0 1 c
1 2 a
1 2 s
2 3 NaN
3 5 f
3 5 j
3 5 l
A beginners approach : Just another way of doing the same thing using pd.DataFrame.stack
df['list'] = df['list'].map(lambda x : str(x).split(','))
dfOut = pd.DataFrame(df['list'].values.tolist())
dfOut.index = df['ID']
dfOut = dfOut.stack().reset_index()
del dfOut['level_1']
dfOut.rename(columns = {0 : 'list'}, inplace = True)
Output:
ID list
0 1 a
1 1 b
2 1 c
3 2 a
4 2 s
5 3 nan
6 5 f
7 5 j
8 5 l

Python Pandas: copy several columns at specific row from one dataframe to another with different names

I have dataframe1 with columns a,b,c,d with 5 rows.
I also have another dataframe2 with columns e,f,g,h
Let's say I want to copy columns a,b in row 3 from dataframe1 to columns f,g in row 3 at dataframe2.
I tried to use this code:
dataframe2.loc[3,['f','g']] = dataframe1.loc[3,['a','b']].
The results was NaN in dataframe2.
Any ideas how can I solve it?
One idea is convert to numpy array for avoid alignment data by columns names:
dataframe2.loc[3,['f','g']] = dataframe1.loc[3,['a','b']].values
Sample:
dataframe1 = pd.DataFrame({'a':list('abcdef'),
'b':[4,5,4,5,5,4],
'c':[7,8,9,4,2,3]})
print (dataframe1)
a b c
0 a 4 7
1 b 5 8
2 c 4 9
3 d 5 4
4 e 5 2
5 f 4 3
dataframe2 = pd.DataFrame({'f':list('HIJK'),
'g':[0,0,7,1],
'h':[0,1,0,1]})
print (dataframe2)
f g h
0 H 0 0
1 I 0 1
2 J 7 0
3 K 1 1
dataframe2.loc[3,['f','g']] = dataframe1.loc[3,['a','b']].values
print (dataframe2)
f g h
0 H 0 0
1 I 0 1
2 J 7 0
3 d 5 1

pandas how to convert a two-dimension dataframe to a one-dimension dataframe

suppose I have a dataframe with multi columns.
a b c
1
2
3
How to convert it to a single columns dataframe
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c
please note that the former is a Dataframe other than Panel
Use melt:
df = df.reset_index().melt('index', var_name='col').set_index('index')[['col']]
print (df)
col
index
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c
Or numpy.repeat and numpy.tile with DataFrame constructor::
a = np.repeat(df.columns, len(df))
b = np.tile(df.index, len(df.columns))
df = pd.DataFrame(a, index=b, columns=['col'])
print (df)
col
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c
another way is,
pd.DataFrame(list(itertools.product(df.index, df.columns.values))).set_index([0])
Output:
1
0
1 a
1 b
1 c
2 a
2 b
2 c
3 a
3 b
3 c
For exact output:
use sort_values
print pd.DataFrame(list(itertools.product(df.index, df.columns.values))).set_index([0]).sort_values(by=[1])
1
0
1 a
2 a
3 a
1 b
2 b
3 b
1 c
2 c
3 c

Deleting the first instance in a data frame

I was wondering whats the best way to delete the first instance of a particular index in a Pandas dataframe?
In the example below, I want to delete row 0,5 and 9
Use boolean indexing with Index.duplicated:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')}, index=[0,0,1,2,2,2])
print (df)
A B C D E F
0 a 4 7 1 5 a
0 b 5 8 3 3 a
1 c 4 9 5 6 a
2 d 5 4 7 9 b
2 e 5 2 1 2 b
2 f 4 3 0 4 b
df = df[df.index.duplicated()]
print (df)
A B C D E F
0 b 5 8 3 3 a
2 e 5 2 1 2 b
2 f 4 3 0 4 b
Detail:
print (df.index.duplicated())
[False True False False True True]
Heres a way to do it using groupby:
rst = df.reset_index()
df['int_index'] = df.reset_index().index
firsts = df.groupby(df.index).first()
filt = df[~df['int_index'].isin(firsts['int_index'])]
missing = df[df.index.value_counts() == 1]
res = pd.concat([drp, missing]).sort_index().drop('int_index', axis=1)

Resources