pivot pandas dataframe while having multiple rows - python-3.x

I am have a dataframe as shown below:
d = pd.DataFrame({'name':['bil','bil','bil','bil','jim', 'jim',
'jim', 'jim'],'col2': ['acct1','law', 'acct1','law', 'acct1','law',
'acct1','law'],'col3': ['a','b','c', 'd', 'e', 'f', 'g', 'h']
})
col2 col3 name
0 acct1 a bil
1 law b bil
2 acct1 c bil
3 law d bil
4 acct1 e jim
5 law f jim
6 acct1 g jim
7 law h jim
I have tried convering it into below format using but not sure how to proceed after this:
d = d.groupby(['name', 'col2'])['col3'].apply(lambda x:
x.reset_index(drop=True)).unstack().reset_index()
name col2 0 1
0 bil acct1 a c
1 bil law b d
2 jim acct1 e g
3 jim law f h
My expected format is as show below:
acc1 law name
0 a b bil
1 c d bil
2 e f jim
3 g h jim

Use GroupBy.cumcount for counter Series, create MultiIndex by DataFrame.set_index and then reshape by second level (col2) by Series.unstack and 1, because python count from 0:
g = d.groupby(['name', 'col2'])['col3'].cumcount()
d = (d.set_index(['name', 'col2', g])['col3']
.unstack(1)
.reset_index(level=1, drop=True)
.reset_index()
.rename_axis(None, axis=1))
print (d)
name acct1 law
0 bil a b
1 bil c d
2 jim e f
3 jim g h

Related

Merge values in columns into one column

I have a dataframe with columns and I'd like to group the ones starting with 'Answer' into one named 'Answers'. This column already exists, but around line 3061 there are no more values, I have to add them. I've tried that so far:
columns_with_answer = [col for col in df if col.startswith('Answer')]
df['Answers']= df.columns_with_answer.tolist()
But I got:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-6-399b932d5740> in <module>()
1 columns_with_answer = [col for col in df if col.startswith('Answer')]
----> 2 df['Answers']= df.columns_with_answer.tolist()
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in __getattr__(self, name)
5272 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5273 return self[name]
-> 5274 return object.__getattribute__(self, name)
5275
5276 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'columns_with_answer'
So, with the sample data:
>>>import numpy as np
>>>df = pd.DataFrame({'A':list('abcdefg'),'B':[4,5,4,5,5,4, np.nan],'Answer1':['a','b','d',np.nan,'d',np.nan,'f'],'Answer2':['a','b','d','e','h','d','k'],'Answer3':['a','b',np.nan,'d','r',np.nan, 'l'],'F':list('aaabbbc'),'Answers':['truc', 'machin', 'bidule', np.nan,np.nan,np.nan,np.nan] })
>>>df.head()
A B Answer1 Answer2 Answer3 F Answers
0 a 4 a a a a [truc]
1 b 5 b b b a [machin]
2 c 4 d d NaN a [bidule]
3 d 5 NaN e d b nan
4 e 5 d h r b nan
I would like to start from row 3 to get:
A B Answer1 Answer2 Answer3 F Answers
0 a 4 a a a a [truc]
1 b 5 b b b a [machin]
2 c 4 d d NaN a [bidule]
3 d 5 NaN e d b [nan, e, d]
4 e 5 d h r b [d, h, r]
Select values by list [] and then convert to numpy array before converting to list:
df['Answers']= df[columns_with_answer].to_numpy().tolist()
Or use DataFrame.filter with regex parameter and ^ for start of string:
df['Answers']= df.filter(regex='^Answer').to_numpy().tolist()
EDIT: For apply solution only for rows if Answers column is filled missing values use:
columns_with_answer = [col for col in df if col.startswith('Answer') and col != 'Answers']
mask = df['Answers'].isna()
print (mask)
0 False
1 False
2 False
3 True
4 True
5 True
6 True
Name: Answers, dtype: bool
L = df.loc[mask, columns_with_answer].to_numpy().tolist()
df.loc[mask, 'Answers'] = pd.Series(L, index=df.index[mask])
print (df)
A B Answer1 Answer2 Answer3 F Answers
0 a 4.0 a a a a truc
1 b 5.0 b b b a machin
2 c 4.0 d d NaN a bidule
3 d 5.0 NaN e d b [nan, e, d]
4 e 5.0 d h r b [d, h, r]
5 f 4.0 NaN d NaN b [nan, d, nan]
6 g NaN f k l c [f, k, l]

Easily generate edge list from specific structure using pandas

This is a question about how to make things properly with pandas (I use version 1.0).
Let say I have a DataFrame with missions which contains an origin and one or more destinations:
mid from to
0 0 A [C]
1 1 A [B, C]
2 2 B [B]
3 3 C [D, E, F]
Eg.: For the mission (mid=1) people will travel from A to B, then from B to C and finally from C to A. Notice, that I have no control on the datamodel of the input DataFrame.
I would like to compute metrics on each travel of the mission. The expected output would be exactly:
tid mid from to
0 0 0 A C
1 1 0 C A
2 2 1 A B
3 3 1 B C
4 4 1 C A
5 5 2 B B
6 6 2 B B
7 7 3 C D
8 8 3 D E
9 9 3 E F
10 10 3 F C
I have found a way to achieve my goal. Please, find bellow the MCVE:
import pandas as pd
# Input:
df = pd.DataFrame(
[["A", ["C"]],
["A", ["B", "C"]],
["B", ["B"]],
["C", ["D", "E", "F"]]],
columns = ["from", "to"]
).reset_index().rename(columns={'index': 'mid'})
# Create chain:
df['chain'] = df.apply(lambda x: list(x['from']) + x['to'] + list(x['from']), axis=1)
# Explode chain:
df = df.explode('chain')
# Shift to create travel:
df['end'] = df.groupby("mid")["chain"].shift(-1)
# Remove extra row, clean, reindex and rename:
df = df.dropna(subset=['end']).reset_index(drop=True).reset_index().rename(columns={'index': 'tid'})
df = df.drop(['from', 'to'], axis=1).rename(columns={'chain': 'from', 'end': 'to'})
My question is: Is there a better/easier way to make it with Pandas? By saying better I mean, not necessary more performant (it can be off course), but more readable and intuitive.
Your operation is basically explode and concat:
# turn series of lists in to single series
tmp = df[['mid','to']].explode('to')
# new `from` is concatenation of `from` and the list
df1 = pd.concat((df[['mid','from']],
tmp.rename(columns={'to':'from'})
)
).sort_index()
# new `to` is concatenation of list and `to``
df2 = pd.concat((tmp,
df[['mid','from']].rename(columns={'from':'to'})
)
).sort_index()
df1['to'] = df2['to']
Output:
mid from to
0 0 A C
0 0 C A
1 1 A B
1 1 B C
1 1 C A
2 2 B B
2 2 B B
3 3 C D
3 3 D E
3 3 E F
3 3 F C
If you don't mind re-constructing the entire DataFrame then you can clean it up a bit with np.roll to get the pairs of destinations and then assign the value of mid based on the number of trips (length of each sublist in l)
import pandas as pd
import numpy as np
from itertools import chain
l = [[fr]+to for fr,to in zip(df['from'], df['to'])]
df1 = (pd.DataFrame(data=chain.from_iterable([zip(sl, np.roll(sl, -1)) for sl in l]),
columns=['from', 'to'])
.assign(mid=np.repeat(df['mid'].to_numpy(), [*map(len, l)])))
from to mid
0 A C 0
1 C A 0
2 A B 1
3 B C 1
4 C A 1
5 B B 2
6 B B 2
7 C D 3
8 D E 3
9 E F 3
10 F C 3

Conditional operations

I have a dataframe in which I would like to do some operations between columns that meet some criteria.
For example, I have the following table:
What I am interested in, is to deduct every column that has for Mar = P from every column that has Mar = I but the same Type
It the end, I would like the following:
Note: The values are just indicative.
Thanks in advance.
You could try this:
df = pd.DataFrame({'Size':[*'ABCDEF'],'Com':[*'PGPGPG'], 'Mar':[*'IPIPEA'],
'0':[1,2,3,4,5,6],
'1':[2,2,2,2,2,2],
'2':[3,3,3,3,3,3],
'3':[4,4,4,4,4,4],
'Type':['Lamba1']*2+['Lamba2']*2+['Lamba1']*2})
df1 = df.set_index(['Size','Com','Mar','Type']).T
print(df1)
Input Dataframe Output:
Size A B C D E F
Com P G P G P G
Mar I P I P E A
Type Lamba1 Lamba1 Lamba2 Lamba2 Lamba1 Lamba1
0 1 2 3 4 5 6
1 2 2 2 2 2 2
2 3 3 3 3 3 3
3 4 4 4 4 4 4
Use pd.IndexSlice and groupby:
idx = pd.IndexSlice
df_out = df1.loc[:,idx[:,:,'I':'P',:]].T.groupby('Type').diff().dropna().T.rename({'B':'L','D':'L','G':'C','P':'X'}, axis=1)
print(df_out)
Output Results:
Size L
Com C
Mar X
Type Lamba1 Lamba2
0 1.0 1.0
1 0.0 0.0
2 0.0 0.0
3 0.0 0.0
Unless I misunderstood the question, this seems to be a simple case of masking.
dict = {'Size':['A', 'B', 'C', 'D'],'Com':['P', 'G', 'P', 'G'], 'Mar':['I', 'P', 'I', 'P'], 'Type':['Lambda1', 'Lambda2', 'Lambda1', 'Lambda2'],'0':[1,2,3,4], '1':[2,2,2,2],'2':[3,3,3,3,], '3':[4,4,4,4]}
df = pd.DataFrame(dict)
#To get Lambda1 & 'I'
df[(df['Type']=='Lambda1') & (df['Mar']=='I')].T
#To get Lambda2 & 'P'
df[(df['Type']=='Lambda2') & (df['Mar']=='P')].T

pandas dataframe concatenate strings from a subset of columns and put them into a list

I tried to retrieve strings from a subset of columns from a DataFrame, concatenate the strings into one string, and then put these into a list,
# row_subset is a sub-DataFrame of some DataFrame
sub_columns = ['A', 'B', 'C']
string_list = [""] * row_subset.shape[0]
for x in range(0, row_subset.shape[0]):
for y in range(0, len(sub_columns)):
string_list[x] += str(row_subset[sub_columns[y]].iloc[x])
so the result is like,
['row 0 string concatenation','row 1 concatenation','row 2 concatenation','row3 concatenation']
I am wondering what is the best way to do this, more efficiently?
I think you need select columns by subset by [] first and then sum or if need separator use join:
df = pd.DataFrame({'A':list('abcdef'),
'B':list('qwerty'),
'C':list('fertuj'),
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')})
print (df)
A B C D E F
0 a q f 1 5 a
1 b w e 3 3 a
2 c e r 5 6 a
3 d r t 7 9 b
4 e t u 1 2 b
5 f y j 0 4 b
sub_columns = ['A', 'B', 'C']
print (df[sub_columns].sum(axis=1).tolist())
['aqf', 'bwe', 'cer', 'drt', 'etu', 'fyj']
print (df[sub_columns].apply(' '.join, axis=1).tolist())
['a q f', 'b w e', 'c e r', 'd r t', 'e t u', 'f y j']
Very similar numpy solution:
print (df[sub_columns].values.sum(axis=1).tolist())
['aqf', 'bwe', 'cer', 'drt', 'etu', 'fyj']

Pandas Dynamic Stack

Given the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'foo':['a','b','c','d'],
'bar':['e','f','g','h'],
0:['i','j','k',np.nan],
1:['m',np.nan,'o','p']})
df=df[['foo','bar',0,1]]
df
foo bar 0 1
0 a e i m
1 b f j NaN
2 c g k o
3 d h NaN p
...which resulted from a previous procedure that produced columns 0 and 1 (and may have produced more or fewer columns than 0 and 1 depending on the data):
I want to somehow stack (if that's the correct term) the data so that each value of 0 and 1 (ignoring NaNs) produces a new row like this:
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p
You probably noticed that the common field is foo.
It will likely occur that there are more common fields in my actual data set.
Also, I'm not sure how important it is that the index values repeat in the end result across values of foo. As long as the data is correct, that's my main concern.
Update:
What if I have 2+ common fields like this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'foo':['a','a','b','b'],
'foo2':['a2','b2','c2','d2'],
'bar':['e','f','g','h'],
0:['i','j','k',np.nan],
1:['m',np.nan,'o','p']})
df=df[['foo','foo2','bar',0,1]]
df
foo foo2 bar 0 1
0 a a2 e i m
1 a b2 f j NaN
2 b c2 g k o
3 b d2 h NaN p
You can use set_index, stack and reset_index:
print df.set_index('foo').stack().reset_index(level=1, drop=True).reset_index(name='bar')
foo bar
0 a e
1 a i
2 a m
3 b f
4 b j
5 c g
6 c k
7 c o
8 d h
9 d p
If you need index, use melt:
print pd.melt(df.reset_index(),
id_vars=['index', 'foo'],
value_vars=['bar', 0, 1],
value_name='bar')
.sort_values('index')
.set_index('index', drop=True)
.dropna()
.drop('variable', axis=1)
.rename_axis(None)
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p
Or use not well known lreshape:
print pd.lreshape(df.reset_index(), {'bar': ['bar', 0, 1]})
.sort_values('index')
.set_index('index', drop=True)
.rename_axis(None)
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p

Resources