Merge values in columns into one column - python-3.x

I have a dataframe with columns and I'd like to group the ones starting with 'Answer' into one named 'Answers'. This column already exists, but around line 3061 there are no more values, I have to add them. I've tried that so far:
columns_with_answer = [col for col in df if col.startswith('Answer')]
df['Answers']= df.columns_with_answer.tolist()
But I got:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-6-399b932d5740> in <module>()
1 columns_with_answer = [col for col in df if col.startswith('Answer')]
----> 2 df['Answers']= df.columns_with_answer.tolist()
/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py in __getattr__(self, name)
5272 if self._info_axis._can_hold_identifiers_and_holds_name(name):
5273 return self[name]
-> 5274 return object.__getattribute__(self, name)
5275
5276 def __setattr__(self, name: str, value) -> None:
AttributeError: 'DataFrame' object has no attribute 'columns_with_answer'
So, with the sample data:
>>>import numpy as np
>>>df = pd.DataFrame({'A':list('abcdefg'),'B':[4,5,4,5,5,4, np.nan],'Answer1':['a','b','d',np.nan,'d',np.nan,'f'],'Answer2':['a','b','d','e','h','d','k'],'Answer3':['a','b',np.nan,'d','r',np.nan, 'l'],'F':list('aaabbbc'),'Answers':['truc', 'machin', 'bidule', np.nan,np.nan,np.nan,np.nan] })
>>>df.head()
A B Answer1 Answer2 Answer3 F Answers
0 a 4 a a a a [truc]
1 b 5 b b b a [machin]
2 c 4 d d NaN a [bidule]
3 d 5 NaN e d b nan
4 e 5 d h r b nan
I would like to start from row 3 to get:
A B Answer1 Answer2 Answer3 F Answers
0 a 4 a a a a [truc]
1 b 5 b b b a [machin]
2 c 4 d d NaN a [bidule]
3 d 5 NaN e d b [nan, e, d]
4 e 5 d h r b [d, h, r]

Select values by list [] and then convert to numpy array before converting to list:
df['Answers']= df[columns_with_answer].to_numpy().tolist()
Or use DataFrame.filter with regex parameter and ^ for start of string:
df['Answers']= df.filter(regex='^Answer').to_numpy().tolist()
EDIT: For apply solution only for rows if Answers column is filled missing values use:
columns_with_answer = [col for col in df if col.startswith('Answer') and col != 'Answers']
mask = df['Answers'].isna()
print (mask)
0 False
1 False
2 False
3 True
4 True
5 True
6 True
Name: Answers, dtype: bool
L = df.loc[mask, columns_with_answer].to_numpy().tolist()
df.loc[mask, 'Answers'] = pd.Series(L, index=df.index[mask])
print (df)
A B Answer1 Answer2 Answer3 F Answers
0 a 4.0 a a a a truc
1 b 5.0 b b b a machin
2 c 4.0 d d NaN a bidule
3 d 5.0 NaN e d b [nan, e, d]
4 e 5.0 d h r b [d, h, r]
5 f 4.0 NaN d NaN b [nan, d, nan]
6 g NaN f k l c [f, k, l]

Related

MELT: multiple values without duplication

Cant be this hard. I Have
df=pd.DataFrame({'id':[1,2,3],'name':['j','l','m'], 'mnt':['f','p','p'],'nt':['b','w','e'],'cost':[20,30,80],'paid':[12,23,45]})
I need
import numpy as np
df1=pd.DataFrame({'id':[1,2,3,1,2,3],'name':['j','l','m','j','l','m'], 't':['f','p','p','b','w','e'],'paid':[12,23,45,np.nan,np.nan,np.nan],'cost':[20,30,80,np.nan,np.nan,np.nan]})
I have 45 columns to invert.
I tried
(df.set_index(['id', 'name'])
.rename_axis(['paid'], axis=1)
.stack().reset_index())
EDIT: I think simpliest here is set missing values by variable column in DataFrame.melt:
df2 = df.melt(['id', 'name','cost','paid'], value_name='t')
df2.loc[df2.pop('variable').eq('nt'), ['cost','paid']] = np.nan
print (df2)
id name cost paid t
0 1 j 20.0 12.0 f
1 2 l 30.0 23.0 p
2 3 m 80.0 45.0 p
3 1 j NaN NaN b
4 2 l NaN NaN w
5 3 m NaN NaN e
Use lreshape working with dictionary of lists for specified which columns are 'grouped' together:
df2 = pd.lreshape(df, {'t':['mnt','nt'], 'mon':['cost','paid']})
print (df2)
id name t mon
0 1 j f 20
1 2 l p 30
2 3 m p 80
3 1 j b 12
4 2 l w 23
5 3 m e 45

Replace values on dataset and apply quartile rule by row on pandas

I have a dataset with lots of variables. So I've extracted the numeric ones:
numeric_columns = transposed_df.select_dtypes(np.number)
Then I want to replace all 0 values for 0.0001
transposed_df[numeric_columns.columns] = numeric_columns.where(numeric_columns.eq(0, axis=0), 0.0001)
And here is the first problem. This line is not replacing the 0 values with 0.0001, but is replacing all non zero values with 0.0001.
Also after this (replacing the 0 values by 0.0001) I want to replace all values there are less than the 1th quartile of the row to -1 and leave the others as they were. But I am not managing how.
To answer your first question
In [36]: from pprint import pprint
In [37]: pprint( numeric_columns.where.__doc__)
('\n'
'Replace values where the condition is False.\n'
'\n'
'Parameters\n'
'----------\n'
because of that your all the values except 0 are getting replaced
Use DataFrame.mask and for second condition compare by DataFrame.quantile:
transposed_df = pd.DataFrame({
'A':list('abcdef'),
'B':[0,0.5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,0,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
numeric_columns = transposed_df.select_dtypes(np.number)
m1 = numeric_columns.eq(0)
m2 = numeric_columns.lt(numeric_columns.quantile(q=0.25, axis=1), axis=0)
transposed_df[numeric_columns.columns] = numeric_columns.mask(m1, 0.0001).mask(m2, -1)
print (transposed_df)
A B C D E F
0 a -1.0 7 1.0 5 a
1 b -1.0 8 3.0 3 a
2 c 4.0 9 -1.0 6 a
3 d 5.0 -1 7.0 9 b
4 e 5.0 2 -1.0 2 b
5 f 4.0 3 -1.0 4 b
EDIT:
from scipy.stats import zscore
print (transposed_df[numeric_columns.columns].apply(zscore))
B C D E
0 -2.236068 0.570352 -0.408248 0.073521
1 0.447214 0.950586 0.408248 -0.808736
2 0.447214 1.330821 -0.816497 0.514650
3 0.447214 -0.570352 2.041241 1.838037
4 0.447214 -1.330821 -0.408248 -1.249865
5 0.447214 -0.950586 -0.816497 -0.367607
EDIT1:
transposed_df = pd.DataFrame({
'A':list('abcdef'),
'B':[0,1,1,1,1,1],
'C':[1,8,9,4,2,3],
'D':[1,3,0,7,1,0],
'E':[1,3,6,9,2,4],
'F':list('aaabbb')
})
numeric_columns = transposed_df.select_dtypes(np.number)
from scipy.stats import zscore
df1 = pd.DataFrame(numeric_columns.apply(zscore, axis=1).tolist(),index=transposed_df.index)
transposed_df[numeric_columns.columns] = df1
print (transposed_df)
A B C D E F
0 a -1.732051 0.577350 0.577350 0.577350 a
1 b -1.063410 1.643452 -0.290021 -0.290021 a
2 c -0.816497 1.360828 -1.088662 0.544331 a
3 d -1.402136 -0.412393 0.577350 1.237179 b
4 e -1.000000 1.000000 -1.000000 1.000000 b
5 f -0.632456 0.632456 -1.264911 1.264911 b

Easily generate edge list from specific structure using pandas

This is a question about how to make things properly with pandas (I use version 1.0).
Let say I have a DataFrame with missions which contains an origin and one or more destinations:
mid from to
0 0 A [C]
1 1 A [B, C]
2 2 B [B]
3 3 C [D, E, F]
Eg.: For the mission (mid=1) people will travel from A to B, then from B to C and finally from C to A. Notice, that I have no control on the datamodel of the input DataFrame.
I would like to compute metrics on each travel of the mission. The expected output would be exactly:
tid mid from to
0 0 0 A C
1 1 0 C A
2 2 1 A B
3 3 1 B C
4 4 1 C A
5 5 2 B B
6 6 2 B B
7 7 3 C D
8 8 3 D E
9 9 3 E F
10 10 3 F C
I have found a way to achieve my goal. Please, find bellow the MCVE:
import pandas as pd
# Input:
df = pd.DataFrame(
[["A", ["C"]],
["A", ["B", "C"]],
["B", ["B"]],
["C", ["D", "E", "F"]]],
columns = ["from", "to"]
).reset_index().rename(columns={'index': 'mid'})
# Create chain:
df['chain'] = df.apply(lambda x: list(x['from']) + x['to'] + list(x['from']), axis=1)
# Explode chain:
df = df.explode('chain')
# Shift to create travel:
df['end'] = df.groupby("mid")["chain"].shift(-1)
# Remove extra row, clean, reindex and rename:
df = df.dropna(subset=['end']).reset_index(drop=True).reset_index().rename(columns={'index': 'tid'})
df = df.drop(['from', 'to'], axis=1).rename(columns={'chain': 'from', 'end': 'to'})
My question is: Is there a better/easier way to make it with Pandas? By saying better I mean, not necessary more performant (it can be off course), but more readable and intuitive.
Your operation is basically explode and concat:
# turn series of lists in to single series
tmp = df[['mid','to']].explode('to')
# new `from` is concatenation of `from` and the list
df1 = pd.concat((df[['mid','from']],
tmp.rename(columns={'to':'from'})
)
).sort_index()
# new `to` is concatenation of list and `to``
df2 = pd.concat((tmp,
df[['mid','from']].rename(columns={'from':'to'})
)
).sort_index()
df1['to'] = df2['to']
Output:
mid from to
0 0 A C
0 0 C A
1 1 A B
1 1 B C
1 1 C A
2 2 B B
2 2 B B
3 3 C D
3 3 D E
3 3 E F
3 3 F C
If you don't mind re-constructing the entire DataFrame then you can clean it up a bit with np.roll to get the pairs of destinations and then assign the value of mid based on the number of trips (length of each sublist in l)
import pandas as pd
import numpy as np
from itertools import chain
l = [[fr]+to for fr,to in zip(df['from'], df['to'])]
df1 = (pd.DataFrame(data=chain.from_iterable([zip(sl, np.roll(sl, -1)) for sl in l]),
columns=['from', 'to'])
.assign(mid=np.repeat(df['mid'].to_numpy(), [*map(len, l)])))
from to mid
0 A C 0
1 C A 0
2 A B 1
3 B C 1
4 C A 1
5 B B 2
6 B B 2
7 C D 3
8 D E 3
9 E F 3
10 F C 3

Deleting the first instance in a data frame

I was wondering whats the best way to delete the first instance of a particular index in a Pandas dataframe?
In the example below, I want to delete row 0,5 and 9
Use boolean indexing with Index.duplicated:
df = pd.DataFrame({'A':list('abcdef'),
'B':[4,5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,5,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')}, index=[0,0,1,2,2,2])
print (df)
A B C D E F
0 a 4 7 1 5 a
0 b 5 8 3 3 a
1 c 4 9 5 6 a
2 d 5 4 7 9 b
2 e 5 2 1 2 b
2 f 4 3 0 4 b
df = df[df.index.duplicated()]
print (df)
A B C D E F
0 b 5 8 3 3 a
2 e 5 2 1 2 b
2 f 4 3 0 4 b
Detail:
print (df.index.duplicated())
[False True False False True True]
Heres a way to do it using groupby:
rst = df.reset_index()
df['int_index'] = df.reset_index().index
firsts = df.groupby(df.index).first()
filt = df[~df['int_index'].isin(firsts['int_index'])]
missing = df[df.index.value_counts() == 1]
res = pd.concat([drp, missing]).sort_index().drop('int_index', axis=1)

Pandas Dynamic Stack

Given the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'foo':['a','b','c','d'],
'bar':['e','f','g','h'],
0:['i','j','k',np.nan],
1:['m',np.nan,'o','p']})
df=df[['foo','bar',0,1]]
df
foo bar 0 1
0 a e i m
1 b f j NaN
2 c g k o
3 d h NaN p
...which resulted from a previous procedure that produced columns 0 and 1 (and may have produced more or fewer columns than 0 and 1 depending on the data):
I want to somehow stack (if that's the correct term) the data so that each value of 0 and 1 (ignoring NaNs) produces a new row like this:
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p
You probably noticed that the common field is foo.
It will likely occur that there are more common fields in my actual data set.
Also, I'm not sure how important it is that the index values repeat in the end result across values of foo. As long as the data is correct, that's my main concern.
Update:
What if I have 2+ common fields like this:
import numpy as np
import pandas as pd
df = pd.DataFrame({'foo':['a','a','b','b'],
'foo2':['a2','b2','c2','d2'],
'bar':['e','f','g','h'],
0:['i','j','k',np.nan],
1:['m',np.nan,'o','p']})
df=df[['foo','foo2','bar',0,1]]
df
foo foo2 bar 0 1
0 a a2 e i m
1 a b2 f j NaN
2 b c2 g k o
3 b d2 h NaN p
You can use set_index, stack and reset_index:
print df.set_index('foo').stack().reset_index(level=1, drop=True).reset_index(name='bar')
foo bar
0 a e
1 a i
2 a m
3 b f
4 b j
5 c g
6 c k
7 c o
8 d h
9 d p
If you need index, use melt:
print pd.melt(df.reset_index(),
id_vars=['index', 'foo'],
value_vars=['bar', 0, 1],
value_name='bar')
.sort_values('index')
.set_index('index', drop=True)
.dropna()
.drop('variable', axis=1)
.rename_axis(None)
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p
Or use not well known lreshape:
print pd.lreshape(df.reset_index(), {'bar': ['bar', 0, 1]})
.sort_values('index')
.set_index('index', drop=True)
.rename_axis(None)
foo bar
0 a e
0 a i
0 a m
1 b f
1 b j
2 c g
2 c k
2 c o
3 d h
3 d p

Resources