MELT: multiple values without duplication - python-3.x

Cant be this hard. I Have
df=pd.DataFrame({'id':[1,2,3],'name':['j','l','m'], 'mnt':['f','p','p'],'nt':['b','w','e'],'cost':[20,30,80],'paid':[12,23,45]})
I need
import numpy as np
df1=pd.DataFrame({'id':[1,2,3,1,2,3],'name':['j','l','m','j','l','m'], 't':['f','p','p','b','w','e'],'paid':[12,23,45,np.nan,np.nan,np.nan],'cost':[20,30,80,np.nan,np.nan,np.nan]})
I have 45 columns to invert.
I tried
(df.set_index(['id', 'name'])
.rename_axis(['paid'], axis=1)
.stack().reset_index())

EDIT: I think simpliest here is set missing values by variable column in DataFrame.melt:
df2 = df.melt(['id', 'name','cost','paid'], value_name='t')
df2.loc[df2.pop('variable').eq('nt'), ['cost','paid']] = np.nan
print (df2)
id name cost paid t
0 1 j 20.0 12.0 f
1 2 l 30.0 23.0 p
2 3 m 80.0 45.0 p
3 1 j NaN NaN b
4 2 l NaN NaN w
5 3 m NaN NaN e
Use lreshape working with dictionary of lists for specified which columns are 'grouped' together:
df2 = pd.lreshape(df, {'t':['mnt','nt'], 'mon':['cost','paid']})
print (df2)
id name t mon
0 1 j f 20
1 2 l p 30
2 3 m p 80
3 1 j b 12
4 2 l w 23
5 3 m e 45

Related

Join with column having the max sequence number

I have a margin table
item margin
0 a 3
1 b 4
2 c 5
and an item table
item sequence
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 c 1
6 c 2
7 c 3
I want to join the two table so that the margin will only be joined to the product with maximum sequence number, the desired outcome is
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
How to achieve this?
Below is the code for margin and item table
import pandas as pd
df_margin=pd.DataFrame({"item":["a","b","c"],"margin":[3,4,5]})
df_item=pd.DataFrame({"item":["a","a","a","b","b","c","c","c"],"sequence":[1,2,3,1,2,1,2,3]})
One option would be to merge then replace extra values with NaN via Series.where:
new_df = df_item.merge(df_margin)
new_df['margin'] = new_df['margin'].where(
new_df.groupby('item')['sequence'].transform('max').eq(new_df['sequence'])
)
Or with loc:
new_df = df_item.merge(df_margin)
new_df.loc[new_df.groupby('item')['sequence']
.transform('max').ne(new_df['sequence']), 'margin'] = np.NAN
Another option would be to assign a temp column to both frames df_item with True where the value is maximal, and df_margin is True everywhere then merge outer and drop the temp column:
new_df = (
df_item.assign(
t=df_item
.groupby('item')['sequence']
.transform('max')
.eq(df_item['sequence'])
).merge(df_margin.assign(t=True), how='outer').drop('t', 1)
)
Both produce new_df:
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
You could do:
df_item.merge(df_item.groupby('item')['sequence'].max().\
reset_index().merge(df_margin), 'left')
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
Breakdown:
df_new = df_item.groupby('item')['sequence'].max().reset_index().merge(df_margin)
df_item.merge(df_new, 'left')

pandas groupby and widen dataframe with ordered columns

I have a long form dataframe that contains multiple samples and time points for each subject. The number of samples and timepoint can vary, and the days between time points can also vary:
test_df = pd.DataFrame({"subject_id":[1,1,1,2,2,3],
"sample":["A", "B", "C", "D", "E", "F"],
"timepoint":[19,11,8,6,2,12],
"time_order":[3,2,1,2,1,1]
})
subject_id sample timepoint time_order
0 1 A 19 3
1 1 B 11 2
2 1 C 8 1
3 2 D 6 2
4 2 E 2 1
5 3 F 12 1
I need to figure out a way to generalize grouping this dataframe by subject_id and putting all samples and time points on the same row, in time order.
DESIRED OUTPUT:
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8 B 11 A 19
1 2 E 2 D 6 null null
5 3 F 12 null null null null
Pivot gets me close, but I'm stuck on how to proceed from there:
test_df = test_df.pivot(index=['subject_id', 'sample'],
columns='time_order', values='timepoint')
Use DataFrame.set_index with DataFrame.unstack for pivoting, sorting MultiIndex in columns, flatten it and last convert subject_id to column:
df = (test_df.set_index(['subject_id', 'time_order'])
.unstack()
.sort_index(level=[1,0], axis=1))
df.columns = df.columns.map(lambda x: f'{x[0]}{x[1]}')
df = df.reset_index()
print (df)
subject_id sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
0 1 C 8.0 B 11.0 A 19.0
1 2 E 2.0 D 6.0 NaN NaN
2 3 F 12.0 NaN NaN NaN NaN
a=test_df.iloc[:,:3].groupby('subject_id').last().add_suffix('1')
b=test_df.iloc[:,:3].groupby('subject_id').nth(-2).add_suffix('2')
c=test_df.iloc[:,:3].groupby('subject_id').nth(-3).add_suffix('3')
pd.concat([a, b,c], axis=1)
sample1 timepoint1 sample2 timepoint2 sample3 timepoint3
subject_id
1 C 8 B 11.0 A 19.0
2 E 2 D 6.0 NaN NaN
3 F 12 NaN NaN NaN NaN

Replace values on dataset and apply quartile rule by row on pandas

I have a dataset with lots of variables. So I've extracted the numeric ones:
numeric_columns = transposed_df.select_dtypes(np.number)
Then I want to replace all 0 values for 0.0001
transposed_df[numeric_columns.columns] = numeric_columns.where(numeric_columns.eq(0, axis=0), 0.0001)
And here is the first problem. This line is not replacing the 0 values with 0.0001, but is replacing all non zero values with 0.0001.
Also after this (replacing the 0 values by 0.0001) I want to replace all values there are less than the 1th quartile of the row to -1 and leave the others as they were. But I am not managing how.
To answer your first question
In [36]: from pprint import pprint
In [37]: pprint( numeric_columns.where.__doc__)
('\n'
'Replace values where the condition is False.\n'
'\n'
'Parameters\n'
'----------\n'
because of that your all the values except 0 are getting replaced
Use DataFrame.mask and for second condition compare by DataFrame.quantile:
transposed_df = pd.DataFrame({
'A':list('abcdef'),
'B':[0,0.5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,0,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
numeric_columns = transposed_df.select_dtypes(np.number)
m1 = numeric_columns.eq(0)
m2 = numeric_columns.lt(numeric_columns.quantile(q=0.25, axis=1), axis=0)
transposed_df[numeric_columns.columns] = numeric_columns.mask(m1, 0.0001).mask(m2, -1)
print (transposed_df)
A B C D E F
0 a -1.0 7 1.0 5 a
1 b -1.0 8 3.0 3 a
2 c 4.0 9 -1.0 6 a
3 d 5.0 -1 7.0 9 b
4 e 5.0 2 -1.0 2 b
5 f 4.0 3 -1.0 4 b
EDIT:
from scipy.stats import zscore
print (transposed_df[numeric_columns.columns].apply(zscore))
B C D E
0 -2.236068 0.570352 -0.408248 0.073521
1 0.447214 0.950586 0.408248 -0.808736
2 0.447214 1.330821 -0.816497 0.514650
3 0.447214 -0.570352 2.041241 1.838037
4 0.447214 -1.330821 -0.408248 -1.249865
5 0.447214 -0.950586 -0.816497 -0.367607
EDIT1:
transposed_df = pd.DataFrame({
'A':list('abcdef'),
'B':[0,1,1,1,1,1],
'C':[1,8,9,4,2,3],
'D':[1,3,0,7,1,0],
'E':[1,3,6,9,2,4],
'F':list('aaabbb')
})
numeric_columns = transposed_df.select_dtypes(np.number)
from scipy.stats import zscore
df1 = pd.DataFrame(numeric_columns.apply(zscore, axis=1).tolist(),index=transposed_df.index)
transposed_df[numeric_columns.columns] = df1
print (transposed_df)
A B C D E F
0 a -1.732051 0.577350 0.577350 0.577350 a
1 b -1.063410 1.643452 -0.290021 -0.290021 a
2 c -0.816497 1.360828 -1.088662 0.544331 a
3 d -1.402136 -0.412393 0.577350 1.237179 b
4 e -1.000000 1.000000 -1.000000 1.000000 b
5 f -0.632456 0.632456 -1.264911 1.264911 b

Pandas dataframe: Number of dates in group prior to row date

I would like to add a column in a dataframe that contains for each group G the number of distinct observations in variable x that happened before time t.
Note: t is in datetime format and missing values in the data are possible but can be ignored. The same x can appear multiple times in a group but then it is assigned the same date. The time assigned to x is not the same across groups.
I hope this example helps:
Input:
Group x t
1 a 2013-11-01
1 b 2015-04-03
1 b 2015-04-03
1 c NaT
2 a 2017-03-01
2 c 2013-11-06
2 d 2015-04-26
2 d 2015-04-26
2 d 2015-04-26
2 b NaT
Output:
Group x t Number of unique x before time t
1 a 2013-11-01 0
1 b 2015-04-03 1
1 b 2015-04-03 1
1 c NaT NaN
2 a 2017-03-01 2
2 c 2013-11-06 0
2 d 2015-04-26 1
2 d 2015-04-26 1
2 d 2015-04-26 1
2 b NaT NaN
The dataset is quite large so I wonder if there is any vectorized way do this (e.g. using groupby).
Many Thanks
Here's another method.
The initial sort makes it so fillna will work later on.
Create df2, which calculates the unique number of days within each group before that date.
Merge the number of days back to the original df. fillna then takes care of the days which were duplicated (the sort ensures this happens properly)
Dates with NaT were placed at the end for the cumsum so just reset them to NaN
If you want to reorder at the end to the original order, just sort the index df.sort_index(inplace=True)
import pandas as pd
import numpy as np
df = df.sort_values(by=['Group', 't'])
df['t'] = pd.to_datetime(df.t)
df2 = df
df2 = df2[df2.t.notnull()]
df2 = df2.drop_duplicates()
df2['temp'] = 1
df2['num_b4'] = df2.groupby('Group').temp.cumsum()-1
df = df.merge(df2[['num_b4']], left_index=True, right_index=True, how='left')
df['num_b4'] = df['num_b4'].fillna(method='ffill')
df.loc[df.t.isnull(), 'num_b4'] = np.NaN
# Group x t num_b4
#0 1 a 2013-11-01 0.0
#1 1 b 2015-04-03 1.0
#2 1 b 2015-04-03 1.0
#3 1 c NaT NaN
#5 2 c 2013-11-06 0.0
#6 2 d 2015-04-26 1.0
#7 2 d 2015-04-26 1.0
#8 2 d 2015-04-26 1.0
#4 2 a 2017-03-01 2.0
#9 2 b NaT NaN
IIUUC for the new cases, you want to change a single line in the above code.
# df2 = df2.drop_duplicates()
df2 = df2.drop_duplicates(['Group', 't'])
With that, the same day that has multiple x values assigned to it does not cause the number of observations to increment. See the output for Group 3 below, in which I added 4 rows to your initial data.
Group x t
3 a 2015-04-03
3 b 2015-04-03
3 c 2015-04-03
3 c 2015-04-04
## Apply the Code changing the drop_duplicates() line
Group x t num_b4
0 1 a 2013-11-01 0.0
1 1 b 2015-04-03 1.0
2 1 b 2015-04-03 1.0
3 1 c NaT NaN
5 2 c 2013-11-06 0.0
6 2 d 2015-04-26 1.0
7 2 d 2015-04-26 1.0
8 2 d 2015-04-26 1.0
4 2 a 2017-03-01 2.0
9 2 b NaT NaN
10 3 a 2015-04-03 0.0
11 3 b 2015-04-03 0.0
12 3 c 2015-04-03 0.0
13 3 c 2015-04-04 1.0
Can you can do it like this using a custom designed function using merge to do a self-join, groupby and nunique to count unique values:
def countunique(x):
df_out = x.merge(x, on='Group')\
.query('x_x != x_y and t_y < t_x')\
.groupby(['x_x','t_x'])['x_y'].nunique()\
.reset_index()
result = x.merge(df_out, left_on=['x','t'],
right_on=['x_x','t_x'],
how='left')
result = result[['Group','x','t','x_y']]
result.loc[result.t.notnull(),'x_y'] = result.loc[result.t.notnull(),'x_y'].fillna(0)
return result.rename(columns={'x_y':'No of unique x before t'})
df.groupby('Group', group_keys=False).apply(countunique)
Output:
Group x t No of unique x before t
0 1 a 2013-11-01 0.0
1 1 b 2015-04-03 1.0
2 1 b 2015-04-03 1.0
3 1 c NaT NaN
0 2 a 2017-03-01 2.0
1 2 c 2013-11-06 0.0
2 2 d 2015-04-26 1.0
3 2 d 2015-04-26 1.0
4 2 d 2015-04-26 1.0
5 2 b NaT NaN
Explanation:
For each group,
Perform a self-join using merge on 'Group'
Filter result of self join only getting those time before the
current record.
Use groupby with nunique to count only unique values of x from
self-join.
Merge count of x back to the original dataframe keep all rows using
how='left'
Fill NaN values with zero where there is time on a recourd
Rename column headings

pandas equivalent of R's cbind (concatenate/stack vectors vertically)

suppose I have two dataframes:
import pandas
....
....
test1 = pandas.DataFrame([1,2,3,4,5])
....
....
test2 = pandas.DataFrame([4,2,1,3,7])
....
I tried test1.append(test2) but it is the equivalent of R's rbind.
How can I combine the two as two columns of a dataframe similar to the cbind function in R?
test3 = pd.concat([test1, test2], axis=1)
test3.columns = ['a','b']
(But see the detailed answer by #feng-mai, below)
There is a key difference between concat(axis = 1) in pandas and cbind() in R:
concat attempts to merge/align by index. There is no concept of index in a R dataframe. If the two pandas dataframes' indexes are misaligned, the results are different from cbind (even if they have the same number of rows). You need to either make sure the indexes align or drop/reset the indexes.
Example:
import pandas as pd
test1 = pd.DataFrame([1,2,3,4,5])
test1.index = ['a','b','c','d','e']
test2 = pd.DataFrame([4,2,1,3,7])
test2.index = ['d','e','f','g','h']
pd.concat([test1, test2], axis=1)
0 0
a 1.0 NaN
b 2.0 NaN
c 3.0 NaN
d 4.0 4.0
e 5.0 2.0
f NaN 1.0
g NaN 3.0
h NaN 7.0
pd.concat([test1.reset_index(drop=True), test2.reset_index(drop=True)], axis=1)
0 1
0 1 4
1 2 2
2 3 1
3 4 3
4 5 7
pd.concat([test1.reset_index(), test2.reset_index(drop=True)], axis=1)
index 0 0
0 a 1 4
1 b 2 2
2 c 3 1
3 d 4 3
4 e 5 7

Resources