Joining Time Series Dataframes where duplicate columns contain the same values - python-3.x

I'm trying to combine multiple dataframes that contain time series data. These dataframes can have up to 100 columns and roughly 5000 rows. Two sample dataframes are
df1 = pd.DataFrame({'SubjectID': ['A', 'A', 'B', 'C'], 'Date': ['2010-05-08', '2010-05-10', '2010-05-08', '2010-05-08'], 'Test1':[1, 2, 3, 4], 'Gender': ['M', 'M', 'M', 'F'], 'StudyID': [1, 1, 1, 1]})
df2 = pd.DataFrame({'SubjectID': ['A', 'A', 'A', 'B', 'C'], 'Date': ['2010-05-08', '2010-05-09', '2010-05-10', '2010-05-08', '2010-05-09'], 'Test2': [1, 2, 3, 4, 5], 'Gender': ['M', 'M', 'M', 'M', 'F'], 'StudyID': [1, 1, 1, 1, 1]})
df1
SubjectID Date Test1 Gender StudyID
0 A 2010-05-08 1 M 1
1 A 2010-05-10 2 M 1
2 B 2010-05-08 3 M 1
3 C 2010-05-08 4 F 1
df2
SubjectID Date Test2 Gender StudyID
0 A 2010-05-08 1 M 1
1 A 2010-05-09 2 M 1
2 A 2010-05-10 3 M 1
3 B 2010-05-08 4 M 1
4 C 2010-05-09 5 F 1
My expected output is
SubjectID Date Test1 Gender StudyID Test2
0 A 2010-05-08 1.0 M 1.0 1.0
1 A 2010-05-09 NaN M 1.0 2.0
2 A 2010-05-10 2.0 M 1.0 3.0
3 B 2010-05-08 3.0 M 1.0 4.0
4 C 2010-05-08 4.0 F 1.0 NaN
5 C 2010-05-09 NaN F 1.0 5.0
I'm joining the dataframes by
merged_df = df1.set_index(['SubjectID', 'Date']).join(df2.set_index(['SubjectID', 'Date']), how = 'outer', lsuffix = '_l', rsuffix = '_r').reset_index()
but my output is
SubjectID Date Test1 Gender_l StudyID_l Test2 Gender_r StudyID_r
0 A 2010-05-08 1.0 M 1.0 1.0 M 1.0
1 A 2010-05-09 NaN NaN NaN 2.0 M 1.0
2 A 2010-05-10 2.0 M 1.0 3.0 M 1.0
3 B 2010-05-08 3.0 M 1.0 4.0 M 1.0
4 C 2010-05-08 4.0 F 1.0 NaN NaN NaN
5 C 2010-05-09 NaN NaN NaN 5.0 F 1.0
Is there a way to combine columns while joining the dataframes if all the values in both dataframes are equal? I can do it after the merge, but that will get tedious for my large datasets.

It depends how you want to implement the logic of resolving information that may not exactly match. Had you merged several frames, I think taking the modal value is appropriate. Taking your merged_df we can resolve it as:
merged_df = merged_df.groupby([x.split('_')[0] for x in merged_df.columns], 1).apply(lambda x: x.mode(1)[0])
Date Gender StudyID SubjectID Test1 Test2
0 2010-05-08 M 1.0 A 1.0 1.0
1 2010-05-09 M 1.0 A NaN 2.0
2 2010-05-10 M 1.0 A 2.0 3.0
3 2010-05-08 M 1.0 B 3.0 4.0
4 2010-05-08 F 1.0 C 4.0 NaN
5 2010-05-09 F 1.0 C NaN 5.0
Or perhaps, you want to give priority to the non-null value value in the first frame, then this is .combine_first.
df1.set_index(['SubjectID', 'Date']).combine_first(df2.set_index(['SubjectID', 'Date']))
Gender StudyID Test1 Test2
SubjectID Date
A 2010-05-08 M 1.0 1.0 1.0
2010-05-09 M 1.0 NaN 2.0
2010-05-10 M 1.0 2.0 3.0
B 2010-05-08 M 1.0 3.0 4.0
C 2010-05-08 F 1.0 4.0 NaN
2010-05-09 F 1.0 NaN 5.0
If you have to merge many DataFrames it may be best to use reduce from functools.
from functools import reduce
merged_df = reduce(lambda l,r: l.merge(r, on=['SubjectID', 'Date'], how='outer', suffixes=['_l', '_r']),
[df1, df2 ,df1, df2, df2])
You'll have lots of overlapping columns, but still can resolve them:
merged_df.groupby([x.split('_')[0] for x in merged_df.columns], 1).apply(lambda x: x.mode(1)[0])
Date Gender StudyID SubjectID Test1 Test2
0 2010-05-08 M 1.0 A 1.0 1.0
1 2010-05-10 M 1.0 A 2.0 3.0
2 2010-05-08 M 1.0 B 3.0 4.0
3 2010-05-08 F 1.0 C 4.0 NaN
4 2010-05-09 M 1.0 A NaN 2.0
5 2010-05-09 F 1.0 C NaN 5.0

Related

Get value from grouped data frame maximum in another column

Return and assign the value of column c based on the maximum value of column b in a grouped data frame, grouped on column a.
I'm trying to return and assign (ideally in a different column) the value of column c, determined by the row with the maximum in column b (which are dates). The dataframe is grouped by column a, but I keep getting errors. The data frame is relatively large, with over 16,000 rows.
df = pd.DataFrame({'a': ['a','a','a','b','b','b','c','c','c'],
'b': ['2008-11-01', '2022-07-01', '2017-02-01', '2017-02-01', '2018-02-01', '2008-11-01', '2014-11-01', '2008-11-01', '2022-07-01'],
'c': [numpy.NaN,6,8,2,1,numpy.NaN,6,numpy.NaN,7]})
df['b'] = pd.to_datetime(df['b'])
df
a b c
0 a 2008-11-01 NaN
1 a 2022-07-01 6.0
2 a 2017-02-01 8.0
3 b 2017-02-01 2.0
4 b 2018-02-01 1.0
5 b 2008-11-01 NaN
6 c 2014-11-01 6.0
7 c 2008-11-01 NaN
8 c 2022-07-01 7.0
I want the following result:
a b c d
0 a 2008-11-01 NaN 8.0
1 a 2022-07-01 6.0 8.0
2 a 2017-02-01 8.0 8.0
3 b 2017-02-01 2.0 1.0
4 b 2018-02-01 1.0 1.0
5 b 2008-11-01 NaN 1.0
6 c 2014-11-01 6.0 7.0
7 c 2008-11-01 NaN 7.0
8 c 2022-07-01 7.0 7.0
I've tried the following, but grouped series ('SeriesGroupBy') cannot be cast to another dtype:
df['d'] = df.groupby('a').b.astype(np.int64).max()[df.c].reset_index()
I've set column b to int before running the above code, but get an error saying none of the values in c are in the index.
None of [Index([....values in column c...\n dtype = 'object', name = 'a', length = 16280)] are in the [index]
I've tried expanding out the command but errors (which I just realized relate to no alternative values being given in the pd.where function - but I don't know how to give the correct value):
df= df.groupby('a')
df['d'] = df.['c'].where(grouped['b'] == grouped['b'].max())
I've also tried using solutions provided here, here, and here.
Any help or direction would be appreciated.
I'm assuming the first three values should be 6.0 (because maximum date for group a is 2022-07-01 with value 6):
df["d"] = df["a"].map(
df.groupby("a").apply(lambda x: x.loc[x["b"].idxmax(), "c"])
)
print(df)
Prints:
a b c d
0 a 2008-11-01 NaN 6.0
1 a 2022-07-01 6.0 6.0
2 a 2017-02-01 8.0 6.0
3 b 2017-02-01 2.0 1.0
4 b 2018-02-01 1.0 1.0
5 b 2008-11-01 NaN 1.0
6 c 2014-11-01 6.0 7.0
7 c 2008-11-01 NaN 7.0
8 c 2022-07-01 7.0 7.0

Copying column that have NaN value in it and adding prefix

I have x number of columns that contain NaN value
With the following code i can check that
for index,value in df.iteritems():
if value.isnull().values.any() == True:
this will show me with Boolean values which volumns have NaN.
If true I need to create new column that will have prefix 'Interpolation' + name of that column in its name.
So to make it clear if Column with the name 'XXX' has NaN I need to create new column with the name 'Interpolation XXX'.
Any ides how to do this ?
Something like this:
In [80]: df = pd.DataFrame({'XXX':[1,2,np.nan,4], 'YYY':[1,2,3,4], 'ZZZ':[1,np.nan, np.nan, 4]})
In [81]: df
Out[81]:
XXX YYY ZZZ
0 1.0 1 1.0
1 2.0 2 NaN
2 NaN 3 NaN
3 4.0 4 4.0
In [92]: nan_cols = df.columns[df.isna().any()].tolist()
In [94]: for col in df.columns:
...: if col in nan_cols:
...: df['Interpolation ' + col ] = df[col]
...:
In [95]: df
Out[95]:
XXX YYY ZZZ Interpolation XXX Interpolation ZZZ
0 1.0 1 1.0 1.0 1.0
1 2.0 2 NaN 2.0 NaN
2 NaN 3 NaN NaN NaN
3 4.0 4 4.0 4.0 4.0

Pandas, how to dropna values using subset with multiindex dataframe?

I have a data frame with multi-index columns.
From this data frame I need to remove the rows with NaN values in a subset of columns.
I am trying to use the subset option of pd.dropna but I do not manage to find the way to specify the subset of columns. I have tried using pd.IndexSlice but this does not work.
In the example below I need to get ride of the last row.
import pandas as pd
# ---
a = [1, 1, 2, 2, 3, 3]
b = ["a", "b", "a", "b", "a", "b"]
col = pd.MultiIndex.from_arrays([a[:], b[:]])
val = [
[1, 2, 3, 4, 5, 6],
[None, None, 1, 2, 3, 4],
[None, 1, 2, 3, 4, 5],
[None, None, 5, 3, 3, 2],
[None, None, None, None, 5, 7],
]
# ---
df = pd.DataFrame(val, columns=col)
# ---
print(df)
# ---
idx = pd.IndexSlice
df.dropna(axis=0, how="all", subset=idx[1:2, :])
# ---
print(df)
Using the thresh option is an alternative but if possible I would like to use subset and how='all'
When dealing with a MultiIndex, each column of the MultiIndex can be specified as a tuple:
In [67]: df.dropna(axis=0, how="all", subset=[(1, 'a'), (1, 'b'), (2, 'a'), (2, 'b')])
Out[67]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
Or, to select all columns whose first level equals 1 or 2 you could use:
In [69]: df.dropna(axis=0, how="all", subset=df.loc[[], [1,2]].columns)
Out[69]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
df[[1,2]].columns also works, but this returns a (possibly large) intermediate DataFrame. df.loc[[], [1,2]].columns is more memory-efficient since its intermediate DataFrame is empty.
If you want to apply the dropna to the columns which have 1 or 2 in level 1, you can do it as follows:
cols= [(c0, c1) for (c0, c1) in df.columns if c0 in [1,2]]
df.dropna(axis=0, how="all", subset=cols)
If applied to your data, it results in:
Out[446]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6
1 NaN NaN 1.0 2.0 3 4
2 NaN 1.0 2.0 3.0 4 5
3 NaN NaN 5.0 3.0 3 2
As you can see, the last line (index=4) is gone, because all columns below 1 and 2 were NaN for this line. If you rather want all rows to be removed, where any NaN occured in the column, you need:
df.dropna(axis=0, how="any", subset=cols)
Which results in:
Out[447]:
1 2 3
a b a b a b
0 1.0 2.0 3.0 4.0 5 6

How do I get nlargest rows without the sorting?

I need to extract the n-smallest rows of a pandas df, but it is very important to me to maintain the original order of rows.
code example:
import pandas as pd
df = pd.DataFrame({
'a': [1, 10, 8, 11, -1],
'b': list('abdce'),
'c': [1.0, 2.0, 1.5, 3.0, 4.0]})
df.nsmallest(3, 'a')
Gives:
a b c
4 -1 e 4.0
0 1 a 1.0
2 8 d 1.5
I need:
a b c
0 1 a 1.0
2 8 d 1.5
4 -1 e 4.0
Any ideas how to do that?
PS! In my real example, the index is not sorted/sortable as they are strings (names).
Simplest approach assuming index was sorted in the beginning
df.nsmallest(3, 'a').sort_index()
a b c
0 1 a 1.0
2 8 d 1.5
4 -1 e 4.0
Alternatively with np.argpartition and iloc
This doesn't depend on sorting the index.emphasized text
df.iloc[np.sort(df.a.values.argpartition(3)[:3])]
a b c
0 1 a 1.0
2 8 d 1.5
4 -1 e 4.0

Pandas Pivot and Summarize For Multiple Rows Vertically

Given the following data frame:
import numpy as np
import pandas as pd
df = pd.DataFrame({'Site':['a','a','a','b','b','b'],
'x':[1,1,0,1,0,0],
'y':[1,np.nan,0,1,1,0]
})
df
Site y x
0 a 1.0 1
1 a NaN 1
2 a 0.0 0
3 b 1.0 1
4 b 1.0 0
5 b 0.0 0
I am looking for the most efficient way, for each numerical column (y and x), to produce a percent per group, label the column name, and stack them in one column.
Here's how I accomplish this for 'y':
df=df.loc[~np.isnan(df['y'])] #do not count non-numbers
t=pd.pivot_table(df,index='Site',values='y',aggfunc=[np.sum,len])
t['Item']='y'
t['Perc']=round(t['sum']/t['len']*100,1)
t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
Now all I need is a way to add 2 more rows to this; the results for 'x' if I had pivoted with its values above, like this:
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1 2 x 50.0
b 1 3 x 33.3
In reality, I have 48 such numerical data columns that need to be stacked as such.
Thanks in advance!
First you can use notnull. Then omit in pivot_table parameter value, stack and sort_values by new column Item. Last you can use pandas function round:
df=df.loc[df['y'].notnull()]
t=pd.pivot_table(df,index='Site', aggfunc=[sum,len])
.stack()
.reset_index(level=1)
.rename(columns={'level_1':'Item'})
.sort_values('Item', ascending=False)
t['Perc']= (t['sum']/t['len']*100).round(1)
#reorder columns
t = t[['sum','len','Item','Perc']]
print t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1.0 2.0 x 50.0
b 1.0 3.0 x 33.3
Another solution if is neccessary define values columns in pivot_table:
df=df.loc[df['y'].notnull()]
t=pd.pivot_table(df,index='Site',values=['y', 'x'], aggfunc=[sum,len])
.stack()
.reset_index(level=1)
.rename(columns={'level_1':'Item'})
.sort_values('Item', ascending=False)
t['Perc']= (t['sum']/t['len']*100).round(1)
#reorder columns
t = t[['sum','len','Item','Perc']]
print t
sum len Item Perc
Site
a 1.0 2.0 y 50.0
b 2.0 3.0 y 66.7
a 1.0 2.0 x 50.0
b 1.0 3.0 x 33.3

Resources