Python - Display range of dates by id in seperate rows - python-3.x

I have some employee data that shows the list of dates for which they have requested leave
emp_id,emp_name,from_date,to_date
101,kevin,2018-12-01,2018-12-05
104,scott,2018-12-02,2018-12-02
I am trying to have the above format converted such that each date in the above sample is shown as a seperate row as shown below:
emp_id,emp_name,date
101,kevin,2018-12-01
101,kevin,2018-12-02
101,kevin,2018-12-03
101,kevin,2018-12-04
101,kevin,2018-12-05
104,scott,2018-12-02
Could anyone advice how could I have this done in pandas. Thanks.

Solution if emp_id values are unique - reshape by melt and resample with ffill:
df1 = (df.melt(['emp_id','emp_name'], value_name='date')
.set_index('date')
.drop('variable', axis=1)
.groupby(['emp_id', 'emp_name'])
.resample('d')[[]]
.ffill()
.reset_index()
)
print (df1)
emp_id emp_name date
0 101 kevin 2018-12-01
1 101 kevin 2018-12-02
2 101 kevin 2018-12-03
3 101 kevin 2018-12-04
4 101 kevin 2018-12-05
5 104 scott 2018-12-02
Another solutions - more general, only necessary default RangeIndex:
#default RangeIndex
#df = df.reset_index(drop=True)
df1 = (df.reset_index()
.melt(['emp_id','emp_name','index'], value_name='date')
.set_index('date')
.drop('variable', axis=1)
.groupby(['index'])
.resample('d')[['emp_id','emp_name']]
.ffill()
.reset_index(level=0, drop=True)
.reset_index()
)
Or use concat by Seriess created by date_range with itertuples and then join:
df1 = (pd.concat([pd.Series(r.Index,
pd.date_range(r.from_date,r.to_date))
for r in df.itertuples()])
.reset_index())
df1.columns = ['date','idx']
df1 = df1.set_index('idx').join(df[['emp_id','emp_name']]).reset_index(drop=True)
print (df1)
date emp_id emp_name
0 2018-12-01 101 kevin
1 2018-12-02 101 kevin
2 2018-12-03 101 kevin
3 2018-12-04 101 kevin
4 2018-12-05 101 kevin
5 2018-12-02 104 scott

You can iterate over each row
df_dates = pd.concat([pd.DataFrame({'Date': pd.date_range(row.from_date, row.to_date, freq='D'),
'Emp_id': row.emp_id,
'Emp_Name': row.emp_name}, columns=['Date', 'Emp_id', 'Emp_Name'])
for i, row in df.iterrows()], ignore_index=True)
print(df_dates)

Related

Update dataframe cells according to match cells within another dataframe in pandas [duplicate]

I have two dataframes in python. I want to update rows in first dataframe using matching values from another dataframe. Second dataframe serves as an override.
Here is an example with same data and code:
DataFrame 1 :
DataFrame 2:
I want to update update dataframe 1 based on matching code and name. In this example Dataframe 1 should be updated as below:
Note : Row with Code =2 and Name= Company2 is updated with value 1000 (coming from Dataframe 2)
import pandas as pd
data1 = {
'Code': [1, 2, 3],
'Name': ['Company1', 'Company2', 'Company3'],
'Value': [200, 300, 400],
}
df1 = pd.DataFrame(data1, columns= ['Code','Name','Value'])
data2 = {
'Code': [2],
'Name': ['Company2'],
'Value': [1000],
}
df2 = pd.DataFrame(data2, columns= ['Code','Name','Value'])
Any pointers or hints?
Using DataFrame.update, which aligns on indices (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html):
>>> df1.set_index('Code', inplace=True)
>>> df1.update(df2.set_index('Code'))
>>> df1.reset_index() # to recover the initial structure
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can using concat + drop_duplicates which updates the common rows and adds the new rows in df2
pd.concat([df1,df2]).drop_duplicates(['Code','Name'],keep='last').sort_values('Code')
Out[1280]:
Code Name Value
0 1 Company1 200
0 2 Company2 1000
2 3 Company3 400
Update due to below comments
df1.set_index(['Code', 'Name'], inplace=True)
df1.update(df2.set_index(['Code', 'Name']))
df1.reset_index(drop=True, inplace=True)
You can merge the data first and then use numpy.where, here's how to use numpy.where
updated = df1.merge(df2, how='left', on=['Code', 'Name'], suffixes=('', '_new'))
updated['Value'] = np.where(pd.notnull(updated['Value_new']), updated['Value_new'], updated['Value'])
updated.drop('Value_new', axis=1, inplace=True)
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
There is a update function available
example:
df1.update(df2)
for more info:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
You can align indices and then use combine_first:
res = df2.set_index(['Code', 'Name'])\
.combine_first(df1.set_index(['Code', 'Name']))\
.reset_index()
print(res)
# Code Name Value
# 0 1 Company1 200.0
# 1 2 Company2 1000.0
# 2 3 Company3 400.0
Assuming company and code are redundant identifiers, you can also do
import pandas as pd
vdic = pd.Series(df2.Value.values, index=df2.Name).to_dict()
df1.loc[df1.Name.isin(vdic.keys()), 'Value'] = df1.loc[df1.Name.isin(vdic.keys()), 'Name'].map(vdic)
# Code Name Value
#0 1 Company1 200
#1 2 Company2 1000
#2 3 Company3 400
You can use pd.Series.where on the result of left-joining df1 and df2
merged = df1.merge(df2, on=['Code', 'Name'], how='left')
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value)
>>> df1
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can change the line to
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value).astype(int)
in order to return the value to be an integer.
There's something I often do.
I merge 'left' first:
df_merged = pd.merge(df1, df2, how = 'left', on = 'Code')
Pandas will create columns with extension '_x' (for your left dataframe) and
'_y' (for your right dataframe)
You want the ones that came from the right. So just remove any columns with '_x' and rename '_y':
for col in df_merged.columns:
if '_x' in col:
df_merged .drop(columns = col, inplace = True)
if '_y' in col:
new_name = col.strip('_y')
df_merged .rename(columns = {col : new_name }, inplace=True)
Append the dataset
Drop the duplicate by code
Sort the values
combined_df = combined_df.append(df2).drop_duplicates(['Code'],keep='last').sort_values('Code')
None of the above solutions worked for my particular example, which I think is rooted in the dtype of my columns, but I eventually came to this solution
indexes = df1.loc[df1.Code.isin(df2.Code.values)].index
df1.at[indexes,'Value'] = df2['Value'].values

Updating multiple columns of df from another df

I have two dataframes, df1 and df2. I want to update some columns(not all) of df1 from the value which is in df2 columns(names of common column is same in both dataframes) based on key column. df1 can have multiple entries of that key but in df2 each key has only one entry.
df2 :
party_id age person_name col2
0 1 12 abdjc abc
1 2 35 fAgBS sfd
2 3 65 Afdc shd
3 5 34 Afazbf qfwjk
4 6 78 asgsdb fdgd
5 7 35 sdgsd dsfbds
df1:
party_id account_id product_type age dob status col2
0 1 1 Current 25 28-01-1994 active sdag
1 2 2 Savings 31 14-07-1988 pending asdg
2 3 3 Loans 65 22-07-1954 frozen sgsdf
3 3 4 Over Draft Facility 93 29-01-1927 active dsfhgd
4 4 5 Mortgage 93 01-03-1926 pending sdggsd
In this example I want to update age, col2 in df1 based on the value present in df2. And key column here is party_id.
I tried mapping df2 into dict with their key (column wise, one column at time). Here key_name = party_id and column_name = age
dict_key = df2[key_name]
dict_value = df2[column_name]
temp_dict = dict(zip(dict_key, dict_value))
and then map it to df1
df1[column_name].map(temp_dict).fillna(df1[column_name])
But issue here is it is only mapping the one entry not all for that key value.In this example party_id == 3 have multiple entry in df1.
Keys which is not in df2, their respective value for that column should be unchanged.
Can anyone help me with efficient solution as my df1 is of big size more than 500k? So that all columns can update at the same time.
df2 is of moderate size around 3k or something.
Thanks
Idea is use DataFrame.merge with left join first, then get columns with are same in both DataFrames to cols and replace missing values by original values by DataFrame.fillna:
df = df1.merge(df2.drop_duplicates('party_id'), on='party_id', suffixes=('','_'), how='left')
cols = df2.columns.intersection(df1.columns).difference(['party_id'])
df[cols] = df[cols + '_'].rename(columns=lambda x: x.strip('_')).fillna(df[cols])
df = df[df1.columns]
print (df)
party_id age person_name col2
0 1 25.0 abdjc sdag
1 2 31.0 fAgBS asdg
2 3 65.0 Afdc sgsdf
3 5 34.0 Afazbf qfwjk
4 6 78.0 asgsdb fdgd
5 7 35.0 sdgsd dsfbds

How to convert column into row?

Assuming I have two rows where for most of the columns the values are same, but not for all. I would like to group these two rows into one where ever the values are same and if the values are different then create an extra column and assign the column name as 'column1'
Step 1: Here assuming I have columns which has same value in both the rows 'a','b','c' and columns which has different values are 'd','e','f' so I am grouping using 'a','b','c' and then unstacking 'd','e','f'
Step 2: Then I am dropping the levels then renaming it to 'a','b','c','d','d1','e','e1','f','f1'
But in my actual case I have 500+ columns and million rows, I dont know how to expand this to 500+ columns where I have constrains like
1) I dont know which all columns will have same values
2) And which all columns will have different values that needs to be converted into new column after grouping with the columns that has same value
df.groupby(['a','b','c']) ['d','e','f'].apply(lambda x:pd.DataFrame(x.values)).unstack().reset_index()
df.columns = df.columns.droplevel()
df.columns = ['a','b','c','d','d1','e','e1','f','f1']
To be more clear, the below code creates the sample dataframe & expected output
df = pd.DataFrame({'Cust_id':[100,100, 101,101,102,103,104,104], 'gender':['M', 'M', 'F','F','M','F','F','F'], 'Date':['01/01/2019', '02/01/2019','01/01/2019',
'01/01/2019','03/01/2019','04/01/2019','03/01/2019','03/01/2019'],
'Product': ['a','a','b','c','d','d', 'e','e']})
expected_output = pd.DataFrame({'Cust_id':[100, 101,102,103,104], 'gender':['M', 'F','M','F','F'], 'Date':['01/01/2019','01/01/2019','03/01/2019','04/01/2019', '03/01/2019'], 'Date1': ['02/01/2019', 'NA','NA','NA','NA']
, 'Product': ['a', 'b', 'd', 'd','e'], 'Product1':['NA', 'c','NA','NA','NA' ]})
you may do following to get expected_output from df
s = df.groupby('Cust_id').cumcount().astype(str).replace('0', '')
df1 = df.pivot_table(index=['Cust_id', 'gender'], columns=s, values=['Date', 'Product'], aggfunc='first')
df1.columns = df1.columns.map(''.join)
Out[57]:
Date Date1 Product Product1
Cust_id gender
100 M 01/01/2019 02/01/2019 a a
101 F 01/01/2019 01/01/2019 b c
102 M 03/01/2019 NaN d NaN
103 F 04/01/2019 NaN d NaN
104 F 03/01/2019 03/01/2019 e e
Next, replace columns having duplicated values with NA
df_expected = df1.where(df1.ne(df1.shift(axis=1)), 'NA').reset_index()
Out[72]:
Cust_id gender Date Date1 Product Product1
0 100 M 01/01/2019 02/01/2019 a NA
1 101 F 01/01/2019 NA b c
2 102 M 03/01/2019 NA d NA
3 103 F 04/01/2019 NA d NA
4 104 F 03/01/2019 NA e NA
You can try this code - it could be a little cleaner but I think it does the job
df = pd.DataFrame({'a':[100, 100], 'b':['tue', 'tue'], 'c':['yes', 'yes'],
'd':['ok', 'not ok'], 'e':['ok', 'maybe'], 'f':[55, 66]})
df_transformed = pd.DataFrame()
for column in df.columns:
col_vals = df.groupby(column)['b'].count().index.values
for ix, col_val in enumerate(col_vals):
temp_df = pd.DataFrame({column + str(ix) : [col_val]})
df_transformed = pd.concat([df_transformed, temp_df], axis = 1)
Output for df_transformed

Basic indexing on axis with MultiIndex

This is my multiIndex dataframe gotten from groupby where I have 2 indexes [ 'YearMonth', 'product_id'] and column name ['count'] I've tried examples from documentation and other stackoverflow suggestions but still cannot index product_id == 6818 for each YearMonth index column.
df = df.groupby(['YearMonth','product_id'])[['count']].sum()
df.head(5)
Out[54]:
count
YearMonth product_id
2017-05-01 6818 3
7394 1 7394 1
8369 1 8369 1
8504 1 8504 1
8666 1 8666 1
In [55]:
df.columns
Out[55]:
Index(['count'], dtype='object')
In [56]:
df.index.names
Out[56]:
FrozenList(['YearMonth', 'product_id'])
In [59]:
df.loc[('2017-05-01',0),'count']
I've tried: simple indexing df['YearMonth'] but it only works with columns not indexes
df.loc\ix\iloc as was given in this stackoverflow question
df.loc[('2017-05-01',0)]
Always I get KeyError such as KeyError: ('2017-05-01', 0) , KeyError: 'YearMonth'
as well as I did a try to unstack method df.unstack(level=0) and did the same manipulations as written above
May someone explain what am I missing? Thanks in advance
Your sample DF doesn't look "healthy" - i have fixed it so it looks like as follows now:
In [121]: df
Out[121]:
count
YearMonth product_id
2017-05-01 6818 3
7394 1
8369 1
8504 1
8666 1
Option 1:
In [122]: df.loc[pd.IndexSlice[:, 6818], :]
Out[122]:
count
YearMonth product_id
2017-05-01 6818 3
Option 2: works for named indices
In [145]: df.query("product_id in [6818]")
Out[145]:
count
YearMonth product_id
2017-05-01 6818 3
Option 3:
In [146]: df.loc[(slice(None), 6818), :]
Out[146]:
count
YearMonth product_id
2017-05-01 6818 3

roll off profile stacking data frames

I have a dataframe that looks like:
import pandas as pd
import datetime as dt
df= pd.DataFrame({'date':['2017-12-31','2017-12-31'],'type':['Asset','Liab'],'Amount':[100,-100],'Maturity Date':['2019-01-02','2018-01-01']})
df
I am trying to build a roll-off profile by checking if the 'Maturity Date' is greater than a 'date' in the future. I am trying to achieve something like:
#First Month
df1=df[df['Maturity Date']>'2018-01-31']
df1['date']='2018-01-31'
#Second Month
df2=df[df['Maturity Date']>'2018-02-28']
df2['date']='2018-02-28'
#third Month
df3=df[df['Maturity Date']>'2018-03-31']
df3['date']='2018-02-31'
#first quarter
qf1=df[df['Maturity Date']>'2018-06-30']
qf1['date']='2018-06-30'
#concatenate
df=pd.concat([df,df1,df2,df3,qf1])
df
I was wondering if there is a way to :
Allow an arbitrary long number of dates without repeating code
I think you need numpy.tile for repeat indices and assign to new column, last filter by boolean indexing and sorting by sort_values:
d = '2017-12-31'
df['Maturity Date'] = pd.to_datetime(df['Maturity Date'])
#generate first month and next quarters
c1 = pd.date_range(d, periods=4, freq='M')
c2 = pd.date_range(c1[-1], periods=2, freq='Q')
#join together
c = c1.union(c2[1:])
#repeat rows be indexing repeated index
df1 = df.loc[np.tile(df.index, len(c))].copy()
#assign column by datetimes
df1['date'] = np.repeat(c, len(df))
#filter by boolean indexing
df1 = df1[df1['Maturity Date'] > df1['date']]
print (df1)
Amount Maturity Date date type
0 100 2019-01-02 2017-12-31 Asset
1 -100 2018-01-01 2017-12-31 Liab
0 100 2019-01-02 2018-01-31 Asset
0 100 2019-01-02 2018-02-28 Asset
0 100 2019-01-02 2018-03-31 Asset
0 100 2019-01-02 2018-06-30 Asset
You could use a nifty tool in the Pandas arsenal called
pd.merge_asof. It
works similarly to pd.merge, except that it matches on "nearest" keys rather
than equal keys. Furthermore, you can tell pd.merge_asof to look for nearest
keys in only the backward or forward direction.
To make things interesting (and help check that things are working properly), let's add another row to df:
df = pd.DataFrame({'date':['2017-12-31', '2017-12-31'],'type':['Asset', 'Asset'],'Amount':[100,200],'Maturity Date':['2019-01-02', '2018-03-15']})
for col in ['date', 'Maturity Date']:
df[col] = pd.to_datetime(df[col])
df = df.sort_values(by='Maturity Date')
print(df)
# Amount Maturity Date date type
# 1 200 2018-03-15 2017-12-31 Asset
# 0 100 2019-01-02 2017-12-31 Asset
Now define some new dates:
dates = (pd.date_range('2018-01-31', periods=3, freq='M')
.union(pd.date_range('2018-01-1', periods=2, freq='Q')))
result = pd.DataFrame({'date': dates})
# date
# 0 2018-01-31
# 1 2018-02-28
# 2 2018-03-31
# 3 2018-06-30
Now we can merge rows, matching nearest dates from result with Maturity Dates from df:
result = pd.merge_asof(result, df.drop('date', axis=1),
left_on='date', right_on='Maturity Date', direction='forward')
In this case we want to "match" dates with Maturity Dates which are greater
so we use direction='forward'.
Putting it all together:
import pandas as pd
df = pd.DataFrame({'date':['2017-12-31', '2017-12-31'],'type':['Asset', 'Asset'],'Amount':[100,200],'Maturity Date':['2019-01-02', '2018-03-15']})
for col in ['date', 'Maturity Date']:
df[col] = pd.to_datetime(df[col])
df = df.sort_values(by='Maturity Date')
dates = (pd.date_range('2018-01-31', periods=3, freq='M')
.union(pd.date_range('2018-01-1', periods=2, freq='Q')))
result = pd.DataFrame({'date': dates})
result = pd.merge_asof(result, df.drop('date', axis=1),
left_on='date', right_on='Maturity Date', direction='forward')
result = pd.concat([df, result], axis=0)
result = result.sort_values(by=['Maturity Date', 'date'])
print(result)
yields
Amount Maturity Date date type
1 200 2018-03-15 2017-12-31 Asset
0 200 2018-03-15 2018-01-31 Asset
1 200 2018-03-15 2018-02-28 Asset
0 100 2019-01-02 2017-12-31 Asset
2 100 2019-01-02 2018-03-31 Asset
3 100 2019-01-02 2018-06-30 Asset

Resources