This is my multiIndex dataframe gotten from groupby where I have 2 indexes [ 'YearMonth', 'product_id'] and column name ['count'] I've tried examples from documentation and other stackoverflow suggestions but still cannot index product_id == 6818 for each YearMonth index column.
df = df.groupby(['YearMonth','product_id'])[['count']].sum()
df.head(5)
Out[54]:
count
YearMonth product_id
2017-05-01 6818 3
7394 1 7394 1
8369 1 8369 1
8504 1 8504 1
8666 1 8666 1
In [55]:
df.columns
Out[55]:
Index(['count'], dtype='object')
In [56]:
df.index.names
Out[56]:
FrozenList(['YearMonth', 'product_id'])
In [59]:
df.loc[('2017-05-01',0),'count']
I've tried: simple indexing df['YearMonth'] but it only works with columns not indexes
df.loc\ix\iloc as was given in this stackoverflow question
df.loc[('2017-05-01',0)]
Always I get KeyError such as KeyError: ('2017-05-01', 0) , KeyError: 'YearMonth'
as well as I did a try to unstack method df.unstack(level=0) and did the same manipulations as written above
May someone explain what am I missing? Thanks in advance
Your sample DF doesn't look "healthy" - i have fixed it so it looks like as follows now:
In [121]: df
Out[121]:
count
YearMonth product_id
2017-05-01 6818 3
7394 1
8369 1
8504 1
8666 1
Option 1:
In [122]: df.loc[pd.IndexSlice[:, 6818], :]
Out[122]:
count
YearMonth product_id
2017-05-01 6818 3
Option 2: works for named indices
In [145]: df.query("product_id in [6818]")
Out[145]:
count
YearMonth product_id
2017-05-01 6818 3
Option 3:
In [146]: df.loc[(slice(None), 6818), :]
Out[146]:
count
YearMonth product_id
2017-05-01 6818 3
Related
I have two dataframes in python. I want to update rows in first dataframe using matching values from another dataframe. Second dataframe serves as an override.
Here is an example with same data and code:
DataFrame 1 :
DataFrame 2:
I want to update update dataframe 1 based on matching code and name. In this example Dataframe 1 should be updated as below:
Note : Row with Code =2 and Name= Company2 is updated with value 1000 (coming from Dataframe 2)
import pandas as pd
data1 = {
'Code': [1, 2, 3],
'Name': ['Company1', 'Company2', 'Company3'],
'Value': [200, 300, 400],
}
df1 = pd.DataFrame(data1, columns= ['Code','Name','Value'])
data2 = {
'Code': [2],
'Name': ['Company2'],
'Value': [1000],
}
df2 = pd.DataFrame(data2, columns= ['Code','Name','Value'])
Any pointers or hints?
Using DataFrame.update, which aligns on indices (https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html):
>>> df1.set_index('Code', inplace=True)
>>> df1.update(df2.set_index('Code'))
>>> df1.reset_index() # to recover the initial structure
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can using concat + drop_duplicates which updates the common rows and adds the new rows in df2
pd.concat([df1,df2]).drop_duplicates(['Code','Name'],keep='last').sort_values('Code')
Out[1280]:
Code Name Value
0 1 Company1 200
0 2 Company2 1000
2 3 Company3 400
Update due to below comments
df1.set_index(['Code', 'Name'], inplace=True)
df1.update(df2.set_index(['Code', 'Name']))
df1.reset_index(drop=True, inplace=True)
You can merge the data first and then use numpy.where, here's how to use numpy.where
updated = df1.merge(df2, how='left', on=['Code', 'Name'], suffixes=('', '_new'))
updated['Value'] = np.where(pd.notnull(updated['Value_new']), updated['Value_new'], updated['Value'])
updated.drop('Value_new', axis=1, inplace=True)
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
There is a update function available
example:
df1.update(df2)
for more info:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.update.html
You can align indices and then use combine_first:
res = df2.set_index(['Code', 'Name'])\
.combine_first(df1.set_index(['Code', 'Name']))\
.reset_index()
print(res)
# Code Name Value
# 0 1 Company1 200.0
# 1 2 Company2 1000.0
# 2 3 Company3 400.0
Assuming company and code are redundant identifiers, you can also do
import pandas as pd
vdic = pd.Series(df2.Value.values, index=df2.Name).to_dict()
df1.loc[df1.Name.isin(vdic.keys()), 'Value'] = df1.loc[df1.Name.isin(vdic.keys()), 'Name'].map(vdic)
# Code Name Value
#0 1 Company1 200
#1 2 Company2 1000
#2 3 Company3 400
You can use pd.Series.where on the result of left-joining df1 and df2
merged = df1.merge(df2, on=['Code', 'Name'], how='left')
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value)
>>> df1
Code Name Value
0 1 Company1 200.0
1 2 Company2 1000.0
2 3 Company3 400.0
You can change the line to
df1.Value = merged.Value_y.where(~merged.Value_y.isnull(), df1.Value).astype(int)
in order to return the value to be an integer.
There's something I often do.
I merge 'left' first:
df_merged = pd.merge(df1, df2, how = 'left', on = 'Code')
Pandas will create columns with extension '_x' (for your left dataframe) and
'_y' (for your right dataframe)
You want the ones that came from the right. So just remove any columns with '_x' and rename '_y':
for col in df_merged.columns:
if '_x' in col:
df_merged .drop(columns = col, inplace = True)
if '_y' in col:
new_name = col.strip('_y')
df_merged .rename(columns = {col : new_name }, inplace=True)
Append the dataset
Drop the duplicate by code
Sort the values
combined_df = combined_df.append(df2).drop_duplicates(['Code'],keep='last').sort_values('Code')
None of the above solutions worked for my particular example, which I think is rooted in the dtype of my columns, but I eventually came to this solution
indexes = df1.loc[df1.Code.isin(df2.Code.values)].index
df1.at[indexes,'Value'] = df2['Value'].values
I have some employee data that shows the list of dates for which they have requested leave
emp_id,emp_name,from_date,to_date
101,kevin,2018-12-01,2018-12-05
104,scott,2018-12-02,2018-12-02
I am trying to have the above format converted such that each date in the above sample is shown as a seperate row as shown below:
emp_id,emp_name,date
101,kevin,2018-12-01
101,kevin,2018-12-02
101,kevin,2018-12-03
101,kevin,2018-12-04
101,kevin,2018-12-05
104,scott,2018-12-02
Could anyone advice how could I have this done in pandas. Thanks.
Solution if emp_id values are unique - reshape by melt and resample with ffill:
df1 = (df.melt(['emp_id','emp_name'], value_name='date')
.set_index('date')
.drop('variable', axis=1)
.groupby(['emp_id', 'emp_name'])
.resample('d')[[]]
.ffill()
.reset_index()
)
print (df1)
emp_id emp_name date
0 101 kevin 2018-12-01
1 101 kevin 2018-12-02
2 101 kevin 2018-12-03
3 101 kevin 2018-12-04
4 101 kevin 2018-12-05
5 104 scott 2018-12-02
Another solutions - more general, only necessary default RangeIndex:
#default RangeIndex
#df = df.reset_index(drop=True)
df1 = (df.reset_index()
.melt(['emp_id','emp_name','index'], value_name='date')
.set_index('date')
.drop('variable', axis=1)
.groupby(['index'])
.resample('d')[['emp_id','emp_name']]
.ffill()
.reset_index(level=0, drop=True)
.reset_index()
)
Or use concat by Seriess created by date_range with itertuples and then join:
df1 = (pd.concat([pd.Series(r.Index,
pd.date_range(r.from_date,r.to_date))
for r in df.itertuples()])
.reset_index())
df1.columns = ['date','idx']
df1 = df1.set_index('idx').join(df[['emp_id','emp_name']]).reset_index(drop=True)
print (df1)
date emp_id emp_name
0 2018-12-01 101 kevin
1 2018-12-02 101 kevin
2 2018-12-03 101 kevin
3 2018-12-04 101 kevin
4 2018-12-05 101 kevin
5 2018-12-02 104 scott
You can iterate over each row
df_dates = pd.concat([pd.DataFrame({'Date': pd.date_range(row.from_date, row.to_date, freq='D'),
'Emp_id': row.emp_id,
'Emp_Name': row.emp_name}, columns=['Date', 'Emp_id', 'Emp_Name'])
for i, row in df.iterrows()], ignore_index=True)
print(df_dates)
My goal is to group the dataframe based on the column['quantity'] in the below dataframes
my dataframe :
df
ordercode quantity
PMC21-AA1U1FBWBJA 1
PMP23-GR1M1FB3CJ 1
PMC11-AA1U1FJWWJA 1
PMC11-AA1U1FBWWJA+I7 2
PMC11-AA1U1FJWWJA 3
PMC11-AA1L1FJWWJA 3
df1:
ordercode quantity
PMC21-AA1U1FBWBJA 1
PMP23-GR1M1FB3CJ 1
PMC11-AA1U1FJWWJA 1
PMC11-AA1U1FBWWJA+I7 2
df2
ordercode quantity
My coding:
df = pd.DataFrame(np.concatenate(df.apply(lambda x: [x[0]] * x[1], 1).as_matrix()),
columns=['ordercode'])
df['quantity'] = 1
df['group'] = sorted(list(range(0, len(df) // 3, 1)) * 4)[0:len(df)]
df = df.groupby(['group', 'ordercode']).sum()
print(df)
With the above coding I got my result in df as below.
Group ordercode quantity
0 PMC21-AA1U1FBWBJA 1
PMP23-GR1M1FB3CJ 1
PMC11-AA1U1FJWWJA 1
PMC11-AA1U1FBWWJA+I7 1
1 PMC11-AA1U1FBWWJA+I7 1
PMC11-AA1U1FJWWJA 3
2 PMC11-AA1L1FJWWJA 3
In group0 & group1 the total values (1+1+1+1=4)(1+3=4)(i.e keeping the max vale of quantity as 4). In group2 we can see that no values to add so the group is formed by the left over(here it is 3).in group0 & group1 we can see that PMC11-AA1U1FBWWJA+I7's value splits.
No problem in it.
In df1 & df2 its showing value error.
in df1:
value error: length of values does not match length of index
raise Value error('length of value does not match length of index')
in df2:
value error:need at least one array to concatenate.
I could understand that my df2 is empty and has no index. I used pd.Series but again the same error.
how to solve this problem?
I have a table below
I was trying to create an additional column to count if Std_1,Std_2 and Std_3 greater than its mean value.
for example, for ACCMGR Row, only Std_2 is greater than the average, so the new column should be 1.
Not sure how to do it.
You need to be a bit careful with how you specify the axes, but you can just use .gt + .mean + .sum
Sample Data
import pandas as pd
import numpy as np
df = pd.DataFrame({'APPL': ['ACCMGR', 'ACCOUNTS', 'ADVISOR', 'AUTH', 'TEST'],
'Std_1': [106.875, 121.703, np.NaN, 116.8585, 1],
'Std_2': [130.1899, 113.4927, np.NaN, 112.4486, 4],
'Std_3': [107.186, 114.5418, np.NaN, 115.2699, np.NaN]})
Code
df = df.set_index('APPL')
df['cts'] = df.gt(df.mean(axis=1), axis=0).sum(axis=1)
df = df.reset_index()
Output:
APPL Std_1 Std_2 Std_3 cts
0 ACCMGR 106.8750 130.1899 107.1860 1
1 ACCOUNTS 121.7030 113.4927 114.5418 1
2 ADVISOR NaN NaN NaN 0
3 AUTH 116.8585 112.4486 115.2699 2
4 TEST 1.0000 4.0000 NaN 1
Considered dataframe
quantity price
0 6 1.45
1 3 1.85
2 2 2.25
apply lambda function on axis =1 , for each series of row check the column of value greater than mean and get the index of column
df.apply(lambda x:df.columns.get_loc(x[x>np.mean(x)].index[0]),axis=1)
Out:
quantity price > than mean
0 6 1.45 0
1 3 1.85 0
2 2 2.25 1
I have two dataframes like this:
DF1
ID
10C
25Y
66B
100W
DF2
ID
10C
5
25Y
66B
I want to check to see if any of the values in DF1 appear in DF2 and if so, add either a 1 (if it exists) or 0 (if it doesn't) to a new column such as
ID Appears
10C 1
25Y 1
66B 1
100W 0
I know this is a really simple problem but it is giving me fits.
Been trying something like
df3 = df1.merge(df2, on='ID', how='left')
df3.fillna(0)
df3['Appear'][df3.ID_x > 0] = 1
df3['Appear'][df3.ID_x = 0] = 0
You may simply use np.in1d:
>>> np.in1d(df1['ID'], df2['ID']).astype('int')
array([1, 1, 1, 0])
>>> df1['Appears'] = np.in1d(df1['ID'], df2['ID']).astype('int')
>>> df1
ID Appears
0 10C 1
1 25Y 1
2 66B 1
3 100 0
merge-kind solution, would be like below, but I think using np.in1d would be faster.
>>> df2['Appears'] = 1
>>> df1.merge(df2, on='ID', how='left').fillna({'Appears':0})
ID Appears
0 10C 1
1 25Y 1
2 66B 1
3 100 0