How to resolve following key error in the my code while manipulating dataframe? - python-3.x

Lets say i have this data frame. here i want to match a date value and will compare if at this current index and at next index employee id is same or not if they are same then i want to delete the row at next index. i tried with below code but getting key error:1
EMPL_ID
AGE
EMPLOYER
END_DATE
12
23
BHU
2022-04-22 00:00:00
12
21
BHU
2022-04-22 00:00:00
34
22
DU
2022-04-22 00:00:00
36
21
BHU
2022-04-22 00:00:00
for index, row in df.iterrows():
value = row['END_DATE']
if (value == '2022-04-22 00:00:00'):
a = index
if (df.loc[a, 'EMPL_ID'] == df.loc[a+1, 'EMPL_ID']):
df.drop(a+1, inplace = True)
else:
df = df
else:
df = df

Keeping only the unique values based on two columns can be performed using the function drop_duplicates, while supplying the columns that make up the subset:
df = pd.DataFrame(data={'id': [1, 2, 3, 3], 'time': [1, 2, 3, 3]})
>>> df
id time
0 1 1
1 2 2
2 3 3
3 3 3
df.drop_duplicates(subset=['id', 'time'], inplace=True)
>>> df
id time
0 1 1
1 2 2
2 3 3

IIUC here is one approach. Since you only search for consecutive duplicates I added another row to your example. The 2nd row of "EMPL_ID" 34 won't be deleted.
df = pd.DataFrame({'EMPL_ID': [12, 12, 34, 36, 34],
'AGE': [23, 21, 22, 21, 22],
'EMPLOYER': ['BHU', 'BHU', 'DU', 'BHU', 'DU'],
'END_DATE': ['2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00'],
})
print(df)
EMPL_ID AGE EMPLOYER END_DATE
0 12 23 BHU 2022-04-22 00:00:00
1 12 21 BHU 2022-04-22 00:00:00
2 34 22 DU 2022-04-22 00:00:00
3 36 21 BHU 2022-04-22 00:00:00
4 34 22 DU 2022-04-22 00:00:00
grp = ['END_DATE', 'EMPL_ID']
df['group'] = (df[grp] != df[grp].shift(1)).any(axis=1).cumsum()
df = df.drop_duplicates('group', keep= 'first').drop('group', axis=1)
print(df)
EMPL_ID AGE EMPLOYER END_DATE
0 12 23 BHU 2022-04-22 00:00:00
2 34 22 DU 2022-04-22 00:00:00
3 36 21 BHU 2022-04-22 00:00:00
4 34 22 DU 2022-04-22 00:00:00

Related

How to sum by month in timestamp Data Frame?

i have dataframe like this :
trx_date
trx_amount
2013-02-11
35
2014-03-10
26
2011-02-9
10
2013-02-12
5
2013-01-11
21
how do i filter that into month and year? so that i can sum the trx_amount
example expected output :
trx_monthly
trx_sum
2013-02
40
2013-01
21
2014-02
35
You can convert values to month periods by Series.dt.to_period and then aggregate sum:
df['trx_date'] = pd.to_datetime(df['trx_date'])
df1 = (df.groupby(df['trx_date'].dt.to_period('m').rename('trx_monthly'))['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df1)
trx_monthly trx_sum
0 2011-02 10
1 2013-01 21
2 2013-02 40
3 2014-03 26
Or convert datetimes to strings in format YYYY-MM by Series.dt.strftime:
df2 = (df.groupby(df['trx_date'].dt.strftime('%Y-%m').rename('trx_monthly'))['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df2)
trx_monthly trx_sum
0 2011-02 10
1 2013-01 21
2 2013-02 40
3 2014-03 26
Or convert to month and years, then output is different - 3 columns:
df2 = (df.groupby([df['trx_date'].dt.year.rename('year'),
df['trx_date'].dt.month.rename('month')])['trx_amount']
.sum()
.reset_index(name='trx_sum'))
print (df2)
year month trx_sum
0 2011 2 10
1 2013 1 21
2 2013 2 40
3 2014 3 26
You can try this -
df['trx_month'] = df['trx_date'].dt.month
df_agg = df.groupby('trx_month')['trx_sum'].sum()

Tidying dataframe with one parameter per row

I have a dataframe like this:
d = {'Date': ['2020-10-09', '2020-10-09', '2020-10-09', '2020-10-10', '2020-10-10', '2020-10-10', '2020-10-11', '2020-10-11', '2020-10-11'],
'ID': ['T1', 'T2', 'T3', 'T1', 'T2', 'T3','T1', 'T2', 'T3'],
'Value': [13, 12, 11, 14, 15, 16, 20, 21, 22]}
df = pd.DataFrame(data=d)
df
Date ID Value
0 2020-10-09 T1 13
1 2020-10-09 T2 12
2 2020-10-09 T3 11
3 2020-10-10 T1 14
4 2020-10-10 T2 15
5 2020-10-10 T3 16
6 2020-10-11 T1 20
7 2020-10-11 T2 21
8 2020-10-11 T3 22
And I'm trying to get:
d = {'Date': ['2020-10-09', '2020-10-10', '2020-10-11'],
'Value T1': ['13', '14', '20'],
'Value T2': ['12', '15', '21'],
'Value T3': ['11', '15', '22']}
df = pd.DataFrame(data=d)
df
Date Value T1 Value T2 Value T3
0 2020-10-09 13 12 11
1 2020-10-10 14 15 15
2 2020-10-11 20 21 22
I tried with pivot but I got the error:
"Index contains duplicate entries, cannot reshape"
Use pd.pivot_table like shown below,
pdf = pd.pivot_table(
df,
values=['Value'],
index=['Date'],
columns=['ID'],
aggfunc='first'
).reset_index(drop=False)
pdf.columns = ['Date', "Value T1", "Value T2", "Value T3"]
Date Value T1 Value T2 Value T3
0 2020-10-09 13 12 11
1 2020-10-10 14 15 16
2 2020-10-11 20 21 22
Note that aggfunc is first here. Which means if there is multiple values for a given ID at given Date, You'll get the first value in the dataframe. You can change it to min/max/last as per you need

Moving aggregate within a specified date range

Using a sample credit card transactions data below:
df = pd.DataFrame({
'card_id' : [1, 1, 1, 2, 2],
'date' : [datetime(2020, 6, random.randint(1, 14)) for i in range(5)],
'amount' : [random.randint(1, 100) for i in range(5)]})
df
card_id date amount
0 1 2020-06-07 11
1 1 2020-06-11 45
2 1 2020-06-14 87
3 2 2020-06-04 48
4 2 2020-06-12 76
I'm trying to take the total amount spent in the past 7 days of a card at the point of the transaction. For example, if card_id 1 made a transaction on June 8, I want to get the total transactions from June 1 to June 7. This is what I was hoping to get:
card_id date amount sum_past_7d
0 1 2020-06-07 11 0
1 1 2020-06-11 45 11
2 1 2020-06-14 87 56
3 2 2020-06-04 48 0
4 2 2020-06-12 76 48
I'm currently using this function and pd.apply to generate my desired column but it's taking too long on the actual data (> 1 million rows).
df['past_week'] = df['date'].apply(lambda x: x - timedelta(days=7))
def myfunction(x):
return df.loc[(df['card_id'] == x.card_id) & \
(df['date'] >= x.past_week) & \
(df['date'] < x.date), :]['amount'].sum()
Is there a faster and more efficient way to do this?
Let's try rolling on date with groupby:
# make sure the data is sorted properly
# your sample is already sorted, so you can skip this
df = df.sort_values(['card_id', 'date'])
df['sum_past_7D'] = (df.set_index('date').groupby('card_id')
['amount'].rolling('7D').sum()
.groupby('card_id').shift(fill_value=0)
.values
)
Output:
card_id date amount sum_past_7D
0 1 2020-06-07 11 0.0
1 1 2020-06-11 45 11.0
2 1 2020-06-14 87 56.0
3 2 2020-06-04 48 0.0
4 2 2020-06-12 76 48.0

Python - sum of each day in multiple excel sheet

I have a excel sheets data like below,
Sheet1
duration date
10 5/20/2017 08:20
23 5/20/2017 10:20
33 5/21/2017 12:20
56 5/22/2017 23:20
Sheet2
duration date
34 5/20/2017 01:20
12 5/20/2017 03:20
05 5/21/2017 11:20
44 5/22/2017 23:20
Expected OP :
day[20] : [33, 46]
day[21] : [33, 12]
day[22] : [56, 44]
I am trying to sum of duration day wise in all sheets like below code,
xls = pd.ExcelFile('reports.xlsx')
report_sheets = []
for sheetName in xls.sheet_names:
sheet = pd.read_excel(xls,sheet_name=sheetName)
sheet['date'] = pd.to_datetime(sheet['date'])
print(sheet.groupby(sheet['date'].dt.strftime('%Y-%m-%d'))['duration'].sum().sort_values())
How can I achieve this?
You can use parameter sheet_name=False to read_excel for return dictionary of DataFrames:
dfs = pd.read_excel('reports.xlsx', sheet_name=None)
print (dfs)
OrderedDict([('Sheet1', duration date
0 10 5/20/2017 08:20
1 23 5/20/2017 10:20
2 33 5/21/2017 12:20
3 56 5/22/2017 23:20), ('Sheet2', duration date
0 34 5/20/2017 01:20
1 12 5/20/2017 03:20
2 5 5/21/2017 11:20
3 44 5/22/2017 23:20)])
Then aggregate in dictionary comprehension:
dfs1 = {i:x.groupby(pd.to_datetime(x['date']).dt.strftime('%Y-%m-%d'))['duration'].sum() for i, x in dfs.items()}
print (dfs1)
{'Sheet2': date
2017-05-20 46
2017-05-21 5
2017-05-22 44
Name: duration, dtype: int64, 'Sheet1': date
2017-05-20 33
2017-05-21 33
2017-05-22 56
Name: duration, dtype: int64}
And last concat, create lists and last dictionary by to_dict:
d = pd.concat(dfs1).groupby(level=1).apply(list).to_dict()
print (d)
{'2017-05-22': [56, 44], '2017-05-21': [33, 5], '2017-05-20': [33, 46]}
Make a function that takes the sheet's dataframe and returns a dictionary
def make_goofy_dict(d):
d = d.set_index('date').duration.resample('D').sum()
return d.apply(lambda x: [x]).to_dict()
Then use merge_with from either toolz or cytoolz
from cytoolz.dicttoolz import merge_with
merge_with(lambda x: sum(x, []), map(make_goofy_dict, (sheet1, sheet2)))
{Timestamp('2017-05-20 00:00:00', freq='D'): [33, 46],
Timestamp('2017-05-21 00:00:00', freq='D'): [33, 5],
Timestamp('2017-05-22 00:00:00', freq='D'): [56, 44]}
details
print(sheet1, sheet2, sep='\n\n')
duration date
0 10 2017-05-20 08:20:00
1 23 2017-05-20 10:20:00
2 33 2017-05-21 12:20:00
3 56 2017-05-22 23:20:00
duration date
0 34 2017-05-20 01:20:00
1 12 2017-05-20 03:20:00
2 5 2017-05-21 11:20:00
3 44 2017-05-22 23:20:00
For your problem
I'd do this
from cytoolz.dicttoolz import merge_with
def make_goofy_dict(d):
d = d.set_index('date').duration.resample('D').sum()
return d.apply(lambda x: [x]).to_dict()
def read_sheet(xls, sn):
return pd.read_excel(xls, sheet_name=sn, parse_dates=['date'])
xls = pd.ExcelFile('reports.xlsx')
sheet_dict = merge_with(
lambda x: sum(x, []),
map(make_goofy_dict, map(read_sheet, xls.sheet_names))
)

Can I create a dataframe from few 1d arrays as columns?

Is it possible to create a dataframe from few 1d arrays and place them as columns?
If I create a dataframe from 1 1d array everything is ok:
arr1 = np.array([11, 12, 13, 14, 15])
arr1_arr2_df = pd.DataFrame(data=arr1, index=None, columns=None)
arr1_arr2_df
Out:
0
0 11
1 12
2 13
3 14
4 15
But If make a datafreme form 2 arrays they are placed is rows:
arr1 = np.array([11, 12, 13, 14, 15])
arr2 = np.array([21, 22, 23, 24, 25])
arr1_arr2_df = pd.DataFrame(data=(arr1,arr2), index=None, columns=None)
arr1_arr2_df
Out:
0 1 2 3 4
0 11 12 13 14 15
1 21 22 23 24 25
I know that I can achieve it by using transpose:
arr1_arr2_df = arr1_arr2_df.transpose()
arr1_arr2_df
Out:
0 1
0 11 21
1 12 22
2 13 23
3 14 24
4 15 25
But is it possible to get it from the start?
You can use a dictionary:
arr1_arr2_df = pd.DataFrame(data={0:arr1,1:arr2})

Resources