Python - sum of each day in multiple excel sheet

Python - sum of each day in multiple excel sheet - python-3.x

I have a excel sheets data like below,
Sheet1
duration date
10 5/20/2017 08:20
23 5/20/2017 10:20
33 5/21/2017 12:20
56 5/22/2017 23:20
Sheet2
duration date
34 5/20/2017 01:20
12 5/20/2017 03:20
05 5/21/2017 11:20
44 5/22/2017 23:20
Expected OP :
day[20] : [33, 46]
day[21] : [33, 12]
day[22] : [56, 44]
I am trying to sum of duration day wise in all sheets like below code,
xls = pd.ExcelFile('reports.xlsx')
report_sheets = []
for sheetName in xls.sheet_names:
sheet = pd.read_excel(xls,sheet_name=sheetName)
sheet['date'] = pd.to_datetime(sheet['date'])
print(sheet.groupby(sheet['date'].dt.strftime('%Y-%m-%d'))['duration'].sum().sort_values())
How can I achieve this?

You can use parameter sheet_name=False to read_excel for return dictionary of DataFrames:
dfs = pd.read_excel('reports.xlsx', sheet_name=None)
print (dfs)
OrderedDict([('Sheet1', duration date
0 10 5/20/2017 08:20
1 23 5/20/2017 10:20
2 33 5/21/2017 12:20
3 56 5/22/2017 23:20), ('Sheet2', duration date
0 34 5/20/2017 01:20
1 12 5/20/2017 03:20
2 5 5/21/2017 11:20
3 44 5/22/2017 23:20)])
Then aggregate in dictionary comprehension:
dfs1 = {i:x.groupby(pd.to_datetime(x['date']).dt.strftime('%Y-%m-%d'))['duration'].sum() for i, x in dfs.items()}
print (dfs1)
{'Sheet2': date
2017-05-20 46
2017-05-21 5
2017-05-22 44
Name: duration, dtype: int64, 'Sheet1': date
2017-05-20 33
2017-05-21 33
2017-05-22 56
Name: duration, dtype: int64}
And last concat, create lists and last dictionary by to_dict:
d = pd.concat(dfs1).groupby(level=1).apply(list).to_dict()
print (d)
{'2017-05-22': [56, 44], '2017-05-21': [33, 5], '2017-05-20': [33, 46]}

Make a function that takes the sheet's dataframe and returns a dictionary
def make_goofy_dict(d):
d = d.set_index('date').duration.resample('D').sum()
return d.apply(lambda x: [x]).to_dict()
Then use merge_with from either toolz or cytoolz
from cytoolz.dicttoolz import merge_with
merge_with(lambda x: sum(x, []), map(make_goofy_dict, (sheet1, sheet2)))
{Timestamp('2017-05-20 00:00:00', freq='D'): [33, 46],
Timestamp('2017-05-21 00:00:00', freq='D'): [33, 5],
Timestamp('2017-05-22 00:00:00', freq='D'): [56, 44]}
details
print(sheet1, sheet2, sep='\n\n')
duration date
0 10 2017-05-20 08:20:00
1 23 2017-05-20 10:20:00
2 33 2017-05-21 12:20:00
3 56 2017-05-22 23:20:00
duration date
0 34 2017-05-20 01:20:00
1 12 2017-05-20 03:20:00
2 5 2017-05-21 11:20:00
3 44 2017-05-22 23:20:00
For your problem
I'd do this
from cytoolz.dicttoolz import merge_with
def make_goofy_dict(d):
d = d.set_index('date').duration.resample('D').sum()
return d.apply(lambda x: [x]).to_dict()
def read_sheet(xls, sn):
return pd.read_excel(xls, sheet_name=sn, parse_dates=['date'])
xls = pd.ExcelFile('reports.xlsx')
sheet_dict = merge_with(
lambda x: sum(x, []),
map(make_goofy_dict, map(read_sheet, xls.sheet_names))
)

Related

How to resolve following key error in the my code while manipulating dataframe?

Lets say i have this data frame. here i want to match a date value and will compare if at this current index and at next index employee id is same or not if they are same then i want to delete the row at next index. i tried with below code but getting key error:1
EMPL_ID
AGE
EMPLOYER
END_DATE
12
23
BHU
2022-04-22 00:00:00
12
21
BHU
2022-04-22 00:00:00
34
22
DU
2022-04-22 00:00:00
36
21
BHU
2022-04-22 00:00:00
for index, row in df.iterrows():
value = row['END_DATE']
if (value == '2022-04-22 00:00:00'):
a = index
if (df.loc[a, 'EMPL_ID'] == df.loc[a+1, 'EMPL_ID']):
df.drop(a+1, inplace = True)
else:
df = df
else:
df = df

Keeping only the unique values based on two columns can be performed using the function drop_duplicates, while supplying the columns that make up the subset:
df = pd.DataFrame(data={'id': [1, 2, 3, 3], 'time': [1, 2, 3, 3]})
>>> df
id time
0 1 1
1 2 2
2 3 3
3 3 3
df.drop_duplicates(subset=['id', 'time'], inplace=True)
>>> df
id time
0 1 1
1 2 2
2 3 3

IIUC here is one approach. Since you only search for consecutive duplicates I added another row to your example. The 2nd row of "EMPL_ID" 34 won't be deleted.
df = pd.DataFrame({'EMPL_ID': [12, 12, 34, 36, 34],
'AGE': [23, 21, 22, 21, 22],
'EMPLOYER': ['BHU', 'BHU', 'DU', 'BHU', 'DU'],
'END_DATE': ['2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00'],
})
print(df)
EMPL_ID AGE EMPLOYER END_DATE
0 12 23 BHU 2022-04-22 00:00:00
1 12 21 BHU 2022-04-22 00:00:00
2 34 22 DU 2022-04-22 00:00:00
3 36 21 BHU 2022-04-22 00:00:00
4 34 22 DU 2022-04-22 00:00:00
grp = ['END_DATE', 'EMPL_ID']
df['group'] = (df[grp] != df[grp].shift(1)).any(axis=1).cumsum()
df = df.drop_duplicates('group', keep= 'first').drop('group', axis=1)
print(df)
EMPL_ID AGE EMPLOYER END_DATE
0 12 23 BHU 2022-04-22 00:00:00
2 34 22 DU 2022-04-22 00:00:00
3 36 21 BHU 2022-04-22 00:00:00
4 34 22 DU 2022-04-22 00:00:00

how to splits columns by date using python

df.head(7)
df
Month,ward1,ward2,...ward30
Apr-19, 20, 30, 45
May-19, 18, 25, 42
Jun-19, 25, 19, 35
Jul-19, 28, 22, 38
Aug-19, 24, 15, 40
Sep-19, 21, 14, 39
Oct-19, 15, 18, 41
to:
Month, ward1
Apr-19, 20
May-19, 18
Jun-19, 25
Jul-19, 28
Aug-19, 24
Sep-19, 21
Oct-19, 15
Month,ward2
Apr-19, 30
May-19, 25
Jun-19, 19
Jul-19, 22
Aug-19, 15
Sep-19, 14
Oct-19, 18
Month, ward30
Apr-19, 45
May-19, 42
Jun-19, 35
Jul-19, 38
Aug-19, 40
Sep-19, 39
Oct-19, 41
How to group-by date wise in python using pandas?
I have dataframe df that contains a datetime and 30 other columns which I want to split by date attached with each of those columns in pandas but I am facing some difficulties.

try using a dictionary comprehension to hold your separate dataframes.
dfs = {col : df.set_index('Month')[[col]] for col in (df.set_index('Month').columns)}
print(dfs['ward1'])
ward1
Month
Apr-19 20
May-19 18
Jun-19 25
Jul-19 28
Aug-19 24
Sep-19 21
Oct-19 15
print(dfs['ward30'])
ward30
Month
Apr-19 45
May-19 42
Jun-19 35
Jul-19 38
Aug-19 40
Sep-19 39
Oct-19 41

One straight forward way would be to set date column as index and separating out every other column:
data.set_index('Month', inplace =True)
data_dict = {col: data[col] for col in data.columns}

You have to create new DataFrames:
data1 = pd.DataFrame()
data1['Month'] = df['Month']
data1['ward1'] = df['ward1']
data1.head()

Tidying dataframe with one parameter per row

I have a dataframe like this:
d = {'Date': ['2020-10-09', '2020-10-09', '2020-10-09', '2020-10-10', '2020-10-10', '2020-10-10', '2020-10-11', '2020-10-11', '2020-10-11'],
'ID': ['T1', 'T2', 'T3', 'T1', 'T2', 'T3','T1', 'T2', 'T3'],
'Value': [13, 12, 11, 14, 15, 16, 20, 21, 22]}
df = pd.DataFrame(data=d)
df
Date ID Value
0 2020-10-09 T1 13
1 2020-10-09 T2 12
2 2020-10-09 T3 11
3 2020-10-10 T1 14
4 2020-10-10 T2 15
5 2020-10-10 T3 16
6 2020-10-11 T1 20
7 2020-10-11 T2 21
8 2020-10-11 T3 22
And I'm trying to get:
d = {'Date': ['2020-10-09', '2020-10-10', '2020-10-11'],
'Value T1': ['13', '14', '20'],
'Value T2': ['12', '15', '21'],
'Value T3': ['11', '15', '22']}
df = pd.DataFrame(data=d)
df
Date Value T1 Value T2 Value T3
0 2020-10-09 13 12 11
1 2020-10-10 14 15 15
2 2020-10-11 20 21 22
I tried with pivot but I got the error:
"Index contains duplicate entries, cannot reshape"

Use pd.pivot_table like shown below,
pdf = pd.pivot_table(
df,
values=['Value'],
index=['Date'],
columns=['ID'],
aggfunc='first'
).reset_index(drop=False)
pdf.columns = ['Date', "Value T1", "Value T2", "Value T3"]
Date Value T1 Value T2 Value T3
0 2020-10-09 13 12 11
1 2020-10-10 14 15 16
2 2020-10-11 20 21 22
Note that aggfunc is first here. Which means if there is multiple values for a given ID at given Date, You'll get the first value in the dataframe. You can change it to min/max/last as per you need

Moving aggregate within a specified date range

Using a sample credit card transactions data below:
df = pd.DataFrame({
'card_id' : [1, 1, 1, 2, 2],
'date' : [datetime(2020, 6, random.randint(1, 14)) for i in range(5)],
'amount' : [random.randint(1, 100) for i in range(5)]})
df
card_id date amount
0 1 2020-06-07 11
1 1 2020-06-11 45
2 1 2020-06-14 87
3 2 2020-06-04 48
4 2 2020-06-12 76
I'm trying to take the total amount spent in the past 7 days of a card at the point of the transaction. For example, if card_id 1 made a transaction on June 8, I want to get the total transactions from June 1 to June 7. This is what I was hoping to get:
card_id date amount sum_past_7d
0 1 2020-06-07 11 0
1 1 2020-06-11 45 11
2 1 2020-06-14 87 56
3 2 2020-06-04 48 0
4 2 2020-06-12 76 48
I'm currently using this function and pd.apply to generate my desired column but it's taking too long on the actual data (> 1 million rows).
df['past_week'] = df['date'].apply(lambda x: x - timedelta(days=7))
def myfunction(x):
return df.loc[(df['card_id'] == x.card_id) & \
(df['date'] >= x.past_week) & \
(df['date'] < x.date), :]['amount'].sum()
Is there a faster and more efficient way to do this?

Let's try rolling on date with groupby:
# make sure the data is sorted properly
# your sample is already sorted, so you can skip this
df = df.sort_values(['card_id', 'date'])
df['sum_past_7D'] = (df.set_index('date').groupby('card_id')
['amount'].rolling('7D').sum()
.groupby('card_id').shift(fill_value=0)
.values
)
Output:
card_id date amount sum_past_7D
0 1 2020-06-07 11 0.0
1 1 2020-06-11 45 11.0
2 1 2020-06-14 87 56.0
3 2 2020-06-04 48 0.0
4 2 2020-06-12 76 48.0

Pandas: Getting average of columns and rows with crosstab

I have this dataframe (sample of it atleast)
DATETIME_FROM DATETIME_TO MEAS ROW VEHICLE SPEED
1 2020-02-27 05:19:42.750 2020-02-27 05:20:42.750 2.2844 1 26 85
2 2020-02-27 05:30:06.050 2020-02-27 05:31:06.050 2.5256 1 31 69
3 2020-02-27 05:36:02.370 2020-02-27 05:37:02.370 4.8933 1 37 86
4 2020-02-27 05:41:12.005 2020-02-27 05:42:12.005 2.6998 1 27 86
5 2020-02-27 05:46:30.773 2020-02-27 05:47:30.773 2.2720 1 26 86
6 2020-02-27 05:50:53.862 2020-02-27 05:51:53.862 4.6953 1 3 82
7 2020-02-27 05:59:45.381 2020-02-27 06:00:45.381 2.5942 1 31 86
8 2020-02-27 06:04:12.657 2020-02-27 06:05:12.657 4.9136 1 37 86
The results should be a table, where I get mean average of every vehicle, each day.
but I would also like to have a total mean of MEAS per day, and per vehicle
I am using this:
pd.crosstab([valid1low.DATE,valid1low.ROW], [valid1low.VEHICLE], values=valid1low.MEAS, aggfunc=[np.mean], margins=True)
And the total looks like an average, but if I use Excel to make the average, I don't get the same result.
Could this be because Excel is not using the same precision of MEAS values?
and how would I get the same result?
The end user of this table will be using excel, so if the total average differs from excel, I would get questions :)

What I think you are looking for if I understand correctly is groupby. I have tried to recreate a similar dataframe with the code below to explain.
import pandas as pd
from datetime import datetime
df = pd.DataFrame()
df['DATETIME_FROM'] = pd.to_datetime(pd.DataFrame({'year': [2020,2020,2020,2020,2020,2020,2020,2020],
'month': [2, 2, 2, 2,2,2,2,2],
'day': [27, 27, 27, 27,28,28,28,28],
'hour':[24,26,28,30,32,34,36,38],
'minute':[2,4,6,8,10,12,14,16],
'second':[1,3,5,7,8,10,12,13] }))
df['DATETIME_TO'] = pd.to_datetime(pd.DataFrame({'year': [2020, 2020, 2020, 2020,2020,2020,2020,2020],
'month': [2, 2, 2, 2,2,2,2,2],
'day': [27, 27, 27, 27,28,28,28,28],
'hour':[25,27,29,31,33,35,37,39],
'minute':[3,5,7,9,11,13,15,17],
'second':[2,4,6,8,10,12,14,16]
}))
df['MEAS'] = [ 2.2844,2.5256,4.8933,2.6998,1,2,3,4]
df['ROW'] = [1,1,1,1,2,2,2,2]
df['VEHICLE'] = [26,31,37,27,65,46,45,49]
df['VEHICLE_SPEED'] =[85,69,86,86,90,91,92,93]
The dataframe that this code creates looks like the following.
DATETIME_FROM DATETIME_TO MEAS ROW VEHICLE VEHICLE_SPEED
0 2020-02-28 00:02:01 2020-02-28 01:03:02 2.2844 1 26 85
1 2020-02-28 02:04:03 2020-02-28 03:05:04 2.5256 1 31 69
2 2020-02-28 04:06:05 2020-02-28 05:07:06 4.8933 1 37 86
3 2020-02-28 06:08:07 2020-02-28 07:09:08 2.6998 1 27 86
4 2020-02-29 08:10:08 2020-02-29 09:11:10 1.0000 2 65 90
5 2020-02-29 10:12:10 2020-02-29 11:13:12 2.0000 2 46 91
6 2020-02-29 12:14:12 2020-02-29 13:15:14 3.0000 2 45 92
7 2020-02-29 14:16:13 2020-02-29 15:17:16 4.0000 2 49 93
You said that you need to get the mean of each vehicle per day and the mean of the MEAS per day. So I grouped by the day using the groupby function along with the Grouper to specify the day as the target to group by within the DATETIME_FROM column. Then I just got the mean of all the rows for a given column using the mean function. This function sums up the values in a given column and divides it by the number of rows.
means = df.set_index(["DATETIME_FROM"]).groupby(pd.Grouper(freq='D')).mean()
The dataframe means now contains the following. The DATEIME_FROM is now the index as we have grouped by this column.
MEAS ROW VEHICLE VEHICLE_SPEED
DATETIME_FROM
2020-02-27 3.100775 1.0 30.25 81.5
2020-02-28 2.500000 2.0 51.25 91.5
When you say you want the total means of MEAS and vehicle I am assuming that you want the mean of the values for the columns in the mean dataframe. This can be done by just getting the means of these columns, I then just created a new dataframe called totals and added these entrys.
mean_meas =means['MEAS'].mean()
mean_vechicles = means['VEHICLE'].mean()
total = pd.DataFrame({'MEAN MEAS':[mean_meas],'MEAN VECHICLE':[mean_vechicles]})
The totals dataframe then will include the following:
MEAN MEAS MEAN VECHICLE
0 2.800388 40.75
I hope this helps, if you have a question let me know!

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python - sum of each day in multiple excel sheet - python-3.x

Related

How to resolve following key error in the my code while manipulating dataframe?

how to splits columns by date using python

Tidying dataframe with one parameter per row

Moving aggregate within a specified date range

Pandas: Getting average of columns and rows with crosstab

Categories

Resources