I have a excel sheets data like below,
Sheet1
duration date
10 5/20/2017 08:20
23 5/20/2017 10:20
33 5/21/2017 12:20
56 5/22/2017 23:20
Sheet2
duration date
34 5/20/2017 01:20
12 5/20/2017 03:20
05 5/21/2017 11:20
44 5/22/2017 23:20
Expected OP :
day[20] : [33, 46]
day[21] : [33, 12]
day[22] : [56, 44]
I am trying to sum of duration day wise in all sheets like below code,
xls = pd.ExcelFile('reports.xlsx')
report_sheets = []
for sheetName in xls.sheet_names:
sheet = pd.read_excel(xls,sheet_name=sheetName)
sheet['date'] = pd.to_datetime(sheet['date'])
print(sheet.groupby(sheet['date'].dt.strftime('%Y-%m-%d'))['duration'].sum().sort_values())
How can I achieve this?
You can use parameter sheet_name=False to read_excel for return dictionary of DataFrames:
dfs = pd.read_excel('reports.xlsx', sheet_name=None)
print (dfs)
OrderedDict([('Sheet1', duration date
0 10 5/20/2017 08:20
1 23 5/20/2017 10:20
2 33 5/21/2017 12:20
3 56 5/22/2017 23:20), ('Sheet2', duration date
0 34 5/20/2017 01:20
1 12 5/20/2017 03:20
2 5 5/21/2017 11:20
3 44 5/22/2017 23:20)])
Then aggregate in dictionary comprehension:
dfs1 = {i:x.groupby(pd.to_datetime(x['date']).dt.strftime('%Y-%m-%d'))['duration'].sum() for i, x in dfs.items()}
print (dfs1)
{'Sheet2': date
2017-05-20 46
2017-05-21 5
2017-05-22 44
Name: duration, dtype: int64, 'Sheet1': date
2017-05-20 33
2017-05-21 33
2017-05-22 56
Name: duration, dtype: int64}
And last concat, create lists and last dictionary by to_dict:
d = pd.concat(dfs1).groupby(level=1).apply(list).to_dict()
print (d)
{'2017-05-22': [56, 44], '2017-05-21': [33, 5], '2017-05-20': [33, 46]}
Make a function that takes the sheet's dataframe and returns a dictionary
def make_goofy_dict(d):
d = d.set_index('date').duration.resample('D').sum()
return d.apply(lambda x: [x]).to_dict()
Then use merge_with from either toolz or cytoolz
from cytoolz.dicttoolz import merge_with
merge_with(lambda x: sum(x, []), map(make_goofy_dict, (sheet1, sheet2)))
{Timestamp('2017-05-20 00:00:00', freq='D'): [33, 46],
Timestamp('2017-05-21 00:00:00', freq='D'): [33, 5],
Timestamp('2017-05-22 00:00:00', freq='D'): [56, 44]}
details
print(sheet1, sheet2, sep='\n\n')
duration date
0 10 2017-05-20 08:20:00
1 23 2017-05-20 10:20:00
2 33 2017-05-21 12:20:00
3 56 2017-05-22 23:20:00
duration date
0 34 2017-05-20 01:20:00
1 12 2017-05-20 03:20:00
2 5 2017-05-21 11:20:00
3 44 2017-05-22 23:20:00
For your problem
I'd do this
from cytoolz.dicttoolz import merge_with
def make_goofy_dict(d):
d = d.set_index('date').duration.resample('D').sum()
return d.apply(lambda x: [x]).to_dict()
def read_sheet(xls, sn):
return pd.read_excel(xls, sheet_name=sn, parse_dates=['date'])
xls = pd.ExcelFile('reports.xlsx')
sheet_dict = merge_with(
lambda x: sum(x, []),
map(make_goofy_dict, map(read_sheet, xls.sheet_names))
)
Related
Lets say i have this data frame. here i want to match a date value and will compare if at this current index and at next index employee id is same or not if they are same then i want to delete the row at next index. i tried with below code but getting key error:1
EMPL_ID
AGE
EMPLOYER
END_DATE
12
23
BHU
2022-04-22 00:00:00
12
21
BHU
2022-04-22 00:00:00
34
22
DU
2022-04-22 00:00:00
36
21
BHU
2022-04-22 00:00:00
for index, row in df.iterrows():
value = row['END_DATE']
if (value == '2022-04-22 00:00:00'):
a = index
if (df.loc[a, 'EMPL_ID'] == df.loc[a+1, 'EMPL_ID']):
df.drop(a+1, inplace = True)
else:
df = df
else:
df = df
Keeping only the unique values based on two columns can be performed using the function drop_duplicates, while supplying the columns that make up the subset:
df = pd.DataFrame(data={'id': [1, 2, 3, 3], 'time': [1, 2, 3, 3]})
>>> df
id time
0 1 1
1 2 2
2 3 3
3 3 3
df.drop_duplicates(subset=['id', 'time'], inplace=True)
>>> df
id time
0 1 1
1 2 2
2 3 3
IIUC here is one approach. Since you only search for consecutive duplicates I added another row to your example. The 2nd row of "EMPL_ID" 34 won't be deleted.
df = pd.DataFrame({'EMPL_ID': [12, 12, 34, 36, 34],
'AGE': [23, 21, 22, 21, 22],
'EMPLOYER': ['BHU', 'BHU', 'DU', 'BHU', 'DU'],
'END_DATE': ['2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00','2022-04-22 00:00:00'],
})
print(df)
EMPL_ID AGE EMPLOYER END_DATE
0 12 23 BHU 2022-04-22 00:00:00
1 12 21 BHU 2022-04-22 00:00:00
2 34 22 DU 2022-04-22 00:00:00
3 36 21 BHU 2022-04-22 00:00:00
4 34 22 DU 2022-04-22 00:00:00
grp = ['END_DATE', 'EMPL_ID']
df['group'] = (df[grp] != df[grp].shift(1)).any(axis=1).cumsum()
df = df.drop_duplicates('group', keep= 'first').drop('group', axis=1)
print(df)
EMPL_ID AGE EMPLOYER END_DATE
0 12 23 BHU 2022-04-22 00:00:00
2 34 22 DU 2022-04-22 00:00:00
3 36 21 BHU 2022-04-22 00:00:00
4 34 22 DU 2022-04-22 00:00:00
df.head(7)
df
Month,ward1,ward2,...ward30
Apr-19, 20, 30, 45
May-19, 18, 25, 42
Jun-19, 25, 19, 35
Jul-19, 28, 22, 38
Aug-19, 24, 15, 40
Sep-19, 21, 14, 39
Oct-19, 15, 18, 41
to:
Month, ward1
Apr-19, 20
May-19, 18
Jun-19, 25
Jul-19, 28
Aug-19, 24
Sep-19, 21
Oct-19, 15
Month,ward2
Apr-19, 30
May-19, 25
Jun-19, 19
Jul-19, 22
Aug-19, 15
Sep-19, 14
Oct-19, 18
Month, ward30
Apr-19, 45
May-19, 42
Jun-19, 35
Jul-19, 38
Aug-19, 40
Sep-19, 39
Oct-19, 41
How to group-by date wise in python using pandas?
I have dataframe df that contains a datetime and 30 other columns which I want to split by date attached with each of those columns in pandas but I am facing some difficulties.
try using a dictionary comprehension to hold your separate dataframes.
dfs = {col : df.set_index('Month')[[col]] for col in (df.set_index('Month').columns)}
print(dfs['ward1'])
ward1
Month
Apr-19 20
May-19 18
Jun-19 25
Jul-19 28
Aug-19 24
Sep-19 21
Oct-19 15
print(dfs['ward30'])
ward30
Month
Apr-19 45
May-19 42
Jun-19 35
Jul-19 38
Aug-19 40
Sep-19 39
Oct-19 41
One straight forward way would be to set date column as index and separating out every other column:
data.set_index('Month', inplace =True)
data_dict = {col: data[col] for col in data.columns}
You have to create new DataFrames:
data1 = pd.DataFrame()
data1['Month'] = df['Month']
data1['ward1'] = df['ward1']
data1.head()
I have a dataframe like this:
d = {'Date': ['2020-10-09', '2020-10-09', '2020-10-09', '2020-10-10', '2020-10-10', '2020-10-10', '2020-10-11', '2020-10-11', '2020-10-11'],
'ID': ['T1', 'T2', 'T3', 'T1', 'T2', 'T3','T1', 'T2', 'T3'],
'Value': [13, 12, 11, 14, 15, 16, 20, 21, 22]}
df = pd.DataFrame(data=d)
df
Date ID Value
0 2020-10-09 T1 13
1 2020-10-09 T2 12
2 2020-10-09 T3 11
3 2020-10-10 T1 14
4 2020-10-10 T2 15
5 2020-10-10 T3 16
6 2020-10-11 T1 20
7 2020-10-11 T2 21
8 2020-10-11 T3 22
And I'm trying to get:
d = {'Date': ['2020-10-09', '2020-10-10', '2020-10-11'],
'Value T1': ['13', '14', '20'],
'Value T2': ['12', '15', '21'],
'Value T3': ['11', '15', '22']}
df = pd.DataFrame(data=d)
df
Date Value T1 Value T2 Value T3
0 2020-10-09 13 12 11
1 2020-10-10 14 15 15
2 2020-10-11 20 21 22
I tried with pivot but I got the error:
"Index contains duplicate entries, cannot reshape"
Use pd.pivot_table like shown below,
pdf = pd.pivot_table(
df,
values=['Value'],
index=['Date'],
columns=['ID'],
aggfunc='first'
).reset_index(drop=False)
pdf.columns = ['Date', "Value T1", "Value T2", "Value T3"]
Date Value T1 Value T2 Value T3
0 2020-10-09 13 12 11
1 2020-10-10 14 15 16
2 2020-10-11 20 21 22
Note that aggfunc is first here. Which means if there is multiple values for a given ID at given Date, You'll get the first value in the dataframe. You can change it to min/max/last as per you need
Using a sample credit card transactions data below:
df = pd.DataFrame({
'card_id' : [1, 1, 1, 2, 2],
'date' : [datetime(2020, 6, random.randint(1, 14)) for i in range(5)],
'amount' : [random.randint(1, 100) for i in range(5)]})
df
card_id date amount
0 1 2020-06-07 11
1 1 2020-06-11 45
2 1 2020-06-14 87
3 2 2020-06-04 48
4 2 2020-06-12 76
I'm trying to take the total amount spent in the past 7 days of a card at the point of the transaction. For example, if card_id 1 made a transaction on June 8, I want to get the total transactions from June 1 to June 7. This is what I was hoping to get:
card_id date amount sum_past_7d
0 1 2020-06-07 11 0
1 1 2020-06-11 45 11
2 1 2020-06-14 87 56
3 2 2020-06-04 48 0
4 2 2020-06-12 76 48
I'm currently using this function and pd.apply to generate my desired column but it's taking too long on the actual data (> 1 million rows).
df['past_week'] = df['date'].apply(lambda x: x - timedelta(days=7))
def myfunction(x):
return df.loc[(df['card_id'] == x.card_id) & \
(df['date'] >= x.past_week) & \
(df['date'] < x.date), :]['amount'].sum()
Is there a faster and more efficient way to do this?
Let's try rolling on date with groupby:
# make sure the data is sorted properly
# your sample is already sorted, so you can skip this
df = df.sort_values(['card_id', 'date'])
df['sum_past_7D'] = (df.set_index('date').groupby('card_id')
['amount'].rolling('7D').sum()
.groupby('card_id').shift(fill_value=0)
.values
)
Output:
card_id date amount sum_past_7D
0 1 2020-06-07 11 0.0
1 1 2020-06-11 45 11.0
2 1 2020-06-14 87 56.0
3 2 2020-06-04 48 0.0
4 2 2020-06-12 76 48.0
I have this dataframe (sample of it atleast)
DATETIME_FROM DATETIME_TO MEAS ROW VEHICLE SPEED
1 2020-02-27 05:19:42.750 2020-02-27 05:20:42.750 2.2844 1 26 85
2 2020-02-27 05:30:06.050 2020-02-27 05:31:06.050 2.5256 1 31 69
3 2020-02-27 05:36:02.370 2020-02-27 05:37:02.370 4.8933 1 37 86
4 2020-02-27 05:41:12.005 2020-02-27 05:42:12.005 2.6998 1 27 86
5 2020-02-27 05:46:30.773 2020-02-27 05:47:30.773 2.2720 1 26 86
6 2020-02-27 05:50:53.862 2020-02-27 05:51:53.862 4.6953 1 3 82
7 2020-02-27 05:59:45.381 2020-02-27 06:00:45.381 2.5942 1 31 86
8 2020-02-27 06:04:12.657 2020-02-27 06:05:12.657 4.9136 1 37 86
The results should be a table, where I get mean average of every vehicle, each day.
but I would also like to have a total mean of MEAS per day, and per vehicle
I am using this:
pd.crosstab([valid1low.DATE,valid1low.ROW], [valid1low.VEHICLE], values=valid1low.MEAS, aggfunc=[np.mean], margins=True)
And the total looks like an average, but if I use Excel to make the average, I don't get the same result.
Could this be because Excel is not using the same precision of MEAS values?
and how would I get the same result?
The end user of this table will be using excel, so if the total average differs from excel, I would get questions :)
What I think you are looking for if I understand correctly is groupby. I have tried to recreate a similar dataframe with the code below to explain.
import pandas as pd
from datetime import datetime
df = pd.DataFrame()
df['DATETIME_FROM'] = pd.to_datetime(pd.DataFrame({'year': [2020,2020,2020,2020,2020,2020,2020,2020],
'month': [2, 2, 2, 2,2,2,2,2],
'day': [27, 27, 27, 27,28,28,28,28],
'hour':[24,26,28,30,32,34,36,38],
'minute':[2,4,6,8,10,12,14,16],
'second':[1,3,5,7,8,10,12,13] }))
df['DATETIME_TO'] = pd.to_datetime(pd.DataFrame({'year': [2020, 2020, 2020, 2020,2020,2020,2020,2020],
'month': [2, 2, 2, 2,2,2,2,2],
'day': [27, 27, 27, 27,28,28,28,28],
'hour':[25,27,29,31,33,35,37,39],
'minute':[3,5,7,9,11,13,15,17],
'second':[2,4,6,8,10,12,14,16]
}))
df['MEAS'] = [ 2.2844,2.5256,4.8933,2.6998,1,2,3,4]
df['ROW'] = [1,1,1,1,2,2,2,2]
df['VEHICLE'] = [26,31,37,27,65,46,45,49]
df['VEHICLE_SPEED'] =[85,69,86,86,90,91,92,93]
The dataframe that this code creates looks like the following.
DATETIME_FROM DATETIME_TO MEAS ROW VEHICLE VEHICLE_SPEED
0 2020-02-28 00:02:01 2020-02-28 01:03:02 2.2844 1 26 85
1 2020-02-28 02:04:03 2020-02-28 03:05:04 2.5256 1 31 69
2 2020-02-28 04:06:05 2020-02-28 05:07:06 4.8933 1 37 86
3 2020-02-28 06:08:07 2020-02-28 07:09:08 2.6998 1 27 86
4 2020-02-29 08:10:08 2020-02-29 09:11:10 1.0000 2 65 90
5 2020-02-29 10:12:10 2020-02-29 11:13:12 2.0000 2 46 91
6 2020-02-29 12:14:12 2020-02-29 13:15:14 3.0000 2 45 92
7 2020-02-29 14:16:13 2020-02-29 15:17:16 4.0000 2 49 93
You said that you need to get the mean of each vehicle per day and the mean of the MEAS per day. So I grouped by the day using the groupby function along with the Grouper to specify the day as the target to group by within the DATETIME_FROM column. Then I just got the mean of all the rows for a given column using the mean function. This function sums up the values in a given column and divides it by the number of rows.
means = df.set_index(["DATETIME_FROM"]).groupby(pd.Grouper(freq='D')).mean()
The dataframe means now contains the following. The DATEIME_FROM is now the index as we have grouped by this column.
MEAS ROW VEHICLE VEHICLE_SPEED
DATETIME_FROM
2020-02-27 3.100775 1.0 30.25 81.5
2020-02-28 2.500000 2.0 51.25 91.5
When you say you want the total means of MEAS and vehicle I am assuming that you want the mean of the values for the columns in the mean dataframe. This can be done by just getting the means of these columns, I then just created a new dataframe called totals and added these entrys.
mean_meas =means['MEAS'].mean()
mean_vechicles = means['VEHICLE'].mean()
total = pd.DataFrame({'MEAN MEAS':[mean_meas],'MEAN VECHICLE':[mean_vechicles]})
The totals dataframe then will include the following:
MEAN MEAS MEAN VECHICLE
0 2.800388 40.75
I hope this helps, if you have a question let me know!