Pandas: Getting average of columns and rows with crosstab - python-3.x

I have this dataframe (sample of it atleast)
DATETIME_FROM DATETIME_TO MEAS ROW VEHICLE SPEED
1 2020-02-27 05:19:42.750 2020-02-27 05:20:42.750 2.2844 1 26 85
2 2020-02-27 05:30:06.050 2020-02-27 05:31:06.050 2.5256 1 31 69
3 2020-02-27 05:36:02.370 2020-02-27 05:37:02.370 4.8933 1 37 86
4 2020-02-27 05:41:12.005 2020-02-27 05:42:12.005 2.6998 1 27 86
5 2020-02-27 05:46:30.773 2020-02-27 05:47:30.773 2.2720 1 26 86
6 2020-02-27 05:50:53.862 2020-02-27 05:51:53.862 4.6953 1 3 82
7 2020-02-27 05:59:45.381 2020-02-27 06:00:45.381 2.5942 1 31 86
8 2020-02-27 06:04:12.657 2020-02-27 06:05:12.657 4.9136 1 37 86
The results should be a table, where I get mean average of every vehicle, each day.
but I would also like to have a total mean of MEAS per day, and per vehicle
I am using this:
pd.crosstab([valid1low.DATE,valid1low.ROW], [valid1low.VEHICLE], values=valid1low.MEAS, aggfunc=[np.mean], margins=True)
And the total looks like an average, but if I use Excel to make the average, I don't get the same result.
Could this be because Excel is not using the same precision of MEAS values?
and how would I get the same result?
The end user of this table will be using excel, so if the total average differs from excel, I would get questions :)

What I think you are looking for if I understand correctly is groupby. I have tried to recreate a similar dataframe with the code below to explain.
import pandas as pd
from datetime import datetime
df = pd.DataFrame()
df['DATETIME_FROM'] = pd.to_datetime(pd.DataFrame({'year': [2020,2020,2020,2020,2020,2020,2020,2020],
'month': [2, 2, 2, 2,2,2,2,2],
'day': [27, 27, 27, 27,28,28,28,28],
'hour':[24,26,28,30,32,34,36,38],
'minute':[2,4,6,8,10,12,14,16],
'second':[1,3,5,7,8,10,12,13] }))
df['DATETIME_TO'] = pd.to_datetime(pd.DataFrame({'year': [2020, 2020, 2020, 2020,2020,2020,2020,2020],
'month': [2, 2, 2, 2,2,2,2,2],
'day': [27, 27, 27, 27,28,28,28,28],
'hour':[25,27,29,31,33,35,37,39],
'minute':[3,5,7,9,11,13,15,17],
'second':[2,4,6,8,10,12,14,16]
}))
df['MEAS'] = [ 2.2844,2.5256,4.8933,2.6998,1,2,3,4]
df['ROW'] = [1,1,1,1,2,2,2,2]
df['VEHICLE'] = [26,31,37,27,65,46,45,49]
df['VEHICLE_SPEED'] =[85,69,86,86,90,91,92,93]
The dataframe that this code creates looks like the following.
DATETIME_FROM DATETIME_TO MEAS ROW VEHICLE VEHICLE_SPEED
0 2020-02-28 00:02:01 2020-02-28 01:03:02 2.2844 1 26 85
1 2020-02-28 02:04:03 2020-02-28 03:05:04 2.5256 1 31 69
2 2020-02-28 04:06:05 2020-02-28 05:07:06 4.8933 1 37 86
3 2020-02-28 06:08:07 2020-02-28 07:09:08 2.6998 1 27 86
4 2020-02-29 08:10:08 2020-02-29 09:11:10 1.0000 2 65 90
5 2020-02-29 10:12:10 2020-02-29 11:13:12 2.0000 2 46 91
6 2020-02-29 12:14:12 2020-02-29 13:15:14 3.0000 2 45 92
7 2020-02-29 14:16:13 2020-02-29 15:17:16 4.0000 2 49 93
You said that you need to get the mean of each vehicle per day and the mean of the MEAS per day. So I grouped by the day using the groupby function along with the Grouper to specify the day as the target to group by within the DATETIME_FROM column. Then I just got the mean of all the rows for a given column using the mean function. This function sums up the values in a given column and divides it by the number of rows.
means = df.set_index(["DATETIME_FROM"]).groupby(pd.Grouper(freq='D')).mean()
The dataframe means now contains the following. The DATEIME_FROM is now the index as we have grouped by this column.
MEAS ROW VEHICLE VEHICLE_SPEED
DATETIME_FROM
2020-02-27 3.100775 1.0 30.25 81.5
2020-02-28 2.500000 2.0 51.25 91.5
When you say you want the total means of MEAS and vehicle I am assuming that you want the mean of the values for the columns in the mean dataframe. This can be done by just getting the means of these columns, I then just created a new dataframe called totals and added these entrys.
mean_meas =means['MEAS'].mean()
mean_vechicles = means['VEHICLE'].mean()
total = pd.DataFrame({'MEAN MEAS':[mean_meas],'MEAN VECHICLE':[mean_vechicles]})
The totals dataframe then will include the following:
MEAN MEAS MEAN VECHICLE
0 2.800388 40.75
I hope this helps, if you have a question let me know!

Related

Moving aggregate within a specified date range

Using a sample credit card transactions data below:
df = pd.DataFrame({
'card_id' : [1, 1, 1, 2, 2],
'date' : [datetime(2020, 6, random.randint(1, 14)) for i in range(5)],
'amount' : [random.randint(1, 100) for i in range(5)]})
df
card_id date amount
0 1 2020-06-07 11
1 1 2020-06-11 45
2 1 2020-06-14 87
3 2 2020-06-04 48
4 2 2020-06-12 76
I'm trying to take the total amount spent in the past 7 days of a card at the point of the transaction. For example, if card_id 1 made a transaction on June 8, I want to get the total transactions from June 1 to June 7. This is what I was hoping to get:
card_id date amount sum_past_7d
0 1 2020-06-07 11 0
1 1 2020-06-11 45 11
2 1 2020-06-14 87 56
3 2 2020-06-04 48 0
4 2 2020-06-12 76 48
I'm currently using this function and pd.apply to generate my desired column but it's taking too long on the actual data (> 1 million rows).
df['past_week'] = df['date'].apply(lambda x: x - timedelta(days=7))
def myfunction(x):
return df.loc[(df['card_id'] == x.card_id) & \
(df['date'] >= x.past_week) & \
(df['date'] < x.date), :]['amount'].sum()
Is there a faster and more efficient way to do this?
Let's try rolling on date with groupby:
# make sure the data is sorted properly
# your sample is already sorted, so you can skip this
df = df.sort_values(['card_id', 'date'])
df['sum_past_7D'] = (df.set_index('date').groupby('card_id')
['amount'].rolling('7D').sum()
.groupby('card_id').shift(fill_value=0)
.values
)
Output:
card_id date amount sum_past_7D
0 1 2020-06-07 11 0.0
1 1 2020-06-11 45 11.0
2 1 2020-06-14 87 56.0
3 2 2020-06-04 48 0.0
4 2 2020-06-12 76 48.0

How to groupby and sum on different column pandas

I have a dataframe df which I want to group by the column Letter and From. Example df below:
Letter Price From RT To
0 A 4 2020-06-04 11 2020-06-05
1 B 12 2020-06-04 11 2020-06-05
2 A 20 2020-06-04 11 2020-06-05
3 A 5 2020-06-04 11 2020-06-05
4 B 89 2020-06-05 11 2020-06-06
5 A 56 2020-06-05 11 2020-06-06
6 B 1 2020-06-06 11 2020-06-07
In standard SQL I would write a query following to achieve the desired results.
SELECT
Letter,
From,
SUM(Price),
MAX(RT)
FROM
some.table
GROUP BY Letter,From
I tried with the following but it didn't work.
df.groupby(['Letter','From']).sum(['RT','To'])
Try this:
df['From'] = pd.to_datetime(df['From'])
df['To'] = pd.to_datetime(df['To'])
df = df.groupby(by=['Letter', 'From'], as_index=False).agg({
'Price': 'sum',
'RT': 'max'
})
print(df)
Letter From Price RT
0 A 2020-06-04 29 11
1 A 2020-06-05 56 11
2 B 2020-06-04 12 11
3 B 2020-06-05 89 11
4 B 2020-06-06 1 11
The columns to be aggregated need to be right after the groupby to work as a slicer just as you would slice a standard df. See the docs for the parameters you can pass to sum. But you actually need aggregate or agg to be able to aggregate each column with a different function.
df.groupby(['Letter','From'])[['Price', 'RT']].agg({'Price': sum, 'RT': max})

Iterate over rows in a data frame create a new column then adding more columns based on the new column

I have a data frame as below:
Date Quantity
2019-04-25 100
2019-04-26 148
2019-04-27 124
The output that I need is to take the quantity difference between two next dates and average over 24 hours and create 23 columns with hourly quantity difference added to the column before such as below:
Date Quantity Hour-1 Hour-2 ....Hour-23
2019-04-25 100 102 104 .... 146
2019-04-26 148 147 146 .... 123
2019-04-27 124
I'm trying to iterate over a loop but it's not working ,my code is as below:
for i in df.index:
diff=(df.get_value(i+1,'Quantity')-df.get_value(i,'Quantity'))/24
for j in range(24):
df[i,[1+j]]=df.[i,[j]]*(1+diff)
I did some research but I have not found how to create columns like above iteratively. I hope you could help me. Thank you in advance.
IIUC using resample and interpolate, then we pivot the output
s=df.set_index('Date').resample('1 H').interpolate()
s=pd.pivot_table(s,index=s.index.date,columns=s.groupby(s.index.date).cumcount(),values=s,aggfunc='mean')
s.columns=s.columns.droplevel(0)
s
Out[93]:
0 1 2 3 ... 20 21 22 23
2019-04-25 100.0 102.0 104.0 106.0 ... 140.0 142.0 144.0 146.0
2019-04-26 148.0 147.0 146.0 145.0 ... 128.0 127.0 126.0 125.0
2019-04-27 124.0 NaN NaN NaN ... NaN NaN NaN NaN
[3 rows x 24 columns]
If I have understood the question correctly.
for loop approach:
list_of_values = []
for i,row in df.iterrows():
if i < len(df) - 2:
qty = row['Quantity']
qty_2 = df.at[i+1,'Quantity']
diff = (qty_2 - qty)/24
list_of_values.append(diff)
else:
list_of_values.append(0)
df['diff'] = list_of_values
Output:
Date Quantity diff
2019-04-25 100 2
2019-04-26 148 -1
2019-04-27 124 0
Now create the columns required.
i.e.
df['Hour-1'] = df['Quantity'] + df['diff']
df['Hour-2'] = df['Quantity'] + 2*df['diff']
.
.
.
.
There are other approaches which will work way better.

Python - sum of each day in multiple excel sheet

I have a excel sheets data like below,
Sheet1
duration date
10 5/20/2017 08:20
23 5/20/2017 10:20
33 5/21/2017 12:20
56 5/22/2017 23:20
Sheet2
duration date
34 5/20/2017 01:20
12 5/20/2017 03:20
05 5/21/2017 11:20
44 5/22/2017 23:20
Expected OP :
day[20] : [33, 46]
day[21] : [33, 12]
day[22] : [56, 44]
I am trying to sum of duration day wise in all sheets like below code,
xls = pd.ExcelFile('reports.xlsx')
report_sheets = []
for sheetName in xls.sheet_names:
sheet = pd.read_excel(xls,sheet_name=sheetName)
sheet['date'] = pd.to_datetime(sheet['date'])
print(sheet.groupby(sheet['date'].dt.strftime('%Y-%m-%d'))['duration'].sum().sort_values())
How can I achieve this?
You can use parameter sheet_name=False to read_excel for return dictionary of DataFrames:
dfs = pd.read_excel('reports.xlsx', sheet_name=None)
print (dfs)
OrderedDict([('Sheet1', duration date
0 10 5/20/2017 08:20
1 23 5/20/2017 10:20
2 33 5/21/2017 12:20
3 56 5/22/2017 23:20), ('Sheet2', duration date
0 34 5/20/2017 01:20
1 12 5/20/2017 03:20
2 5 5/21/2017 11:20
3 44 5/22/2017 23:20)])
Then aggregate in dictionary comprehension:
dfs1 = {i:x.groupby(pd.to_datetime(x['date']).dt.strftime('%Y-%m-%d'))['duration'].sum() for i, x in dfs.items()}
print (dfs1)
{'Sheet2': date
2017-05-20 46
2017-05-21 5
2017-05-22 44
Name: duration, dtype: int64, 'Sheet1': date
2017-05-20 33
2017-05-21 33
2017-05-22 56
Name: duration, dtype: int64}
And last concat, create lists and last dictionary by to_dict:
d = pd.concat(dfs1).groupby(level=1).apply(list).to_dict()
print (d)
{'2017-05-22': [56, 44], '2017-05-21': [33, 5], '2017-05-20': [33, 46]}
Make a function that takes the sheet's dataframe and returns a dictionary
def make_goofy_dict(d):
d = d.set_index('date').duration.resample('D').sum()
return d.apply(lambda x: [x]).to_dict()
Then use merge_with from either toolz or cytoolz
from cytoolz.dicttoolz import merge_with
merge_with(lambda x: sum(x, []), map(make_goofy_dict, (sheet1, sheet2)))
{Timestamp('2017-05-20 00:00:00', freq='D'): [33, 46],
Timestamp('2017-05-21 00:00:00', freq='D'): [33, 5],
Timestamp('2017-05-22 00:00:00', freq='D'): [56, 44]}
details
print(sheet1, sheet2, sep='\n\n')
duration date
0 10 2017-05-20 08:20:00
1 23 2017-05-20 10:20:00
2 33 2017-05-21 12:20:00
3 56 2017-05-22 23:20:00
duration date
0 34 2017-05-20 01:20:00
1 12 2017-05-20 03:20:00
2 5 2017-05-21 11:20:00
3 44 2017-05-22 23:20:00
For your problem
I'd do this
from cytoolz.dicttoolz import merge_with
def make_goofy_dict(d):
d = d.set_index('date').duration.resample('D').sum()
return d.apply(lambda x: [x]).to_dict()
def read_sheet(xls, sn):
return pd.read_excel(xls, sheet_name=sn, parse_dates=['date'])
xls = pd.ExcelFile('reports.xlsx')
sheet_dict = merge_with(
lambda x: sum(x, []),
map(make_goofy_dict, map(read_sheet, xls.sheet_names))
)

Grouping and Multiindexing a pandas dataframe

Suppose I have a dataframe as follows
In [6]: df.head()
Out[6]:
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
4 Dragoons 1st Cooze 3 70
I have a dictionary as follows:
army = {'Majors' : 'Nighthawks', 'Captains' : 'Dragoons'}
and I want that it and should have a multi-index in the shape of ["army","company"] only.
How will I proceed?
If I understand correctly:
You can use map to find values in a dictionary (using dictionary comprehension to swap key/value pairs since they are backwards):
army = {'Majors': 'Nighthawks', 'Captains': 'Dragoons'}
df.assign(army=df.regiment.map({k:v for v, k in army.items()})).set_index(['army', 'company'], drop=True)
regiment name preTestScore postTestScore
army company
Majors 1st Nighthawks Miller 4 25
1st Nighthawks Jacobson 24 94
2nd Nighthawks Ali 31 57
2nd Nighthawks Milner 2 62
Captains 1st Dragoons Cooze 3 70

Resources