Average time without changing format using pandas - python-3.x

Below is my avg_df:
Date Model INumber Type TimeDiff Device
326 20/07/18 TG I625 Devicetime 0:02:31 RD
328 20/07/18 TG I5271 Devicetime 0:00:32 RD
332 20/07/18 TG I660 Devicetime 0:00:31 RD
I want to get average of "TimeDiff". I know that i can convert Time into secs and get avg and can format it back, but would be interested to know if there is any way that i can get without formatting time back and forth. something like below:
print(avg_df.loc[:,"TimeDiff"].mean())
Appreciate any help!

You can get the average if you convert it to timedelta first:
>>> pd.to_timedelta(df['TimeDiff']).mean()
Timedelta('0 days 00:01:11.333333')

Related

How can I read and process 100 bytes at a time from a large CSV file?

The below csv is only a snippet of my main data file.
customer.csv
customer_id,order_id,number_of_items
10,4736,9
5,3049,1
1,4689,3
6,4114,9
1,4524,15
2,3727,16
3,3507,7
7,3988,3
5,4993,16
6,1945,4
7,3081,7
3,3707,2
5,1739,12
9,4167,17
7,3242,12
2,3109,10
10,2197,20
10,3528,13
8,4917,2
5,1713,19
8,4224,4
7,2160,2
10,2044,19
10,2956,8
3,3906,2
5,2288,16
7,1854,20
7,4404,2
9,1622,2
7,3685,2
10,2755,10
3,3390,10
6,1424,6
3,2127,15
4,1221,15
9,2994,14
1,1413,13
7,2771,7
3,4579,13
10,2208,4
CURRENTLY ALL I HAVE
import os
os.path.getsize("customer.csv") # outputs, 424 bytes
HOW I THINK I NEED TO PROCEED
I think I need to do something with open csv and read bytes? Then look at each row bit wise?
Please note, I am not looking specifically for someone to just give me an answer on how to do this (although that would be appreciated). Therefore, if someone could just point me in the right direction or give me some topics to look into that would be great. Side note, I know I am supposed to use encoding and decoding somewhere for this task.
This script will use the csv to load the data from customer.csv and compute the average using the builtin statistics module:
import csv
from statistics import mean
with open('customer.csv', newline='') as csvfile:
data = csv.DictReader(csvfile)
# group the customers by customer_id
customers = {}
for order in data:
customers.setdefault(order['customer_id'], []).append(int(order['number_of_items']))
# print the `average`:
print('{:<15} {}'.format('customer_id', 'average'))
for k, v in customers.items():
print('{:<15} {:.2f}'.format(k, mean(v)))
Prints:
customer_id average
10 11.86
5 12.80
1 10.33
6 6.33
2 13.00
3 8.17
7 6.88
9 11.00
8 3.00
4 15.00

How to create a time array in python for seasonal data

I am working with paleoclimate data (536-550 CE) in NetCDF format, which I imported with xarray. The time format is a bit strange:
import xarray as xr
ds_tas_01 = xr.open_dataset('ue536a01_temp2_seasmean.nc')
ds_tas_01['time']
<xarray.DataArray 'time' (time: 61)>
array([15360215.25, 15360430.75, 15360731.75, 15361031.75, 15370131.75,
15370430.75, 15370731.75, 15371031.75, 15380131.75, 15380430.75,
15380731.75, 15381031.75, 15390131.75, 15390430.75, 15390731.75,
15391031.75, 15400131.75, 15400430.75, 15400731.75, 15401031.75,
15410131.75, 15410430.75, 15410731.75, 15411031.75, 15420131.75,
15420430.75, 15420731.75, 15421031.75, 15430131.75, 15430430.75,
15430731.75, 15431031.75, 15440131.75, 15440430.75, 15440731.75,
15441031.75, 15450131.75, 15450430.75, 15450731.75, 15451031.75,
15460131.75, 15460430.75, 15460731.75, 15461031.75, 15470131.75,
15470430.75, 15470731.75, 15471031.75, 15480131.75, 15480430.75,
15480731.75, 15481031.75, 15490131.75, 15490430.75, 15490731.75,
15491031.75, 15500131.75, 15500430.75, 15500731.75, 15501031.75,
15501231.75])
Coordinates:
* time (time) float64 1.536e+07 1.536e+07 1.536e+07 ... 1.55e+07 1.55e+07
Attributes:
standard_name: time
bounds: time_bnds
units: day as %Y%m%d.%f
calendar: proleptic_gregorian
axis: T
So I want to make my own time array that I can use to plot the climate data. For monthly data I used:
import numpy as np
time = np.arange('0536-01-31', '0551-01-31', dtype='datetime64[M]')
which gives me an array with the years and months between those two dates.
now I grouped my data by season using cdo seasmean ('djf', 'mam', jja, 'son') and got 61 values instead of 180. Is there a way to regroup the 'time' array to seasonal values, or create a new time array that corresponds to the seasonal data?
I made it work by setting the number of steps in np.arange:
time = np.arange('0536-01-31', '0551-01-31', steps=3, dtype='datetime64[M]')
This gives a time step every three months, so essentially every 'season'.

Simple datetime conversion from integer or string

Is there a simple way to convert a start and end time input into a list of evenly separated times? the input can be string or integer with format 1000,"1000",or "10:00" in 2400hr format. I've managed to accomplish this in a messy looking way, is there a tighter more efficient way to create this list? As you'll notice I created an array first and then called .tolist() to make the time transformation iteration easier. The problem is that an input of 1030 or 1015 would need to be translated into 1050 or 1025 to create the right spacing but if there were a way I could call a datetime.timedelta or something and cleanly make the array?
start="1000"
end="1600"
total_minutes=(int(end[:2])*60)+int(end[2:])-(int(start[:2])*60)-
int(start[2:])
dog=list(range(0,int(total_minutes),25))
walk=dog_df["Walk Length"][dog_df.index[dog_df["Name"]==self.name][0]]
if walk=='half':
self.dogarr=np.array([(x-25,x,x+25,x+50) for x in dog])
elif walk=='full':
self.dogarr=np.array([(x-25,x,x+25,x+50,x+75,x+100) for x in dog])
else:
self.dogarr=np.array([(x,x+25,x+50) for x in dog])
if int(start[2])!=0:
start=start[:2]+str(int(int(start[2:])*1.667))
self.dogarr+=(int(start))
self.dogarr=self.dogarr.tolist()
z=0
while z<len(self.dogarr):
for timespot in self.dogarr[z].copy():
self.dogarr[z][self.dogarr[z].index(timespot)]=time.strftime('%H%M', time.gmtime(self.dogarr[z][self.dogarr[z].index(timespot)]*36))
z+=1
self.dogarr=np.array(self.dogarr)```
array([['1115', '1130', '1145', '1200'],
['1130', '1145', '1200', '1215'],
['1145', '1200', '1215', '1230'],
['1200', '1215', '1230', '1245'],
['1215', '1230', '1245', '1300']], dtype='<U4')
I'm sure you can figure out to parse times from any number of existing questions. The crux of your question seems to be how to create evenly separated times within a range. Here's a simple way:
start = datetime.datetime(2018,12,20,10) # or use strptime etc.
end = datetime.datetime(2018,12,24,18)
count = 10
interval = (end - start) / count
dt = start
while dt <= end:
print(dt)
dt += interval
The output is:
2018-12-20 10:00:00
2018-12-20 20:24:00
2018-12-21 06:48:00
2018-12-21 17:12:00
2018-12-22 03:36:00
2018-12-22 14:00:00
2018-12-23 00:24:00
2018-12-23 10:48:00
2018-12-23 21:12:00
2018-12-24 07:36:00
2018-12-24 18:00:00

Resampling Time Series Data (Pandas Python 3)

Trying to convert data at daily frequency to weekly frequency.
In:
weeklyaaapl = pd.DataFrame()
weeklyaapl['Open'] = aapl.Open.resample('W').iloc[0]
#here I am trying to take the first value of the aapl.Open,
#that falls within the week.
Out:
ValueError: .resample() is now a deferred operation
use .resample(...).mean() instead of .resample(...)
I want the true open (the first open that prints for the week) (the open of the first day in that week).
It instead wants me to take the mean of the daily open values for a given week using .mean(), which is not the information I need.
Can't seem to interpret the error, documentation isn't helping either.
I think you need.
aapl.resample('W').first()
Output:
Open High Low Close Volume
Date
2010-01-10 30.49 30.64 30.34 30.57 123432050
2010-01-17 30.40 30.43 29.78 30.02 115557365
2010-01-24 29.76 30.74 29.61 30.72 182501620
2010-01-31 28.93 29.24 28.60 29.01 266424802
2010-02-07 27.48 28.00 27.33 27.82 187468421

How to determine a formula for execution time given quantitative data, Excel, trendlines, monte carlo simulation

Can I get your help on some Maths and possibly Excel?
I have benchmarked my app increasing the number of iterations and number of obligors recording the time taken in seconds with the following result:
200 400 600 800 1000 1200 1400 1600 1800 2000
20000 15.627681 30.0968663 44.7592684 60.9037558 75.8267358 90.3718977 105.8749983 121.0030672 135.9191249 150.3331682
40000 31.7202111 62.3603882 97.2085204 128.8111731 156.2443206 186.6374271 218.324317 249.2699288 279.6008184 310.9970803
60000 47.0708635 92.4599437 138.874287 186.0576007 231.2181381 280.541207 322.9836878 371.3076757 413.4058622 459.6208335
80000 60.7346238 120.3216303 180.471169 241.668982 300.4283548 376.9639188 417.5231669 482.6288981 554.9740194 598.0394434
100000 76.7535915 150.7479245 227.5125656 304.3908046 382.5900043 451.6034296 526.0730786 609.0358776 679.0268121 779.6887277
120000 90.4174626 179.5511355 269.4099593 360.2934453 448.4387573 537.1406039 626.7325734 727.6132992 807.4767327 898.307638
How can I now come up with a function for T (time taken in seconds) as an expression of number of obligors O and number of iterations I
Thanks
I'm not quite sure of the data involved due to the question construction/presentation.
Assuming you're looking for y = f(x). If you load the data into Excel, you can use the methods SLOPE and INTERCEPT on the data ranges to derive an expression of the form
y = mx+c
and thus a linear function.
If you want a quadratic or cubic, you can use LINEST with a column of time data squared/cubed etc. to give you quadratic/cubic parameters, and thus derive an appropriate higher order function.
Spoke to one of the quants here the function is of the from T = KNO, where T is time, K some constant, N iterations, O obligors.
Rearrange for K = T/(NO), plug this into my sample data, take the average of all sample points, use the Std dev for the error
I did this for my data and get:
T = 3.81524E-06 * N * O (with 1.9% error), this is a pretty good approximation.
Create a chart in Excel, add a trendline, and select to have the equation displayed on the chart.
To clarify: You have tabular data below which you want to fit to some function f(O,I)=t?
200 400 600 800 1000 1200 1400 1600 1800 2000
20000 15.627681 30.0968663 44.7592684 60.9037558 75.8267358 90.3718977 105.8749983 121.0030672 135.9191249 150.3331682
40000 31.7202111 62.3603882 97.2085204 128.8111731 156.2443206 186.6374271 218.324317 249.2699288 279.6008184 310.9970803
60000 47.0708635 92.4599437 138.874287 186.0576007 231.2181381 280.541207 322.9836878 371.3076757 413.4058622 459.6208335
80000 60.7346238 120.3216303 180.471169 241.668982 300.4283548 376.9639188 417.5231669 482.6288981 554.9740194 598.0394434
100000 76.7535915 150.7479245 227.5125656 304.3908046 382.5900043 451.6034296 526.0730786 609.0358776 679.0268121 779.6887277
120000 90.4174626 179.5511355 269.4099593 360.2934453 448.4387573 537.1406039 626.7325734 727.6132992 807.4767327 898.307638
A rough guess looks like both O & I are linear. So f is in the form t = aO + bI + c. Plug in a few (O,I,t) and see what a,b,c should be.

Resources