How to get a timestamp (for Time Series analysis) when data is in seconds, using python? - python-3.x

I have a dataset (see image 1) that I want to analyze for anomalies.
The story of the dataset is the following:
I measured temperature (TPU, CPU), memory (MemUsed) and time (Time) for each inference (an image classification task).
I also made a cumulative sum of the 'Time' column in order to get the time that the whole process of classifying 70000 images will take. This is around 7000 seconds.
The following code is what I get when trying to get a Timestamp (aka '_time'). And in Image 2 you will see how does it look.
#Return the index with frequency (RangeIndex to DateTimeIndex)
df1_MAX['_time']=pd.to_datetime(df1_MAX['TimeTotal'])#string to DatetimeIndex
df1_MAX=df1_MAX.set_index('_time')
df1_MAX
As I am working with data in seconds, how can I get a proper Timestamp? what format do I need to use?
Thank you
Image 1
Image 2
------------------EDIT--------------
Using timedelta64
---------------EDIT------------
I changed 'TimeTotal' to ms. And also put 'timedelta64[]' in ms.
df1_MAX['_time']= pd.to_datetime("2020-07-06 10:53:00")+df1_MAX['TimeTotal'].astype('timedelta64[ms]')
df1_MAX=df1_MAX.set_index('_time')
df1_MAX

Can make it a time delta then
df['TIME']= df['Time'].astype('timedelta64[s]')
If you wanted to create a datetime stamp, say you began at 2021-07-06 10:53:02. Just add the timedelta to the start datettime.
Data
df = pd.DataFrame({"Time": [121.83,101.22],"score": [1,2],"Label": ["trimaran", "trimaran"]})
Solution
df['DateTime']= pd.to_datetime("2021-07-06 10:53:02")+df['Time'].astype('timedelta64[s]')
Outcome
Time score Label DateTime
0 121.83 1 trimaran 2021-07-06 10:55:03
1 101.22 2 trimaran 2021-07-06 10:54:43

Related

How to aggregate data by period in a rrdtool graph

I have a rrd file with average ping times to a server (GAUGE) every minute and when the server is offline (which is very frequent for reasons that doesn't matter now) it stores a NaN/unknown.
I'd like to create a graph with the percentage the server is offline each hour which I think can be achieved by counting every NaN within 60 samples and then dividing by 60.
For now I get to the point where I define a variable that is 1 when the server is offline and 0 otherwise, but I already read the docs and don't know how to aggregate this:
DEF:avg=server.rrd:rtt:AVERAGE CDEF:offline=avg,UN,1,0,IF
Is it possible to do this when creating a graph? Or I will have to store that info in another rrd?
I don't think you can do exactly what you want, but you have a couple of options.
You can define a sliding window average, that shows the percentage of the previous hour that was unknown, and graph that, using TRENDNAN.
DEF:avg=server.rrd:rtt:AVERAGE:step=60
CDEF:offline=avg,UN,100,0,IF
CDEF:pcavail=offline,3600,TREND
LINE:pcavail#ff0000:Availability
This defines avg as the 1-min time series of ping data. Note we use step=60 to ensure we get the best resolution of data even in a smaller graph. Then we define offline as 100 when the server is there, 0 when not. Then, pcavail is a 1-hour sliding window average of this, which will in effect be the percentage of time during the previous hour during which the server was available.
However, there's a problem in that RRDTool will silently summarise the source data before you get your hands on it, if there are many data points to a pixel in the graph (this won't happen if doing a fetch of course). To get around that, you'd need to have the offline CDEF done at store time -- IE, have a COMPUTE type DS that is 100 or 0 depending on if the avg DS is known. Then, any averaging will preserve data (normal averaging omits the unknowns, or the xff setting makes the whole cdp unknown).
rrdtool create ...
DS:rtt:GAUGE:120:0:9999
DS:offline:COMPUTE:rtt,UN,100,0,IF
rrdtool graph ...
DEF:offline=server.rrd:offline:AVERAGE:step=3600
LINE:offline#ff0000:Availability
If you are able to modify your RRD, and do not need historical data, then use of a COMPUTE in this way will allow you to display your data in a 1-hour stepped graph as you wanted.

How to speed up featuretools dfs execution?

I am running featuretools to create new features and have created the entitysets from existing dataframe.
The dataframe for training has ~233K records and 81 columns which is split into 3 entities and provided as an input argument to es.dfs command which takes about 2.5 hours of execution time on train dataset and 1.5 hours on test dataset. The test data set size is ~120K with 80 columns.
How can I improve the performance in terms of reducing time to execute? I am running the code on Kaggle Kernel and I lose nearly 4+ hours out of the 9 hours available for a session just running the es.dfs command.
I have referred the code on featuretools website on parallel processing and speeding up the code but it is not very clear on how to go about doing it when entities are created from a dataframe or may be I am not understanding it very clearly.
Execution time reduction by 1/4th time.

Record by record timestamp difference calculation

I am working on a logic to find the consecutive time difference between two timestamps in the streaming layer(spark) by comparing the previous time and current time and storing the value in the database.
For eg:
2017-08-01 11:00:00
2017-08-01 11:05:00
2017-08-01 11:07:00
So according to the above timestamps my consecutive diff will be 5 mins(11:00:00 - 11:05:00) and 2 mins respectively and when i sum the difference I will get 7 mins(5+2) which will be actual time difference.. Now the real challenge is when I receive delayed timestamp.
For eg:
2017-08-01 11:00:00
2017-08-01 11:05:00
2017-08-01 11:07:00
2017-08-01 11:02:00
Here when i calculate the difference it will be 5 mins,2 mins,5 mins respectively and now sum of the difference I will get 12 mins(5+2+5) which will be greater than the actual time difference(7 mins).which is wrong
please help me to find a workaround to handle this delayed timestamp in record by record time difference calculation.
What you are experiencing is the difference between 'event time' and 'processing time'. In the best case, processing time will be nearly identical to event time, but sometimes, an input record is delayed, so the difference will be larger.
When you process streaming data, you define (explicitly or implicitly) a window of records that you look at. If you process records individually, this window has size 1. In your case, your window has size 2. But you could also have a window that is based on time, i.e. you can look at all records that have been received in the last 10 minutes.
If you want to process delayed records in-order, you need to wait before the delayed records have arrived and then sort the records within the window. The problem then becomes, how long do you wait? The delayed records may show up 2 days later! How long to wait is a subjective question and depends on your application and its requirements.
Note that if your window is time-based, you will need to handle the case where no previous record is available.
I highly recommend this article: https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 to get to grips with streaming terminology and windows.

How big the window could be when using spark stream?

We have some stream data need to be calculated and considering use spark stream to do it.
We need to generate three kinds of reports. The reports are based on
The last 5 minutes data
The last 1 hour data
The last 24 hour data
The frequency of reports is 5 minutes.
After reading the docs, the most obvious way to solve this seems to set up a spark stream with 5 minutes interval and two window which are 1 hour and 1 day.
But I am worrying that if the window is too big for one day and one hour. I do not have much experience on spark stream, so what is the window length in your environment?

Calculating the throughput (requests/sec) and plot it

I'm using JMeter client to test the throughtput of a certain workload (PHP+MySQL, 1 page) on a certain server. Basically I'm doing a "capacity test" with an increasing number of threads over the time.
I installed the "Statistical Aggregate Report" JMeter plugin and this was the result (ignore the "Response time" line):
At the same time I used the "Simple Data Writer" listener to write a log file ("JMeter.csv"). Then I tried to "manually" calculate the throughput for every second of the test.
Each line of "JMeter.csv" has this format:
timestamp elaspedtime responsecode success bytes
1385731020607 42 200 true 325
... ... ... ... ...
The timestamp is referred to the time when the request is made by the client, and not when the request is served by the server. So I simply did: totaltime = timestamp + elapsedtime.
In the next step I converted the totaltime to a date format, like: 13:17:01.
I have more than 14K samples and with Excel I was able to do this quickly.
Then I counted how many samples there were for each second. Example:
totaltime samples (requestsServed/second)
13:17:01 204
13:17:02 297
... ...
When I tried to plot the results I obtained the following graphic:
As you can notice it is far different from the first graphic.
Given that the first graphic is correct, what is the mistake of my formula/procedure to calculate the throughput?
It turns out that this plugin is plotting something that I don't know... I tried many times and my considerations were actually correct. Be careful with this plugin (or check its source code).
Throughput can be view in Jmeter Summary Report and you can calculate by saving your Test Results file in xml file in Summary Report.
Throughput = Number of samples/(Max (ts+t) - Min ts)*1000
Throughput = (Number of samples/The difference between Maximum and minimum response time)*1000
By this formula you can calculate Throughput for each and every http requests in Summary Report.
Example:
Max Response Time = 1485538701633+569 = 1485538702202
Min Response Time = 1485538143112
Throughput = (2/1485538702202-1485538143112)*1000
Throughput = (2/1505) *1000
Throughput = 0.00132890*1000
Throughput = 1.3/sec
You can read more with examples Here(http://www.wikishown.com/how-to-calculate-throughput-in-jmeter/), i got a good idea about Throughput Calculation.

Resources