Trying to convert a string into a time format using pandas - python-3.x

i like to import a dataset with pandas, this is my code:
data = pd.read_csv(source + 'clustering_CB\\AAA_tableau_jan_oct_19_cardiologia.txt' ,sep='\t' , engine='python')
Column Col10 contains string values which let me know the duration of a web visit, here's an example
00:02:35
2 minute and 35 seconds.
What i like to do is to import this columns as a time format in order to measure (in seconds or in minutes) the duration of the web visit.

If I understand your question correctly, the column already contains timedelta values (in string format) - in this case, you can apply the pd.to_timedelta function to the column:
timedelta = pd.to_timedelta(df["Col10"])

Related

Widening long table grouped on date

I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!
Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7

Calculate time difference between first and last row in Excel column or .hdf file

I have the "Datetime" column in Excel and also in a .hdf Dataframe. How can I calculate the time difference (in hours, min or sec) between the first and the last row? here is how my data look like; please remember that my data has few thousands rows. Therefore I cannot write a code and manually add these dates:
(P.S. I am very new to python, this is my very first code)
please see the table below to see how it looks like:
as you can see, my date and time are in one column:
Datetime Header: Machine_started
2021-02-02 14:33:09 Data 1
2021-02-02 14:33:09 Data 1
2021-02-02 14:33:11 Data 1
2021-02-02 14:41:36 Data 1
I created a demo dataframe:
import pandas as pd
import numpy as np
data = {"Datetime": ['2021-02-02 14:33:09', '2021-02-02 14:33:09', '2021-02-02 14:33:11', '2021-02-02 14:41:36'],
"Header": ['Data', 'Data','Data','Data'],
"1_2_eBeam_started": [1,1,1,1]}
df = pd.DataFrame(data)
# creating dataframe
df['Datetime'].dtype
# dtype is object
# convert it to datetime
df['Datetime']=pd.to_datetime(df['Datetime'])
df['Datetime'].iloc[0] # this is first row
df['Datetime'].iloc[-1] # this is last row
# difference in seconds:
(df['Datetime'].iloc[-1] - df['Datetime'].iloc[0])/np.timedelta64(1,'s')
#output 507.0
# You can also get the difference in minutes, hours, etc. by rplacing 's' by 'm' or 'h' in np.timedelta64(1,'s')

Date stuck as unformattable in pandas dataframe

I am trying to plot time series data and my date column is stuck like this, and I cannot seem to figure out what datatype it is to change it, as adding verbose = True doesn't yield any explanation for the data.
Here is a screenshot of the output Date formatting
Here is the code I have for importing the dataset and all the formatting I've done to it. Ultimately, I want date and time to be separate values, but I cannot figure out why it's auto formatting it and applying today's date.
df = pd.read_csv('dataframe.csv')
df['Avg_value'] = (df['Open'] + df['High'] + df['Low'] + df['Close'])/4
df['Date'] = pd.to_datetime(df['Date'])
df['Time'] = pd.to_datetime(df['Time'])
Any help would be appreciated
The output I'd be looking for would be something like:
Date: 2016-01-04 10:00:00
As one column rather than 2 separate ones.
When you pass a Pandas Series into pd.to_datetime(...), it parses the values and returns a new Series of dtype datetime64[ns] containing the parsed values:
>>> pd.to_datetime(pd.Series(["12:30:00"]))
0 2021-08-24 12:30:00
dtype: datetime64[ns]
The reason you see today's date is that a datetime64 value must have both date and time information. Since the date is missing, the current date is substituted by default.
Pandas does not really have a dtype that supports "just time of day, no date" (there is one that supports time intervals, such as "5 minutes"). The Python standard library does, however: the datetime.time class.
Pandas provides a convenience function called the .dt accessor for extracting a Series of datetime.time objects from a Series of datetime64 objects:
>>> pd.to_datetime(pd.Series(["12:30:00"])).dt.time
0 12:30:00
dtype: object
Note that the dtype is just object which is the fallback dtype Pandas uses for anything which is not natively supported by Pandas or NumPy. This means that working with these datetime.time objects will be a bit slower than working with a native Pandas dtype, but this probably won't be a bottleneck for your application.
Recommended reference: https://pandas.pydata.org/docs/user_guide/timeseries.html

Calculate average of time with fraction of seconds

I have 3 different columns of timestamps in a pandas dataframe, two of which have fraction of seconds recorded while the third does not have fraction of seconds. I would like to calculate an average of these 3 columns.
I have already tried to compute the average using the mean function on the columns and consistently received nan as the result
import pandas as pd
data = [{'time1': '2018-07-22 04:34:10.8966', 'time2': '2017-07-22 04:34:10.8966', 'time3': '2018-07-27 00:10:04'}]
df = pd.DataFrame(data)
df['estimate'] = df[['time1', 'time2', 'time3']].mean(axis=1)
df
Expected : An average of the 3 timestamps
Actual : While there is no error, it also always evaluates to nan which is not what is desired.
As far as I know you can't to it directly on datetime values, you need to convert them, average, and then convert back:
data = [{'time1': '2018-07-22 04:34:10.8966', 'time2': '2017-07-22 04:34:10.8966', 'time3': '2018-07-27 00:10:04'}]
df = pd.DataFrame(data).apply(pd.to_datetime)
df['estimate'] = pd.to_datetime(df[['time1', 'time2', 'time3']].values.astype(pd.np.int64).mean(axis=1))
Result:
time1 time2 time3 estimate
0 2018-07-22 04:34:10.896600 2017-07-22 04:34:10.896600 2018-07-27 00:10:04 2018-03-24 03:06:08.597733376

What is the most performant way to slice a datetime in a multi-index?

What is the most 'performant' way to filter a DataFrame by time if the DataFrame has a multi-index containing a datetime index?
For example, how to filter for business hours only in a datetime index which is contained in a multi-index.
how to filter for business hours only in a datetime index which is contained in a multi-index.
df.index.get_level_values(1).hour.isin([9,10,11,13,14,15,16])
That's just one example--filter the second level of a MultiIndex which is a datetime column and get a boolean mask which is True wherever the hour is 9 to 5 excluding lunch break.
Need more precision?
dt = df.index.get_level_values(1)
minutes = dt.hour * 60 + dt.minute
minutes.between(8*60+15, 17*60+45)
That's 8:15 to 17:45.
Siesta?
minutes.between(9*60+30, 15*60) | minutes.between(17*60+30, 20*60)

Resources