Calculate time difference between first and last row in Excel column or .hdf file - python-3.x

I have the "Datetime" column in Excel and also in a .hdf Dataframe. How can I calculate the time difference (in hours, min or sec) between the first and the last row? here is how my data look like; please remember that my data has few thousands rows. Therefore I cannot write a code and manually add these dates:
(P.S. I am very new to python, this is my very first code)
please see the table below to see how it looks like:
as you can see, my date and time are in one column:
Datetime Header: Machine_started
2021-02-02 14:33:09 Data 1
2021-02-02 14:33:09 Data 1
2021-02-02 14:33:11 Data 1
2021-02-02 14:41:36 Data 1

I created a demo dataframe:
import pandas as pd
import numpy as np
data = {"Datetime": ['2021-02-02 14:33:09', '2021-02-02 14:33:09', '2021-02-02 14:33:11', '2021-02-02 14:41:36'],
"Header": ['Data', 'Data','Data','Data'],
"1_2_eBeam_started": [1,1,1,1]}
df = pd.DataFrame(data)
# creating dataframe
df['Datetime'].dtype
# dtype is object
# convert it to datetime
df['Datetime']=pd.to_datetime(df['Datetime'])
df['Datetime'].iloc[0] # this is first row
df['Datetime'].iloc[-1] # this is last row
# difference in seconds:
(df['Datetime'].iloc[-1] - df['Datetime'].iloc[0])/np.timedelta64(1,'s')
#output 507.0
# You can also get the difference in minutes, hours, etc. by rplacing 's' by 'm' or 'h' in np.timedelta64(1,'s')

Related

Widening long table grouped on date

I have run into a problem in transforming a dataframe. I'm trying to widen a table grouped on a datetime column, but cant seem to make it work. I have tried to transpose it, and pivot it but cant really make it the way i want it.
Example table:
datetime value
2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7
What I want to achieve is:
index date 02:00 03:00
1 2022-04-29 5 6
2 2022-05-29 5 7
The real data has one data point from 00:00 - 20:00 fore each day. So I guess a loop would be the way to go to generate the columns.
Does anyone know a way to solve this, or can nudge me in the right direction?
Thanks in advance!
Assuming from details you have provided, I think you are dealing with timeseries data and you have data from different dates acquired at 02:00:00 and 03:00:00. Please correct me if I am wrong.
First we replicate your DataFrame object.
import datetime as dt
from io import StringIO
import pandas as pd
data_str = """2022-04-29T02:00:00.000000000 5
2022-04-29T03:00:00.000000000 6
2022-05-29T02:00:00.000000000 5
2022-05-29T03:00:00.000000000 7"""
df = pd.read_csv(StringIO(data_str), sep=" ", header=None)
df.columns = ["date", "value"]
now we calculate unique days where you acquired data:
unique_days = df["date"].apply(lambda x: dt.datetime.strptime(x[:-3], "%Y-%m-%dT%H:%M:%S.%f").date()).unique()
Here I trimmed last 3 0s from your date because it would get complicated to parse. We convert the datetime to datetime object and get unique values
Now we create a new empty df in desired form:
new_df = pd.DataFrame(columns=["date", "02:00", "03:00"])
after this we can populate the values:
for day in unique_days:
new_row_data = [day] # this creates a row of 3 elems, which will be inserted into empty df
new_row_data.append(df.loc[df["date"] == f"{day}T02:00:00.000000000", "value"].values[0]) # here we find data for 02:00 for that date
new_row_data.append(df.loc[df["date"] == f"{day}T03:00:00.000000000", "value"].values[0]) # here we find data for 03:00 same day
new_df.loc[len(new_df)] = new_row_data # now we insert row to last pos
this should give you:
date 02:00 03:00
0 2022-04-29 5 6
1 2022-05-29 5 7

convert datetime to date python --> error: unhashable type: 'numpy.ndarray'

Pandas by default represent dates with datetime64 [ns], so I have in my columns this format [2016-02-05 00:00:00] but I just want the date 2016-02-05, so I applied this code for a few columns:
df3a['MA'] = pd.to_datetime(df3a['MA'])
df3a['BA'] = pd.to_datetime(df3a['BA'])
df3a['FF'] = pd.to_datetime(df3a['FF'])
df3a['JJ'] = pd.to_datetime(df3a['JJ'])
.....
but it gives me as result this error: TypeError: type unhashable: 'numpy.ndarray'
my question is: why i got this error and how do i convert datetime to date for multiple columns (around 50)?
i will be grateful for your help
One way to achieve what you'd like is with a DatetimeIndex. I've first created an Example DataFrame with 'date' and 'values' columns and tried from there on to reproduce the error you've got.
import pandas as pd
import numpy as np
# Example DataFrame with a DatetimeIndex (dti)
dti = pd.date_range('2020-12-01','2020-12-17') # dates from first of december up to date
values = np.random.choice(range(1, 101), len(dti)) # random values between 1 and 100
df = pd.DataFrame({'date':dti,'values':values}, index=range(len(dti)))
print(df.head())
>>> date values
0 2020-12-01 85
1 2020-12-02 100
2 2020-12-03 96
3 2020-12-04 40
4 2020-12-05 27
In the example, just the dates are already shown without the time in the 'date' column, I guess since it is a DatetimeIndex.
What I haven't tested but might can work for you is:
# Your dataframe
df3a['MA'] = pd.DatetimeIndex(df3a['MA'])
...
# automated transform for all columns (if all columns are datetimes!)
for label in df3a.columns:
df3a[label] = pd.DatetimeIndex(df3a[label])
Use DataFrame.apply:
cols = ['MA', 'BA', 'FF', 'JJ']
df3a[cols] = df3a[cols].apply(pd.to_datetime)

Going from monthly average dataframe to an interpolated daily timeseries

I am interested in taking average monthly values, for each month, and set the monthly average values to be the value on the 15th day of each month (within a daily timeseries).
I start with the following (these are the monthly average values I am given):
m_avg = pd.DataFrame({'Month': ['1.527013956', '1.899169054', '1.669356146','1.44920871', '1.188557788', '1.017035727', '0.950243755', '1.022453993', '1.203913739', '1.369545041','1.441827406','1.48621651']
EDIT: I added one more value to the dataframe so that there are now 12 values.
Next, I want to put each of these monthly values on the 15th day (within each month) for the following time period:
ts = pd.date_range(start='1/1/1950', end='12/31/1999', freq='D')
I know how to pull out the date on 15th day of an already existing daily timeseries by using:
df= df.loc[(df.index.day==15)] # Where df is any daily timeseries
Lastly, I know how to interpolate the values once I have the average monthly values on the 15th day of each month, using:
df.loc[:, ['Col1']] = df.loc[:, ['Col1']].interpolate(method='linear', limit_direction='both', limit=100)
How do I get from the monthly DataFrame to an interpolated daily DataFrame, where I linearly interpolate between the 15th day of each month, which is the monthly value of my original DataFrame by construction?
EDIT:
Your suggestion to use np.tile() was good, but I ended up needing to do this for multiple columns. Instead of np.tile, I used:
index = pd.date_range(start='1/1/1950', end='12/31/1999', freq='MS')
m_avg = pd.concat([month]*49,axis=0).set_index(index)
There may be a better solution out there, but this is working for my needs so far.
Here is one way to do it:
import pandas as pd
import numpy as np
# monthly averages, note these should be cast to float
month = np.array(['1.527013956', '1.899169054', '1.669356146',
'1.44920871', '1.188557788', '1.017035727',
'0.950243755', '1.022453993', '1.203913739',
'1.369545041', '1.441827406', '1.48621651'], dtype='float')
# expand this to 51 years, with the same monthly averages repeating each year
# (obviously not very efficient, probably there are better ways to attack the problem,
# but this was the question)
month = np.tile(month, 51)
# create DataFrame with these values
m_avg = pd.DataFrame({'Month': month})
# set the date index to the desired time period
m_avg.index = pd.date_range(start='1/1/1950', end='12/1/2000', freq='MS')
# shift the index by 14 days to get the 15th of each month
m_avg = m_avg.tshift(14, freq='D')
# expand the index to daily frequency
daily = m_avg.asfreq(freq='D')
# interpolate (linearly) the missing values
daily = daily.interpolate()
# show result
display(daily)
Output:
Month
1950-01-15 1.527014
1950-01-16 1.539019
1950-01-17 1.551024
1950-01-18 1.563029
1950-01-19 1.575034
... ...
2000-12-11 1.480298
2000-12-12 1.481778
2000-12-13 1.483257
2000-12-14 1.484737
2000-12-15 1.486217
18598 rows × 1 columns

How can I loop over rows in my DataFrame, calculate a value and put that value in a new column with this lambda function

./test.csv looks like:
price datetime
1 100 2019-10-10
2 150 2019-11-10
...
import pandas as pd
import datetime as date
import datetime as time
from datetime import datetime
from datetime import timedelta
csv_df = pd.read_csv('./test.csv')
today = datetime.today()
csv_df['datetime'] = csv_df['expiration_date'].apply(lambda x: pd.to_datetime(x)) #convert `expiration_date` to datetime Series
def days_until_exp(expiration_date, today):
diff = (expiration_date - today)
return [diff]
csv_df['days_until_expiration'] = csv_df['datetime'].apply(lambda x: days_until_exp(csv_df['datetime'], today))
I am trying to iterate over a specific column in my DateFrame labeled csv_df['datetime'] which in each cell has just one value, a date, and do a calcation defined by diff.
Then I want the single value diff to be put into the new Series csv_df['days_until_expiration'].
The problem is, it's calculating values for every row (673 rows) and putting all those values in a list in each row of csv_df['days_until_expiration. I realize it may be due to the brackets around [diff], but without them I get an error.
In Excel, I would just do something like =SUM(datetime - price) and click and drag down the rows to have it populate a new column. However, I want to do this in Pandas as it's part of a bigger application.
csv_df['datetime'] is series, so x of apply is each cell of series. You call apply with lambda and days_until_exp(), but you doesn't passing x to it. Therefore, the result is wrong.
Anyway, Without your sample data, I guess that you want to find sum of csv_df['datetime'] - today(). To do this, you don't need apply. Just do direct vectorized operation on series and sum.
I make 2 columns dataframe for sample:
csv_df:
datetime days_until_expiration
0 2019-09-01 NaN
1 2019-09-02 NaN
2 2019-09-03 NaN
Do the following return series of delta between csv_df['datetime'] and today(). I guess you want this::
td = datetime.datetime.today()
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days
csv_df:
datetime days_until_expiration
0 2019-09-01 115
1 2019-09-02 116
2 2019-09-03 117
OR:
To find sum of all deltas and assign the same sum value to csv_df['days_until_expiration']
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days.sum()
csv_df:
datetime days_until_expiration
0 2019-09-01 348
1 2019-09-02 348
2 2019-09-03 348

P-value normal test for multiple rows

I got the following simple code to calculate normality over an array:
import pandas as pd
df = pd.read_excel("directory\file.xlsx")
import numpy as np
x=df.iloc[:,1:].values.flatten()
import scipy.stats as stats
from scipy.stats import normaltest
stats.normaltest(x,axis=None)
This gives me nicely a p-value and a statistic.
The only thing I want right now is to:
Add 2 columns in the file with this p value and statistic and if i have multiple rows, do it for all the rows (calculate p value & statistic for each row and add 2 columns with these values in it).
Can someone help?
If you want to calculate row-wise normaltest, you should not flatten your data in x and use axis=1 such as
df = pd.DataFrame(np.random.random(105).reshape(5,21)) # to generate data
# calculate normaltest row-wise without the first column like you
df['stat'] ,df['p'] = stats.normaltest(df.iloc[:,1:],axis=1)
Then df contains two columns 'stat' and 'p' with the values your are looking for IIUC.
Note: to be able to perform normaltest, you need at least 8 values (according to what I experienced) so you need at least 8 columns in df.iloc[:,1:] otherwise it will raise an error. And even, it would be better to have more than 20 values in each row.

Resources