Compute difference in days between two date variables - Python - python-3.x

I have two date variables, and I tried to compute the difference in days between them with:
from datetime import date, timedelta,datetime
date_format = "%Y/%m/%d"
a = datetime.strptime(df.D1, date_format)
b = datetime.strptime(df.D2, date_format)
df['delta'] = b - a
print delta.days
But I'm getting this error:
TypeError: strptime() argument 1 must be str, not Series
How could I do this? The variables are objects, should I transform them in Datatime64?

Since you're working with pandas, you can use pd.to_datetime instead of the datetime package:
# Convert each date column to datetime:
df['D1'] = pd.to_datetime(df.D1,format='%Y/%m/%d')
df['D2'] = pd.to_datetime(df.D2,format='%Y/%m/%d')
# With 2 datetime Series, a simple subtraction will give you a Timedelta column:
df['delta'] = df.D1 - df.D2
For example:
>>> df
D1 D2
0 2015/05/18 2014/06/21
1 2015/10/18 2014/08/14
df['D1'] = pd.to_datetime(df.D1,format='%Y/%m/%d')
df['D2'] = pd.to_datetime(df.D2,format='%Y/%m/%d')
df['delta'] = df.D1 - df.D2
>>> df
D1 D2 delta
0 2015/05/18 2014/06/21 331 days
1 2015/10/18 2014/08/14 430 days

Related

convert datetime to date python --> error: unhashable type: 'numpy.ndarray'

Pandas by default represent dates with datetime64 [ns], so I have in my columns this format [2016-02-05 00:00:00] but I just want the date 2016-02-05, so I applied this code for a few columns:
df3a['MA'] = pd.to_datetime(df3a['MA'])
df3a['BA'] = pd.to_datetime(df3a['BA'])
df3a['FF'] = pd.to_datetime(df3a['FF'])
df3a['JJ'] = pd.to_datetime(df3a['JJ'])
.....
but it gives me as result this error: TypeError: type unhashable: 'numpy.ndarray'
my question is: why i got this error and how do i convert datetime to date for multiple columns (around 50)?
i will be grateful for your help
One way to achieve what you'd like is with a DatetimeIndex. I've first created an Example DataFrame with 'date' and 'values' columns and tried from there on to reproduce the error you've got.
import pandas as pd
import numpy as np
# Example DataFrame with a DatetimeIndex (dti)
dti = pd.date_range('2020-12-01','2020-12-17') # dates from first of december up to date
values = np.random.choice(range(1, 101), len(dti)) # random values between 1 and 100
df = pd.DataFrame({'date':dti,'values':values}, index=range(len(dti)))
print(df.head())
>>> date values
0 2020-12-01 85
1 2020-12-02 100
2 2020-12-03 96
3 2020-12-04 40
4 2020-12-05 27
In the example, just the dates are already shown without the time in the 'date' column, I guess since it is a DatetimeIndex.
What I haven't tested but might can work for you is:
# Your dataframe
df3a['MA'] = pd.DatetimeIndex(df3a['MA'])
...
# automated transform for all columns (if all columns are datetimes!)
for label in df3a.columns:
df3a[label] = pd.DatetimeIndex(df3a[label])
Use DataFrame.apply:
cols = ['MA', 'BA', 'FF', 'JJ']
df3a[cols] = df3a[cols].apply(pd.to_datetime)

Create a distance matrix from Pandas Dataframe using a bespoke distance function

I have a Pandas dataframe with two columns, "id" (a unique identifier) and "date", that looks as follows:
test_df.head()
id date
0 N1 2020-01-31
1 N2 2020-02-28
2 N3 2020-03-10
I have created a custom Python function that, given two date strings, will compute the absolute number of days between those dates (with a given date format string e.g. %Y-%m-%d), as follows:
def days_distance(date_1, date_1_format, date_2, date_2_format):
"""Calculate the number of days between two given string dates
Args:
date_1 (str): First date
date_1_format (str): The format of the first date
date_2 (str): Second date
date_2_format (str): The format of the second date
Returns:
The absolute number of days between date1 and date2
"""
date1 = datetime.strptime(date_1, date_1_format)
date2 = datetime.strptime(date_2, date_2_format)
return abs((date2 - date1).days)
I would like to create a distance matrix that, for all pairs of IDs, will calculate the number of days between those IDs. Using the test_df example above, the final time distance matrix should look as follows:
N1 N2 N3
N1 0 28 39
N2 28 0 11
N3 39 11 0
I am struggling to find a way to compute a distance matrix using a bespoke distance function, such as my days_distance() function above, as opposed to a standard distance measure provided for example by SciPy.
Any suggestions?
Let us try pdist + squareform to create a square distance matrix representing the pair wise differences between the datetime objects, finally create a new dataframe from this square matrix:
from scipy.spatial.distance import pdist, squareform
i, d = test_df['id'].values, pd.to_datetime(test_df['date'])
df = pd.DataFrame(squareform(pdist(d[:, None])), dtype='timedelta64[ns]', index=i, columns=i)
Alternatively you can also calculate the distance matrix using numpy broadcasting:
i, d = test_df['id'].values, pd.to_datetime(test_df['date']).values
df = pd.DataFrame(np.abs(d[:, None] - d), index=i, columns=i)
N1 N2 N3
N1 0 days 28 days 39 days
N2 28 days 0 days 11 days
N3 39 days 11 days 0 days
You can convert the date column to datetime format. Then create numpy array from the column. Then create a matrix with the array repeated 3 times. Then subtract the matrix with its transpose. Then convert the result to a dataframe
import pandas as pd
import numpy as np
from datetime import datetime
test_df = pd.DataFrame({'ID': ['N1', 'N2', 'N3'],
'date': ['2020-01-31', '2020-02-28', '2020-03-10']})
test_df['date_datetime'] = test_df.date.apply(lambda x : datetime.strptime(x, '%Y-%m-%d'))
date_array = np.array(test_df.date_datetime)
date_matrix = np.tile(date_array, (3,1))
date_diff_matrix = np.abs((date_matrix.T - date_matrix))
date_diff = pd.DataFrame(date_diff_matrix)
date_diff.columns = test_df.ID
date_diff.index = test_df.ID
>>> ID N1 N2 N3
ID
N1 0 days 28 days 39 days
N2 28 days 0 days 11 days
N3 39 days 11 days 0 days

How can I loop over rows in my DataFrame, calculate a value and put that value in a new column with this lambda function

./test.csv looks like:
price datetime
1 100 2019-10-10
2 150 2019-11-10
...
import pandas as pd
import datetime as date
import datetime as time
from datetime import datetime
from datetime import timedelta
csv_df = pd.read_csv('./test.csv')
today = datetime.today()
csv_df['datetime'] = csv_df['expiration_date'].apply(lambda x: pd.to_datetime(x)) #convert `expiration_date` to datetime Series
def days_until_exp(expiration_date, today):
diff = (expiration_date - today)
return [diff]
csv_df['days_until_expiration'] = csv_df['datetime'].apply(lambda x: days_until_exp(csv_df['datetime'], today))
I am trying to iterate over a specific column in my DateFrame labeled csv_df['datetime'] which in each cell has just one value, a date, and do a calcation defined by diff.
Then I want the single value diff to be put into the new Series csv_df['days_until_expiration'].
The problem is, it's calculating values for every row (673 rows) and putting all those values in a list in each row of csv_df['days_until_expiration. I realize it may be due to the brackets around [diff], but without them I get an error.
In Excel, I would just do something like =SUM(datetime - price) and click and drag down the rows to have it populate a new column. However, I want to do this in Pandas as it's part of a bigger application.
csv_df['datetime'] is series, so x of apply is each cell of series. You call apply with lambda and days_until_exp(), but you doesn't passing x to it. Therefore, the result is wrong.
Anyway, Without your sample data, I guess that you want to find sum of csv_df['datetime'] - today(). To do this, you don't need apply. Just do direct vectorized operation on series and sum.
I make 2 columns dataframe for sample:
csv_df:
datetime days_until_expiration
0 2019-09-01 NaN
1 2019-09-02 NaN
2 2019-09-03 NaN
Do the following return series of delta between csv_df['datetime'] and today(). I guess you want this::
td = datetime.datetime.today()
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days
csv_df:
datetime days_until_expiration
0 2019-09-01 115
1 2019-09-02 116
2 2019-09-03 117
OR:
To find sum of all deltas and assign the same sum value to csv_df['days_until_expiration']
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days.sum()
csv_df:
datetime days_until_expiration
0 2019-09-01 348
1 2019-09-02 348
2 2019-09-03 348

How to define time column in in a pandas dataframe?

I have a "pandas.core.frame.DataFrame" "seied_log":
seied_log
Out[155]:
0
0 5.264761
1 5.719328
2 6.420809
3 6.129704
...
What I run is ARIMA model:
model = ARIMA(seied_log, order=(2, 1, 0))
Hovewer I receive the following mistake:
ValueError: Given a pandas object and the index does not contain dates
What I need, is to define a "date" column. These are yearly observations. How can I define a column with date starting from 1978?
If your index is 0 through n_obs-1, then simply
from datetime import datetime
seied_log["date"] = seied_log.index.map(lambda idx: datetime(year=1978+idx, month=1, day=1)

Convert pandas dataframe column containing Excel general numbers into datetime object

I have a dataframe that I constructed from pulling data from SQL using pd.read_sql_query(). I have one column that has dates but in excel general number format. How do convert this column into datetime object.
I can convert one value with the xlrd library but looking for the best way to convert the entire column.
datetime_value = datetime(*xlrd.xldate_as_tuple(42369, 0))
You can use map to apply a lambda function performing that operation to every entry in a column:
import pandas as pd
import xlrd
from datetime import datetime
# Create dummy dataframe
df = pd.DataFrame({
"date": [42369, 42370, 42371, 42372]
})
print df.to_string()
# Convert values into a new column named "converted"
df["converted"] = df["date"].map(lambda x: datetime(*xlrd.xldate_as_tuple(x, 0)))
print df.to_string()
Before conversion:
date
0 42369
1 42370
2 42371
3 42372
After:
date converted
0 42369 2015-12-31
1 42370 2016-01-01
2 42371 2016-01-02
3 42372 2016-01-03
Is this what you are looking for?
Update:
To make this work with string entries, you could either tell Pandas to treat the column as ints or floats:
# int
df["converted"] = df["date"].astype(int).map(lambda x: datetime(*xlrd.xldate_as_tuple(x, 0)))
# float
df["converted"] = df["date"].astype(float).map(lambda x: datetime(*xlrd.xldate_as_tuple(x, 0)))
or just cast x to int or float within the lambda function:
# int
df["converted"] = df["date"].map(lambda x: datetime(*xlrd.xldate_as_tuple(int(x), 0)))
# float
df["converted"] = df["date"].map(lambda x: datetime(*xlrd.xldate_as_tuple(float(x), 0)))

Resources