Create a distance matrix from Pandas Dataframe using a bespoke distance function - python-3.x

I have a Pandas dataframe with two columns, "id" (a unique identifier) and "date", that looks as follows:
test_df.head()
id date
0 N1 2020-01-31
1 N2 2020-02-28
2 N3 2020-03-10
I have created a custom Python function that, given two date strings, will compute the absolute number of days between those dates (with a given date format string e.g. %Y-%m-%d), as follows:
def days_distance(date_1, date_1_format, date_2, date_2_format):
"""Calculate the number of days between two given string dates
Args:
date_1 (str): First date
date_1_format (str): The format of the first date
date_2 (str): Second date
date_2_format (str): The format of the second date
Returns:
The absolute number of days between date1 and date2
"""
date1 = datetime.strptime(date_1, date_1_format)
date2 = datetime.strptime(date_2, date_2_format)
return abs((date2 - date1).days)
I would like to create a distance matrix that, for all pairs of IDs, will calculate the number of days between those IDs. Using the test_df example above, the final time distance matrix should look as follows:
N1 N2 N3
N1 0 28 39
N2 28 0 11
N3 39 11 0
I am struggling to find a way to compute a distance matrix using a bespoke distance function, such as my days_distance() function above, as opposed to a standard distance measure provided for example by SciPy.
Any suggestions?

Let us try pdist + squareform to create a square distance matrix representing the pair wise differences between the datetime objects, finally create a new dataframe from this square matrix:
from scipy.spatial.distance import pdist, squareform
i, d = test_df['id'].values, pd.to_datetime(test_df['date'])
df = pd.DataFrame(squareform(pdist(d[:, None])), dtype='timedelta64[ns]', index=i, columns=i)
Alternatively you can also calculate the distance matrix using numpy broadcasting:
i, d = test_df['id'].values, pd.to_datetime(test_df['date']).values
df = pd.DataFrame(np.abs(d[:, None] - d), index=i, columns=i)
N1 N2 N3
N1 0 days 28 days 39 days
N2 28 days 0 days 11 days
N3 39 days 11 days 0 days

You can convert the date column to datetime format. Then create numpy array from the column. Then create a matrix with the array repeated 3 times. Then subtract the matrix with its transpose. Then convert the result to a dataframe
import pandas as pd
import numpy as np
from datetime import datetime
test_df = pd.DataFrame({'ID': ['N1', 'N2', 'N3'],
'date': ['2020-01-31', '2020-02-28', '2020-03-10']})
test_df['date_datetime'] = test_df.date.apply(lambda x : datetime.strptime(x, '%Y-%m-%d'))
date_array = np.array(test_df.date_datetime)
date_matrix = np.tile(date_array, (3,1))
date_diff_matrix = np.abs((date_matrix.T - date_matrix))
date_diff = pd.DataFrame(date_diff_matrix)
date_diff.columns = test_df.ID
date_diff.index = test_df.ID
>>> ID N1 N2 N3
ID
N1 0 days 28 days 39 days
N2 28 days 0 days 11 days
N3 39 days 11 days 0 days

Related

Making rows of points in dataframe into POLYGON using geopandas

I am trying to finding out if points are in closed polygons (in this question: Finding if a point in a dataframe is in a polygon and assigning polygon name to point) but I realized that there might be another way to do this:
I have this dataframe
df=
id x_zone y_zone
0 A1 65.422080 48.147850
1 A1 46.635708 51.165745
2 A1 46.597984 47.657444
3 A1 68.477700 44.073700
4 A3 46.635708 54.108190
5 A3 46.635708 51.844770
6 A3 63.309560 48.826878
7 A3 62.215572 54.108190
and I would like to transform this into
id Polygon
0 A1 POLYGON((65.422080, 48.147850), (46.635708, 51.165745), (46.597984, 47.657444), (68.477700, 44.073700))
1 A3 POLYGON((46.635708,54.108190), (46.635708 ,51.844770), (63.309560, 48.826878),(62.215572 , 54.108190))
and do the same for points:
df1=
item x y
0 1 50 49
1 2 60 53
2 3 70 30
to
item point
0 1 POINT(50,49)
1 2 POINT(60,53)
2 3 POINT(70,30)
I have never used geopandas and am a little at a loss here.
My question is thus: How do I get from a pandas dataframe to a dataframe with geopandas attributes?
Thankful for any insight!
You can achieve as follows but you would have to set the right dtype. I know in ArcGIS you have to set the dtype as geometry;
df.groupby('id').apply(lambda x: 'POLYGON(' + str(tuple(zip(x['x_zone'],x['y_zone'])))+')')
I'd suggest the following to directly get a GeoDataFrame from your df:
from shapely.geometry import Polygon
import geopandas as gpd
gdf = gpd.GeoDataFrame(geometry=df.groupby('name').apply(
lambda g: Polygon(gpd.points_from_xy(g['x_zone'], g['y_zone']))))
It first creates a list of points using geopandas' points_from_xy, then create a Polygon object from this list.

Take the most recent date and the most remote one and calculate the months between them with .groupby

I'd like to get the number of month in between these dates (between max and minimum date) and keep the same order in the groupby
One of possible solutions is to start from a datesac - the result
of your grouping (presented in your picture).
I also assume that ORDER_INST column of your source DataFrame is of datetime type (not a string) and hence just this type has also level 1 of
the MultiIndex in datesac.
To compute the month span separately for each MRN (level 0 of the
MultiIndex), define a function, to be applied to each group:
def monthSpan(grp):
dates = grp.index.get_level_values(1)
return (dates.max().to_period('M') - dates.min().to_period('M')).n
Then add MonthSpan column to your df, running:
datesac['MonthSpan'] = datesac.groupby(level=0).transform(monthSpan);
The result is:
List MonthSpan
MRN ORDER_INST
1000031 2010-04-12 0 11
2010-04-16 0 11
2010-04-17 0 11
2010-04-18 0 11
2011-03-01 0 11
9017307 2018-11-27 0 7
2019-02-04 0 7
2019-04-25 0 7
2019-05-14 0 7
2019-06-09 0 7
Pandas does not allow item assignments to a groupby object (a new column cannot be added to a groupby object) so the operation will have to be split. One solution is first calculate the month difference from the groupby object, merge the dataframes together, and then groupby again.
Create the first groupby object:
datesac = acdates.groupby(['MRN'])
Calculate the difference in months between each group and join to the original dataframe (or a new dataframe). This method requires numpy so import as necessary
import numpy as np
acdates_new = pd.merge(
left=acdates,
right=((datesac['ORDER_INST'].max() - df_group['ORDER_INST'].min())/np.timedelta64(1, 'M')).astype('int').rename("DATE_DIFF"),
left_on='MRN',
right_index=True
)
Regroup
datesac = acdates_new.groupby(['MRN'])

How to bucket/bin the dates in python?

I have a column with 16 days, 256 days, 450 days as values, which was obtained by subtracting 2 date columns (eg. 2010-11-10 - 2010-11-1). I want to bin the dates into 4 categories (0-30 days as 1, 30-90 days as 2, 90-180 days as 3 and greater than 180 days as 4).
I tried converting the column into categorical and then tried to split the (16 days to '16' and 'days') but got an error.
df_merged['Case_Duration'] = df_merged['DateOfResolution'] -df_merged['DateOfRegistration']
DateOfRegistration and DateOfResolution are date fields (eg. 2010-11-1)
df_merged['Case_Duration'] = df_merged['Case_Duration'].astype('category')
to convert 'Case_Duration' column to category
df_Days = df_merged["Case_Duration"].str.split(" ", n = 1, expand = True)
to split the 'Case_Duration' column values. (eg. 16 days -> '16' and 'days')
But this step gives an error -> can only use .str accessor with string values, which use np.object_ dtype in pandas
Desired output:
Here I create a pandas df named data with random timestamps at columns a and b (to represent your initial datetime columns). Column bucket has your desired output
data_dic = {
"a": ['2019-07-26 13:21:12','2019-07-26 13:21:12','2019-07-26 13:21:12','2019-07-26 13:21:12'],
"b": ['2019-03-26 13:21:12','2019-05-26 13:21:12','2019-07-23 13:21:12','2019-02-26 13:21:12'],
}
data = pd.DataFrame(data_dic)
data['a'] = pd.to_datetime(data['a'])
data['b'] = pd.to_datetime(data['b'])
data['bucket'] = np.select( [(data['a'] - data['b']).dt.days< 31, (data['a'] - data['b']).dt.days< 91 ] ,[1,2], 3)
Note that
(data['a'] - data['b']).dt.days
computes the time difference in days

How can I loop over rows in my DataFrame, calculate a value and put that value in a new column with this lambda function

./test.csv looks like:
price datetime
1 100 2019-10-10
2 150 2019-11-10
...
import pandas as pd
import datetime as date
import datetime as time
from datetime import datetime
from datetime import timedelta
csv_df = pd.read_csv('./test.csv')
today = datetime.today()
csv_df['datetime'] = csv_df['expiration_date'].apply(lambda x: pd.to_datetime(x)) #convert `expiration_date` to datetime Series
def days_until_exp(expiration_date, today):
diff = (expiration_date - today)
return [diff]
csv_df['days_until_expiration'] = csv_df['datetime'].apply(lambda x: days_until_exp(csv_df['datetime'], today))
I am trying to iterate over a specific column in my DateFrame labeled csv_df['datetime'] which in each cell has just one value, a date, and do a calcation defined by diff.
Then I want the single value diff to be put into the new Series csv_df['days_until_expiration'].
The problem is, it's calculating values for every row (673 rows) and putting all those values in a list in each row of csv_df['days_until_expiration. I realize it may be due to the brackets around [diff], but without them I get an error.
In Excel, I would just do something like =SUM(datetime - price) and click and drag down the rows to have it populate a new column. However, I want to do this in Pandas as it's part of a bigger application.
csv_df['datetime'] is series, so x of apply is each cell of series. You call apply with lambda and days_until_exp(), but you doesn't passing x to it. Therefore, the result is wrong.
Anyway, Without your sample data, I guess that you want to find sum of csv_df['datetime'] - today(). To do this, you don't need apply. Just do direct vectorized operation on series and sum.
I make 2 columns dataframe for sample:
csv_df:
datetime days_until_expiration
0 2019-09-01 NaN
1 2019-09-02 NaN
2 2019-09-03 NaN
Do the following return series of delta between csv_df['datetime'] and today(). I guess you want this::
td = datetime.datetime.today()
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days
csv_df:
datetime days_until_expiration
0 2019-09-01 115
1 2019-09-02 116
2 2019-09-03 117
OR:
To find sum of all deltas and assign the same sum value to csv_df['days_until_expiration']
csv_df['days_until_expiration'] = (csv_df['datetime'] - td).dt.days.sum()
csv_df:
datetime days_until_expiration
0 2019-09-01 348
1 2019-09-02 348
2 2019-09-03 348

Compute difference in days between two date variables - Python

I have two date variables, and I tried to compute the difference in days between them with:
from datetime import date, timedelta,datetime
date_format = "%Y/%m/%d"
a = datetime.strptime(df.D1, date_format)
b = datetime.strptime(df.D2, date_format)
df['delta'] = b - a
print delta.days
But I'm getting this error:
TypeError: strptime() argument 1 must be str, not Series
How could I do this? The variables are objects, should I transform them in Datatime64?
Since you're working with pandas, you can use pd.to_datetime instead of the datetime package:
# Convert each date column to datetime:
df['D1'] = pd.to_datetime(df.D1,format='%Y/%m/%d')
df['D2'] = pd.to_datetime(df.D2,format='%Y/%m/%d')
# With 2 datetime Series, a simple subtraction will give you a Timedelta column:
df['delta'] = df.D1 - df.D2
For example:
>>> df
D1 D2
0 2015/05/18 2014/06/21
1 2015/10/18 2014/08/14
df['D1'] = pd.to_datetime(df.D1,format='%Y/%m/%d')
df['D2'] = pd.to_datetime(df.D2,format='%Y/%m/%d')
df['delta'] = df.D1 - df.D2
>>> df
D1 D2 delta
0 2015/05/18 2014/06/21 331 days
1 2015/10/18 2014/08/14 430 days

Resources