I have a column with 16 days, 256 days, 450 days as values, which was obtained by subtracting 2 date columns (eg. 2010-11-10 - 2010-11-1). I want to bin the dates into 4 categories (0-30 days as 1, 30-90 days as 2, 90-180 days as 3 and greater than 180 days as 4).
I tried converting the column into categorical and then tried to split the (16 days to '16' and 'days') but got an error.
df_merged['Case_Duration'] = df_merged['DateOfResolution'] -df_merged['DateOfRegistration']
DateOfRegistration and DateOfResolution are date fields (eg. 2010-11-1)
df_merged['Case_Duration'] = df_merged['Case_Duration'].astype('category')
to convert 'Case_Duration' column to category
df_Days = df_merged["Case_Duration"].str.split(" ", n = 1, expand = True)
to split the 'Case_Duration' column values. (eg. 16 days -> '16' and 'days')
But this step gives an error -> can only use .str accessor with string values, which use np.object_ dtype in pandas
Desired output:
Here I create a pandas df named data with random timestamps at columns a and b (to represent your initial datetime columns). Column bucket has your desired output
data_dic = {
"a": ['2019-07-26 13:21:12','2019-07-26 13:21:12','2019-07-26 13:21:12','2019-07-26 13:21:12'],
"b": ['2019-03-26 13:21:12','2019-05-26 13:21:12','2019-07-23 13:21:12','2019-02-26 13:21:12'],
}
data = pd.DataFrame(data_dic)
data['a'] = pd.to_datetime(data['a'])
data['b'] = pd.to_datetime(data['b'])
data['bucket'] = np.select( [(data['a'] - data['b']).dt.days< 31, (data['a'] - data['b']).dt.days< 91 ] ,[1,2], 3)
Note that
(data['a'] - data['b']).dt.days
computes the time difference in days
Related
I am not too experienced with programming, and I got stuck in a research project in the asset management field.
My Goal:
I have 2 dataframes,- one containing aside from others columns "European short date"," SP150030after", "SP1500365before" (Screenshot) and second containing column "Dates" and "S&P 1500_return"(Screenshot). For each row in the first dataframe, I want to calculate cumulative returns of S&P 1500 for 365 days before the date in column "European short date" and cumulative returns of S&P 1500 for 30 days after the date in column "European short date" and put these results in columns "SP150030after" and "SP1500365before".
These returns are to be calculated using a second Dataframe. "S&P 1500_return" column in the second data frame for each date represents "daily return of S&P 1500 market index + 1". So, for example to get cumulative returns over 1 year before 31.12.2020 in first dataframe, I would have to calculate the product of values in column "S&P 1500_return" from the second dataframe for each day present (trading day) in the dataframe 2 during the period 31.12.2019 - 30.12.2020.
What I have tried so far:
I turned "European short date" in DataFrame 1 and "Date" in Dataframe 2 to be index fields and though about approaching my goal through "for" loop. I tried to turn "European short date" to be "List" to use it to iterate through the dataframe 1, but I get the following error: "/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:18: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead".
Here is my code so far:
Main_set = pd.read_excel('...')
Main_set = pd.DataFrame(Main_set)
Main_set['European short date'] = pd.to_datetime(Main_set['European short date'], format='%d.%m.%y', errors='coerce').dt.strftime('%Y-%m-%d')
Main_set = Main_set.set_index('European short date')
Main_set.head(5)
Indexes = pd.read_excel('...')
Indexes = pd.DataFrame(Indexes)
Indexes['Date'] = pd.to_datetime(Indexes['Date'], format='%d.%m.%y', errors='coerce').dt.strftime('%Y-%m-%d')
Indexes = Indexes.set_index('Date')
SP1500DailyReturns = Indexes[['S&P 1500 SUPER COMPOSITE - PRICE INDEX']]
SP1500DailyReturns['S&P 1500_return'] = (SP1500DailyReturns['S&P 1500 SUPER COMPOSITE - PRICE INDEX'] / SP1500DailyReturns['S&P 1500 SUPER COMPOSITE - PRICE INDEX'].shift(1))
SP1500DailyReturns.to_csv('...')
Main_set['SP50030after'] = np.zeros(326)
import math
dates = Main_set['European short date'].to_list()
dates.head()
for n in dates:
Main_set['SP50030after'] = math.prod(arr)
Many thanks in advance!
In case it will be useful for someone, I solved the problem by using a for loop and dividing the problem in more steps:
for n in dates:
Date = pd.Timestamp(n)
DateB4 = Date - pd.Timedelta("365 days")
DateAfter = Date + pd.Timedelta("30 days")
ReturnsofPeriodsBackwards = SP1500DailyReturns.loc[str(DateB4) : str(Date), 'S&P 1500_return']
Main_set.loc[str(Date), 'SP500365before'] = np.prod(ReturnsofPeriodsBackwards)
ReturnsofPeriodsForward = SP1500DailyReturns.loc[str(Date) : str(DateAfter), 'S&P 1500_return']
Main_set.loc[str(Date), 'SP50030after'] = np.prod(ReturnsofPeriodsForward)
raw_data = {'Event': ['A','B','C','D', 'E'],
'dates': ['08-12-1600','26-09-1400', '04-11-1991','25-03-1991', '10-05-1991']}
df_1 = pd.DataFrame(raw_data, columns = ['Event', 'dates'])
df_1['dates'] = pd.to_datetime(df_1['dates'])
the above code gives error due to date 08-12-1600 if the date is removed it works fine, what could be the possible reason for it?
error is:
Out of bounds nanosecond timestamp: 1600-08-12 00:00:00
That is because the provided dates are outside the range of Timestamp.
pd.Timestamp.min
Timestamp('1677-09-21 00:12:43.145225')
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
Details here
If we need the dates even out of range
Then we can convert them to period using below code
raw_data = {'Event': ['A','B','C','D', 'E'],
'dates': ['08-12-1600','26-09-1400', '04-11-1991','25-03-1991', '10-05-1991']}
df_1 = pd.DataFrame(raw_data, columns = ['Event', 'dates'])
def conv(x):
day,month,year = tuple(x.split('-'))
return pd.Period(year=int(year), month=int(month), day=int(day), freq="D")
df_1['dates'] = df_1.dates.apply(conv)
df_1
Output
Event dates
0 A 1600-12-08
1 B 1400-09-26
2 C 1991-11-04
3 D 1991-03-25
4 E 1991-05-10
If we can ignore dates outside range
df_1['dates'] = pd.to_datetime(df_1.dates, errors='coerce')
df_1
Output
Event dates
0 A NaT
1 B NaT
2 C 1991-04-11
3 D 1991-03-25
4 E 1991-10-05
Bonus Fact
Why timestamp can hold values for around 584 years 1677-2262?
Since timestamps provides nano second precision and is stored in 64-bit integer, hence it can store around 584 years with this nano second resolution in 64-bit int space.
I have a Pandas dataframe with two columns, "id" (a unique identifier) and "date", that looks as follows:
test_df.head()
id date
0 N1 2020-01-31
1 N2 2020-02-28
2 N3 2020-03-10
I have created a custom Python function that, given two date strings, will compute the absolute number of days between those dates (with a given date format string e.g. %Y-%m-%d), as follows:
def days_distance(date_1, date_1_format, date_2, date_2_format):
"""Calculate the number of days between two given string dates
Args:
date_1 (str): First date
date_1_format (str): The format of the first date
date_2 (str): Second date
date_2_format (str): The format of the second date
Returns:
The absolute number of days between date1 and date2
"""
date1 = datetime.strptime(date_1, date_1_format)
date2 = datetime.strptime(date_2, date_2_format)
return abs((date2 - date1).days)
I would like to create a distance matrix that, for all pairs of IDs, will calculate the number of days between those IDs. Using the test_df example above, the final time distance matrix should look as follows:
N1 N2 N3
N1 0 28 39
N2 28 0 11
N3 39 11 0
I am struggling to find a way to compute a distance matrix using a bespoke distance function, such as my days_distance() function above, as opposed to a standard distance measure provided for example by SciPy.
Any suggestions?
Let us try pdist + squareform to create a square distance matrix representing the pair wise differences between the datetime objects, finally create a new dataframe from this square matrix:
from scipy.spatial.distance import pdist, squareform
i, d = test_df['id'].values, pd.to_datetime(test_df['date'])
df = pd.DataFrame(squareform(pdist(d[:, None])), dtype='timedelta64[ns]', index=i, columns=i)
Alternatively you can also calculate the distance matrix using numpy broadcasting:
i, d = test_df['id'].values, pd.to_datetime(test_df['date']).values
df = pd.DataFrame(np.abs(d[:, None] - d), index=i, columns=i)
N1 N2 N3
N1 0 days 28 days 39 days
N2 28 days 0 days 11 days
N3 39 days 11 days 0 days
You can convert the date column to datetime format. Then create numpy array from the column. Then create a matrix with the array repeated 3 times. Then subtract the matrix with its transpose. Then convert the result to a dataframe
import pandas as pd
import numpy as np
from datetime import datetime
test_df = pd.DataFrame({'ID': ['N1', 'N2', 'N3'],
'date': ['2020-01-31', '2020-02-28', '2020-03-10']})
test_df['date_datetime'] = test_df.date.apply(lambda x : datetime.strptime(x, '%Y-%m-%d'))
date_array = np.array(test_df.date_datetime)
date_matrix = np.tile(date_array, (3,1))
date_diff_matrix = np.abs((date_matrix.T - date_matrix))
date_diff = pd.DataFrame(date_diff_matrix)
date_diff.columns = test_df.ID
date_diff.index = test_df.ID
>>> ID N1 N2 N3
ID
N1 0 days 28 days 39 days
N2 28 days 0 days 11 days
N3 39 days 11 days 0 days
I have a large (>32 M rows) Pandas dataframe.
In column 'Time_Stamp' I have a Unix timestamp in seconds. These values are not linear, there are gaps, and some timestamps can be duplicated (ex: 1, 2, 4, 6, 6, 9,...).
I would like to set column 'Result' of current row to the index of the row that is 60 seconds before current row (closest match if there are no rows exactly 60 seconds before current row, and if more than one match, take maximum of all matches).
I've tried this to first get the list of indexes, but it always return an empty list:
df.index[df['Time_Stamp'] <= df.Time_Stamp-60].tolist()
I cannot use a for loop due to the large number of rows.
Edit 20.01.2020:
Based on comment below, I'm adding a sample dataset, and instead of returning the index I want to return the column Value:
In [2]: df
Out[2]:
Time_Stamp Value
0 1 2.4
1 2 3.1
2 4 6.3
3 6 7.2
4 6 6.1
5 9 6.0
So with the precious help of ALollz, I managed to achieve what i wanted to do in the end, here's my code:
#make copy of dataframe
df2 = df[['Time_Stamp','Value']].copy()
#add Time_gap to Time_Stamp in df2
df2['Time_Stamp'] = df2.Time_Stamp +Time_gap
#sort df2 on Time_Stamp
df2.sort_values(by = 'Time_Stamp', ascending=True,inplace = True)
df2 = df2.reset_index(drop=True)
df3 = pd.merge_asof(df, df2, on='Time_Stamp', direction='forward')
Initial Note
I already got this running, but it takes a very long time to execute. My DataFrame is around 500MB large. I am hoping to hear some feedback on how to execute this as quickly as possible.
Problem Statement
I want to normalize the DataFrame columns by the mean of the column's values during each month. An added complexity is that I have a column named group which denotes a different sensor in which the parameter (column) was measured. Therefore, the analysis needs to iterate around group and each month.
DF example
X Y Z group
2019-02-01 09:30:07 1 2 1 'grp1'
2019-02-01 09:30:23 2 4 3 'grp2'
2019-02-01 09:30:38 3 6 5 'grp1'
...
Code (Functional, but slow)
This is the code that I used. Coding annotations provide descriptions of most lines. I recognize that the three for loops are causing this runtime issue, but I do not have the foresight to see a way around it. Does anyone know any
# Get mean monthly values for each group
mean_per_month_unit = process_df.groupby('group').resample('M', how='mean')
# Store the monthly dates created in last line into a list called month_dates
month_dates = mean_per_month_unit.index.get_level_values(1)
# Place date on multiIndex columns. future note: use df[DATE, COL_NAME][UNIT] to access mean value
mean_per_month_unit = mean_per_month_unit.unstack().swaplevel(0,1,1).sort_index(axis=1)
divide_df = pd.DataFrame().reindex_like(df)
process_cols.remove('group')
for grp in group_list:
print(grp)
# Iterate through month
for mnth in month_dates:
# Make mask where month and group
mask = (df.index.month == mnth.month) & (df['group'] == grp)
for col in process_cols:
# Set values of divide_df
divide_df.iloc[mask.tolist(), divide_df.columns.get_loc(col)] = mean_per_month_unit[mnth, col][grp]
# Divide process_df with divide_df
final_df = process_df / divide_df.values
EDIT: Example data
Here is the data in CSV format.
EDIT2: Current code (according to current answer)
def normalize_df(df):
df['month'] = df.index.month
print(df['month'])
df['year'] = df.index.year
print(df['year'])
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
agg = df.groupby(by=['group', 'month', 'year'], as_index=True).mean()
print("###################", x.name, x['month'])
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by
print(column)
mean_col = agg.loc[(x['group'], x['month'], x['year']), column]
print(mean_col)
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
normalize_cols = df.columns.tolist()
normalize_cols.remove('group')
#normalize_cols.remove('mode')
df2 = df.apply(find_norm, df_col_list = normalize_cols, axis=1)
The code runs perfectly for one iteration and then it fails with the error:
KeyError: ('month', 'occurred at index 2019-02-01 11:30:17')
As I said, it runs correctly once. However, it iterates over the same row again and then fails. I see according to df.apply() documentation that the first row always runs twice. I'm just not sure why this fails on the second time through.
Assuming that the requirement is to group the columns by mean and the month, here is another approach:
Create new columns - month and year from the index. df.index.month can be used for this provided the index is of type DatetimeIndex
type(df.index) # df is the original dataframe
#pandas.core.indexes.datetimes.DatetimeIndex
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.
Now, group over (grp, month, year) and aggregate to find mean of every column. (Added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.)
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
Use a function to calculate the normalized values and use apply() over the original dataframe
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)
#df2 will now have 3 additional columns - normA, normB, normC
df2:
A B C grp month year normA normB normC
2019-02-01 09:30:07 1 2 3 1 2 2019 0.666667 0.8 1.5
2019-03-02 09:30:07 2 3 4 1 3 2019 1.000000 1.0 1.0
2019-02-01 09:40:07 2 3 1 2 2 2019 1.000000 1.0 1.0
2019-02-01 09:38:07 2 3 1 1 2 2019 1.333333 1.2 0.5
Alternatively, for step 3, one can join the agg and df dataframes and find the norm.
Hope this helps!
Here is how the code would look like:
# Step 1
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs
# Step 2
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
# Step 3
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)