Create new column based on general date time format DD/MM/YYYY , but encounter error "ValueError: bins must increase monotonically." - python-3.x

I have a pandas data frame column with a general date format that looks like the below. My date format is in DD/MM/YYYY.
dates
0 11/04/2017
1 17/04/2017
2 23/04/2017
3 02/04/2017
4 30/03/2017
I would like to create a new column based on this dates column, e.g.
Expected new column
phase
0 3
1 4
2 5
3 2
4 1
I tried to use the method suggested in this post
Create new column based on date column Pandas
But I am encountering an error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [46], in <cell line: 10>()
1 cutoff = [
2 '24/04/2017',
3 '18/04/2017',
(...)
6 '31/03/2017',
7 ]
9 cutoff = pd.Series(cutoff).astype('datetime64')
---> 10 final_commit['phase'] = pd.cut(final_commit['dates'], cutoff, labels = [ 4, 3, 2, 1])
11 print(final_commit.sort_values('dates'))
File ~/Library/Python/3.8/lib/python/site-packages/pandas/core/reshape/tile.py:290, in cut(x, bins, right, labels, retbins, precision, include_lowest, duplicates, ordered)
288 # GH 26045: cast to float64 to avoid an overflow
289 if (np.diff(bins.astype("float64")) < 0).any():
--> 290 raise ValueError("bins must increase monotonically.")
292 fac, bins = _bins_to_cuts(
293 x,
294 bins,
(...)
301 ordered=ordered,
302 )
304 return _postprocess_for_cut(fac, bins, retbins, dtype, original)
ValueError: bins must increase monotonically.
My cutoff for creating the new column is as below
'24/04/2017' -> phase 5
'18/04/2017' -> phase 4
'12/04/2017' -> phase 3
'06/04/2017' -> phase 2
'31/03/2017' -> phase 1
Code I tried
cutoff = [
'24/04/2017',
'18/04/2017',
'12/04/2017',
'06/04/2017',
'31/03/2017',
]
cutoff = pd.Series(cutoff).astype('datetime64')
final_commit['phase'] = pd.cut(final_commit['dates'], cutoff, labels = [5, 4, 3, 2, 1])
print(final_commit.sort_values('dates'))
Any suggestion is appreciated. Thank you.

As the error suggests, you need to make sure the cutoff is monotonically increasing. You can pre sort the values using sort_values:
cutoff = pd.to_datetime(cutoff, format='%d/%m/%Y').sort_values()
pd.cut(final_commit['dates'], cutoff, labels=[1,2,3,4])
Example:
final_commit = pd.DataFrame({
'dates': pd.to_datetime(['2017-04-15', '2017-04-03'])
})
pd.cut(final_commit['dates'], cutoff, labels=[1,2,3,4])
#0 3
#1 1
#Name: dates, dtype: category
#Categories (4, int64): [1 < 2 < 3 < 4]

Related

Locate dataframe rows where values are outside bounds specified for each column

I have a dataframe with k columns and n rows, k ~= 10, n ~= 1000. I have a (2, k) array representing bounds on values for each column, e.g.:
# For 5 columns
bounds = ([0.1, 1, 0.1, 5, 10],
[10, 1000, 1, 1000, 50])
# Example df
a b c d e
0 5 3 0.3 17 12
1 12 50 0.5 2 31
2 9 982 0.2 321 21
3 1 3 1.2 92 48
# Expected output with bounds given above
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21
Crucially, the bounds on each column are different.
I would like to identify and exclude all rows of the dataframe where any column value falls outside the bounds for that respective column, preferably using array operations rather than iterating over the dataframe. The best I can think of so far involves iterating over the columns (which isn't too bad but still seems less than ideal):
for i in len(df.columns):
df = df.query('(bounds[0][i] < df.columns[i]) & (df.columns[i] < bounds[1][i])')
Is there a better way to do this? Or alternatively, to select only the rows where all column values are within the respective bounds?
One way using pandas.DataFrame.apply with pandas.Series.between:
bounds = dict(zip(df.columns, zip(*bounds)))
new_df = df[~df.apply(lambda x: ~x.between(*bounds[x.name])).any(1)]
print(new_df)
Output:
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21

Alternative to looping? Vectorisation, cython?

I have a pandas dataframe something like the below:
Total Yr_to_Use First_Year_Del Del_rate 2019 2020 2021 2022 2023 etc
ref1 100 2020 5 10 0 0 0 0 0
ref2 20 2028 2 5 0 0 0 0 0
ref3 30 2021 7 16 0 0 0 0 0
ref4 40 2025 9 18 0 0 0 0 0
ref5 10 2022 4 30 0 0 0 0 0
The 'Total' column shows how many of a product needs to be delivered.
'First_yr_Del' tells you how many will be delivered in the first year. After this the delivery rate reverts to 'Del_rate' - a flat rate that can be applied each year until all products are delivered.
The 'Year to Use' column tells you the first year column to begin delivery from.
EXAMPLE: Ref1 has 100 to deliver. It will start delivering in 2020 and will deliver 5 in the first year, and 10 each year after that until all 100 are accounted for.
Any ideas how to go about this?
I thought i might use something like the below to reference which columns to use in turn, but i'm not even sure if that's helpful or not as it will depend on the solution (in the proper version, base_date.year is defined as the first column in the table - 2019):
start_index_for_slice = df.columns.get_loc(base_date.year)
end_index_for_slice = start_index_for_slice+no_yrs_to_project
df.columns[start_index_for_slice:end_index_for_slice]
I'm pretty new to python and aren't sure if i'm getting ahead of myself a bit...
The way i would think to go about it would be to use a for loop, or something using iterrows, but other posts seem to say this is a bad idea and i should be using vectorisation, cython or lambdas. Of those 3 i've only managed a very simple lambda so far. The others are a bit of a mystery to me since the solution seems to suggest doing one action after another until complete.
Any and all help appreciated!
Thanks
EDIT: Example expected output below (I edited some of the dates so you can better see the logic):
Total Yr_to_Use First_Year_Del Del_rate 2019 2020 2021 2022 2023etc
ref1 100 2020 5 10 0 5 10 10 10
ref2 20 2021 2 5 0 0 2 5 5
ref3 30 2021 7 16 0 0 7 16 7
ref4 40 2019 9 18 9 18 13 0 0
ref5 10 2020 4 30 0 4 6 0 0
Here's another option, which separates the calculation of the rates/years matrix and appends it to the input df later on. Still does looping in the script itself (not "externalized" to some numpy / pandas function). Should be fine for 5k rows I'd guesstimate.
import pandas as pd
import numpy as np
# def gen_df1():
# create the inital df without years/rates
df = pd.DataFrame({'Total': [100, 20, 30, 40, 10],
'Yr_to_Use': [2020, 2021, 2021, 2019, 2020],
'First_Year_Del': [5, 2, 7, 9, 10],
'Del_rate': [10, 5, 16, 18, 30]})
# get number of rates + remainder
n, r = np.divmod((df['Total']-df['First_Year_Del']), df['Del_rate'])
# get the year of the last rate considering all rows
max_year = np.max(n + r.astype(np.bool) + df['Yr_to_Use'])
# get the offsets for the start of delivery, year zero is 2019
offset = df['Yr_to_Use'] - 2019
# subtracting the year zero lets you use this as an index...
# get a year index; this determines the the columns that will be created
yrs = np.arange(2019, max_year+1)
# prepare a n*m array to hold the rates for all years, initalize with all zero
out = np.zeros((df['Total'].shape[0], yrs.shape[0]))
# n: number of rows of the df, m: number of years where rates will have to be payed
# calculate the rates for each year and insert them into the output array
for i in range(df['Total'].shape[0]):
# concatenate: year of the first rate, all yearly rates, a final rate if there was a remainder
if r[i]: # if rest is not zero, append it as well
rates = np.concatenate([[df['First_Year_Del'][i]], n[i]*[df['Del_rate'][i]], [r[i]]])
else: # rest is zero, skip it
rates = np.concatenate([[df['First_Year_Del'][i]], n[i]*[df['Del_rate'][i]]])
# insert the rates at the apropriate location of the output array:
out[i, offset[i]:offset[i]+rates.shape[0]] = rates
# add the years/rates matrix to the original df
df = pd.concat([df, pd.DataFrame(out, columns=yrs.astype(str))], axis=1, sort=False)
You can accomplish this using two user-defined function and apply method
import pandas as pd
import numpy as np
df = pd.DataFrame(data={'id': ['ref1','ref2','ref3','ref4','ref5'],
'Total': [100, 20, 30, 40, 10],
'Yr_to_Use': [2020, 2028, 2021, 2025, 2022],
'First_Year_Del': [5,2,7,9,4],
'Del_rate':[10,5,16,18,30]})
def f(r):
'''
Computes values per year and respective year
'''
n = (r['Total'] - r['First_Year_Del'])//r['Del_rate']
leftover = (r['Total'] - r['First_Year_Del'])%r['Del_rate']
r['values'] = [r['First_Year_Del']] + [r['Del_rate'] for _ in range(n)] + [leftover]
r['years'] = np.arange(r['Yr_to_Use'], r['Yr_to_Use'] + len(r['values']))
return r
df = df.apply(f, axis=1)
def get_year_range(r):
'''
Computes min and max year for each row
'''
r['y_min'] = min(r['years'])
r['y_max'] = max(r['years'])
return r
df = df.apply(get_year_range, axis=1)
y_min = df['y_min'].min()
y_max = df['y_max'].max()
#Initialize each year value to zero
for year in range(y_min, y_max+1):
df[year] = 0
def expand(r):
'''
Update value for each year
'''
for v, y in zip(r['values'], r['years']):
r[y] = v
return r
# Apply and drop temporary columns
df = df.apply(expand, axis=1).drop(['values', 'years', 'y_min', 'y_max'], axis=1)

Binning with pd.Cut Beyond range(replacing Nan with "<min_val" or ">Max_val" )

df= pd.DataFrame({'days': [0,31,45,35,19,70,80 ]})
df['range'] = pd.cut(df.days, [0,30,60])
df
Here as code is reproduced , where pd.cut is used to convert a numerical column to categorical column . pd.cut usually gives category as per the list passed [0,30,60]. In this row's 0 , 5 & 6 categorized as Nan which is beyond the [0,30,60]. what i want is 0 should categorized as <0 & 70 should categorized as >60 and similarly 80 should categorized as >60 respectively, If possible dynamic text labeling of A,B,C,D,E depending on no of category created.
For the first part, adding -np.inf and np.inf to the bins will ensure that everything gets a bin:
In [5]: df= pd.DataFrame({'days': [0,31,45,35,19,70,80]})
...: df['range'] = pd.cut(df.days, [-np.inf, 0, 30, 60, np.inf])
...: df
...:
Out[5]:
days range
0 0 (-inf, 0.0]
1 31 (30.0, 60.0]
2 45 (30.0, 60.0]
3 35 (30.0, 60.0]
4 19 (0.0, 30.0]
5 70 (60.0, inf]
6 80 (60.0, inf]
For the second, you can use .cat.codes to get the bin index and do some tweaking from there:
In [8]: df['range'].cat.codes.apply(lambda x: chr(x + ord('A')))
Out[8]:
0 A
1 C
2 C
3 C
4 B
5 D
6 D
dtype: object

How do I perform inter-row operations within a pandas.dataframe

How do I write the nested for loop to access every other row with respect to a row within a pandas.dataframe?
I am trying to perform some operations between rows in a pandas.dataframe
The operation for my example code is calculating Euclidean distances between each row with each other row.
The results are then saved into a some list in the form
[(row_reference, name, dist)].
I understand how to access each row in a pandas.dataframe using df.itterrows() but I'm not sure how to access every other row with respect to the current row in order to perform the inter-row operation.
import pandas as pd
import numpy
import math
df = pd.DataFrame([{'name': "Bill", 'c1': 3, 'c2': 8}, {'name': "James", 'c1': 4, 'c2': 12},
{'name': "John", 'c1': 12, 'c2': 26}])
#Euclidean distance function where x1=c1_row1 ,x2=c1_row2, y1=c2_row1, #y2=c2_row2
def edist(x1, x2, y1, y2):
dist = math.sqrt(math.pow((x1 - x2),2) + math.pow((y1 - y2),2))
return dist
# Calculate Euclidean distance for one row (e.g. Bill) against each other row
# (e.g. "James" and "John"). Save results to a list (N_name, dist).
all_results = []
for index, row in df.iterrows():
results = []
# secondary loop to look for OTHER rows with respect to the current row
# results.append(row2['name'],edist())
all_results.append(row,results)
I hope to perform some operation edist() on all rows with respect to the current row/index.
I expect the loop to do the following:
In[1]:
result = []
result.append(['James',edist(3,4,8,12)])
result.append(['John',edist(3,12,8,26)])
results_all=[]
results_all.append([0,result])
result2 = []
result2.append(['John',edist(4,12,12,26)])
result2.append(['Bill',edist(4,3,12,8)])
results_all.append([1,result2])
result3 = []
result3.append(['Bill',edist(12,3,26,8)])
result3.append(['James', edist(12,4,26,12)])
results_all.append([2,result3])
results_all
With the following expected resulting output:
OUT[1]:
[[0, [['James', 4.123105625617661], ['John', 20.12461179749811]]],
[1, [['John', 16.1245154965971], ['Bill', 4.123105625617661]]],
[2, [['Bill', 20.12461179749811], ['James', 16.1245154965971]]]]
If you data is not too long, you can check out scipy's distance_matrix:
all_results = pd.DataFrame(distance_matrix(df[['c1','c2']],df[['c1','c2']]),
index=df['name'],
columns=df['name'])
Output:
name Bill James John
name
Bill 0.000000 4.123106 20.124612
James 4.123106 0.000000 16.124515
John 20.124612 16.124515 0.000000
Consider shift and avoid any rowwise looping. And because you run straightforward arithmetic, run the expression directly on columns using help of numpy for vectorized calculation.
import numpy as np
df = (df.assign(c1_shift = lambda x: x['c1'].shift(1),
c2_shift = lambda x: x['c2'].shift(1))
)
df['dist'] = np.sqrt(np.power(df['c1'] - df['c1_shift'], 2) +
np.power(df['c2'] - df['c2_shift'], 2))
print(df)
# name c1 c2 c1_shift c2_shift dist
# 0 Bill 3 8 NaN NaN NaN
# 1 James 4 12 3.0 8.0 4.123106
# 2 John 12 26 4.0 12.0 16.124515
Should you want every row combination with each other, consider a cross join on itself and query out reverse duplicates:
df = (pd.merge(df.assign(key=1), df.assign(key=1), on="key")
.query("name_x < name_y")
.drop(columns=['key'])
)
df['dist'] = np.sqrt(np.power(df['c1_x'] - df['c1_y'], 2) +
np.power(df['c2_x'] - df['c2_y'], 2))
print(df)
# name_x c1_x c2_x name_y c1_y c2_y dist
# 1 Bill 3 8 James 4 12 4.123106
# 2 Bill 3 8 John 12 26 20.124612
# 5 James 4 12 John 12 26 16.124515

Reorder columns in groups by number embedded in column name?

I have a very large dataframe with 1,000 columns. The first few columns occur only once, denoting a customer. The next few columns are representative of multiple encounters with the customer, with an underscore and the number encounter. Every additional encounter adds a new column, so there is NOT a fixed number of columns -- it'll grow with time.
Sample dataframe header structure excerpt:
id dob gender pro_1 pro_10 pro_11 pro_2 ... pro_9 pre_1 pre_10 ...
I'm trying to re-order the columns based on the number after the column name, so all _1 should be together, all _2 should be together, etc, like so:
id dob gender pro_1 pre_1 que_1 fre_1 gen_1 pro2 pre_2 que_2 fre_2 ...
(Note that the re-order should order the numbers correctly; the current order treats them like strings, which orders 1, 10, 11, etc. rather than 1, 2, 3)
Is this possible to do in pandas, or should I be looking at something else? Any help would be greatly appreciated! Thank you!
EDIT:
Alternatively, is it also possible to re-arrange column names based on the string part AND number part of the column names? So the output would then look similar to the original, except the numbers would be considered so that the order is more intuitive:
id dob gender pro_1 pro_2 pro_3 ... pre_1 pre_2 pre_3 ...
EDIT 2.0:
Just wanted to thank everyone for helping! While only one of the responses worked, I really appreciate the effort and learned a lot about other approaches / ways to think about this.
Here is one way you can try:
# column names copied from your example
example_cols = 'id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10'.split()
# sample DF
df = pd.DataFrame([range(len(example_cols))], columns=example_cols)
df
# id dob gender pro_1 pro_10 pro_11 pro_2 pro_9 pre_1 pre_10
#0 0 1 2 3 4 5 6 7 8 9
# number of columns excluded from sorting
N = 3
# get a list of columns from the dataframe
cols = df.columns.tolist()
# split, create an tuple of (column_name, prefix, number) and sorted based on the 2nd and 3rd item of the tuple, then retrieved the first item.
# adjust "key = lambda x: x[2]" to group cols by numbers only
cols_new = cols[:N] + [ a[0] for a in sorted([ (c, p, int(n)) for c in cols[N:] for p,n in [c.split('_')]], key = lambda x: (x[1], x[2])) ]
# get the new dataframe based on the cols_new
df_new = df[cols_new]
# id dob gender pre_1 pre_10 pro_1 pro_2 pro_9 pro_10 pro_11
#0 0 1 2 8 9 3 6 7 4 5
Luckily there is a one liner in python that can fix this:
df = df.reindex(sorted(df.columns), axis=1)
For Example lets say you had this dataframe:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name': [2, 4, 8, 0],
'ID': [2, 0, 0, 0],
'Prod3': [10, 2, 1, 8],
'Prod1': [2, 4, 8, 0],
'Prod_1': [2, 4, 8, 0],
'Pre7': [2, 0, 0, 0],
'Pre2': [10, 2, 1, 8],
'Pre_2': [10, 2, 1, 8],
'Pre_9': [10, 2, 1, 8]}
)
print(df)
Output:
Name ID Prod3 Prod1 Prod_1 Pre7 Pre2 Pre_2 Pre_9
0 2 2 10 2 2 2 10 10 10
1 4 0 2 4 4 0 2 2 2
2 8 0 1 8 8 0 1 1 1
3 0 0 8 0 0 0 8 8 8
Then used
df = df.reindex(sorted(df.columns), axis=1)
Then the dataframe will then look like:
ID Name Pre2 Pre7 Pre_2 Pre_9 Prod1 Prod3 Prod_1
0 2 2 10 2 10 10 2 10 2
1 0 4 2 0 2 2 4 2 4
2 0 8 1 0 1 1 8 1 8
3 0 0 8 0 8 8 0 8 0
As you can see, the columns without underscore will come first, followed by an ordering based on the number after the underscore. However this also sorts of the column names, so the column names that come first in the alphabet will be first.
You need to split you column on '_' then convert to int:
c = ['A_1','A_10','A_2','A_3','B_1','B_10','B_2','B_3']
df = pd.DataFrame(np.random.randint(0,100,(2,8)), columns = c)
df.reindex(sorted(df.columns, key = lambda x: int(x.split('_')[1])), axis=1)
Output:
A_1 B_1 A_2 B_2 A_3 B_3 A_10 B_10
0 68 11 59 69 37 68 76 17
1 19 37 52 54 23 93 85 3
Next case, you need human sorting:
import re
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
'''
alist.sort(key=natural_keys) sorts in human order
http://nedbatchelder.com/blog/200712/human_sorting.html
(See Toothy's implementation in the comments)
'''
return [ atoi(c) for c in re.split(r'(\d+)', text) ]
df.reindex(sorted(df.columns, key = lambda x:natural_keys(x)), axis=1)
Output:
A_1 A_2 A_3 A_10 B_1 B_2 B_3 B_10
0 68 59 37 76 11 69 68 17
1 19 52 23 85 37 54 93 3
Try this.
To re-order the columns based on the number after the column name
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable, key=lambda x : int(x.split('_')[1])) # split based on the number after '_'
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])
To re-arrange column names based on the string part AND number part of the column names
cols_fixed = df.columns[:3] # change index no based on your df
cols_variable = df.columns[3:] # change index no based on your df
cols_variable = sorted(cols_variable)
cols_new = cols_fixed + cols_variable
new_df = pd.DataFrame(df[cols_new])

Resources