Setting specific bin length in python list - python-3.x

I have a straightforward question but I'm facing issues for conversion.
I have a pandas dataframe column which I converted to a list. It has both positive and negative values:
bin_length = 5
list = [-200, -112, -115, 0, 50, 120, 250]
I need to group these numbers into a bin of length 5.
For example:
-100 to -95 should have a value of -100
-95 to -90 should have a value of -95
Similarly for positive values:
0 to 5 should be 5
5 to 10 should be 10
What I have tried until now:
df = pd.DataFrame(dataframe['rd2'].values.tolist(), columns = ['values'])
bins = np.arange(0, df['values'].max() + 5, 5)
df['bins'] = pd.cut(df['values'], bins, include_lowest = True)
But this doesn't account for negative values and then I get problems in converting the pandas interval into a separate columns for list.
Any help would be amazing.

Setting up the correct lower limit with np.arange:
bins = np.arange(df["values"].min(), df['values'].max() + 5, 5)
df['bins'] = pd.cut(df['values'], bins, include_lowest = True)
print (df)
values bins
0 -200 (-200.001, -195.0]
1 -112 (-115.0, -110.0]
2 -115 (-120.0, -115.0]
3 0 (-5.0, 0.0]
4 50 (45.0, 50.0]
5 120 (115.0, 120.0]
6 250 (245.0, 250.0]
Convert the intervals back to a list:
s = pd.IntervalIndex(df["bins"])
print ([[x,y] for x,y in zip(s.left, s.right)])
[[-200.001, -195.0], [-115.0, -110.0], [-120.0, -115.0], [-5.0, 0.0], [45.0, 50.0], [115.0, 120.0], [245.0, 250.0]]

Related

Create new column based on general date time format DD/MM/YYYY , but encounter error "ValueError: bins must increase monotonically."

I have a pandas data frame column with a general date format that looks like the below. My date format is in DD/MM/YYYY.
dates
0 11/04/2017
1 17/04/2017
2 23/04/2017
3 02/04/2017
4 30/03/2017
I would like to create a new column based on this dates column, e.g.
Expected new column
phase
0 3
1 4
2 5
3 2
4 1
I tried to use the method suggested in this post
Create new column based on date column Pandas
But I am encountering an error
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Input In [46], in <cell line: 10>()
1 cutoff = [
2 '24/04/2017',
3 '18/04/2017',
(...)
6 '31/03/2017',
7 ]
9 cutoff = pd.Series(cutoff).astype('datetime64')
---> 10 final_commit['phase'] = pd.cut(final_commit['dates'], cutoff, labels = [ 4, 3, 2, 1])
11 print(final_commit.sort_values('dates'))
File ~/Library/Python/3.8/lib/python/site-packages/pandas/core/reshape/tile.py:290, in cut(x, bins, right, labels, retbins, precision, include_lowest, duplicates, ordered)
288 # GH 26045: cast to float64 to avoid an overflow
289 if (np.diff(bins.astype("float64")) < 0).any():
--> 290 raise ValueError("bins must increase monotonically.")
292 fac, bins = _bins_to_cuts(
293 x,
294 bins,
(...)
301 ordered=ordered,
302 )
304 return _postprocess_for_cut(fac, bins, retbins, dtype, original)
ValueError: bins must increase monotonically.
My cutoff for creating the new column is as below
'24/04/2017' -> phase 5
'18/04/2017' -> phase 4
'12/04/2017' -> phase 3
'06/04/2017' -> phase 2
'31/03/2017' -> phase 1
Code I tried
cutoff = [
'24/04/2017',
'18/04/2017',
'12/04/2017',
'06/04/2017',
'31/03/2017',
]
cutoff = pd.Series(cutoff).astype('datetime64')
final_commit['phase'] = pd.cut(final_commit['dates'], cutoff, labels = [5, 4, 3, 2, 1])
print(final_commit.sort_values('dates'))
Any suggestion is appreciated. Thank you.
As the error suggests, you need to make sure the cutoff is monotonically increasing. You can pre sort the values using sort_values:
cutoff = pd.to_datetime(cutoff, format='%d/%m/%Y').sort_values()
pd.cut(final_commit['dates'], cutoff, labels=[1,2,3,4])
Example:
final_commit = pd.DataFrame({
'dates': pd.to_datetime(['2017-04-15', '2017-04-03'])
})
pd.cut(final_commit['dates'], cutoff, labels=[1,2,3,4])
#0 3
#1 1
#Name: dates, dtype: category
#Categories (4, int64): [1 < 2 < 3 < 4]

In Pandas, how to compute value counts on bins and sum value in 1 other column

I have Pandas dataframe like:
df =
col1 col2
23 75
25 78
22 120
I want to specify bins: 0-100 and 100-200 and divide col2 in those bins compute its value counts and sum col1 for the values that lie in those bins.
So:
df_output:
col2_range count col1_cum
0-100 2 48
100-200 1 22
Getting the col2_range and count is pretty simple:
import numpy as np
a = np.arange(0,200, 100)
bins = a.tolist()
counts = data['col1'].value_counts(bins=bins, sort=False)
How do I get to sum col2 though?
IIUC, try using pd.cut to create bins and groupby those bins:
g = pd.cut(df['col2'],
bins=[0, 100, 200, 300, 400],
labels = ['0-99', '100-199', '200-299', '300-399'])
df.groupby(g, observed=True)['col1'].agg(['count','sum']).reset_index()
Output:
col2 count sum
0 0-99 2 48
1 100-199 1 22
I think I misread the original post:
g = pd.cut(df['col2'],
bins=[0,100,200,300,400],
labels = ['0-99', '100-199', '200-299', '300-399'])
df.groupby(g, observed=True).agg(col1_count=('col1','count'),
col2_sum=('col2','sum'),
col1_sum=('col1','sum')).reset_index()
Output:
col2 col1_count col2_sum col1_sum
0 0-99 2 153 48
1 100-199 1 120 22

Locate dataframe rows where values are outside bounds specified for each column

I have a dataframe with k columns and n rows, k ~= 10, n ~= 1000. I have a (2, k) array representing bounds on values for each column, e.g.:
# For 5 columns
bounds = ([0.1, 1, 0.1, 5, 10],
[10, 1000, 1, 1000, 50])
# Example df
a b c d e
0 5 3 0.3 17 12
1 12 50 0.5 2 31
2 9 982 0.2 321 21
3 1 3 1.2 92 48
# Expected output with bounds given above
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21
Crucially, the bounds on each column are different.
I would like to identify and exclude all rows of the dataframe where any column value falls outside the bounds for that respective column, preferably using array operations rather than iterating over the dataframe. The best I can think of so far involves iterating over the columns (which isn't too bad but still seems less than ideal):
for i in len(df.columns):
df = df.query('(bounds[0][i] < df.columns[i]) & (df.columns[i] < bounds[1][i])')
Is there a better way to do this? Or alternatively, to select only the rows where all column values are within the respective bounds?
One way using pandas.DataFrame.apply with pandas.Series.between:
bounds = dict(zip(df.columns, zip(*bounds)))
new_df = df[~df.apply(lambda x: ~x.between(*bounds[x.name])).any(1)]
print(new_df)
Output:
a b c d e
0 5 3 0.3 17 12
2 9 982 0.2 321 21

Binning with pd.Cut Beyond range(replacing Nan with "<min_val" or ">Max_val" )

df= pd.DataFrame({'days': [0,31,45,35,19,70,80 ]})
df['range'] = pd.cut(df.days, [0,30,60])
df
Here as code is reproduced , where pd.cut is used to convert a numerical column to categorical column . pd.cut usually gives category as per the list passed [0,30,60]. In this row's 0 , 5 & 6 categorized as Nan which is beyond the [0,30,60]. what i want is 0 should categorized as <0 & 70 should categorized as >60 and similarly 80 should categorized as >60 respectively, If possible dynamic text labeling of A,B,C,D,E depending on no of category created.
For the first part, adding -np.inf and np.inf to the bins will ensure that everything gets a bin:
In [5]: df= pd.DataFrame({'days': [0,31,45,35,19,70,80]})
...: df['range'] = pd.cut(df.days, [-np.inf, 0, 30, 60, np.inf])
...: df
...:
Out[5]:
days range
0 0 (-inf, 0.0]
1 31 (30.0, 60.0]
2 45 (30.0, 60.0]
3 35 (30.0, 60.0]
4 19 (0.0, 30.0]
5 70 (60.0, inf]
6 80 (60.0, inf]
For the second, you can use .cat.codes to get the bin index and do some tweaking from there:
In [8]: df['range'].cat.codes.apply(lambda x: chr(x + ord('A')))
Out[8]:
0 A
1 C
2 C
3 C
4 B
5 D
6 D
dtype: object

How to make each bin of data as column of dataframe

I have have dataframe with column A , i want to divide column in bins and count of each bin as column of dataframe , for example bin from 0 to how many points and add this in in dataframe.
i used this code for binning but i am not sure how to insert count column in df.
df=pd.DataFrame({'max':[0.2,0.3,1,1.5,2.5,0.2]})
print(df)
max
0 0.2
1 0.3
2 1.0
3 1.5
4 2.5
5 0.2
bins = [0, 0.5, 1, 1.5, 2, 2.5]
x=pd.cut(df['max'], bins)
desired output
print(df)
0_0.5_count 0.5_1_count
0 3 1
First add parameter label to cut, then count by Series.value_counts and for DataFrame use Series.to_frame with transpose by DataFrame.T:
bins = [0, 0.5, 1, 1.5, 2, 2.5]
labels = ['{}_{}_count'.format(i, j) for i, j in zip(bins[:-1], bins[1:])]
x=pd.cut(df['max'], bins, labels=labels).value_counts().sort_index().to_frame(0).T
print (x)
0_0.5_count 0.5_1_count 1_1.5_count 1.5_2_count 2_2.5_count
0 3 1 1 0 1
Details:
print (pd.cut(df['max'], bins, labels=labels))
0 0_0.5_count
1 0_0.5_count
2 0.5_1_count
3 1_1.5_count
4 2_2.5_count
5 0_0.5_count
Name: max, dtype: category
Categories (5, object): [0_0.5_count < 0.5_1_count < 1_1.5_count < 1.5_2_count < 2_2.5_count]
print (pd.cut(df['max'], bins, labels=labels).value_counts())
0_0.5_count 3
2_2.5_count 1
1_1.5_count 1
0.5_1_count 1
1.5_2_count 0
Name: max, dtype: int64
Alternative solution with GroupBy.size:
bins = [0, 0.5, 1, 1.5, 2, 2.5]
labels = ['{}_{}_count'.format(i, j) for i, j in zip(bins[:-1], bins[1:])]
x= df.groupby(pd.cut(df['max'], bins, labels=labels)).size().rename_axis(None).to_frame().T
print (x)
0_0.5_count 0.5_1_count 1_1.5_count 1.5_2_count 2_2.5_count
0 3 1 1 0 1

Resources