Converting float into range in Python - python-3.x

I am doing some data analysis with pandas and am struggling to find a nice, clean way of summing up a range of numbers. I have a data frame with a column of floats, however I am not interested in the exact number, but a rough range. Ultimately I want to run a pivot and count how many values are in a certain range. Therefore ideally I would want to create a new column in my data frame, that converts my column of floats into a range. Say df[number] = 3.5, then df[range] = 0-10
The ranges should be 0-10, 10-20, ... >100
This may sound very arbitrary, but I've been struggling to find an answer on this. Many thanks

Pandas has a cut function for this
In [18]: s = pd.Series(np.random.uniform(0, 110, 100))
In [19]: s
Out[19]:
0 57.614427
1 30.576853
2 95.578943
3 53.010340
4 63.947381
...
95 42.252644
96 14.814418
97 81.271527
98 5.732966
99 90.932890
In [12]: s = pd.Series(np.random.uniform(0, 110, 100))
In [13]: s
Out[13]:
0 2.652461
1 46.536276
2 6.455352
3 6.075963
4 40.013378
...
95 39.775493
96 99.688307
97 41.064469
98 91.401904
99 60.580600
dtype: float64
In [14]: cuts = np.arange(0, 101, 10)
In [15]: pd.cut(s, cuts)
Out[15]:
0 (0, 10]
1 (40, 50]
2 (0, 10]
3 (0, 10]
4 (40, 50]
...
95 (30, 40]
96 (90, 100]
97 (40, 50]
98 (90, 100]
99 (60, 70]
dtype: category
Categories (10, object): [(0, 10] < (10, 20] < (20, 30] < (30, 40] ... (60, 70] < (70, 80] < (80, 90] <
(90, 100]]
See the docs for controlling what happens with endpoints.
Note that in 0.18 (coming out soonish) the result will be an IntervalIndex instead of a Categorical, which will make things even nicer.
To get your counts per interval, use the value_counts method
In [17]: pd.cut(s, cuts).value_counts()
Out[17]:
(30, 40] 15
(40, 50] 13
(50, 60] 12
(60, 70] 10
(0, 10] 10
(90, 100] 8
(70, 80] 8
(80, 90] 7
(10, 20] 6
(20, 30] 3
dtype: int64

def get_range_for(x, start=0, stop=100, step=10):
if x < start:
return (float('-inf'), start)
if x >= stop:
return (stop, float('inf'))
left = step * ((x - start) // step)
right = left + step
return (left, right)
Examples:
>>> get_range_for(3.5)
(0.0, 10.0)
>>> get_range_for(27.3)
(20.0, 30.0)
>>> get_range_for(75.6)
(70.0, 80.0)
Corner cases:
>>> get_range_for(-100)
(-inf, 0)
>>> get_range_for(1234)
(100, inf)
>>> get_range_for(0)
(0, 10)
>>> get_range_for(10)
(10, 20)

Using the properties of integer division should help. Because you want ranges in units of 10, dividing a number by 10 (13.5 / 10 == 1.35), converting it to an integer (int(1.35) == 1), and then multiplying by 10 (1 * 10 == 10) will convert the number to the low-end of the range it falls into. This might need some refinement (especially for negative numbers), but you could try something like:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'vals': [3.5, 4.2, 10.5, 19.5, 20.3, 24.2]})
>>> df
vals
0 3.5
1 4.2
2 10.5
3 19.5
4 20.3
5 24.2
>>> df['range_start'] = np.floor(df['vals'] / 10) * 10
>>> df
vals range_start
0 3.5 0
1 4.2 0
2 10.5 10
3 19.5 10
4 20.3 20
5 24.2 20

Related

Pandas Series value_counts with unexpected result [duplicate]

I have a dataframe of Age and Marital_Status. The age is int and Marital_Status is string with 8 unique string eg: Married, Single etc.
After dividing into age group of interval 10
df['age_group'] = pd.cut(df.Age,bins=[0,10,20,30,40,50,60,70,80,90])
I had group the marital status by age group
df.groupby(['age_group'])['Marital_Status'].value_counts().
With error as below return
ValueError: operands could not be broadcast together with shape (9,) (7,)
When i tried with smaller number of age_group where the shape <= (7,) the groupby works fine. eg
df['age_group'] = pd.cut(df.Age,bins=[10,20,30,40,50,60,70,80])
I presume that the first shape (x,) must be smaller than second shape (y,). Can anyone please explain why?
Here is problem (maybe bug?) in DataFrame.groupby, default parameter is observed=False, but you need working only with existing categoricals:
observed bool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
Sample:
df = pd.DataFrame({'Marital_Status':['stat1'] * 50,
'Age': range(50)})
df['age_group'] = pd.cut(df.Age,bins=[10,20,30,40,50,60,70,80])
# print (df)
print (df.groupby(['age_group'], observed=True)['Marital_Status'].value_counts())
age_group Marital_Status
(10, 20] stat1 10
(20, 30] stat1 10
(30, 40] stat1 10
(40, 50] stat1 9
Name: Marital_Status, dtype: int64
In alternative solution better is possible check difference:
print (df.groupby(['age_group', 'Marital_Status']).size())
age_group Marital_Status
(10, 20] stat1 10
(20, 30] stat1 10
(30, 40] stat1 10
(40, 50] stat1 9
(50, 60] stat1 0
(60, 70] stat1 0
(70, 80] stat1 0
dtype: int64
print (df.groupby(['age_group', 'Marital_Status'], observed=True).size())
age_group Marital_Status
(10, 20] stat1 10
(20, 30] stat1 10
(30, 40] stat1 10
(40, 50] stat1 9
dtype: int64

Setting specific bin length in python list

I have a straightforward question but I'm facing issues for conversion.
I have a pandas dataframe column which I converted to a list. It has both positive and negative values:
bin_length = 5
list = [-200, -112, -115, 0, 50, 120, 250]
I need to group these numbers into a bin of length 5.
For example:
-100 to -95 should have a value of -100
-95 to -90 should have a value of -95
Similarly for positive values:
0 to 5 should be 5
5 to 10 should be 10
What I have tried until now:
df = pd.DataFrame(dataframe['rd2'].values.tolist(), columns = ['values'])
bins = np.arange(0, df['values'].max() + 5, 5)
df['bins'] = pd.cut(df['values'], bins, include_lowest = True)
But this doesn't account for negative values and then I get problems in converting the pandas interval into a separate columns for list.
Any help would be amazing.
Setting up the correct lower limit with np.arange:
bins = np.arange(df["values"].min(), df['values'].max() + 5, 5)
df['bins'] = pd.cut(df['values'], bins, include_lowest = True)
print (df)
values bins
0 -200 (-200.001, -195.0]
1 -112 (-115.0, -110.0]
2 -115 (-120.0, -115.0]
3 0 (-5.0, 0.0]
4 50 (45.0, 50.0]
5 120 (115.0, 120.0]
6 250 (245.0, 250.0]
Convert the intervals back to a list:
s = pd.IntervalIndex(df["bins"])
print ([[x,y] for x,y in zip(s.left, s.right)])
[[-200.001, -195.0], [-115.0, -110.0], [-120.0, -115.0], [-5.0, 0.0], [45.0, 50.0], [115.0, 120.0], [245.0, 250.0]]

How exactly 'abs()' and 'argsort()' works together

#Creating DataFrame
df=pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
output:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
aValue = 43.0
df.loc[(df.CCC-aValue).abs().argsort()]
output:
AAA BBB CCC
1 5 20 50
0 4 10 100
2 6 30 -30
3 7 40 -50
The output confusing, can you please explain in detail how the below line
works
df.loc[(df.CCC-aValue).abs().argsort()]
With abs flipping negative values, and the subtraction shift values around, it's hard to visualize what's going on. Instead I need to calculate it step by step:
In [97]: x = np.array([100,50,-30,-50])
In [98]: x-43
Out[98]: array([ 57, 7, -73, -93])
In [99]: abs(x-43)
Out[99]: array([57, 7, 73, 93])
In [100]: np.argsort(abs(x-43))
Out[100]: array([1, 0, 2, 3])
In [101]: x[np.argsort(abs(x-43))]
Out[101]: array([ 50, 100, -30, -50])
argsort is the indexing that puts the elements in sorted order. We can see that with:
In [104]: Out[99][Out[100]]
Out[104]: array([ 7, 57, 73, 93])
or
In [105]: np.array([57, 7, 73, 93])[[1, 0, 2, 3]]
Out[105]: array([ 7, 57, 73, 93])
How they work together is determined by the Python syntax; that's straight forward.
(df.CCC-aValue).abs() will take the absolute value of the df.CCC-aValue, and argsort will sort the values and takes the sorted indexes, and df.loc will show the rows with sorted indexes

Pandas: Random integer between values in two columns

How can I create a new column that calculates random integer between values of two columns in particular row.
Example df:
import pandas as pd
import numpy as np
data = pd.DataFrame({'start': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'end': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})
data = data.iloc[:, [1, 0]]
Result:
Now I am trying something like this:
data['rand_between'] = data.apply(lambda x: np.random.randint(data.start, data.end))
or
data['rand_between'] = np.random.randint(data.start, data.end)
But it doesn't work of course because data.start is a Series not a number.
how can I used numpy.random with data from columns as vectorized operation?
You are close, need specify axis=1 for process data by rows and change data.start/end to x.start/end for working with scalars:
data['rand_between'] = data.apply(lambda x: np.random.randint(x.start, x.end), axis=1)
Another possible solution:
data['rand_between'] = [np.random.randint(s, e) for s,e in zip(data['start'], data['end'])]
print (data)
start end rand_between
0 1 10 8
1 2 20 3
2 3 30 23
3 4 40 35
4 5 50 30
5 6 60 28
6 7 70 60
7 8 80 14
8 9 90 85
9 10 100 83
If you want to truly vectorize this, you can generate a random number between 0 and 1 and normalize it with your min/max numbers:
(
data['start'] + np.random.rand(len(data)) * (data['end'] - data['start'] + 1)
).astype('int')
Out:
0 1
1 18
2 18
3 35
4 22
5 27
6 35
7 23
8 33
9 81
dtype: int64

Getting count of rows using groupby in Pandas

I have two columns in my dataset, col1 and col2. I want to display data grouped by col1.
For that I have written code like:
grouped = df[['col1','col2']].groupby(['col1'], as_index= False)
The above code creates the groupby object.
How do I use the object to display the data grouped as per col1?
To get the counts by group, you can use dataframe.groupby('column').size().
Example:
In [10]:df = pd.DataFrame({'id' : [123,512,'zhub1', 12354.3, 129, 753, 295, 610],
'colour': ['black', 'white','white','white',
'black', 'black', 'white', 'white'],
'shape': ['round', 'triangular', 'triangular','triangular','square',
'triangular','round','triangular']
}, columns= ['id','colour', 'shape'])
In [11]:df
Out[11]:
id colour shape
0 123 black round
1 512 white triangular
2 zhub1 white triangular
3 12354.3 white triangular
4 129 black square
5 753 black triangular
6 295 white round
7 610 white triangular
In [12]:df.groupby('colour').size()
Out[12]:
colour
black 3
white 5
dtype: int64
In [13]:df.groupby('shape').size()
Out[13]:
shape
round 2
square 1
triangular 5
dtype: int64
Try groups attribute and get_group() method of the object returned by groupby():
>>> import numpy as np
>>> import pandas as pd
>>> anarray=np.array([[0, 31], [1, 26], [0, 35], [1, 22], [0, 41]])
>>> df = pd.DataFrame(anarray, columns=['is_female', 'age'])
>>> by_gender=df[['is_female','age']].groupby(['is_female'])
>>> by_gender.groups # returns indexes of records
{0: [0, 2, 4], 1: [1, 3]}
>>> by_gender.get_group(0)['age'] # age of males
0 31
2 35
4 41
Name: age, dtype: int64

Resources