I have a dataframe of Age and Marital_Status. The age is int and Marital_Status is string with 8 unique string eg: Married, Single etc.
After dividing into age group of interval 10
df['age_group'] = pd.cut(df.Age,bins=[0,10,20,30,40,50,60,70,80,90])
I had group the marital status by age group
With error as below return
ValueError: operands could not be broadcast together with shape (9,) (7,)
When i tried with smaller number of age_group where the shape <= (7,) the groupby works fine. eg
df['age_group'] = pd.cut(df.Age,bins=[10,20,30,40,50,60,70,80])
I presume that the first shape (x,) must be smaller than second shape (y,). Can anyone please explain why?

Here is problem (maybe bug?) in DataFrame.groupby, default parameter is observed=False, but you need working only with existing categoricals:
observed bool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
df = pd.DataFrame({'Marital_Status':['stat1'] * 50,
'Age': range(50)})
df['age_group'] = pd.cut(df.Age,bins=[10,20,30,40,50,60,70,80])
# print (df)
print (df.groupby(['age_group'], observed=True)['Marital_Status'].value_counts())
age_group Marital_Status
(10, 20] stat1 10
(20, 30] stat1 10
(30, 40] stat1 10
(40, 50] stat1 9
Name: Marital_Status, dtype: int64
In alternative solution better is possible check difference:
print (df.groupby(['age_group', 'Marital_Status']).size())
age_group Marital_Status
(10, 20] stat1 10
(20, 30] stat1 10
(30, 40] stat1 10
(40, 50] stat1 9
(50, 60] stat1 0
(60, 70] stat1 0
(70, 80] stat1 0
dtype: int64
print (df.groupby(['age_group', 'Marital_Status'], observed=True).size())
age_group Marital_Status
(10, 20] stat1 10
(20, 30] stat1 10
(30, 40] stat1 10
(40, 50] stat1 9
dtype: int64


Pandas qcut ValueError: Input array must be 1 dimensional

I was trying to categorize my values into 10 bins and I met with this error. How can I avoid this error and bin them smoothly?
Attached are samples of the data and code.
2008-01-02 NaN
2008-01-03 NaN
2008-01-04 NaN
2008-01-07 NaN
2008-01-08 NaN
... ...
2009-12-24 -0.054014
2009-12-28 0.002679
2009-12-29 -0.030015
2009-12-30 -0.019058
2009-12-31 -0.010090
505 rows × 1 columns
group_names = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
discretized_roc = pd.qcut(df, 10, labels=group_names)
Pass column JPM and for only integer indicators of the bins use labels=False:
discretized_roc = pd.qcut(df['JPM'], 10, labels=False)
If need first column instead label use DataFrame.iloc:
discretized_roc = pd.qcut(df.iloc[:, 0], 10, labels=False)

find out percentage of duplicates

I have the following data:
id date A Area Price Hol
0 1 2019-01-01 No 80 200 No
1 2 2019-01-02 Yes 100 300 Yes
2 3 2019-01-03 Yes 100 300 Yes
3 4 2019-01-04 No 50 100 No
4 5 2019-01-05 No 20 50 No
5 1 2019-01-01 No 80 200 No
I want to find out duplicates (for the same id).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 1], 'date': ['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-01'],
'A': ['No', 'Yes', 'Yes', 'No', 'No', 'No'],
'Area': [80, 100, 100, 50, 20, 80], 'Price': [200, 300, 300, 100, 50, 200],
'Hol': ['No', 'Yes', 'Yes', 'No', 'No', 'No']})
df['date'] = pd.to_datetime(df['date'])
fig, ax = plt.subplots(figsize=(15, 7))
df.groupby(['A', 'Area', 'Price', 'Hol'])['id'].value_counts().plot(ax=ax)
I can see that I have one duplicate (for id 1 , all the entries are the same)
Now, I want to find out what percentage those duplicates represent in the whole dataset.
I can't find a way to express this, since I am already using value_counts() in order to find the duplicates and I can't do something like:
df.groupby(['A', 'Area', 'Price', 'Hol'])['id'].value_counts().size()
percentage = (test / test.groupby(level=0).sum()) * 100
I believe you need DataFrame.duplicated with Series.value_counts:
percentage = df.duplicated(keep=False).value_counts(normalize=True) * 100
print (percentage)
False 66.666667
True 33.333333
dtype: float64
Is duplicated what you need ?
Out[107]: 0.3333333333333333

How exactly 'abs()' and 'argsort()' works together

#Creating DataFrame
df=pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
aValue = 43.0
1 5 20 50
0 4 10 100
2 6 30 -30
3 7 40 -50
The output confusing, can you please explain in detail how the below line
With abs flipping negative values, and the subtraction shift values around, it's hard to visualize what's going on. Instead I need to calculate it step by step:
In [97]: x = np.array([100,50,-30,-50])
In [98]: x-43
Out[98]: array([ 57, 7, -73, -93])
In [99]: abs(x-43)
Out[99]: array([57, 7, 73, 93])
In [100]: np.argsort(abs(x-43))
Out[100]: array([1, 0, 2, 3])
In [101]: x[np.argsort(abs(x-43))]
Out[101]: array([ 50, 100, -30, -50])
argsort is the indexing that puts the elements in sorted order. We can see that with:
In [104]: Out[99][Out[100]]
Out[104]: array([ 7, 57, 73, 93])
In [105]: np.array([57, 7, 73, 93])[[1, 0, 2, 3]]
Out[105]: array([ 7, 57, 73, 93])
How they work together is determined by the Python syntax; that's straight forward.
(df.CCC-aValue).abs() will take the absolute value of the df.CCC-aValue, and argsort will sort the values and takes the sorted indexes, and df.loc will show the rows with sorted indexes

Converting float into range in Python

I am doing some data analysis with pandas and am struggling to find a nice, clean way of summing up a range of numbers. I have a data frame with a column of floats, however I am not interested in the exact number, but a rough range. Ultimately I want to run a pivot and count how many values are in a certain range. Therefore ideally I would want to create a new column in my data frame, that converts my column of floats into a range. Say df[number] = 3.5, then df[range] = 0-10
The ranges should be 0-10, 10-20, ... >100
This may sound very arbitrary, but I've been struggling to find an answer on this. Many thanks
Pandas has a cut function for this
In [18]: s = pd.Series(np.random.uniform(0, 110, 100))
In [19]: s
0 57.614427
1 30.576853
2 95.578943
3 53.010340
4 63.947381
95 42.252644
96 14.814418
97 81.271527
98 5.732966
99 90.932890
In [12]: s = pd.Series(np.random.uniform(0, 110, 100))
In [13]: s
0 2.652461
1 46.536276
2 6.455352
3 6.075963
4 40.013378
95 39.775493
96 99.688307
97 41.064469
98 91.401904
99 60.580600
dtype: float64
In [14]: cuts = np.arange(0, 101, 10)
In [15]: pd.cut(s, cuts)
0 (0, 10]
1 (40, 50]
2 (0, 10]
3 (0, 10]
4 (40, 50]
95 (30, 40]
96 (90, 100]
97 (40, 50]
98 (90, 100]
99 (60, 70]
dtype: category
Categories (10, object): [(0, 10] < (10, 20] < (20, 30] < (30, 40] ... (60, 70] < (70, 80] < (80, 90] <
(90, 100]]
See the docs for controlling what happens with endpoints.
Note that in 0.18 (coming out soonish) the result will be an IntervalIndex instead of a Categorical, which will make things even nicer.
To get your counts per interval, use the value_counts method
In [17]: pd.cut(s, cuts).value_counts()
(30, 40] 15
(40, 50] 13
(50, 60] 12
(60, 70] 10
(0, 10] 10
(90, 100] 8
(70, 80] 8
(80, 90] 7
(10, 20] 6
(20, 30] 3
dtype: int64
def get_range_for(x, start=0, stop=100, step=10):
if x < start:
return (float('-inf'), start)
if x >= stop:
return (stop, float('inf'))
left = step * ((x - start) // step)
right = left + step
return (left, right)
>>> get_range_for(3.5)
(0.0, 10.0)
>>> get_range_for(27.3)
(20.0, 30.0)
>>> get_range_for(75.6)
(70.0, 80.0)
Corner cases:
>>> get_range_for(-100)
(-inf, 0)
>>> get_range_for(1234)
(100, inf)
>>> get_range_for(0)
(0, 10)
>>> get_range_for(10)
(10, 20)
Using the properties of integer division should help. Because you want ranges in units of 10, dividing a number by 10 (13.5 / 10 == 1.35), converting it to an integer (int(1.35) == 1), and then multiplying by 10 (1 * 10 == 10) will convert the number to the low-end of the range it falls into. This might need some refinement (especially for negative numbers), but you could try something like:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'vals': [3.5, 4.2, 10.5, 19.5, 20.3, 24.2]})
>>> df
0 3.5
1 4.2
2 10.5
3 19.5
4 20.3
5 24.2
>>> df['range_start'] = np.floor(df['vals'] / 10) * 10
>>> df
vals range_start
0 3.5 0
1 4.2 0
2 10.5 10
3 19.5 10
4 20.3 20
5 24.2 20

Getting count of rows using groupby in Pandas

I have two columns in my dataset, col1 and col2. I want to display data grouped by col1.
For that I have written code like:
grouped = df[['col1','col2']].groupby(['col1'], as_index= False)
The above code creates the groupby object.
How do I use the object to display the data grouped as per col1?
To get the counts by group, you can use dataframe.groupby('column').size().
In [10]:df = pd.DataFrame({'id' : [123,512,'zhub1', 12354.3, 129, 753, 295, 610],
'colour': ['black', 'white','white','white',
'black', 'black', 'white', 'white'],
'shape': ['round', 'triangular', 'triangular','triangular','square',
}, columns= ['id','colour', 'shape'])
In [11]:df
id colour shape
0 123 black round
1 512 white triangular
2 zhub1 white triangular
3 12354.3 white triangular
4 129 black square
5 753 black triangular
6 295 white round
7 610 white triangular
In [12]:df.groupby('colour').size()
black 3
white 5
dtype: int64
In [13]:df.groupby('shape').size()
round 2
square 1
triangular 5
dtype: int64
Try groups attribute and get_group() method of the object returned by groupby():
>>> import numpy as np
>>> import pandas as pd
>>> anarray=np.array([[0, 31], [1, 26], [0, 35], [1, 22], [0, 41]])
>>> df = pd.DataFrame(anarray, columns=['is_female', 'age'])
>>> by_gender=df[['is_female','age']].groupby(['is_female'])
>>> by_gender.groups # returns indexes of records
{0: [0, 2, 4], 1: [1, 3]}
>>> by_gender.get_group(0)['age'] # age of males
0 31
2 35
4 41
Name: age, dtype: int64
