Pandas Series value_counts with unexpected result [duplicate] - python-3.x

I have a dataframe of Age and Marital_Status. The age is int and Marital_Status is string with 8 unique string eg: Married, Single etc.
After dividing into age group of interval 10
df['age_group'] = pd.cut(df.Age,bins=[0,10,20,30,40,50,60,70,80,90])
I had group the marital status by age group
df.groupby(['age_group'])['Marital_Status'].value_counts().
With error as below return
ValueError: operands could not be broadcast together with shape (9,) (7,)
When i tried with smaller number of age_group where the shape <= (7,) the groupby works fine. eg
df['age_group'] = pd.cut(df.Age,bins=[10,20,30,40,50,60,70,80])
I presume that the first shape (x,) must be smaller than second shape (y,). Can anyone please explain why?

Here is problem (maybe bug?) in DataFrame.groupby, default parameter is observed=False, but you need working only with existing categoricals:
observed bool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
Sample:
df = pd.DataFrame({'Marital_Status':['stat1'] * 50,
'Age': range(50)})
df['age_group'] = pd.cut(df.Age,bins=[10,20,30,40,50,60,70,80])
# print (df)
print (df.groupby(['age_group'], observed=True)['Marital_Status'].value_counts())
age_group Marital_Status
(10, 20] stat1 10
(20, 30] stat1 10
(30, 40] stat1 10
(40, 50] stat1 9
Name: Marital_Status, dtype: int64
In alternative solution better is possible check difference:
print (df.groupby(['age_group', 'Marital_Status']).size())
age_group Marital_Status
(10, 20] stat1 10
(20, 30] stat1 10
(30, 40] stat1 10
(40, 50] stat1 9
(50, 60] stat1 0
(60, 70] stat1 0
(70, 80] stat1 0
dtype: int64
print (df.groupby(['age_group', 'Marital_Status'], observed=True).size())
age_group Marital_Status
(10, 20] stat1 10
(20, 30] stat1 10
(30, 40] stat1 10
(40, 50] stat1 9
dtype: int64

Related

Pandas qcut ValueError: Input array must be 1 dimensional

I was trying to categorize my values into 10 bins and I met with this error. How can I avoid this error and bin them smoothly?
Attached are samples of the data and code.
Data
JPM
2008-01-02 NaN
2008-01-03 NaN
2008-01-04 NaN
2008-01-07 NaN
2008-01-08 NaN
... ...
2009-12-24 -0.054014
2009-12-28 0.002679
2009-12-29 -0.030015
2009-12-30 -0.019058
2009-12-31 -0.010090
505 rows × 1 columns
Code
group_names = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
discretized_roc = pd.qcut(df, 10, labels=group_names)
Pass column JPM and for only integer indicators of the bins use labels=False:
discretized_roc = pd.qcut(df['JPM'], 10, labels=False)
If need first column instead label use DataFrame.iloc:
discretized_roc = pd.qcut(df.iloc[:, 0], 10, labels=False)

find out percentage of duplicates

I have the following data:
id date A Area Price Hol
0 1 2019-01-01 No 80 200 No
1 2 2019-01-02 Yes 100 300 Yes
2 3 2019-01-03 Yes 100 300 Yes
3 4 2019-01-04 No 50 100 No
4 5 2019-01-05 No 20 50 No
5 1 2019-01-01 No 80 200 No
I want to find out duplicates (for the same id).
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.DataFrame({'id': [1, 2, 3, 4, 5, 1], 'date': ['2019-01-01', '2019-01-02', '2019-01-03', '2019-01-04',
'2019-01-05', '2019-01-01'],
'A': ['No', 'Yes', 'Yes', 'No', 'No', 'No'],
'Area': [80, 100, 100, 50, 20, 80], 'Price': [200, 300, 300, 100, 50, 200],
'Hol': ['No', 'Yes', 'Yes', 'No', 'No', 'No']})
df['date'] = pd.to_datetime(df['date'])
fig, ax = plt.subplots(figsize=(15, 7))
df.groupby(['A', 'Area', 'Price', 'Hol'])['id'].value_counts().plot(ax=ax)
I can see that I have one duplicate (for id 1 , all the entries are the same)
Now, I want to find out what percentage those duplicates represent in the whole dataset.
I can't find a way to express this, since I am already using value_counts() in order to find the duplicates and I can't do something like:
df.groupby(['A', 'Area', 'Price', 'Hol'])['id'].value_counts().size()
percentage = (test / test.groupby(level=0).sum()) * 100
I believe you need DataFrame.duplicated with Series.value_counts:
percentage = df.duplicated(keep=False).value_counts(normalize=True) * 100
print (percentage)
False 66.666667
True 33.333333
dtype: float64
Is duplicated what you need ?
df.duplicated(keep=False).mean()
Out[107]: 0.3333333333333333

How exactly 'abs()' and 'argsort()' works together

#Creating DataFrame
df=pd.DataFrame({'AAA' : [4,5,6,7], 'BBB' : [10,20,30,40],'CCC' : [100,50,-30,-50]}); df
output:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
aValue = 43.0
df.loc[(df.CCC-aValue).abs().argsort()]
output:
AAA BBB CCC
1 5 20 50
0 4 10 100
2 6 30 -30
3 7 40 -50
The output confusing, can you please explain in detail how the below line
works
df.loc[(df.CCC-aValue).abs().argsort()]
With abs flipping negative values, and the subtraction shift values around, it's hard to visualize what's going on. Instead I need to calculate it step by step:
In [97]: x = np.array([100,50,-30,-50])
In [98]: x-43
Out[98]: array([ 57, 7, -73, -93])
In [99]: abs(x-43)
Out[99]: array([57, 7, 73, 93])
In [100]: np.argsort(abs(x-43))
Out[100]: array([1, 0, 2, 3])
In [101]: x[np.argsort(abs(x-43))]
Out[101]: array([ 50, 100, -30, -50])
argsort is the indexing that puts the elements in sorted order. We can see that with:
In [104]: Out[99][Out[100]]
Out[104]: array([ 7, 57, 73, 93])
or
In [105]: np.array([57, 7, 73, 93])[[1, 0, 2, 3]]
Out[105]: array([ 7, 57, 73, 93])
How they work together is determined by the Python syntax; that's straight forward.
(df.CCC-aValue).abs() will take the absolute value of the df.CCC-aValue, and argsort will sort the values and takes the sorted indexes, and df.loc will show the rows with sorted indexes

Converting float into range in Python

I am doing some data analysis with pandas and am struggling to find a nice, clean way of summing up a range of numbers. I have a data frame with a column of floats, however I am not interested in the exact number, but a rough range. Ultimately I want to run a pivot and count how many values are in a certain range. Therefore ideally I would want to create a new column in my data frame, that converts my column of floats into a range. Say df[number] = 3.5, then df[range] = 0-10
The ranges should be 0-10, 10-20, ... >100
This may sound very arbitrary, but I've been struggling to find an answer on this. Many thanks
Pandas has a cut function for this
In [18]: s = pd.Series(np.random.uniform(0, 110, 100))
In [19]: s
Out[19]:
0 57.614427
1 30.576853
2 95.578943
3 53.010340
4 63.947381
...
95 42.252644
96 14.814418
97 81.271527
98 5.732966
99 90.932890
In [12]: s = pd.Series(np.random.uniform(0, 110, 100))
In [13]: s
Out[13]:
0 2.652461
1 46.536276
2 6.455352
3 6.075963
4 40.013378
...
95 39.775493
96 99.688307
97 41.064469
98 91.401904
99 60.580600
dtype: float64
In [14]: cuts = np.arange(0, 101, 10)
In [15]: pd.cut(s, cuts)
Out[15]:
0 (0, 10]
1 (40, 50]
2 (0, 10]
3 (0, 10]
4 (40, 50]
...
95 (30, 40]
96 (90, 100]
97 (40, 50]
98 (90, 100]
99 (60, 70]
dtype: category
Categories (10, object): [(0, 10] < (10, 20] < (20, 30] < (30, 40] ... (60, 70] < (70, 80] < (80, 90] <
(90, 100]]
See the docs for controlling what happens with endpoints.
Note that in 0.18 (coming out soonish) the result will be an IntervalIndex instead of a Categorical, which will make things even nicer.
To get your counts per interval, use the value_counts method
In [17]: pd.cut(s, cuts).value_counts()
Out[17]:
(30, 40] 15
(40, 50] 13
(50, 60] 12
(60, 70] 10
(0, 10] 10
(90, 100] 8
(70, 80] 8
(80, 90] 7
(10, 20] 6
(20, 30] 3
dtype: int64
def get_range_for(x, start=0, stop=100, step=10):
if x < start:
return (float('-inf'), start)
if x >= stop:
return (stop, float('inf'))
left = step * ((x - start) // step)
right = left + step
return (left, right)
Examples:
>>> get_range_for(3.5)
(0.0, 10.0)
>>> get_range_for(27.3)
(20.0, 30.0)
>>> get_range_for(75.6)
(70.0, 80.0)
Corner cases:
>>> get_range_for(-100)
(-inf, 0)
>>> get_range_for(1234)
(100, inf)
>>> get_range_for(0)
(0, 10)
>>> get_range_for(10)
(10, 20)
Using the properties of integer division should help. Because you want ranges in units of 10, dividing a number by 10 (13.5 / 10 == 1.35), converting it to an integer (int(1.35) == 1), and then multiplying by 10 (1 * 10 == 10) will convert the number to the low-end of the range it falls into. This might need some refinement (especially for negative numbers), but you could try something like:
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({'vals': [3.5, 4.2, 10.5, 19.5, 20.3, 24.2]})
>>> df
vals
0 3.5
1 4.2
2 10.5
3 19.5
4 20.3
5 24.2
>>> df['range_start'] = np.floor(df['vals'] / 10) * 10
>>> df
vals range_start
0 3.5 0
1 4.2 0
2 10.5 10
3 19.5 10
4 20.3 20
5 24.2 20

Getting count of rows using groupby in Pandas

I have two columns in my dataset, col1 and col2. I want to display data grouped by col1.
For that I have written code like:
grouped = df[['col1','col2']].groupby(['col1'], as_index= False)
The above code creates the groupby object.
How do I use the object to display the data grouped as per col1?
To get the counts by group, you can use dataframe.groupby('column').size().
Example:
In [10]:df = pd.DataFrame({'id' : [123,512,'zhub1', 12354.3, 129, 753, 295, 610],
'colour': ['black', 'white','white','white',
'black', 'black', 'white', 'white'],
'shape': ['round', 'triangular', 'triangular','triangular','square',
'triangular','round','triangular']
}, columns= ['id','colour', 'shape'])
In [11]:df
Out[11]:
id colour shape
0 123 black round
1 512 white triangular
2 zhub1 white triangular
3 12354.3 white triangular
4 129 black square
5 753 black triangular
6 295 white round
7 610 white triangular
In [12]:df.groupby('colour').size()
Out[12]:
colour
black 3
white 5
dtype: int64
In [13]:df.groupby('shape').size()
Out[13]:
shape
round 2
square 1
triangular 5
dtype: int64
Try groups attribute and get_group() method of the object returned by groupby():
>>> import numpy as np
>>> import pandas as pd
>>> anarray=np.array([[0, 31], [1, 26], [0, 35], [1, 22], [0, 41]])
>>> df = pd.DataFrame(anarray, columns=['is_female', 'age'])
>>> by_gender=df[['is_female','age']].groupby(['is_female'])
>>> by_gender.groups # returns indexes of records
{0: [0, 2, 4], 1: [1, 3]}
>>> by_gender.get_group(0)['age'] # age of males
0 31
2 35
4 41
Name: age, dtype: int64

Resources