Pandas series.groupby().apply( .sum() ), .sum() not summing values - pandas-groupby

I have the following test code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'MONTH': [1,2,3,1,1,1,1,1,1,2,3,2,2,3,2,1,1,1,1,1,1,1],
'HOUR': [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
'CIGFT': [np.NaN,12000,2500,73300,73300,np.NaN,np.NaN,np.NaN,np.NaN,12000,100,100,15000,2500,np.NaN,15000,11000,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN]})
cigs = pd.DataFrame()
cigs['cigsum'] = df.groupby(['MONTH','HOUR'])['CIGFT'].apply(lambda c: (c>=0.0).sum())
cigs['cigcount'] = df.groupby(['MONTH','HOUR'])['CIGFT'].apply(lambda c: (c>=0.0).count())
df.fillna(value='-', inplace=True)
cigs['cigminus'] = df.groupby(['MONTH','HOUR'])['CIGFT'].apply(lambda c: (c>=0.0).sum())
tfile = open('test_COUNT_manual.txt', 'a')
tfile.write(cigs.to_string())
tfile.close()
I wind up with the following results:
The dataframe:
CIGFT HOUR MONTH
0 NaN 0 1
1 12000.0 0 2
2 2500.0 0 3
3 73300.0 0 1
4 73300.0 0 1
5 NaN 0 1
6 NaN 0 1
7 NaN 0 1
8 NaN 0 1
9 12000.0 0 2
10 100.0 0 3
11 100.0 0 2
12 15000.0 0 2
13 2500.0 0 3
14 NaN 0 2
15 15000.0 0 1
16 11000.0 0 1
17 NaN 0 1
18 NaN 0 1
19 NaN 0 1
20 NaN 0 1
21 NaN 0 1
The results in the write to file:
cigsum cigcount cigminus
MONTH HOUR
1 0 4 14 14
2 0 4 5 5
3 0 3 3 3
My issue is that the .sum() is not summing the values. It is doing a count of the non null values. When I replace the null values with a minus, the .sum()
produces the same result as the count().
So what do I use to get the sum of the values if .sum() does not do it?

Series.sum() -> return the sum of the series values excluding NA/null values by default as mentioned in official docs.
You are getting series in lambda function each time, just apply sum function to series in lambda will give you correct result.
Do this,
cigs['cigsum'] = df.groupby(['MONTH','HOUR'])['CIGFT'].apply(lambda c: c.sum())
Result of this code will be,
MONTH HOUR
1 0 172600.0
2 0 39100.0
3 0 5100.0

Related

ValueError: could not convert string to float: 'Mme'

When I run the following code in Jupyter Lab
import numpy as np
from sklearn.feature_selection import SelectKBest,f_classif
import matplotlib.pyplot as plt
predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"]
selector = SelectKBest(f_classif,k=5)
selector.fit(titanic[predictors],titanic["Survived"])
Then it went errors and note that ValueError: could not convert string to float: 'Mme',details are like these:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
C:\Users\ADMINI~1\AppData\Local\Temp/ipykernel_17760/1637555559.py in <module>
5 predictors = ["Pclass","Sex","Age","SibSp","Parch","Fare","Embarked","FamilySize","Title","NameLength"]
6 selector = SelectKBest(f_classif,k=5)
----> 7 selector.fit(titanic[predictors],titanic["Survived"])
......
ValueError: could not convert string to float: 'Mme'
I tried to print titanic[predictors] and titanic["Survived"],then the details are follows:
Pclass Sex Age SibSp Parch Fare Embarked FamilySize Title NameLength
0 3 0 22.0 1 0 7.2500 0 1 1 23
1 1 1 38.0 1 0 71.2833 1 1 3 51
2 3 1 26.0 0 0 7.9250 0 0 2 22
3 1 1 35.0 1 0 53.1000 0 1 3 44
4 3 0 35.0 0 0 8.0500 0 0 1 24
... ... ... ... ... ... ... ... ... ... ...
886 2 0 27.0 0 0 13.0000 0 0 6 21
887 1 1 19.0 0 0 30.0000 0 0 2 28
888 3 1 28.0 1 2 23.4500 0 3 2 40
889 1 0 26.0 0 0 30.0000 1 0 1 21
890 3 0 32.0 0 0 7.7500 2 0 1 19
891 rows × 10 columns
0 0
1 1
2 1
3 1
4 0
..
886 0
887 1
888 0
889 1
890 0
Name: Survived, Length: 891, dtype: int64
How to Solve this Problem?
When you are trying to fit some algorithm (in your case SelectKBest), you need to be aware of your data. And, almost all time you need to preprocess it.
Take a look to your data:
Do you have categorical features or they are numerical? Or a mix?
Do you have NaN values?
...
Most of algorithm don't accept categorical features, and you will need to make a transformation to numerical one (evaluate the use of OneHotEncoder).
In your case it seems you have a categorical value called Mme, which is in the feature Title. Check it.
You will have the same problem with NaN values.
In conclusion, before start fitting, you have to preprocess your data.
is it printing column labels in first line?
if so then you do proper data assigning so assign the array starting from second row array[1:,:]
otherwise try to look into it and see where is "Mme" string located so you understand how the code is fetching it.

how to count rows when value change from value greater than threshold to 0

I have three columns in dataframe , X1 X2 X3 , i want to count rows when value change from value greater than 1 to 0 . if before 0 value less than 1 dont need to count.
input df:
df1=pd.DataFrame({'x1':[3,4,7,0,0,0,0,20,15,16,0,0,70],
'X2':[3,4,7,0,0,0,0,20,15,16,0,0,70],
'X3':[6,3,0.5,0,0,0,0,20,15,16,0,0,70]})
print(df1)
x1 X2 X3
0 3 3 6.0
1 4 4 3.0
2 7 7 0.5
3 0 0 0.0
4 0 0 0.0
5 0 0 0.0
6 0 0 0.0
7 20 20 20.0
8 15 15 15.0
9 16 16 16.0
10 0 0 0.0
11 0 0 0.0
12 70 70 70.0
desired_output
x1_count X2_count X3_count
0 6 6 2
Idea is replace 0 to missing values, forward filling them, convert all another values to NaNs, compare greater like 1 and count Trues by sum to Series converted to one row DataFrame with transpose:
m = df1.eq(0)
df2 = (df1.mask(m)
.ffill()
.where(m)
.gt(1)
.sum()
.add_suffix('_count')
.to_frame()
.T
)
print (df2)
x1_count X2_count X3_count
0 6 6 2

Subtract from all columns in dataframe row by the value in a Series when indexes match

I am trying to subtract 1 from all columns in the rows of a DataFrame that have a matching index in a list.
For example, if I have a DataFrame like this one:
df = pd.DataFrame({'AMOS Admin': [1,1,0,0,2,2], 'MX Programs': [0,0,1,1,0,0], 'Material Management': [2,2,2,2,1,1]})
print(df)
AMOS Admin MX Programs Material Management
0 1 0 2
1 1 0 2
2 0 1 2
3 0 1 2
4 2 0 1
5 2 0 1
I want to subtract 1 from all columns where index is in [2, 3] so that the end result is:
AMOS Admin MX Programs Material Management
0 1 0 2
1 1 0 2
2 -1 0 1
3 -1 0 1
4 2 0 1
5 2 0 1
Having found no way to do this I created a Series:
sr = pd.Series([1,1], index=['2', '3'])
print(sr)
2 1
3 1
dtype: int64
However, applying the sub method as per this question results in a DataFrame with all NaN and new rows at the bottom.
AMOS Admin MX Programs Material Management
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
5 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
Any help would be most appreciated.
Thanks,
Juan
Using reindex with you sr then subtract using values
df.loc[:]=df.values-sr.reindex(df.index,fill_value=0).values[:,None]
df
Out[1117]:
AMOS Admin MX Programs Material Management
0 1 0 2
1 1 0 2
2 -1 0 1
3 -1 0 1
4 2 0 1
5 2 0 1
If what you want to do is that specific, why don't you just:
df.loc[[2, 3], :] = df.loc[[2, 3], :].subtract(1)

python dataframe counter on a column

I column x in dataframe has only 0 and 1. I want to create variable y which starts counting zeros and resets when when 1 comes in x. I'm getting an error "The truth value of a Series is ambiguous."
count=1
countList=[0]
for x in df['x']:
if df['x'] == 0:
count = count + 1
df['y']= count
else:
df['y'] = 1
count = 1
First dont loop in pandas, because slow, if exist some vectorized solution.
I think need count consecutive 0 values:
df = pd.DataFrame({'x':[1,0,0,1,1,0,1,0,0,0,1,1,0,0,0,0,1]})
a = df['x'].eq(0)
b = a.cumsum()
df['y'] = (b-b.mask(a).ffill().fillna(0).astype(int))
print (df)
x y
0 1 0
1 0 1
2 0 2
3 1 0
4 1 0
5 0 1
6 1 0
7 0 1
8 0 2
9 0 3
10 1 0
11 1 0
12 0 1
13 0 2
14 0 3
15 0 4
16 1 0
Detail + explanation:
#compare by zero
a = df['x'].eq(0)
#cumulative sum of mask
b = a.cumsum()
#replace Trues to NaNs
c = b.mask(a)
#forward fill NaNs
d = b.mask(a).ffill()
#First NaNs to 0 and cast to integers
e = b.mask(a).ffill().fillna(0).astype(int)
#subtract from cumulative sum Series
y = b - e
df = pd.concat([df['x'], a, b, c, d, e, y], axis=1, keys=('x','a','b','c','d','e', 'y'))
print (df)
x a b c d e y
0 0 True 1 NaN NaN 0 1
1 0 True 2 NaN NaN 0 2
2 0 True 3 NaN NaN 0 3
3 1 False 3 3.0 3.0 3 0
4 1 False 3 3.0 3.0 3 0
5 0 True 4 NaN 3.0 3 1
6 1 False 4 4.0 4.0 4 0
7 0 True 5 NaN 4.0 4 1
8 0 True 6 NaN 4.0 4 2
9 0 True 7 NaN 4.0 4 3
10 1 False 7 7.0 7.0 7 0
11 1 False 7 7.0 7.0 7 0
12 0 True 8 NaN 7.0 7 1
13 0 True 9 NaN 7.0 7 2
14 0 True 10 NaN 7.0 7 3
15 0 True 11 NaN 7.0 7 4
16 1 False 11 11.0 11.0 11 0

Using pandas' groupby with shifting

I am looking to use pd.rolling_mean in a groupby operation. I want to have in each group a rolling mean of the previous elemnets within the same group. Here is an example:
id val
0 1
0 2
0 3
1 4
1 5
2 6
Grouping by id, this should be transformed into:
id val
0 nan
0 1
0 1.5
1 nan
1 4
2 nan
I believe you want pd.Series.expanding
df.groupby('id').val.apply(lambda x: x.expanding().mean().shift())
0 NaN
1 1.0
2 1.5
3 NaN
4 4.0
5 NaN
Name: val, dtype: float64
I think you need groupby with shift and rolling, window size can be set to scalar:
df['val']=df.groupby('id')['val'].apply(lambda x: x.shift().rolling(2, min_periods=1).mean())
print (df)
id val
0 0 NaN
1 0 1.0
2 0 1.5
3 1 NaN
4 1 4.0
5 2 NaN
Thank you 3novak for comment - you can set window size by max length of group:
f = lambda x: x.shift().rolling(df['id'].value_counts().iloc[0], min_periods=1).mean()
df['val'] = df.groupby('id')['val'].apply(f)
print (df)
id val
0 0 NaN
1 0 1.0
2 0 1.5
3 1 NaN
4 1 4.0
5 2 NaN

Resources