python dataframe counter on a column

python dataframe counter on a column - python-3.x

I column x in dataframe has only 0 and 1. I want to create variable y which starts counting zeros and resets when when 1 comes in x. I'm getting an error "The truth value of a Series is ambiguous."
count=1
countList=[0]
for x in df['x']:
if df['x'] == 0:
count = count + 1
df['y']= count
else:
df['y'] = 1
count = 1

First dont loop in pandas, because slow, if exist some vectorized solution.
I think need count consecutive 0 values:
df = pd.DataFrame({'x':[1,0,0,1,1,0,1,0,0,0,1,1,0,0,0,0,1]})
a = df['x'].eq(0)
b = a.cumsum()
df['y'] = (b-b.mask(a).ffill().fillna(0).astype(int))
print (df)
x y
0 1 0
1 0 1
2 0 2
3 1 0
4 1 0
5 0 1
6 1 0
7 0 1
8 0 2
9 0 3
10 1 0
11 1 0
12 0 1
13 0 2
14 0 3
15 0 4
16 1 0
Detail + explanation:
#compare by zero
a = df['x'].eq(0)
#cumulative sum of mask
b = a.cumsum()
#replace Trues to NaNs
c = b.mask(a)
#forward fill NaNs
d = b.mask(a).ffill()
#First NaNs to 0 and cast to integers
e = b.mask(a).ffill().fillna(0).astype(int)
#subtract from cumulative sum Series
y = b - e
df = pd.concat([df['x'], a, b, c, d, e, y], axis=1, keys=('x','a','b','c','d','e', 'y'))
print (df)
x a b c d e y
0 0 True 1 NaN NaN 0 1
1 0 True 2 NaN NaN 0 2
2 0 True 3 NaN NaN 0 3
3 1 False 3 3.0 3.0 3 0
4 1 False 3 3.0 3.0 3 0
5 0 True 4 NaN 3.0 3 1
6 1 False 4 4.0 4.0 4 0
7 0 True 5 NaN 4.0 4 1
8 0 True 6 NaN 4.0 4 2
9 0 True 7 NaN 4.0 4 3
10 1 False 7 7.0 7.0 7 0
11 1 False 7 7.0 7.0 7 0
12 0 True 8 NaN 7.0 7 1
13 0 True 9 NaN 7.0 7 2
14 0 True 10 NaN 7.0 7 3
15 0 True 11 NaN 7.0 7 4
16 1 False 11 11.0 11.0 11 0

Related

how to count rows when value change from value greater than threshold to 0

I have three columns in dataframe , X1 X2 X3 , i want to count rows when value change from value greater than 1 to 0 . if before 0 value less than 1 dont need to count.
input df:
df1=pd.DataFrame({'x1':[3,4,7,0,0,0,0,20,15,16,0,0,70],
'X2':[3,4,7,0,0,0,0,20,15,16,0,0,70],
'X3':[6,3,0.5,0,0,0,0,20,15,16,0,0,70]})
print(df1)
x1 X2 X3
0 3 3 6.0
1 4 4 3.0
2 7 7 0.5
3 0 0 0.0
4 0 0 0.0
5 0 0 0.0
6 0 0 0.0
7 20 20 20.0
8 15 15 15.0
9 16 16 16.0
10 0 0 0.0
11 0 0 0.0
12 70 70 70.0
desired_output
x1_count X2_count X3_count
0 6 6 2

Idea is replace 0 to missing values, forward filling them, convert all another values to NaNs, compare greater like 1 and count Trues by sum to Series converted to one row DataFrame with transpose:
m = df1.eq(0)
df2 = (df1.mask(m)
.ffill()
.where(m)
.gt(1)
.sum()
.add_suffix('_count')
.to_frame()
.T
)
print (df2)
x1_count X2_count X3_count
0 6 6 2

Creating a variable using conditionals python using vectorization

I have a pandas dataframe as below,
flag a b c
0 1 5 1 3
1 1 2 1 3
2 1 3 0 3
3 1 4 0 3
4 1 5 5 3
5 1 6 0 3
6 1 7 0 3
7 2 6 1 4
8 2 2 1 4
9 2 3 1 4
10 2 4 1 4
I want to create a column 'd' based on the below condition:
1) For first row of each flag, if a>c, then d = b, else d = nan
2) For non-first row of each flag, if (a>c) & ((previous row of d is nan) | (b > previous row of d)), d=b, else d = prev row of d
I am expecting the below output:
flag a b c d
0 1 5 1 3 1
1 1 2 1 3 1
2 1 3 0 3 1
3 1 4 0 3 1
4 1 5 5 3 5
5 1 6 0 3 5
6 1 7 0 3 5
7 2 6 1 4 1
8 2 2 1 4 1
9 2 3 1 4 1
10 2 4 1 4 1

Here's how I would translate your logic:
df['d'] = np.nan
# first row of flag
s = df.flag.ne(df.flag.shift())
# where a > c
a_gt_c = df['a'].gt(df['c'])
# fill the first rows with a > c
df.loc[s & a_gt_c, 'd'] = df['b']
# mask for second fill
mask = ((~s) # not first rows
& a_gt_c # a > c
& (df['d'].shift().isna() # previous d not null
| df['b'].gt(df['d']).shift()) # or b > previous d
)
# fill those values:
df.loc[mask, 'd'] = df['b']
# ffill for the rest
df['d'] = df['d'].ffill()
Output:
flag a b c d
0 1 5 1 3 1.0
1 1 2 1 3 1.0
2 1 3 0 3 1.0
3 1 4 0 3 0.0
4 1 5 5 3 5.0
5 1 6 0 3 0.0
6 1 7 0 3 0.0
7 2 6 1 4 1.0
8 2 2 1 4 1.0
9 2 3 1 4 1.0
10 2 4 1 4 1.0

Pandas pivot_table: Defining Columns

I'm new to pandas and python in general. I'm pulling data from an Access database and creating a pivot table.
PTable = TRep.pivot_table(values = ['Students'],
index = ['GradeLevel', 'Class'],
columns = ['Grade'],
aggfunc='count', fill_value=0, margins=True, dropna=True,
margins_name='Grand Total')
Grade will always be A, B, C, D, F - And I want the resulting pivot table to always show columns for those 5 grades even if there are 0 students with that grade.
Currently, if the list of students pulled from Access does not contain a student receiving a C (for example), the resulting pivot table will have the C column omitted.
Is there a way to define constant columns in a pivot table?

What I have tried:
This is my sample data:
GradeLevel Class Student Grade
0 I 1 AAA A
1 I 2 BBB B
2 I 2 CCC D
3 I 3 DDD E
4 I 4 EEE A
5 II 1 FFF B
6 II 2 GGG A
7 II 3 HHH B
8 II 4 KKK D
9 II 1 LLL D
10 II 2 MMM E
11 III 1 NNN E
12 III 2 OOO A
13 III 2 PPP A
14 III 3 QQQ A
Change Grade column to category.
df["Grade"] = df["Grade"].astype('category')
Set the level of category of Grade column.
df["Grade"] = df["Grade"].cat.set_categories(["A", "B", "C", "D", "E"])
Pivot the data:
df.pivot_table(values = ["Student"], index = ["GradeLevel", "Class"],
columns = ["Grade"], aggfunc='count', fill_value=0,
margins=True, dropna=False, margins_name='Grand Total')
Result:
Student
Grade A B C D E Grand Total
GradeLevel Class
I 1 1 0 0 0 0 1.0
2 0 1 0 1 0 2.0
3 0 0 0 0 1 1.0
4 1 0 0 0 0 1.0
II 1 0 1 0 1 0 2.0
2 1 0 0 0 1 2.0
3 0 1 0 0 0 1.0
4 0 0 0 1 0 1.0
III 1 0 0 0 0 1 1.0
2 2 0 0 0 0 2.0
3 1 0 0 0 0 1.0
4 0 0 0 0 0 NaN
Grand Total 6 3 0 3 3 15.0
But from the pivot table we still see the NaN value. So to remove that NaN value:
(df.pivot_table(values = ["Student"], index = ["GradeLevel", "Class"],
columns = ["Grade"], aggfunc='count', fill_value=0,
margins=True, dropna=False, margins_name='Grand Total')).dropna()
Result:
Student
Grade A B C D E Grand Total
GradeLevel Class
I 1 1 0 0 0 0 1.0
2 0 1 0 1 0 2.0
3 0 0 0 0 1 1.0
4 1 0 0 0 0 1.0
II 1 0 1 0 1 0 2.0
2 1 0 0 0 1 2.0
3 0 1 0 0 0 1.0
4 0 0 0 1 0 1.0
III 1 0 0 0 0 1 1.0
2 2 0 0 0 0 2.0
3 1 0 0 0 0 1.0
Grand Total 6 3 0 3 3 15.0
Hope it is useful...

I suppose you can always put some 'touch-up' onto the df once it's been created. For example, you can add a column and fill it up with nan i.e. df['C'] = np.nan

Simply convert the grades column to categorical, specifying all possible values.
TRep[‘Grade’] = pd.Categorical(TRep[‘Grade’], [‘A’, ‘B’, ‘C’, ‘D’, ‘F’])
Then pass dropna=False to pivot_table and it’ll include all the columns.

Pandas series.groupby().apply( .sum() ), .sum() not summing values

I have the following test code:
import pandas as pd
import numpy as np
df = pd.DataFrame({'MONTH': [1,2,3,1,1,1,1,1,1,2,3,2,2,3,2,1,1,1,1,1,1,1],
'HOUR': [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0],
'CIGFT': [np.NaN,12000,2500,73300,73300,np.NaN,np.NaN,np.NaN,np.NaN,12000,100,100,15000,2500,np.NaN,15000,11000,np.NaN,np.NaN,np.NaN,np.NaN,np.NaN]})
cigs = pd.DataFrame()
cigs['cigsum'] = df.groupby(['MONTH','HOUR'])['CIGFT'].apply(lambda c: (c>=0.0).sum())
cigs['cigcount'] = df.groupby(['MONTH','HOUR'])['CIGFT'].apply(lambda c: (c>=0.0).count())
df.fillna(value='-', inplace=True)
cigs['cigminus'] = df.groupby(['MONTH','HOUR'])['CIGFT'].apply(lambda c: (c>=0.0).sum())
tfile = open('test_COUNT_manual.txt', 'a')
tfile.write(cigs.to_string())
tfile.close()
I wind up with the following results:
The dataframe:
CIGFT HOUR MONTH
0 NaN 0 1
1 12000.0 0 2
2 2500.0 0 3
3 73300.0 0 1
4 73300.0 0 1
5 NaN 0 1
6 NaN 0 1
7 NaN 0 1
8 NaN 0 1
9 12000.0 0 2
10 100.0 0 3
11 100.0 0 2
12 15000.0 0 2
13 2500.0 0 3
14 NaN 0 2
15 15000.0 0 1
16 11000.0 0 1
17 NaN 0 1
18 NaN 0 1
19 NaN 0 1
20 NaN 0 1
21 NaN 0 1
The results in the write to file:
cigsum cigcount cigminus
MONTH HOUR
1 0 4 14 14
2 0 4 5 5
3 0 3 3 3
My issue is that the .sum() is not summing the values. It is doing a count of the non null values. When I replace the null values with a minus, the .sum()
produces the same result as the count().
So what do I use to get the sum of the values if .sum() does not do it?

Series.sum() -> return the sum of the series values excluding NA/null values by default as mentioned in official docs.
You are getting series in lambda function each time, just apply sum function to series in lambda will give you correct result.
Do this,
cigs['cigsum'] = df.groupby(['MONTH','HOUR'])['CIGFT'].apply(lambda c: c.sum())
Result of this code will be,
MONTH HOUR
1 0 172600.0
2 0 39100.0
3 0 5100.0

Using pandas' groupby with shifting

I am looking to use pd.rolling_mean in a groupby operation. I want to have in each group a rolling mean of the previous elemnets within the same group. Here is an example:
id val
0 1
0 2
0 3
1 4
1 5
2 6
Grouping by id, this should be transformed into:
id val
0 nan
0 1
0 1.5
1 nan
1 4
2 nan

I believe you want pd.Series.expanding
df.groupby('id').val.apply(lambda x: x.expanding().mean().shift())
0 NaN
1 1.0
2 1.5
3 NaN
4 4.0
5 NaN
Name: val, dtype: float64

I think you need groupby with shift and rolling, window size can be set to scalar:
df['val']=df.groupby('id')['val'].apply(lambda x: x.shift().rolling(2, min_periods=1).mean())
print (df)
id val
0 0 NaN
1 0 1.0
2 0 1.5
3 1 NaN
4 1 4.0
5 2 NaN
Thank you 3novak for comment - you can set window size by max length of group:
f = lambda x: x.shift().rolling(df['id'].value_counts().iloc[0], min_periods=1).mean()
df['val'] = df.groupby('id')['val'].apply(f)
print (df)
id val
0 0 NaN
1 0 1.0
2 0 1.5
3 1 NaN
4 1 4.0
5 2 NaN

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

python dataframe counter on a column - python-3.x

Related

how to count rows when value change from value greater than threshold to 0

Creating a variable using conditionals python using vectorization

Pandas pivot_table: Defining Columns

Pandas series.groupby().apply( .sum() ), .sum() not summing values

Using pandas' groupby with shifting

Categories

Resources