How to count consecutive value change to zero in column python - python-3.x

from dataframe want to check how many times value change to zero in columns.
here is input df
pd.DataFrame({'value1':[3,4,7,0,11,20,0,20,15,16],
'value2':[2,2,0,8,8,2,2,2,5,5],
'value3':[7,10,20,4008,0,1,4820,1,1,1]})
value1 value2 value3
0 3 2 7
1 4 2 10
2 7 0 20
3 0 8 4008
4 11 8 0
5 20 2 1
6 0 2 4820
7 20 2 1
8 15 5 1
9 16 5 1
desired output:
df_out=pd.DataFrame({'value1_count':[2],
'value2_count':[1],
'value3_ount':[1]})
value1_count value2_count value3_ount
0 2 1 1

Try this
df.eq(0).astype(int).diff().eq(-1).sum()
Out[77]:
value1 2
value2 1
value3 1
dtype: int64
To get exact your output, just add the following
df.eq(0).astype(int).diff().eq(-1).sum().to_frame().T.add_suffix('_count')
Out[85]:
value1_count value2_count value3_count
0 2 1 1

Here's something you can do
df_out=pd.DataFrame({'value1_count':[df['value1'].value_counts()[0]],'value2_count':[df['value2'].value_counts()[0]],'value3_count':[df['value3'].value_counts()[0]]})
Output
value1_count value2_count value3_count
0 2 1 1
.value_counts() returns a pandas.Series object with the frequency of all the values, the index being the value. So at index [0] you find the frequency of zeros in the column.

>>> columns_name = ['value1_count','value2_count','value3_ount']
>>> df_out = pd.DataFrame((df==0).sum().values.reshape(1,-1), columns=columns_name )
>>> df_out
value1_count value2_count value3_ount
0 2 1 1

Related

Pandas dataframe: Count no of rows which meet a set of conditions across multiple columns [duplicate]

I have a dataframe(edata) as given below
Domestic Catsize Type Count
1 0 1 1
1 1 1 8
1 0 2 11
0 1 3 14
1 1 4 21
0 1 4 31
From this dataframe I want to calculate the sum of all counts where the logical AND of both variables (Domestic and Catsize) results in Zero (0) such that
1 0 0
0 1 0
0 0 0
The code I use to perform the process is
g=edata.groupby('Type')
q3=g.apply(lambda x:x[((x['Domestic']==0) & (x['Catsize']==0) |
(x['Domestic']==0) & (x['Catsize']==1) |
(x['Domestic']==1) & (x['Catsize']==0)
)]
['Count'].sum()
)
q3
Type
1 1
2 11
3 14
4 31
This code works fine, however, if the number of variables in the dataframe increases then the number of conditions grows rapidly. So, is there a smart way to write a condition that states that if the ANDing the two (or more) variables result in a zero then perform the sum() function
You can filter first using pd.DataFrame.all negated:
cols = ['Domestic', 'Catsize']
res = df[~df[cols].all(1)].groupby('Type')['Count'].sum()
print(res)
# Type
# 1 1
# 2 11
# 3 14
# 4 31
# Name: Count, dtype: int64
Use np.logical_and.reduce to generalise.
columns = ['Domestic', 'Catsize']
df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
Type
1 1
2 11
3 14
4 31
Name: Count, dtype: int64
Before adding it back, use map to broadcast:
u = df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
df['NewCol'] = df.Type.map(u)
df
Domestic Catsize Type Count NewCol
0 1 0 1 1 1
1 1 1 1 8 1
2 1 0 2 11 11
3 0 1 3 14 14
4 1 1 4 21 31
5 0 1 4 31 31
how about
columns = ['Domestic', 'Catsize']
df.loc[~df[columns].prod(axis=1).astype(bool), 'Count']
and then do with it whatever you want.
for logical AND the product does the trick nicely.
for logcal OR you can use sum(axis=1) with proper negation in advance.

Counting of elements in list not producing expected result

I have a pandas dataframe dftouse['col a'] which consists of lists :
0 []
1 [carbon, nature]
2 [Lincoln]
3 [CBDC]
4 []
I want to count elements in each list across the rows. When I am executing
dftouse['Col a'].apply(lambda x: (len(x)-1) if not x else len(x))
0 1
1 2
2 1
3 1
4 1
Expected :
0 0
1 2
2 1
3 1
4 0
It would be helpful if I can get help debugging this. TIA
Use len on apply:
>>> dftouse['Col a'].apply(len)
0 0
1 2
2 1
3 1
4 0
Name: Col a, dtype: int64
You can also use str.len() on the column
>>> dftouse['col a'].str.len()
0 0
1 2
2 1
3 1
4 0
Name: col a, dtype: int64

How to take mean of 3 values before flag change 0 to 1python

I have dataframe with columns A,B and flag. I want to calculate mean of 2 values before flag change from 0 to 1 , and record value when flag change from 0 to 1 and record value when flag changes from 1 to 0.
# Input dataframe
df=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,0,0,0]})
# Expected output
df_out=df=pd.DataFrame({'A_mean_before_flag_change':[5.5],
'B_mean_before_flag_change':[5],
'A_value_before_change_flag':[7],
'B_value_before_change_flag':[6]})
I try to create more general solution:
df=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,1,0,1]})
print (df)
A B flag
0 1 1 0
1 3 3 0
2 4 4 0
3 7 6 0
4 8 8 1
5 11 11 1
6 1 1 1
7 15 19 0
8 20 20 0
9 15 15 1
10 16 16 0
11 87 87 1
First create groups by mask for 0 with next 1 values of flag:
m1 = df['flag'].eq(0) & df['flag'].shift(-1).eq(1)
df['g'] = m1.iloc[::-1].cumsum()
print (df)
A B flag g
0 1 1 0 3
1 3 3 0 3
2 4 4 0 3
3 7 6 0 3
4 8 8 1 2
5 11 11 1 2
6 1 1 1 2
7 15 19 0 2
8 20 20 0 2
9 15 15 1 1
10 16 16 0 1
11 87 87 1 0
then filter out groups with size less like N:
N = 4
df1 = df[df['g'].map(df['g'].value_counts()).ge(N)].copy()
print (df1)
A B flag g
0 1 1 0 3
1 3 3 0 3
2 4 4 0 3
3 7 6 0 3
4 8 8 1 2
5 11 11 1 2
6 1 1 1 2
7 15 19 0 2
8 20 20 0 2
Filter last N rows:
df2 = df1.groupby('g').tail(N)
And aggregate last with mean:
d = {'mean':'_mean_before_flag_change', 'last': '_value_before_change_flag'}
df3 = df2.groupby('g')['A','B'].agg(['mean','last']).sort_index(axis=1, level=1).rename(columns=d)
df3.columns = df3.columns.map(''.join)
print (df3)
A_value_before_change_flag B_value_before_change_flag \
g
2 20 20
3 7 6
A_mean_before_flag_change B_mean_before_flag_change
g
2 11.75 12.75
3 3.75 3.50
I'm assuming that this needs to work for cases with more than one rising edge and that the consecutive values and averages get appended to the output lists:
# the first step is to extract the rising and falling edges using diff(), identify sections and length
df['flag_diff'] = df.flag.diff().fillna(0)
df['flag_sections'] = (df.flag_diff != 0).cumsum()
df['flag_sum'] = df.flag.groupby(df.flag_sections).transform('sum')
# then you can get the relevant indices by checking for the rising edges
rising_edges = df.index[df.flag_diff==1.0]
val_indices = [i-1 for i in rising_edges]
avg_indices = [(i-2,i-1) for i in rising_edges]
# and finally iterate over the relevant sections
df_out = pd.DataFrame()
df_out['A_mean_before_flag_change'] = [df.A.loc[tpl[0]:tpl[1]].mean() for tpl in avg_indices]
df_out['B_mean_before_flag_change'] = [df.B.loc[tpl[0]:tpl[1]].mean() for tpl in avg_indices]
df_out['A_value_before_change_flag'] = [df.A.loc[idx] for idx in val_indices]
df_out['B_value_before_change_flag'] = [df.B.loc[idx] for idx in val_indices]
df_out['length'] = [df.flag_sum.loc[idx] for idx in rising_edges]
df_out.index = rising_edges

How to check value change in column

in my dataframe have three columns columns value ,ID and distance . i want to check in ID column when its changes from 2 to any other value count rows and record first value and last value when 2 changes to other value and save and also save corresponding value of column distance when change from 2 to other in ID column.
df=pd.DataFrame({'value':[3,4,7,8,11,20,15,20,15,16],'ID':[2,2,8,8,8,2,2,2,5,5],'distance':[0,0,1,0,0,0,0,0,0,0]})
print(df)
value ID distance
0 3 2 0
1 4 2 0
2 7 8 1
3 8 8 0
4 11 8 0
5 20 2 0
6 15 2 0
7 20 2 0
8 15 5 0
9 16 5 0
required results:
df_out=pd.DataFrame({'rows_Count':[3,2],'value_first':[7,15],'value_last':[11,16],'distance_first':[1,0]})
print(df_out)
rows_Count value_first value_last distance_first
0 3 7 11 1
1 2 15 16 0
Use:
#compare by 2
m = df['ID'].eq(2)
#filter out data before first 2 (in sample data not, in real data possible)
df = df[m.cumsum().ne(0)]
#create unique groups for non 2 groups, add misisng values by reindex
s = m.ne(m.shift()).cumsum()[~m].reindex(df.index)
#aggregate with helper s Series
df1 = df.groupby(s).agg({'ID':'size', 'value':['first','last'], 'distance':'first'})
#flatten MultiIndex
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index(drop=True)
print (df1)
ID_size value_first value_last distance_first
0 3 7 11 1
1 2 15 16 0
Verify in changed data (not only 2 first group):
df=pd.DataFrame({'value':[3,4,7,8,11,20,15,20,15,16],
'ID':[1,7,8,8,8,2,2,2,5,5],
'distance':[0,0,1,0,0,0,0,0,0,0]})
print(df)
value ID distance
0 3 1 0 <- changed ID
1 4 7 0 <- changed ID
2 7 8 1
3 8 8 0
4 11 8 0
5 20 2 0
6 15 2 0
7 20 2 0
8 15 5 0
9 16 5 0
#compare by 2
m = df['ID'].eq(2)
#filter out data before first 2 (in sample data not, in real data possible)
df = df[m.cumsum().ne(0)]
#create unique groups for non 2 groups, add misisng values by reindex
s = m.ne(m.shift()).cumsum()[~m].reindex(df.index)
#aggregate with helper s Series
df1 = df.groupby(s).agg({'ID':'size', 'value':['first','last'], 'distance':'first'})
#flatten MultiIndex
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index(drop=True)
print (df1)
ID_size value_first value_last distance_first
0 2 15 16 0

In Python Pandas using cumsum with groupby and reset of cumsum when value is 0

I'm rather new at python.
I try to have a cumulative sum for each client to see the consequential months of inactivity (flag: 1 or 0). The cumulative sum of the 1's need therefore to be reset when we have a 0. The reset need to happen as well when we have a new client. See below with example where a is the column of clients and b are the dates.
After some research, I found the question 'Cumsum reset at NaN' and 'In Python Pandas using cumsum with groupby'. I assume that I kind of need to put them together.
Adapting the code of 'Cumsum reset at NaN' to the reset towards 0, is successful:
cumsum = v.cumsum().fillna(method='pad')
reset = -cumsum[v.isnull() !=0].diff().fillna(cumsum)
result = v.where(v.notnull(), reset).cumsum()
However, I don't succeed at adding a groupby. My count just goes on...
So, a dataset would be like this:
import pandas as pd
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15],
'c' : [1,0,1,0,1,1,0,1,1,0,1,1,1,1]})
this should result in a dataframe with the columns a, b, c and d with
'd' : [1,0,1,0,1,2,0,1,2,0,1,2,3,4]
Please note that I have a very large dataset, so calculation time is really important.
Thank you for helping me
Use groupby.apply and cumsum after finding contiguous values in the groups. Then groupby.cumcount to get the integer counting upto each contiguous value and add 1 later.
Multiply with the original row to create the AND logic cancelling all zeros and only considering positive values.
df['d'] = df.groupby('a')['c'] \
.apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
print(df['d'])
0 1
1 0
2 1
3 0
4 1
5 2
6 0
7 1
8 2
9 0
10 1
11 2
12 3
13 4
Name: d, dtype: int64
Another way of doing would be to apply a function after series.expanding on the groupby object which basically computes values on the series starting from the first index upto that current index.
Use reduce later to apply function of two args cumulatively to the items of iterable so as to reduce it to a single value.
from functools import reduce
df.groupby('a')['c'].expanding() \
.apply(lambda i: reduce(lambda x, y: x+1 if y==1 else 0, i, 0))
a
1 0 1.0
1 0.0
2 1.0
3 0.0
4 1.0
5 2.0
6 0.0
2 7 1.0
8 2.0
9 0.0
10 1.0
11 2.0
12 3.0
13 4.0
Name: c, dtype: float64
Timings:
%%timeit
df.groupby('a')['c'].apply(lambda x: x * (x.groupby((x != x.shift()).cumsum()).cumcount() + 1))
100 loops, best of 3: 3.35 ms per loop
%%timeit
df.groupby('a')['c'].expanding().apply(lambda s: reduce(lambda x, y: x+1 if y==1 else 0, s, 0))
1000 loops, best of 3: 1.63 ms per loop
I think you need custom function with groupby:
#change row with index 6 to 1 for better testing
df = pd.DataFrame({'a' : [1,1,1,1,1,1,1,2,2,2,2,2,2,2],
'b' : [1/15,2/15,3/15,4/15,5/15,6/15,1/15,2/15,3/15,4/15,5/15,6/15,7/15,8/15],
'c' : [1,0,1,0,1,1,1,1,1,0,1,1,1,1],
'd' : [1,0,1,0,1,2,3,1,2,0,1,2,3,4]})
print (df)
a b c d
0 1 0.066667 1 1
1 1 0.133333 0 0
2 1 0.200000 1 1
3 1 0.266667 0 0
4 1 0.333333 1 1
5 1 0.400000 1 2
6 1 0.066667 1 3
7 2 0.133333 1 1
8 2 0.200000 1 2
9 2 0.266667 0 0
10 2 0.333333 1 1
11 2 0.400000 1 2
12 2 0.466667 1 3
13 2 0.533333 1 4
def f(x):
x.ix[x.c == 1, 'e'] = 1
a = x.e.notnull()
x.e = a.cumsum()-a.cumsum().where(~a).ffill().fillna(0).astype(int)
return (x)
print (df.groupby('a').apply(f))
a b c d e
0 1 0.066667 1 1 1
1 1 0.133333 0 0 0
2 1 0.200000 1 1 1
3 1 0.266667 0 0 0
4 1 0.333333 1 1 1
5 1 0.400000 1 2 2
6 1 0.066667 1 3 3
7 2 0.133333 1 1 1
8 2 0.200000 1 2 2
9 2 0.266667 0 0 0
10 2 0.333333 1 1 1
11 2 0.400000 1 2 2
12 2 0.466667 1 3 3
13 2 0.533333 1 4 4

Resources