Pandas dataframe: Count no of rows which meet a set of conditions across multiple columns [duplicate] - python-3.x

I have a dataframe(edata) as given below
Domestic Catsize Type Count
1 0 1 1
1 1 1 8
1 0 2 11
0 1 3 14
1 1 4 21
0 1 4 31
From this dataframe I want to calculate the sum of all counts where the logical AND of both variables (Domestic and Catsize) results in Zero (0) such that
1 0 0
0 1 0
0 0 0
The code I use to perform the process is
g=edata.groupby('Type')
q3=g.apply(lambda x:x[((x['Domestic']==0) & (x['Catsize']==0) |
(x['Domestic']==0) & (x['Catsize']==1) |
(x['Domestic']==1) & (x['Catsize']==0)
)]
['Count'].sum()
)
q3
Type
1 1
2 11
3 14
4 31
This code works fine, however, if the number of variables in the dataframe increases then the number of conditions grows rapidly. So, is there a smart way to write a condition that states that if the ANDing the two (or more) variables result in a zero then perform the sum() function

You can filter first using pd.DataFrame.all negated:
cols = ['Domestic', 'Catsize']
res = df[~df[cols].all(1)].groupby('Type')['Count'].sum()
print(res)
# Type
# 1 1
# 2 11
# 3 14
# 4 31
# Name: Count, dtype: int64

Use np.logical_and.reduce to generalise.
columns = ['Domestic', 'Catsize']
df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
Type
1 1
2 11
3 14
4 31
Name: Count, dtype: int64
Before adding it back, use map to broadcast:
u = df[~np.logical_and.reduce(df[columns], axis=1)].groupby('Type')['Count'].sum()
df['NewCol'] = df.Type.map(u)
df
Domestic Catsize Type Count NewCol
0 1 0 1 1 1
1 1 1 1 8 1
2 1 0 2 11 11
3 0 1 3 14 14
4 1 1 4 21 31
5 0 1 4 31 31

how about
columns = ['Domestic', 'Catsize']
df.loc[~df[columns].prod(axis=1).astype(bool), 'Count']
and then do with it whatever you want.
for logical AND the product does the trick nicely.
for logcal OR you can use sum(axis=1) with proper negation in advance.

Related

How to sort pandas rows based on column values

in this dataframe:
Feat1 Feat2 Feat3 Feat4 Labels
-46.220314 22.862856 -6.1573067 5.6060414 2
-23.80669 20.536781 -5.015675 4.2216353 2
-42.092365 25.680704 -5.0092897 5.665794 2
-35.29639 21.709473 -4.160352 5.578346 2
-37.075096 22.347767 -3.860426 5.6953945 2
-42.8849 28.03802 -7.8572545 3.3361 2
-32.3057 26.568039 -9.47018 3.4532788 2
-24.469942 27.005375 -9.301921 4.3995037 2
-97.89892 -0.38156664 6.4163384 7.234347 1
-81.96325 0.1821717 -1.2870358 4.703838 1
-78.41986 -6.766374 0.8001185 0.83444935 1
-100.68544 -4.5810957 1.6977689 1.8801615 1
-87.05412 -2.9231584 6.817379 5.4460077 1
-64.121056 -3.7892206 -0.283514 6.3084154 1
-94.504845 -0.9999217 3.2884297 6.881124 1
-61.951996 -8.960198 -1.5915259 5.6160254 1
-108.19452 13.909201 0.6966458 -1.956591 0
-97.4037 22.897585 -2.8488266 1.4105041 0
-92.641335 22.10624 -3.5110545 2.467166 0
-199.18787 3.3090565 -2.5994794 4.0802555 0
-137.5976 6.795896 1.6793671 2.2256763 0
-208.0035 -1.33229 -3.2078092 1.5177402 0
-108.225975 14.341716 1.02891 -1.8651972 0
-121.29299 18.274035 2.2891548 2.3360753 0
I wanted to sort the rows based on different column values in the "Labels" column.
I am able to sort in ascending such that the labels appear as [0 1 2] via the command
df2 = df1.sort_values(by = 'Labels', ascending = True)
Then ascending = False, where the labels appear [2 1 0].
How then do I go about sorting the labels as [1 0 2]?
Any help will be greatly appreciated!
Here's a way using Categorical:
df['Labels'] = pd.Categorical(df['Labels'],
categories = [1, 0, 2],
ordered=True)
df.sort_values('Labels')
Output:
Feat1 Feat2 Feat3 Feat4 Labels
11 -100.685440 -4.581096 1.697769 1.880162 1
15 -61.951996 -8.960198 -1.591526 5.616025 1
8 -97.898920 -0.381567 6.416338 7.234347 1
9 -81.963250 0.182172 -1.287036 4.703838 1
10 -78.419860 -6.766374 0.800118 0.834449 1
14 -94.504845 -0.999922 3.288430 6.881124 1
12 -87.054120 -2.923158 6.817379 5.446008 1
13 -64.121056 -3.789221 -0.283514 6.308415 1
21 -208.003500 -1.332290 -3.207809 1.517740 0
20 -137.597600 6.795896 1.679367 2.225676 0
19 -199.187870 3.309057 -2.599479 4.080255 0
18 -92.641335 22.106240 -3.511055 2.467166 0
17 -97.403700 22.897585 -2.848827 1.410504 0
16 -108.194520 13.909201 0.696646 -1.956591 0
23 -121.292990 18.274035 2.289155 2.336075 0
22 -108.225975 14.341716 1.028910 -1.865197 0
7 -24.469942 27.005375 -9.301921 4.399504 2
6 -32.305700 26.568039 -9.470180 3.453279 2
5 -42.884900 28.038020 -7.857254 3.336100 2
4 -37.075096 22.347767 -3.860426 5.695394 2
3 -35.296390 21.709473 -4.160352 5.578346 2
2 -42.092365 25.680704 -5.009290 5.665794 2
1 -23.806690 20.536781 -5.015675 4.221635 2
0 -46.220314 22.862856 -6.157307 5.606041 2
You can use an ordered Categorical, or if you don't want to change the DataFrame, the poor-man's variant, a mapping Series:
order = [1, 0, 2]
key = pd.Series({k:v for v,k in enumerate(order)}).get
# or
# pd.Series(range(len(order)), index=order).get
df1.sort_values(by='Labels', key=key)
Example:
df1 = pd.DataFrame({'Labels': [1,0,1,2,0,2,1]})
order = [1, 0, 2]
key = pd.Series({k:v for v,k in enumerate(order)}).get
print(df1.sort_values(by='Labels', key=key))
Labels
0 1
2 1
6 1
1 0
4 0
3 2
5 2
here is another way to do it
create a new column using map and map the new order sequence and then sort as usual
df['sort_label'] = df['Labels'].map({1:0, 0:1, 2:2 }) #).sort_values('sort_label', ascending=False)
df.sort_values('sort_label')
Feat1 Feat2 Feat3 Feat4 Labels sort_label
11 -100.685440 -4.581096 1.697769 1.880162 1 0
15 -61.951996 -8.960198 -1.591526 5.616025 1 0
8 -97.898920 -0.381567 6.416338 7.234347 1 0
9 -81.963250 0.182172 -1.287036 4.703838 1 0
10 -78.419860 -6.766374 0.800119 0.834449 1 0
14 -94.504845 -0.999922 3.288430 6.881124 1 0
12 -87.054120 -2.923158 6.817379 5.446008 1 0
13 -64.121056 -3.789221 -0.283514 6.308415 1 0
21 -208.003500 -1.332290 -3.207809 1.517740 0 1
20 -137.597600 6.795896 1.679367 2.225676 0 1
19 -199.187870 3.309057 -2.599479 4.080255 0 1
18 -92.641335 22.106240 -3.511054 2.467166 0 1
17 -97.403700 22.897585 -2.848827 1.410504 0 1
16 -108.194520 13.909201 0.696646 -1.956591 0 1
23 -121.292990 18.274035 2.289155 2.336075 0 1
22 -108.225975 14.341716 1.028910 -1.865197 0 1
7 -24.469942 27.005375 -9.301921 4.399504 2 2
6 -32.305700 26.568039 -9.470180 3.453279 2 2
5 -42.884900 28.038020 -7.857254 3.336100 2 2
4 -37.075096 22.347767 -3.860426 5.695394 2 2
3 -35.296390 21.709473 -4.160352 5.578346 2 2
2 -42.092365 25.680704 -5.009290 5.665794 2 2
1 -23.806690 20.536781 -5.015675 4.221635 2 2
0 -46.220314 22.862856 -6.157307 5.606041 2 2

Replacing the first column values according to the second column pattern

How to use regex to replace values in Data Frames, here, 5th column according to pattern of the 1st column? The column 5 consist only in ones for now. However, I would like to start changing this column when in the 1st column pattern 34444 appears. Then program suppose to replace ones with 11111, 22222, 33333 etc. until the end of the file when the pattern appears.
Sample of the file:
0 5 1 2 3 4
11 1 1 1 -173.386856 -0.152110 -58.235509
12 2 1 1 -176.102464 -1.020643 -1.217859
13 3 1 1 -175.792961 -57.458357 -58.538891
14 4 1 1 -172.774153 -59.284206 -1.988605
15 5 1 1 -174.974179 -56.371161 -58.406157
16 6 1 3 138.998480 12.596951 0.223780
17 7 1 4 138.333252 11.884713 -0.281429
18 8 1 4 139.498084 13.356891 -0.480091
19 9 1 4 139.710930 11.981460 0.697098
20 10 1 4 138.452807 13.136061 0.990663
21 11 1 3 138.998480 12.596951 0.223780
22 12 1 4 138.333252 11.884713 -0.281429
23 13 1 4 139.498084 13.356891 -0.480091
24 14 1 4 139.710930 11.981460 0.697098
25 15 1 4 138.452807 13.136061 0.990663
Expected result:
0 5 1 2 3 4
11 1 1 1 -173.386856 -0.152110 -58.235509
12 2 1 1 -176.102464 -1.020643 -1.217859
13 3 1 1 -175.792961 -57.458357 -58.538891
14 4 1 1 -172.774153 -59.284206 -1.988605
15 5 1 1 -174.974179 -56.371161 -58.406157
16 6 1 3 138.998480 12.596951 0.223780
17 7 1 4 138.333252 11.884713 -0.281429
18 8 1 4 139.498084 13.356891 -0.480091
19 9 1 4 139.710930 11.981460 0.697098
20 10 1 4 138.452807 13.136061 0.990663
21 11 2 3 138.998480 12.596951 0.223780
22 12 2 4 138.333252 11.884713 -0.281429
23 13 2 4 139.498084 13.356891 -0.480091
24 14 2 4 139.710930 11.981460 0.697098
25 15 2 4 138.452807 13.136061 0.990663
Yeah, if you really want re, there is a way. But I doubt it would be really more efficient than a for-loop.
1. re.finditer
import pandas as pd
import numpy as np
import re
# present col1 as number-strings
arr1 = df['1'].values
str1 = "".join([str(i) for i in arr1])
ans = np.ones(len(str1), dtype=int)
# when a pattern is found, increase latter elements by 1
for match in re.finditer('34444', str1):
e = match.end()
ans[e:] += 1
# replace column 5
df['5'] = ans
# Output
df[['0', '5', '1']]
Out[50]:
0 5 1
11 1 1 1
12 2 1 1
13 3 1 1
14 4 1 1
15 5 1 1
16 6 1 3
17 7 1 4
18 8 1 4
19 9 1 4
20 10 1 4
21 11 2 3
22 12 2 4
23 13 2 4
24 14 2 4
25 15 2 4
2. naïve for-loop
Checks the array directly element-by-element. By comparison with re.finditer, no typecasting is involved, but an explicit for-loop is written. The same output is obtained. Please benchmark by yourself if efficiency became relevant, say, if there were tens of millions of rows involved.
arr1 = df['1'].values
ans = np.ones(len(str1), dtype=int)
n = len(arr1)
for i, el in enumerate(arr1):
# termination
if i > n - 5:
break
# ignore non-3 elements
if el != 3:
continue
# if found, increase latter elements by 1
if np.all(arr1[i+1:i+5] == 4):
ans[i+5:] += 1
df['5'] = ans

How to take mean of 3 values before flag change 0 to 1python

I have dataframe with columns A,B and flag. I want to calculate mean of 2 values before flag change from 0 to 1 , and record value when flag change from 0 to 1 and record value when flag changes from 1 to 0.
# Input dataframe
df=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,0,0,0]})
# Expected output
df_out=df=pd.DataFrame({'A_mean_before_flag_change':[5.5],
'B_mean_before_flag_change':[5],
'A_value_before_change_flag':[7],
'B_value_before_change_flag':[6]})
I try to create more general solution:
df=pd.DataFrame({'A':[1,3,4,7,8,11,1,15,20,15,16,87],
'B':[1,3,4,6,8,11,1,19,20,15,16,87],
'flag':[0,0,0,0,1,1,1,0,0,1,0,1]})
print (df)
A B flag
0 1 1 0
1 3 3 0
2 4 4 0
3 7 6 0
4 8 8 1
5 11 11 1
6 1 1 1
7 15 19 0
8 20 20 0
9 15 15 1
10 16 16 0
11 87 87 1
First create groups by mask for 0 with next 1 values of flag:
m1 = df['flag'].eq(0) & df['flag'].shift(-1).eq(1)
df['g'] = m1.iloc[::-1].cumsum()
print (df)
A B flag g
0 1 1 0 3
1 3 3 0 3
2 4 4 0 3
3 7 6 0 3
4 8 8 1 2
5 11 11 1 2
6 1 1 1 2
7 15 19 0 2
8 20 20 0 2
9 15 15 1 1
10 16 16 0 1
11 87 87 1 0
then filter out groups with size less like N:
N = 4
df1 = df[df['g'].map(df['g'].value_counts()).ge(N)].copy()
print (df1)
A B flag g
0 1 1 0 3
1 3 3 0 3
2 4 4 0 3
3 7 6 0 3
4 8 8 1 2
5 11 11 1 2
6 1 1 1 2
7 15 19 0 2
8 20 20 0 2
Filter last N rows:
df2 = df1.groupby('g').tail(N)
And aggregate last with mean:
d = {'mean':'_mean_before_flag_change', 'last': '_value_before_change_flag'}
df3 = df2.groupby('g')['A','B'].agg(['mean','last']).sort_index(axis=1, level=1).rename(columns=d)
df3.columns = df3.columns.map(''.join)
print (df3)
A_value_before_change_flag B_value_before_change_flag \
g
2 20 20
3 7 6
A_mean_before_flag_change B_mean_before_flag_change
g
2 11.75 12.75
3 3.75 3.50
I'm assuming that this needs to work for cases with more than one rising edge and that the consecutive values and averages get appended to the output lists:
# the first step is to extract the rising and falling edges using diff(), identify sections and length
df['flag_diff'] = df.flag.diff().fillna(0)
df['flag_sections'] = (df.flag_diff != 0).cumsum()
df['flag_sum'] = df.flag.groupby(df.flag_sections).transform('sum')
# then you can get the relevant indices by checking for the rising edges
rising_edges = df.index[df.flag_diff==1.0]
val_indices = [i-1 for i in rising_edges]
avg_indices = [(i-2,i-1) for i in rising_edges]
# and finally iterate over the relevant sections
df_out = pd.DataFrame()
df_out['A_mean_before_flag_change'] = [df.A.loc[tpl[0]:tpl[1]].mean() for tpl in avg_indices]
df_out['B_mean_before_flag_change'] = [df.B.loc[tpl[0]:tpl[1]].mean() for tpl in avg_indices]
df_out['A_value_before_change_flag'] = [df.A.loc[idx] for idx in val_indices]
df_out['B_value_before_change_flag'] = [df.B.loc[idx] for idx in val_indices]
df_out['length'] = [df.flag_sum.loc[idx] for idx in rising_edges]
df_out.index = rising_edges

Pandas - Fill N rows for a specific column with a integer value and increment the integer there after

I have a dataframe to which I added say a column named col_1. I want to add integer values to that column starting from the first row that increment after every 4th row. So the new resulting column should have values of as such.
col_1
1
1
1
1
2
2
2
2
The current approach I have is a very brute force one:
for x in range(len(df)):
if x <= 3:
df['col_1'][x] = 1
if x >3 and x <= 7:
df['col_1'][x] = 2
This might work for something small but when moving to something larger it will chew up a lot of time.
If there si default RangeIndex you can use integer division with add 1:
df['col_1'] = df.index // 4 + 1
Or for general solution use helper array by lenght of DataFrame:
df['col_1'] = np.arange(len(df)) // 4 + 1
For repeat 1 and 2 pattern use also modulo by 2 like:
df = pd.DataFrame({'a':range(20, 40)})
df['col_1'] = (np.arange(len(df)) // 4) % 2 + 1
print (df)
a col_1
0 20 1
1 21 1
2 22 1
3 23 1
4 24 2
5 25 2
6 26 2
7 27 2
8 28 1
9 29 1
10 30 1
11 31 1
12 32 2
13 33 2
14 34 2
15 35 2
16 36 1
17 37 1
18 38 1
19 39 1

How to check value change in column

in my dataframe have three columns columns value ,ID and distance . i want to check in ID column when its changes from 2 to any other value count rows and record first value and last value when 2 changes to other value and save and also save corresponding value of column distance when change from 2 to other in ID column.
df=pd.DataFrame({'value':[3,4,7,8,11,20,15,20,15,16],'ID':[2,2,8,8,8,2,2,2,5,5],'distance':[0,0,1,0,0,0,0,0,0,0]})
print(df)
value ID distance
0 3 2 0
1 4 2 0
2 7 8 1
3 8 8 0
4 11 8 0
5 20 2 0
6 15 2 0
7 20 2 0
8 15 5 0
9 16 5 0
required results:
df_out=pd.DataFrame({'rows_Count':[3,2],'value_first':[7,15],'value_last':[11,16],'distance_first':[1,0]})
print(df_out)
rows_Count value_first value_last distance_first
0 3 7 11 1
1 2 15 16 0
Use:
#compare by 2
m = df['ID'].eq(2)
#filter out data before first 2 (in sample data not, in real data possible)
df = df[m.cumsum().ne(0)]
#create unique groups for non 2 groups, add misisng values by reindex
s = m.ne(m.shift()).cumsum()[~m].reindex(df.index)
#aggregate with helper s Series
df1 = df.groupby(s).agg({'ID':'size', 'value':['first','last'], 'distance':'first'})
#flatten MultiIndex
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index(drop=True)
print (df1)
ID_size value_first value_last distance_first
0 3 7 11 1
1 2 15 16 0
Verify in changed data (not only 2 first group):
df=pd.DataFrame({'value':[3,4,7,8,11,20,15,20,15,16],
'ID':[1,7,8,8,8,2,2,2,5,5],
'distance':[0,0,1,0,0,0,0,0,0,0]})
print(df)
value ID distance
0 3 1 0 <- changed ID
1 4 7 0 <- changed ID
2 7 8 1
3 8 8 0
4 11 8 0
5 20 2 0
6 15 2 0
7 20 2 0
8 15 5 0
9 16 5 0
#compare by 2
m = df['ID'].eq(2)
#filter out data before first 2 (in sample data not, in real data possible)
df = df[m.cumsum().ne(0)]
#create unique groups for non 2 groups, add misisng values by reindex
s = m.ne(m.shift()).cumsum()[~m].reindex(df.index)
#aggregate with helper s Series
df1 = df.groupby(s).agg({'ID':'size', 'value':['first','last'], 'distance':'first'})
#flatten MultiIndex
df1.columns = df1.columns.map('_'.join)
df1 = df1.reset_index(drop=True)
print (df1)
ID_size value_first value_last distance_first
0 2 15 16 0

Resources