pandas drop rows based on condition on groupby - python-3.x

I have a DataFrame like below
I am trying to groupby cell column and drop the "NA" values where group size > 1.
required Output :
How to get my expected output? How to filter on a condition and drop rows in groupby statement?

From your DataFrame, first we group by cell to get the size of each groups :
>>> df_grouped = df.groupby(['cell'], as_index=False).size()
>>> df_grouped
cell size
0 A 3
1 B 1
2 D 3
Then, we merge the result with the original DataFrame like so :
>>> df_merged = pd.merge(df, df_grouped, on='cell', how='left')
>>> df_merged
cell value kpi size
0 A 5.0 thpt 3
1 A 6.0 ret 3
2 A NaN thpt 3
3 B NaN acc 1
4 D 8.0 int 3
5 D NaN ps 3
6 D NaN yret 3
To finish, we filter the Dataframe to get the expected result :
>>> df_filtered = df_merged[~((df_merged['value'].isna()) & (df_merged['size'] > 1))]
>>> df_filtered[['cell', 'value', 'kpi']]
cell value kpi
0 A 5.0 thpt
1 A 6.0 ret
3 B NaN acc
4 D 8.0 int

Use boolean mask:
>>> df[df.groupby('cell').cumcount().eq(0) | df['value'].notna()]
cell value kpi
0 A crud thpt
1 A 6 ret
3 B NaN acc
4 D hi int
Details:
m1 = df.groupby('cell').cumcount().eq(0)
m2 = df['value'].notna()
df.assign(keep_at_least_one=m1, keep_notna=m2, keep_rows=m1|m2)
# Output:
cell value kpi keep_at_least_one keep_notna keep_rows
0 A crud thpt True True True
1 A 6 ret False True True
2 A NaN thpt False False False
3 B NaN acc True False True
4 D hi int True True True
5 D NaN ps False False False
6 D NaN yret False False False

Related

Join with column having the max sequence number

I have a margin table
item margin
0 a 3
1 b 4
2 c 5
and an item table
item sequence
0 a 1
1 a 2
2 a 3
3 b 1
4 b 2
5 c 1
6 c 2
7 c 3
I want to join the two table so that the margin will only be joined to the product with maximum sequence number, the desired outcome is
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
How to achieve this?
Below is the code for margin and item table
import pandas as pd
df_margin=pd.DataFrame({"item":["a","b","c"],"margin":[3,4,5]})
df_item=pd.DataFrame({"item":["a","a","a","b","b","c","c","c"],"sequence":[1,2,3,1,2,1,2,3]})
One option would be to merge then replace extra values with NaN via Series.where:
new_df = df_item.merge(df_margin)
new_df['margin'] = new_df['margin'].where(
new_df.groupby('item')['sequence'].transform('max').eq(new_df['sequence'])
)
Or with loc:
new_df = df_item.merge(df_margin)
new_df.loc[new_df.groupby('item')['sequence']
.transform('max').ne(new_df['sequence']), 'margin'] = np.NAN
Another option would be to assign a temp column to both frames df_item with True where the value is maximal, and df_margin is True everywhere then merge outer and drop the temp column:
new_df = (
df_item.assign(
t=df_item
.groupby('item')['sequence']
.transform('max')
.eq(df_item['sequence'])
).merge(df_margin.assign(t=True), how='outer').drop('t', 1)
)
Both produce new_df:
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
You could do:
df_item.merge(df_item.groupby('item')['sequence'].max().\
reset_index().merge(df_margin), 'left')
item sequence margin
0 a 1 NaN
1 a 2 NaN
2 a 3 3.0
3 b 1 NaN
4 b 2 4.0
5 c 1 NaN
6 c 2 NaN
7 c 3 5.0
Breakdown:
df_new = df_item.groupby('item')['sequence'].max().reset_index().merge(df_margin)
df_item.merge(df_new, 'left')

Replace values on dataset and apply quartile rule by row on pandas

I have a dataset with lots of variables. So I've extracted the numeric ones:
numeric_columns = transposed_df.select_dtypes(np.number)
Then I want to replace all 0 values for 0.0001
transposed_df[numeric_columns.columns] = numeric_columns.where(numeric_columns.eq(0, axis=0), 0.0001)
And here is the first problem. This line is not replacing the 0 values with 0.0001, but is replacing all non zero values with 0.0001.
Also after this (replacing the 0 values by 0.0001) I want to replace all values there are less than the 1th quartile of the row to -1 and leave the others as they were. But I am not managing how.
To answer your first question
In [36]: from pprint import pprint
In [37]: pprint( numeric_columns.where.__doc__)
('\n'
'Replace values where the condition is False.\n'
'\n'
'Parameters\n'
'----------\n'
because of that your all the values except 0 are getting replaced
Use DataFrame.mask and for second condition compare by DataFrame.quantile:
transposed_df = pd.DataFrame({
'A':list('abcdef'),
'B':[0,0.5,4,5,5,4],
'C':[7,8,9,4,2,3],
'D':[1,3,0,7,1,0],
'E':[5,3,6,9,2,4],
'F':list('aaabbb')
})
numeric_columns = transposed_df.select_dtypes(np.number)
m1 = numeric_columns.eq(0)
m2 = numeric_columns.lt(numeric_columns.quantile(q=0.25, axis=1), axis=0)
transposed_df[numeric_columns.columns] = numeric_columns.mask(m1, 0.0001).mask(m2, -1)
print (transposed_df)
A B C D E F
0 a -1.0 7 1.0 5 a
1 b -1.0 8 3.0 3 a
2 c 4.0 9 -1.0 6 a
3 d 5.0 -1 7.0 9 b
4 e 5.0 2 -1.0 2 b
5 f 4.0 3 -1.0 4 b
EDIT:
from scipy.stats import zscore
print (transposed_df[numeric_columns.columns].apply(zscore))
B C D E
0 -2.236068 0.570352 -0.408248 0.073521
1 0.447214 0.950586 0.408248 -0.808736
2 0.447214 1.330821 -0.816497 0.514650
3 0.447214 -0.570352 2.041241 1.838037
4 0.447214 -1.330821 -0.408248 -1.249865
5 0.447214 -0.950586 -0.816497 -0.367607
EDIT1:
transposed_df = pd.DataFrame({
'A':list('abcdef'),
'B':[0,1,1,1,1,1],
'C':[1,8,9,4,2,3],
'D':[1,3,0,7,1,0],
'E':[1,3,6,9,2,4],
'F':list('aaabbb')
})
numeric_columns = transposed_df.select_dtypes(np.number)
from scipy.stats import zscore
df1 = pd.DataFrame(numeric_columns.apply(zscore, axis=1).tolist(),index=transposed_df.index)
transposed_df[numeric_columns.columns] = df1
print (transposed_df)
A B C D E F
0 a -1.732051 0.577350 0.577350 0.577350 a
1 b -1.063410 1.643452 -0.290021 -0.290021 a
2 c -0.816497 1.360828 -1.088662 0.544331 a
3 d -1.402136 -0.412393 0.577350 1.237179 b
4 e -1.000000 1.000000 -1.000000 1.000000 b
5 f -0.632456 0.632456 -1.264911 1.264911 b

How to drop after first consecutive duplicate values in pandas dataframe using python?

I am having a dataframe as
df=pd.DataFrame(['a','a','a','b','b','b','c','d','d','a','a','b','b','e','f','d','d']).
In this first consecutive‘a’,‘b’and ‘d’values I want to keep. After that onwards if any duplicate values if come means I want to drop it.
So, now my expected output is
['a','a','a','b','b','b','c','d','d','e','f'].
If I use
print(df.drop_duplicates())
it deletes all duplicate values. So, how to get my expected output? Thanks in advance.
Compare each value with its preceeding value to find the start of each run:
df['start'] = df[0] != df[0].shift()
For each group, use cumsum to find a cumulative sum of the start values (taking advantage of the fact that Pandas treats True as 1 and False as 0). The cumulative sum can act as a group number:
df['group'] = df.groupby(0)['start'].cumsum()
Then select all rows which are in the first group (i.e., the first run of values):
result = df.loc[df['group'] == 1]
import pandas as pd
df = pd.DataFrame(['a','a','a','b','b','b','c','d','d','a','a','b','b','e','f','d','d'])
df['start'] = df[0] != df[0].shift()
df['group'] = df.groupby(0)['start'].cumsum()
result = df.loc[df['group'] == 1]
print(df)
# 0 start group
# 0 a True 1.0
# 1 a False 1.0
# 2 a False 1.0
# 3 b True 1.0
# 4 b False 1.0
# 5 b False 1.0
# 6 c True 1.0
# 7 d True 1.0
# 8 d False 1.0
# 9 a True 2.0
# 10 a False 2.0
# 11 b True 2.0
# 12 b False 2.0
# 13 e True 1.0
# 14 f True 1.0
# 15 d True 2.0
# 16 d False 2.0
df = result[[0]]
print(df)
yields
0
0 a
1 a
2 a
3 b
4 b
5 b
6 c
7 d
8 d
13 e
14 f

Locate rows with 0 value columns and set them to none pandas

Data:
f a b
5 0 1
5 1 3
5 1 3
5 6 3
5 0 0
5 1 5
5 0 0
I know how to locate the rows with both columns being 0, setting them to None on the other hand is a mystery.
df_o[(df_o['a'] == 0) & (df_o['d'] == 0)]
# set a and b to None
Expected result:
f a b
5 0 1
5 1 3
5 1 3
5 6 3
5 None None
5 1 5
5 None None
If working with numeric values None is converted to NaN and integers to float by design:
df_o.loc[(df_o['a'] == 0) & (df_o['b'] == 0), ['a','b']] = None
print (df_o)
a b
0 0.0 1.0
1 1.0 3.0
2 1.0 3.0
3 6.0 3.0
4 NaN NaN
5 1.0 5.0
6 NaN NaN
Another solution with DataFrame.all for check if all Trues per rows with axis=1:
df_o.loc[(df_o[['a', 'b']] == 0).all(axis=1), ['a','b']] = None
print (df_o)
a b
0 0.0 1.0
1 1.0 3.0
2 1.0 3.0
3 6.0 3.0
4 NaN NaN
5 1.0 5.0
6 NaN NaN
Details:
print ((df_o[['a', 'b']] == 0))
a b
0 True False
1 False False
2 False False
3 False False
4 True True
5 False False
6 True True
print ((df_o[['a', 'b']] == 0).all(axis=1))
0 False
1 False
2 False
3 False
4 True
5 False
6 True
dtype: bool
One way I could think of is like this. Create an extra copy of the dataframe and check both individually while setting the value to None on the main dataframe. Not the cleanest solutions but:
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['f'] = [5,5,5,5,5,5,5]
df['a'] = [0,1,1,6,0,1,0]
df['b'] = [1,3,3,3,0,5,0]
df1 = df.copy()
df['a'] = np.where((df.a == 0) & (df.b == 0), None, df.a)
df['b'] = np.where((df1.a == 0) & (df1.b == 0), None, df.b)
print(df)
Output:
f a b
0 5 0 1
1 5 1 3
2 5 1 3
3 5 6 3
4 5 None None
5 5 1 5
6 5 None None
df.replace(0, np.nan) -- to get NaNs (possibly more useful)
df.replace(0, 'None') -- what you actually want
It is surely not the most elegant way to do this, but maybe this helps.
import pandas as pd
data = {'a': [0,1,1,6,0,1,0],
'b':[1,3,3,3,0,5,0]}
df_o = pd.DataFrame.from_dict(data)
df_None = df_o[(df_o['a'] == 0) & (df_o['b'] == 0)]
df_o.loc[df_None.index,:] = None
print(df_o)
Out:
a b
0 0.0 1.0
1 1.0 3.0
2 1.0 3.0
3 6.0 3.0
4 NaN NaN
5 1.0 5.0
6 NaN NaN
This is how I would do it:
import pandas as pd
a = pd.Series([0, 1, 1, 6, 0, 1, 0])
b = pd.Series([1, 3, 3, 3, 0, 5 ,0])
data = pd.DataFrame({'a': a, 'b': b})
v = [[data[i][j] for i in data] == [0, 0] for j in range(len(data['a']))] # spot null rows
a = [None if v[i] else a[i] for i in range(len(a))]
b = [None if v[i] else b[i] for i in range(len(b))]
data = pd.DataFrame({'a': a, 'b': b})
print(data)
Output:
a b
0 0.0 1.0
1 1.0 3.0
2 1.0 3.0
3 6.0 3.0
4 NaN NaN
5 1.0 5.0
6 NaN NaN

How to use “na_values='?'” option in the pd.read.csv() function?

I am trying to find the operation with na_values='?' option in the pd.read.csv() function.
So that I can find the list of rows containing "?" value and then remove that value.
Sample:
import pandas as pd
from pandas.compat import StringIO
#test data
temp=u"""id,col1,col2,col3
1,13?,15,14
1,13,15,?
1,12,15,13
2,?,15,?
2,18,15,13
2,18?,15,13"""
#in real data use
#df = pd.read_csv('test.csv')
df = pd.read_csv(StringIO(temp))
print (df)
id col1 col2 col3
0 1 13? 15 14
1 1 13 15 ?
2 1 12 15 13
3 2 ? 15 ?
4 2 18 15 13
5 2 18? 15 13
If want remove values with ? which are separately or substrings need mask created by str.contains and then check if at least one True per row by DataFrame.any:
print (df.astype(str).apply(lambda x: x.str.contains('?', regex=False)))
id col1 col2 col3
0 False True False False
1 False False False True
2 False False False False
3 False True False True
4 False False False False
5 False True False False
m = ~df.astype(str).apply(lambda x: x.str.contains('?', regex=False)).any(axis=1)
print (m)
0 False
1 False
2 True
3 False
4 True
5 False
dtype: bool
df = df[m]
print (df)
id col1 col2 col3
2 1 12 15 13
4 2 18 15 13
If want replace only separately ? simply compare value:
print (df.astype(str) == '?')
id col1 col2 col3
0 False False False False
1 False False False True
2 False False False False
3 False True False True
4 False False False False
5 False False False False
m = ~(df.astype(str) == '?').any(axis=1)
print (m)
0 True
1 False
2 True
3 False
4 True
5 True
dtype: bool
df = df[m]
print (df)
id col1 col2 col3
0 1 13? 15 14
2 1 12 15 13
4 2 18 15 13
5 2 18? 15 13
It replace all ? to NaNs is necessary parameter na_values and dropna if want remove all rows with NaNs:
import pandas as pd
from pandas.compat import StringIO
#test data
temp=u"""id,col1,col2,col3
1,13?,15,14
1,13,15,?
1,12,15,13
2,?,15,?
2,18,15,13
2,18?,15,13"""
#in real data use
#df = pd.read_csv('test.csv', na_values='?')
df = pd.read_csv(StringIO(temp), na_values='?')
print (df)
id col1 col2 col3
0 1 13? 15 14.0
1 1 13 15 NaN
2 1 12 15 13.0
3 2 NaN 15 NaN
4 2 18 15 13.0
5 2 18? 15 13.0
df = df.dropna()
print (df)
id col1 col2 col3
0 1 13? 15 14.0
2 1 12 15 13.0
4 2 18 15 13.0
5 2 18? 15 13.0
na_values = ['NO CLUE', 'N/A', '0']
requests = pd.read_csv('some-data.csv', na_values=na_values)
Create a list with useless parameters and use it trough reading from file
"??" or "####" type of junk values can be converted into missing value, since in python all the blank values can be replaced with nan. Hence you can also replace these type of junk value to missing value by passing them as as list to the parameter
'na_values'.
data_csv = pd.read_csv('test.csv',na_values = ["??"])
If you want to remove the rows which are contain "?" in pandas dataframe, you can try with:
suppose you have df:
import pandas as pd
df = pd.read_csv('test.csv')
df:
A B
0 Maths 4/13/2017
1 Physics 4/15/2016
2 English 4/16/2016
3 test?dsfsa 9/15/2016
check if column A contain "?" to generate new df1:
df1 = df[df.A.str.contains("\?")==False]
df1 will be:
A B
0 Maths 4/13/2017
1 Physics 4/15/2016
2 English 4/16/2016
which will give you the new df1 which doesn't contain "?".

Resources