replace all values in range of columns based on condition - python-3.x

How can I replace values in multiple columns based on a condition?
Suppose I have a df looking something like this:
df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})
With numpy I can change the value of a column based on a condition like this:
df['A'] = np.where((df['B'] < 5), '-', df['A'])
But how can I change the value of many columns based on a condition? Thought I could do something like below but that's not working.
df[['A','C']] = np.where((df['B'] < 5), '-', df[['A', 'C']])
I could do a loop but that does not feel very pythonic/pands
cols = ['A', 'C']
for col in cols:
df[col] = np.where((df['B'] < 5), '-', df[col])

One idea is use DataFrame.mask:
df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, '-')
print (df)
A C B
0 - - 3
1 - - 4
2 3 3 6
3 4 4 6
Alternative solution with DataFrame.loc:
df.loc[df['B'] < 5, ['A','C']] = '-'
print (df)
A C B
0 - - 3
1 - - 4
2 3 3 6
3 4 4 6
Solution with numpy.where and broadcasting mask:
df[['A','C']] = np.where((df['B'] < 5)[:, None], '-', df[['A', 'C']])
Performance if mixed values - numeric with string -:
df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})
#400k rows
df = pd.concat([df] * 100000, ignore_index=True)
In [217]: %timeit df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, '-')
171 ms ± 13.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [219]: %timeit df[['A','C']] = np.where((df['B'] < 5)[:, None], '-', df[['A', 'C']])
72.5 ms ± 11.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [221]: %timeit df.loc[df['B'] < 5, ['A','C']] = '-'
27.8 ms ± 533 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Performace if replace by numeric:
df = pd.DataFrame({'A': [1,2,3,4], 'C': [1,2,3,4], 'B': [3,4,6,6]})
df = pd.concat([df] * 100000, ignore_index=True)
In [229]: %timeit df[['A','C']] = df[['A', 'C']].mask(df['B'] < 5, 0)
187 ms ± 4.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [231]: %timeit df[['A','C']] = np.where((df['B'] < 5)[:, None], 0, df[['A', 'C']])
20.8 ms ± 455 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [233]: %timeit df.loc[df['B'] < 5, ['A','C']] = 0
61.3 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Related

Python: pandas count row-wise string matches across specific variables

I have a dataset that I am manipulating with Python's pandas library. The data frame has one string variable that is the ID of interest. I have a set of other variables with a shared prefix (i.e. name_). For each row, I want to count the number of variables in this set that have the ID as the value.
Question: Is there a pandas 1 liner that I can use?
Here is an example dataset
import pandas as pd
df = pd.DataFrame({
'ID_var': ['ab?c', 'xyzyy', 'ab?c', 'ghi55'],
'name_01': ['def55', 'abc', 'ab?c', 'def'],
'name_02': ['ab?c', 'jkl123', 'ab?c', 'ghi55'],
'name_03': ['ghi55', 'mn_o', 'ab?c', 'ghi55'],
'not_name': [0, 1, 2, 3],
'other_str': ['str1', 'str2', 'str3', 'str'],
})
and, for each row, I want to count the number of times variables with the prefix name_ equal ID_var. So the desired output is:
import pandas as pd
df_final = pd.DataFrame({
'ID_var': ['ab?c', 'xyzyy', 'ab?c', 'ghi55'],
'name_01': ['def55', 'abc', 'ab?c', 'def'],
'name_02': ['ab?c', 'jkl123', 'ab?c', 'ghi55'],
'name_03': ['ghi55', 'mn_o', 'ab?c', 'ghi55'],
'not_name': [0, 1, 2, 3],
'other_x': ['str1', 'str2', 'str3', 'str'],
'total_name':[1,0,3,2]
})
I haven't been able to find this elsewhere on SO. I suspect that there is a way I can use pd.str.contains but I am not sure how. Thank you in advance
Here's a pandas one-liner, a bit faster on my computer than #Timeless. The filter subsets to just columns that start wtih name_, then the .eq compares the values to the first column by-column (axis=0) then sums by row (axis=1)
df['total_name'] = df.filter(regex='^name_').eq(df['ID_var'],axis=0).sum(axis=1)
Timing:
196 µs ± 1.41 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%%timeit
df['total_name'] = df.filter(regex='^name_').eq(df['ID_var'],axis=0).sum(axis=1)
Compared to: 353 µs ± 6.61 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
%%timeit
df["total_name"] = [(df.iloc[i, 0]==df.iloc[i, 1:]).sum() for i in range(len(df))]
Question: Is there a pandas 1 liner that I can use?
One option, is to use a listcomp with iloc :
df["total_name"] = [(df.iloc[i, 0]==df.iloc[i, 1:]).sum() for i in range(len(df))]
Output :
print(df)
ID_var name_01 name_02 name_03 not_name other_str total_name
0 ab?c def55 ab?c ghi55 0 str1 1
1 xyzyy abc jkl123 mn_o 1 str2 0
2 ab?c ab?c ab?c ab?c 2 str3 3
3 ghi55 def ghi55 ghi55 3 str 2
#%%timeit : 547 µs ± 12.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

How to make a calculation in a pandas daframe depending on a value of a certain column

I have this dataframe and I want to make a calculation depending on a condition, like below:
count prep result
0 10 100
10 100 100
I wanto to create a new column evaluated that is:
if df['count']==0:
df['evaluated'] = df['result'] / df['prep']
else:
df['evaluated'] = df['result'] / df['count']
expected result is:
count prep result evaluated
0 10 100 10
100 10 100 1
What's the best way to do it? My real dataframe has 30k rows.
You can use where or mask:
df['evaluated'] = df['result'].div(df['prep'].where(df['count'].eq(0), df['count']))
Or:
df['evaluated'] = df['result'].div(df['count'].mask(df['count'].eq(0), df['prep']))
Output (assuming there was an error in the provided input):
count prep result evaluated
0 0 10 100 10.0
1 100 10 100 1.0
You can also use np.where from numpy to do that:
df['evaluated'] = np.where(df['count'] == 0,
df['result'] / df['prep'], # == 0
df['result'] / df['count']) # != 0
Performance (not really significant) over 30k rows:
>>> %timeit df['result'].div(df['prep'].where(df['count'].eq(0), df['count']))
652 µs ± 12.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit df['result'].div(df['count'].mask(df['count'].eq(0), df['prep']))
638 µs ± 1.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit np.where(df['count'] == 0, df['result'] / df['prep'], df['result'] / df['count'])
462 µs ± 1.63 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Pythonic way for calculating complex terms in Pandas (values bigger or equal to a number divided by the length of a list)

I have the following dataframe:
simple_list=[[3.0, [1.1, 2.2, 3.3, 4.4, 5.5]]]
simple_list.append([0.25, [0.1, 0.2, 0.3, 0.4, 0.5, 0.6]])
df4=pd.DataFrame(simple_list,columns=['col1','col2'])
I want create a new column called new_col, in which there's the following calculation:
The number of occurrences of elements in col2 that are bigger or equal than the given number in col1, divided by the length of the list in col2.
i.e.,
first value in new_col will be: 0.6 (there are 3 numbers bigger than 3.0, and 5 is the length of this list)
second value in new_col will be: 0.6667 (there are 4 numbers bigger than 0.25, and 6 is the length of this list).
Use DataFrame.squeeze with DataFrame.eval for compare columns and then mean per index:
df4['new'] = df4.explode('col2').eval('col1 < col2').mean(level=0)
Or convert lists to DataFrame and before mean create missing values by df1:
df1 = pd.DataFrame(df4['col2'].tolist(), index=df4.index)
df4['new'] = df1.gt(df4['col1'], axis=0).mask(df1.isna()).mean(axis=1)
Slowier solutions:
Or is possible use list comprehension with convert list to numpy array:
df4['new'] = [(np.array(b) > a).mean() for a, b in df4[['col1','col2']].to_numpy()]
Another idea with DataFrame.apply:
df4['new'] = df4.apply(lambda x: (np.array(x['col2']) > x['col1']).mean(), axis=1)
print (df4)
col1 col2 new
0 3.00 [1.1, 2.2, 3.3, 4.4, 5.5] 0.600000
1 0.25 [0.1, 0.2, 0.3, 0.4, 0.5, 0.6] 0.666667
Perfromance:
df4=pd.DataFrame(simple_list,columns=['col1','col2'])
df4 = pd.concat([df4] * 10000, ignore_index=True)
In [262]: %%timeit
...: df1 = pd.DataFrame(df4['col2'].tolist(), index=df4.index)
...: df4['new'] = df1.gt(df4['col1'], axis=0).mask(df1.isna()).mean(axis=1)
...:
40.9 ms ± 3.03 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [263]: %timeit df4.explode('col2').eval('col1 < col2').mean(level=0)
97.2 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [264]: %timeit [(np.array(b) > a).mean() for a, b in df4[['col1','col2']].to_numpy()]
305 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [265]: %timeit df4.apply(lambda x: (np.array(x['col2']) > x['col1']).mean(), axis=1)
1.23 s ± 32.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

parsing a panda dataframe column from a dictionary data form into new columns for each dictionary key

In python 3, pandas. Imagine there is a dataframe df with a column x
df=pd.DataFrame(
[
{'x':'{"a":"1","b":"2","c":"3"}'},
{'x':'{"a":"2","b":"3","c":"4"}'}
]
)
The column x has data which looks like a dictionary. Wonder how can I parse them into a new dataframe, so each key here becomes a new column?
The desired output dataframe is like
x,a,b,c
'{"a":"1","b":"2","c":"3"}',1,2,3
'{"a":"2","b":"3","c":"4"}',2,3,4
None of the solution in this post seems to work in this case
parsing a dictionary in a pandas dataframe cell into new row cells (new columns)
df1=pd.DataFrame(df.loc[:,'x'].values.tolist())
print(df1)
result the same dataframe. didn't separate the column into each key per column
Any 2 cents?
Thanks!
You can also map json.loads and convert to a dataframe like;
import json
df1 = pd.DataFrame(df['x'].map(json.loads).tolist(),index=df.index)
print(df1)
a b c
0 1 2 3
1 2 3 4
this tests to be faster than evaluating via ast , below is the benchmark for 40K rows:
m = pd.concat([df]*20000,ignore_index=True)
%%timeit
import json
df1 = pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
#256 ms ± 18.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit
import ast
df1 = pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
#1.32 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%%timeit
import ast
df1 = pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
#1.34 s ± 71.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Because string repr of dictionaries is necessary convert values to dictionaries:
import ast, json
#performance for repeated sample data, in real data should be different
m = pd.concat([df]*20000,ignore_index=True)
In [98]: %timeit pd.DataFrame([json.loads(x) for x in m['x']], index=m.index)
206 ms ± 1.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#anky_91 solution
In [99]: %timeit pd.DataFrame(m['x'].map(json.loads).tolist(),index=m.index)
210 ms ± 11.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [100]: %timeit pd.DataFrame(m['x'].map(ast.literal_eval).tolist(),index=m.index)
903 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [101]: %timeit pd.DataFrame(m['x'].apply(ast.literal_eval).tolist(),index=m.index)
893 ms ± 2.15 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(df1)
a b c
0 1 2 3
1 2 3 4
Last for append to original:
df = df.join(df1)
print(df)
x a b c
0 {"a":"1","b":"2","c":"3"} 1 2 3
1 {"a":"2","b":"3","c":"4"} 2 3 4

Optimizing pandas operation for column intersection

I have a DataFrame with 2 columns (event and events) . Event column contains a particular eventid and events column contain list of event Ids.
Example :-
df
event events
'a' ['x','y','abc','a']
'b' ['x','y','c','a']
'c' ['a','c']
'd' ['b']
I want to create another column(eventoccured) indicating whether event isin events.
eventoccured
1
0
1
0
I am currently using
df['eventoccured']= df.apply(lambda x: x['event'] in x['events'], axis=1)
which gives the desired result but is slow, I want a faster solution for this.
Thanks
One idea is use list comprehension:
#40k rows
df = pd.concat([df] * 10000, ignore_index=True)
In [217]: %timeit df['eventoccured']= df.apply(lambda x: x['event'] in x['events'], axis=1)
1.15 s ± 36.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [218]: %timeit df['eventoccured1'] = [x in y for x, y in zip(df['event'], df['events'])]
15.2 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Resources