Pandas exclude rows based on dynamic condition set from configuration file - python-3.x

As title suggest, I have a rule engine in xml format which contains column name and values to exlcule.
<ExclusionSet>
<Exclude Excl="Col1:A" Count="1"/>
<Exclude Excl="Col2:BB,BBB" Count="1"/>
<Exclude Excl="Col3:A1B" Count="1"/>
<Exclude Excl="Col1:A2" Excl="Col2:BC" Count="2"/>
</ExclusionSet>
based on the above condition I need to exclude rows. i.e. row where Col1 has value A, Col2 has value BB and value BBB, Col3 has value A1B.
I was able to get a working code for the single condition (first 3) but unable to figure out how to implement last condition (with more than one condition)
def exclusionEngine(config,df):
#parsing xml
xml_map = minidom.parse(config)
value_map = xml_map.getElementsByTagName('Exclude')
exclusion_df = pd.DataFrame()
#iterating conditions
for atrb in value_map:
#not using Count any where but thought it might be useful for multiple conditions.
rule_count = atrb.attributes['Count'].value
for count in range(1,int(rule_count)+1):
#column name
col = atrb.attributes['Excl'].value.split(':')[0]
# value(s) as a list
value = list(atrb.attributes['Excl'].value.split(':')[1].split(','))
#creating filter for exclusion ; if there is way to implement multiple filters dynamically or create a list of filters and apply it.
filter1 = df[col].isin(value)
df = df.loc[~(filter1)]
return df
expecting something as follows but dynamic as there could be more conditions or less.
df = df.loc[~(filter1 & filter2)]
EDIT:
To simplify the ask here, is it possible to evaluate multiple conditions dynamically ?
<Exclude Excl="Col1:A2" Excl="Col2:BC" Count="2"/>

You could use the pandas query method. I am dropping all the non related xml stuff, as it is not going to work (you have a duplicate attribute so the supplied text is not a valid xml)
import pandas as pd
import re
def exclusionEngine(config: str,df: pd.DataFrame):
ret_df = df.copy()
with open(config, 'r') as inp:
for line in inp.readlines():
if '</ExclusionSet>' in line:
read = False
continue
if 'ExclusionSet' in line:
read = True
continue
if read:
matches = re.findall("Excl\=\"(.*?)\"", line)
query = 'not (' + ' & '.join([f'({col}=="{x}")' for match in matches for col,x in [match.rstrip().split(':')]]) + ')'
ret_df = ret_df.query(query)
return ret_df
So for example if we have the following dataframe:
df = pd.DataFrame([['A', 'B1', 'C1'], ['O', 'C', 'D'], ['B', 'A', 'C'], ['M', 'BB,BBB', 'A'], ['A2', 'H', 'B'], ['A2', 'BC', 'y']], columns=['Col1', 'Col2', 'Col3'])
Col1 Col2 Col3
0 A B1 C1
1 O C D
2 B A C
3 M BB,BBB A
4 A2 H B
5 A2 BC y
and config is saved inside 'config.xml' then calling exclusionEngine('config.xml', df) returns:
Col1 Col2 Col3
1 O C D
2 B A C
4 A2 H B

I think what you looking for is this:
df = df.loc[~(filter1 ^ filter2)]
This also will give you the same result:
df.loc[~filter1 & ~filter2]

Related

column comprehension robust to missing values

I have only been able to create a two column data frame from a defaultdict (termed output):
df_mydata = pd.DataFrame([(k, v) for k, v in output.items()],
columns=['id', 'value'])
What I would like to be able to do is using this basic format also initiate the dataframe with three columns: 'id', 'id2' and 'value'. I have a separate defined dict that contains the necessary look up info, called id_lookup.
So I tried:
df_mydata = pd.DataFrame([(k, id_lookup[k], v) for k, v in output.items()],
columns=['id', 'id2','value'])
I think I'm doing it right, but I get key errors. I will only know if id_lookup is exhaustive for all possible encounters in hindsight. For my purposes, simply putting it all together and placing 'N/A` or something for those types of errors will be acceptable.
Would the above be appropriate for calculating a new column of data using a defaultdict and a simple lookup dict, and how might I make it robust to key errors?
Here is an example of how you could do this:
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'id': [1, 2, 3, 4],
'value': [10, 20, 30, 40]})
id_lookup = {1: 'A', 2: 'B', 3: 'C'}
new_column = defaultdict(str)
# Loop through the df and populate the defaultdict
for index, row in df.iterrows():
try:
new_column[index] = id_lookup[row['id']]
except KeyError:
new_column[index] = 'N/A'
# Convert the defaultdict to a Series and add it as a new column in the df
df['id2'] = pd.Series(new_column)
# Print the updated DataFrame
print(df)
which gives:
id value id2
0 1 10 A
1 2 20 B
2 3 30 C
3 4 40 N/A
​

Python using apply function to skip Nan

I am trying to preprocess a dataset to use for XGBoost by mapping the classes in each column to numerical values. A working example looks like this:
from collections import defaultdict
from sklearn.preprocessing import LabelEncoder
import pandas as pd
df1 = pd.DataFrame(data = {'col1': ['A', 'B','C','B','A'], 'col2': ['Z', 'X','Z','Z','Y'], 'col3':['I','J','I','J','J']})
d = defaultdict(LabelEncoder)
encodedDF = df1.apply(lambda x: d[x.name].fit_transform(x))
inv = encodedDF.apply(lambda x: d[x.name].inverse_transform(x))
Where encodedDF gives the output:
col1 col2 col3
0 2 0
1 0 1
2 2 0
1 2 1
0 1 1
And inv just reverts it back to the original dataframe. My issue is when null values get introduced:
df2 = pd.DataFrame(data = {'col1': ['A', 'B',None,'B','A'], 'col2': ['Z', 'X','Z',None,'Y'], 'col3':['I','J','I','J','J']})
encodedDF = df2.apply(lambda x: d[x.name].fit_transform(x))
Running the above will throw the error:
"TypeError: ('argument must be a string or number', 'occurred at index col1')"
Basically, I want to apply the encoding, but skip over the individual cell values that are null to get an output like this:
col1 col2 col3
0 2 0
1 0 1
NaN 2 0
1 NaN 1
0 1 1
I can't use dropna() before applying the encoding because then I lose data that I will be trying to impute down the line with XGBoost. I can't use conditionals to skip x if null, (e.g. using x.notnull() in the lambda function) because fit_transform(x) uses a Pandas.Series object as the argument, and none of the logical operators that I could use in the conditional appear to do what I'm trying to do. I'm not sure what else to try in order to get this to work. I hope what I'm trying to do makes sense. Let me know if I need to clarify.
I think I figured out a workaround. I probably should have been using sklearn's OneHotEncoder class from the beginning instead of the LabelEncoder/defaultdict combo. I'm brand new to all this. I replaced NaNs with dummy values, and then dropped those dummy values once I encoded the dataframe.
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
df = pd.DataFrame(data = {'col1': ['A', 'B','C',None,'A'], 'col2': ['Z', 'X',None,'Z','Y'], 'col3':['I','J',None,'J','J'], 'col4':[45,67,None,32,94]})
replaceVals = {'col1':'missing','col2':'missing','col3':'missing','col4':-1}
df = df.fillna(value = replaceVals,axis=0)
drop = [['missing'],['missing'],['missing'],[-1]]
enc = OneHotEncoder(drop=drop)
encodeDF = enc.fit_transform(df)

Remove consecutive duplicate entries from pandas in each cell

I have a data frame that looks like
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
pd.DataFrame(data=d)
expected output
d={'col1':['a,b','a,c,b'],'col2':['a,b','a,b,a']}
I have tried like this :
arr = ['a', 'a', 'b', 'a', 'a', 'c','c']
print([x[0] for x in groupby(arr)])
How do I remove the duplicate entries in each row and column of dataframe?
a,a,b,c should be a,b,c
From what I understand, you don't want to include values which repeat in a sequence, you can try with this custom function:
def myfunc(x):
s=pd.Series(x.split(','))
res=s[s.ne(s.shift())]
return ','.join(res.values)
print(df.applymap(myfunc))
col1 col2
0 a,b a,b
1 a,c,b a,b,a
Another function can be created with itertools.groupby such as :
from itertools import groupby
def myfunc(x):
l=[x[0] for x in groupby(x.split(','))]
return ','.join(l)
You could define a function to help with this, then use .applymap to apply it to all columns (or .apply one column at a time):
d = {'col1': ['a,a,b', 'a,c,c,b'], 'col2': ['a,a,b', 'a,b,b,a']}
df = pd.DataFrame(data=d)
def remove_dups(string):
split = string.split(',') # split string into a list
uniques = set(split) # remove duplicate list elements
return ','.join(uniques) # rejoin the list elements into a string
result = df.applymap(remove_dups)
This returns:
col1 col2
0 a,b a,b
1 a,c,b a,b
Edit: This looks slightly different to your expected output, why do you expect a,b,a for the second row in col2?
Edit2: to preserve the original order, you can replace the set() function with unique_everseen()
from more_itertools import unique_everseen
.
.
.
uniques = unique_everseen(split)

Return pieces of strings from separate pandas dataframes based on multi-conditional logic

I'm new to python, and trying to do some work with dataframes in pandas
On the left side is piece of the primary dataframe (df1), and the right is a second (df2). The goal is to fill in the df1['vd_type'] column with strings based on several pieces of conditional logic. I can make this work with nested np.where() functions, but as this gets deeper into the hierarchy, it gets too long to run at all, so I'm looking for a more elegant solution.
The english version of the logic is this:
For df1['vd_type']: If df1['shape'] == the first two characters in df2['vd_combo'] AND df1['vd_pct'] <= df2['combo_value'], then return the last 3 characters in df2['vd_combo'] on the line where both of these conditions are true. If it can't find a line in df2 where both conditions are true, then return "vd4".
Thanks in advance!
EDIT #2: So I want to implement a 3rd condition based on another variable, with everything else the same, except in df1 there is another column 'log_vsc' with existing values, and the goal is to fill in an empty df1 column 'vsc_type' with one of 4 strings in the same scheme. The extra condition would be just that the 'vd_type' that we just defined would match the 'vd' column arising from the split 'vsc_combo'.
df3 = pd.DataFrame()
df3['vsc_combo'] = ['A1_vd1_vsc1','A1_vd1_vsc2','A1_vd1_vsc3','A1_vd2_vsc1','A1_vd2_vsc2' etc etc etc
df3['combo_value'] = [(number), (number), (number), (number), (number), etc etc
df3[['shape','vd','vsc']] = df3['vsc_combo'].str.split('_', expand = True)
def vsc_condition( row, df3):
df_select = df3[(df3['shape'] == row['shape']) & (df3['vd'] == row['vd_type']) & (row['log_vsc'] <= df3['combo_value'])]
if df_select.empty:
return 'vsc4'
else:
return df_select['vsc'].iloc[0]
## apply vsc_type
df1['vsc_type'] = df1.apply( vsc_condition, args = ([df3]), axis = 1)
And this works!! Thanks again!
so your inputs are like:
import pandas as pd
df1 = pd.DataFrame({'shape': ['A2', 'A1', 'B1', 'B1', 'A2'],
'vd_pct': [0.78, 0.33, 0.48, 0.38, 0.59]} )
df2 = pd.DataFrame({'vd_combo': ['A1_vd1', 'A1_vd2', 'A1_vd3', 'A2_vd1', 'A2_vd2', 'A2_vd3', 'B1_vd1', 'B1_vd2', 'B1_vd3'],
'combo_value':[0.38, 0.56, 0.68, 0.42, 0.58, 0.71, 0.39, 0.57, 0.69]} )
If you are not against creating columns in df2 (you can delete them at the end if it's a problem) you generate two columns shape and vd by splitting the column vd_combo:
df2[['shape','vd']] = df2['vd_combo'].str.split('_',expand=True)
Then you can create a function condition that you will use in apply such as:
def condition( row, df2):
# row will be a row of df1 in apply
# here you select only the rows of df2 with your conditions on shape and value
df_select = df2[(df2['shape'] == row['shape']) & (row['vd_pct'] <= df2['combo_value'])]
# if empty (your condition not met) then return vd4
if df_select.empty:
return 'vd4'
# if your condition met, then return the value of 'vd' the smallest
else:
return df_select['vd'].iloc[0]
Now you can create your column vd_type in df1 with:
df1['vd_type'] = df1.apply( condition, args =([df2]), axis=1)
df1 is like:
shape vd_pct vd_type
0 A2 0.78 vd4
1 A1 0.33 vd1
2 B1 0.48 vd2
3 B1 0.38 vd1
4 A2 0.59 vd3

Merge and then sort columns of a dataframe based on the columns of the merging dataframe

I have two dataframes, both indexed with timestamps. I would like to preserve the order of the columns in the first dataframe that is merged.
For example:
#required packages
import pandas as pd
import numpy as np
# defining stuff
num_periods_1 = 11
num_periods_2 = 4
# create sample time series
dates1 = pd.date_range('1/1/2000 00:00:00', periods=num_periods_1, freq='10min')
dates2 = pd.date_range('1/1/2000 01:30:00', periods=num_periods_2, freq='10min')
column_names_1 = ['C', 'B', 'A']
column_names_2 = ['B', 'C', 'D']
df1 = pd.DataFrame(np.random.randn(num_periods_1, len(column_names_1)), index=dates1, columns=column_names_1)
df2 = pd.DataFrame(np.random.randn(num_periods_2, len(column_names_2)), index=dates2, columns=column_names_2)
df3 = df1.merge(df2, how='outer', left_index=True, right_index=True, suffixes=['_1', '_2'])
print("\nData Frame Three:\n", df3)
The above code generates two data frames the first with columns C, B, and A. The second dataframe has columns B, C, and D. The current output has the columns in the following order; C_1, B_1, A, B_2, C_2, D. What I want the columns from the output of the merge to be C_1, C_2, B_1, B_2, A_1, D_2. The order of the columns is preserved from the first data frame and any data similar to the second data frame is added next to the corresponding data.
Could there be a setting in merge or can I use sort_index to do this?
EDIT: Maybe a better way to phrase the sorting process would be to call it uncollated. Where each column is put together and so on.
Using an OrderedDict, as you suggested.
from collections import OrderedDict
from itertools import chain
c = df3.columns.tolist()
o = OrderedDict()
for x in c:
o.setdefault(x.split('_')[0], []).append(x)
c = list(chain.from_iterable(o.values()))
df3 = df3[c]
An alternative that involves extracting the prefixes and then calling sorted on the index.
# https://stackoverflow.com/a/46839182/4909087
p = [s[0] for s in c]
c = sorted(c, key=lambda x: (p.index(x[0]), x))
df = df[c]

Resources