col_exclusions = ['numerator','Numerator' 'Denominator', "denominator"]
dataframe
id prim_numerator sec_Numerator tern_Numerator tern_Denominator final_denominator Result
1 12 23 45 54 56 Fail
Final output is id and Result
using regex
import re
pat = re.compile('|'.join(col_exclusions),flags=re.IGNORECASE)
final_cols = [c for c in df.columns if not re.search(pat,c)]
#out:
['id', 'Result']
print(df[final_cols])
id Result
0 1 Fail
if you want to drop
df = df.drop([c for c in df.columns if re.search(pat,c)],axis=1)
or the pure pandas approach thanks to #Anky_91
df.loc[:,~df.columns.str.contains('|'.join(col_exclusions),case=False)]
You can be explicit and use del for columns that contain the suffixes in your input list:
for column in df.columns:
if any([column.endswith(suffix) for suffix in col_exclusions]):
del df[column]
You can also use the following approach where the column names are splitted then matched with col_exclusions
df.drop(columns=[i for i in df.columns if i.split("_")[-1] in col_exclusions], inplace=True)
print(df.head())
Related
This is what my dataframe looks like:
df = {"a": [[1,2,3], [4,5,6]],
"b": [[11,22,33], [44,55,66]],
"c": [[111,222,333], [444,555,666]]}
df = pd.Dataframe(df)
I want for each cell, keep only the item that I choose its index.
I've tried doing this way to keep the first item of the list
df = df.apply(lambda x:x[0])
but it doesn't work.
Anyone can enlighten me on this ? Thanks,
If need first value for all columns need processing lambda function elementwise by DataFrame.applymap:
df = df.applymap(lambda x:x[0])
print(df)
a b c
0 1 11 111
1 4 44 444
I want to check if any of the columns in checklst is one of the column names in a df and if not, create an empty col with the name in checklst. Not sure what best approach is for this.
checklst= ["a","b","c","d","e","f"]
for i in checklst:
if not checklst' in df.columns:
df = df.withColumn("checklst", F.lit(None))
checklst= ["a","b","c","d","e","f"]
for x in checklst:
if x not in df.columns:
df = df.withColumn(x, F.lit(None))
I have a sample dataframe as given below.
import pandas as pd
data = {'ID':['A', 'B', 'C', 'D],
'Age':[[20], [21], [19], [24]],
'Sex':[['Male'], ['Male'],['Female'], np.nan],
'Interest': [['Dance','Music'], ['Dance','Sports'], ['Hiking','Surfing'], np.nan]}
df = pd.DataFrame(data)
df
Each of the columns are in list datatype. I want to remove those lists and preserve the datatypes present within the lists for all columns.
The final output should look something shown below.
Any help is greatly appreciated. Thank you.
Option 1. You can use the .str column accessor to index the lists stored in the DataFrame values (or strings, or any other iterable):
# Replace columns containing length-1 lists with the only item in each list
df['Age'] = df['Age'].str[0]
df['Sex'] = df['Sex'].str[0]
# Pass the variable-length list into the join() string method
df['Interest'] = df['Interest'].apply(', '.join)
Option 2. explode Age and Sex, then apply ', '.join to Interest:
df = df.explode(['Age', 'Sex'])
df['Interest'] = df['Interest'].apply(', '.join)
Both options return:
df
ID Age Sex Interest
0 A 20 Male Dance, Music
1 B 21 Male Dance, Sports
2 C 19 Female Hiking, Surfing
EDIT
Option 3. If you have many columns which contain lists with possible missing values as np.nan, you can get the list-column names and then loop over them as follows:
# Get columns which contain at least one python list
list_cols = [c for c in df
if df[c].apply(lambda x: isinstance(x, list)).any()]
list_cols
['Age', 'Sex', 'Interest']
# Process each column
for c in list_cols:
# If all lists in column c contain a single item:
if (df[c].str.len() == 1).all():
df[c] = df[c].str[0]
else:
df[c] = df[c].apply(', '.join)
I have 2 Pandas Dataframes with 5 columns and about 1000 rows each (working with python3).
I'm interested in making a comparison between the first column in df1 and the first column of df2 as follows:
DF1
[index] [col1]
1 "foobar"
2 "acksyn"
3 "foobaz"
4 "ackfin"
... ...
DF2
[index] [col1]
1 "old"
2 "fin"
3 "new"
4 "bar"
... ...
What I want to achieve is this: for each row of DF1, if DF1.col1 ends in any values of DF2.col1, drop the row.
In this example the resulting DF1 should be:
DF1
[index] [col1]
2 "acksyn"
3 "foobaz"
... ...
(see DF2 indexes 2 and 4 are the final part in DF1 indexes 1 and 4)
I tried using an internally defined function like:
def check_presence(df1_col1, second_csv):
for index, row in second_csv.iterrows():
search_string = "(?P<first_group>^(" + some_string + "))(?P<the_rest>" + row["col1"] + "$)"
if re.search(search_string, df1_col1):
return True
return False
and instructions with this format:
indexes = csv[csv.col1.str.contains(some_regex, regex= True, na=False)].index
but in both cases the python console complies about not being able to compare non-string objects with a string
What am I doing wrong? I can even try a solution after joining the 2 CSVs but I think I would need to do the same thing in the end
Thanks for patience, I'm new to python...
You will need to join your keywords in df2 first if you want to use str.contains method.
import pandas as pd
df = pd.DataFrame({'col1': {0: 'foobar', 1: 'acksyn', 2: 'foobaz', 3: 'ackfin'}})
df2 = pd.DataFrame({'col1': {0: 'old', 1: 'fin', 2: 'new', 3: 'bar'}})
print (df["col1"].str.contains("|".join(df2["col1"])))
#
0 True
1 False
2 False
3 True
Possible Solution
"" for each row of DF1, if DF1.col1 ends in any values of DF2.col1, drop the row.""
This is a one-liner if I understand properly:
# Search for Substring
# Generate an "OR" statement with a join
# Drop if match.
df[~df.col1.str.contains('|'.join(df2.col1.values))]
This will keep only the rows where DF2.Col1 is NOT found in DF1.Col1.
pd.Series.str.contains
Take your frames
frame1 =frame1=pd.DataFrame({"col1":["foobar","acksyn","foobaz","ackfin"]})
frame2=pd.DataFrame({"col1":["old","fin","new","bar"]})
Then
myList=frame2.col2.values
pattern='|'.join(myList)
Finally
frame1["col2"]=frame1["col1"].str.contains(pattern)
frame1.loc[frame1["col2"]==True]
col1 col2
0 foobar True
3 ackfin True
I have written the below code that accepts a pandas series (dataframe column) of strings and a dictionary of terms to replace in the strings.
def phrase_replace(repl_dict, str_series):
for k,v in repl_dict.items():
str_series = str_series.str.replace(k,v)
return str_series
It works correctly, but it seems like I should be able to use some kind of list comprehension instead of the for loop.
I don't want to use str_series = [] or {} because I don't want a list or a dictionary returned, but a pandas.core.series.Series
Likewise, if I want to use the function on every column in a dataframe:
for column in df.columns:
df[column] = phrase_replace(repl_dict, df[column])
There must be a list comprehension method to do this?
It is possible, but then need concat for DataFrame because get list of Series:
df = pd.concat([phrase_replace(repl_dict, df[column]) for column in df.columns], axis=1)
But maybe need replace by dictionary:
df = df.replace(repl_dict)
df = pd.DataFrame({'words':['apple','banana','orange']})
repl_dict = {'an':'foo', 'pp':'zz'}
df.replace({'words':repl_dict}, inplace=True, regex=True)
df
Out[263]:
words
0 azzle
1 bfoofooa
2 orfooge
If you want to apply to all columns:
df2 = pd.DataFrame({'key1':['apple', 'banana', 'orange'], 'key2':['banana', 'apple', 'pineapple']})
df2
Out[13]:
key1 key2
0 apple banana
1 banana apple
2 orange pineapple
df2.replace(repl_dict,inplace=True, regex=True)
df2
Out[15]:
key1 key2
0 azzle bfoofooa
1 bfoofooa azzle
2 orfooge pineazzle
The whole point of pandas is to not use for loops... it's optimized to use the built in methods for dataframes and series...