Drop the columns in pandas dataframe

Drop the columns in pandas dataframe - python-3.x

col_exclusions = ['numerator','Numerator' 'Denominator', "denominator"]
dataframe
id prim_numerator sec_Numerator tern_Numerator tern_Denominator final_denominator Result
1 12 23 45 54 56 Fail
Final output is id and Result

using regex
import re
pat = re.compile('|'.join(col_exclusions),flags=re.IGNORECASE)
final_cols = [c for c in df.columns if not re.search(pat,c)]
#out:
['id', 'Result']
print(df[final_cols])
id Result
0 1 Fail
if you want to drop
df = df.drop([c for c in df.columns if re.search(pat,c)],axis=1)
or the pure pandas approach thanks to #Anky_91
df.loc[:,~df.columns.str.contains('|'.join(col_exclusions),case=False)]

You can be explicit and use del for columns that contain the suffixes in your input list:
for column in df.columns:
if any([column.endswith(suffix) for suffix in col_exclusions]):
del df[column]

You can also use the following approach where the column names are splitted then matched with col_exclusions
df.drop(columns=[i for i in df.columns if i.split("_")[-1] in col_exclusions], inplace=True)
print(df.head())

Related

How to only keep an item of a list within pandas dataframe

This is what my dataframe looks like:
df = {"a": [[1,2,3], [4,5,6]],
"b": [[11,22,33], [44,55,66]],
"c": [[111,222,333], [444,555,666]]}
df = pd.Dataframe(df)
I want for each cell, keep only the item that I choose its index.
I've tried doing this way to keep the first item of the list
df = df.apply(lambda x:x[0])
but it doesn't work.
Anyone can enlighten me on this ? Thanks,

If need first value for all columns need processing lambda function elementwise by DataFrame.applymap:
df = df.applymap(lambda x:x[0])
print(df)
a b c
0 1 11 111
1 4 44 444

Check if column name in lst is in df.columns

I want to check if any of the columns in checklst is one of the column names in a df and if not, create an empty col with the name in checklst. Not sure what best approach is for this.
checklst= ["a","b","c","d","e","f"]
for i in checklst:
if not checklst' in df.columns:
df = df.withColumn("checklst", F.lit(None))

checklst= ["a","b","c","d","e","f"]
for x in checklst:
if x not in df.columns:
df = df.withColumn(x, F.lit(None))

Convert lists present in each column to its respective datatypes

I have a sample dataframe as given below.
import pandas as pd
data = {'ID':['A', 'B', 'C', 'D],
'Age':[[20], [21], [19], [24]],
'Sex':[['Male'], ['Male'],['Female'], np.nan],
'Interest': [['Dance','Music'], ['Dance','Sports'], ['Hiking','Surfing'], np.nan]}
df = pd.DataFrame(data)
df
Each of the columns are in list datatype. I want to remove those lists and preserve the datatypes present within the lists for all columns.
The final output should look something shown below.
Any help is greatly appreciated. Thank you.

Option 1. You can use the .str column accessor to index the lists stored in the DataFrame values (or strings, or any other iterable):
# Replace columns containing length-1 lists with the only item in each list
df['Age'] = df['Age'].str[0]
df['Sex'] = df['Sex'].str[0]
# Pass the variable-length list into the join() string method
df['Interest'] = df['Interest'].apply(', '.join)
Option 2. explode Age and Sex, then apply ', '.join to Interest:
df = df.explode(['Age', 'Sex'])
df['Interest'] = df['Interest'].apply(', '.join)
Both options return:
df
ID Age Sex Interest
0 A 20 Male Dance, Music
1 B 21 Male Dance, Sports
2 C 19 Female Hiking, Surfing
EDIT
Option 3. If you have many columns which contain lists with possible missing values as np.nan, you can get the list-column names and then loop over them as follows:
# Get columns which contain at least one python list
list_cols = [c for c in df
if df[c].apply(lambda x: isinstance(x, list)).any()]
list_cols
['Age', 'Sex', 'Interest']
# Process each column
for c in list_cols:
# If all lists in column c contain a single item:
if (df[c].str.len() == 1).all():
df[c] = df[c].str[0]
else:
df[c] = df[c].apply(', '.join)

Is there a way to compare the values of a Pandas DataFrame with the values of a second DataFrame?

I have 2 Pandas Dataframes with 5 columns and about 1000 rows each (working with python3).
I'm interested in making a comparison between the first column in df1 and the first column of df2 as follows:
DF1
[index] [col1]
1 "foobar"
2 "acksyn"
3 "foobaz"
4 "ackfin"
... ...
DF2
[index] [col1]
1 "old"
2 "fin"
3 "new"
4 "bar"
... ...
What I want to achieve is this: for each row of DF1, if DF1.col1 ends in any values of DF2.col1, drop the row.
In this example the resulting DF1 should be:
DF1
[index] [col1]
2 "acksyn"
3 "foobaz"
... ...
(see DF2 indexes 2 and 4 are the final part in DF1 indexes 1 and 4)
I tried using an internally defined function like:
def check_presence(df1_col1, second_csv):
for index, row in second_csv.iterrows():
search_string = "(?P<first_group>^(" + some_string + "))(?P<the_rest>" + row["col1"] + "$)"
if re.search(search_string, df1_col1):
return True
return False
and instructions with this format:
indexes = csv[csv.col1.str.contains(some_regex, regex= True, na=False)].index
but in both cases the python console complies about not being able to compare non-string objects with a string
What am I doing wrong? I can even try a solution after joining the 2 CSVs but I think I would need to do the same thing in the end
Thanks for patience, I'm new to python...

You will need to join your keywords in df2 first if you want to use str.contains method.
import pandas as pd
df = pd.DataFrame({'col1': {0: 'foobar', 1: 'acksyn', 2: 'foobaz', 3: 'ackfin'}})
df2 = pd.DataFrame({'col1': {0: 'old', 1: 'fin', 2: 'new', 3: 'bar'}})
print (df["col1"].str.contains("|".join(df2["col1"])))
#
0 True
1 False
2 False
3 True

Possible Solution
"" for each row of DF1, if DF1.col1 ends in any values of DF2.col1, drop the row.""
This is a one-liner if I understand properly:
# Search for Substring
# Generate an "OR" statement with a join
# Drop if match.
df[~df.col1.str.contains('|'.join(df2.col1.values))]
This will keep only the rows where DF2.Col1 is NOT found in DF1.Col1.
pd.Series.str.contains

Take your frames
frame1 =frame1=pd.DataFrame({"col1":["foobar","acksyn","foobaz","ackfin"]})
frame2=pd.DataFrame({"col1":["old","fin","new","bar"]})
Then
myList=frame2.col2.values
pattern='|'.join(myList)
Finally
frame1["col2"]=frame1["col1"].str.contains(pattern)
frame1.loc[frame1["col2"]==True]
col1 col2
0 foobar True
3 ackfin True

Using List Comprehension with Pandas Series and Dataframes

I have written the below code that accepts a pandas series (dataframe column) of strings and a dictionary of terms to replace in the strings.
def phrase_replace(repl_dict, str_series):
for k,v in repl_dict.items():
str_series = str_series.str.replace(k,v)
return str_series
It works correctly, but it seems like I should be able to use some kind of list comprehension instead of the for loop.
I don't want to use str_series = [] or {} because I don't want a list or a dictionary returned, but a pandas.core.series.Series
Likewise, if I want to use the function on every column in a dataframe:
for column in df.columns:
df[column] = phrase_replace(repl_dict, df[column])
There must be a list comprehension method to do this?

It is possible, but then need concat for DataFrame because get list of Series:
df = pd.concat([phrase_replace(repl_dict, df[column]) for column in df.columns], axis=1)
But maybe need replace by dictionary:
df = df.replace(repl_dict)

df = pd.DataFrame({'words':['apple','banana','orange']})
repl_dict = {'an':'foo', 'pp':'zz'}
df.replace({'words':repl_dict}, inplace=True, regex=True)
df
Out[263]:
words
0 azzle
1 bfoofooa
2 orfooge
If you want to apply to all columns:
df2 = pd.DataFrame({'key1':['apple', 'banana', 'orange'], 'key2':['banana', 'apple', 'pineapple']})
df2
Out[13]:
key1 key2
0 apple banana
1 banana apple
2 orange pineapple
df2.replace(repl_dict,inplace=True, regex=True)
df2
Out[15]:
key1 key2
0 azzle bfoofooa
1 bfoofooa azzle
2 orfooge pineazzle
The whole point of pandas is to not use for loops... it's optimized to use the built in methods for dataframes and series...

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Drop the columns in pandas dataframe - python-3.x

col_exclusions = ['numerator','Numerator' 'Denominator', "denominator"] dataframe id prim_numerator sec_Numerator tern_Numerator tern_Denominator final_denominator Result 1 12 23 45 54 56 Fail Final output is id and Result

You can be explicit and use del for columns that contain the suffixes in your input list: for column in df.columns: if any([column.endswith(suffix) for suffix in col_exclusions]): del df[column]

You can also use the following approach where the column names are splitted then matched with col_exclusions df.drop(columns=[i for i in df.columns if i.split("_")[-1] in col_exclusions], inplace=True) print(df.head())

Related

How to only keep an item of a list within pandas dataframe

Check if column name in lst is in df.columns

Convert lists present in each column to its respective datatypes

Is there a way to compare the values of a Pandas DataFrame with the values of a second DataFrame?

Using List Comprehension with Pandas Series and Dataframes

Categories

Resources