How to check and delete pandas column if all rows contain just punctuation character? - python-3.x

I have a dataframe with multiple columns and I want to delete such columns where all rows just contain any of the punctuation character. e-g
col_1 col_2 col_3 col_4
0 1 _ ab 1,235
1 2 ? cd 8,900
2 3 _ ef 1,235
3 4 - gh 8,900
Here I just want to delete col_2. How can I achieve that?

Idea is test if all values of column contains number or string by Series.str.contains in DataFrame.apply and DataFrame.all, last filter by DataFrame.loc:
df = df.loc[:, df.astype(str).apply(lambda x: x.str.contains('\d|\w')).all()]
Or:
df = df.loc[:, df.astype(str).apply(lambda x: x.str.contains('\d|[a-zA-Z]')).all()]
print (df)
col_1 col_3 col_4
0 1 ab 1,235
1 2 cd 8,900
2 3 ef 1,235
3 4 gh 8,900
If possible get all values for remove in string is posible add ^ for start of string and $ for end of string and then invert mask by ~:
p = """[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ ]"""
df = df.loc[:, ~df.astype(str).apply(lambda x: x.str.contains('^' + p + '$')).all()]
print (df)
col_1 col_3 col_4
0 1 ab 1,235
1 2 cd 8,900
2 3 ef 1,235
3 4 gh 8,900

You could use python's string punctuation method and filter out the column that has punctuations with pandas' all method :
import string
#create a list of individual punctuations
punctuation_list = list(string.punctuation)
#filter out unwanted column
df.loc[:,~df.isin(punctuation_list).all()]
col_1 col_3 col_4
0 1 ab 1,235
1 2 cd 8,900
2 3 ef 1,235
3 4 gh 8,900

Related

Pandas Conditional Statement and Replace

I have a dataframe that looks like this
col_1 col_2 col_3 col_4
0 1 security team steve#example.com
1 james#example.com chef testing.csv
2 megan#example.com
I want to implement this logic:
If column 1 contains a value that contains an # sign, then replace column 4 with that column
else check column 2 and implement the same logic
else check column 3 and implement the same logic
else column 4 remains an empty string
My output will look like this:
col_1 col_2 col_3 col_4
0 1 security team steve#example.com steve#example.com
1 james#example.com chef testing.csv james#example.com
2 megan#example.com megan#example.com
I thought about trying to use a Numpy select statement somehow, but I believe that may be too difficult.
Any suggestions?
Solution if some rows has multiple values with # then email are joined by ,:
s = df.stack(dropna=False)
df['col_4'] = s[s.str.contains('#', na=False)].groupby(level=0).agg(','.join)
print (df)
col_1 col_2 col_3 col_4
0 1 security team steve#example.com steve#example.com
1 james#example.com chef testing.csv james#example.com
2 megan#example.com NaN NaN megan#example.com
Another idea with list comprehension:
df['col_4'] = [','.join(y for y in x if '#' in y) for x in df.fillna('').to_numpy()]

Drop by multiple columns groups if specific values not exit in another column in Pandas

How can I drop the whole group by city and district if date's value of 2018/11/1 not exits in the following dataframe:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
3 b d 2018/9/1 3
4 b d 2018/10/1 7
The expected result will like this:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Thank you!
Create helper column by DataFrame.assign, compare by datetime and test if at least one true per groups with GroupBy.any and GroupBy.transform for possible filter by boolean indexing:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
If error with misisng values in mask one possivle idea is replace misisng values in columns used for groups:
mask = (df.assign(new=df['date'].eq('2018/11/1'),
city= df['city'].fillna(-1),
district= df['district'].fillna(-1))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Another idea is add possible misisng index values by reindex and also replace missing values to False:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask.reindex(df.index, fill_value=False).fillna(False)]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
There's a special GroupBy.filter() method for this. Assuming date is already datetime:
filter_date = pd.Timestamp('2018-11-01').date()
df = df.groupby(['city', 'district']).filter(lambda x: (x['date'].dt.date == filter_date).any())

In python, how to locate the position of the empty rows in the middle of the file and skip some rows at the beginning dynamically

The data in an excel file looks like this
A B C
1 1 1
1 1 1
D E F G H
1 1 1 1 1
1 1 1 1 1
The file is separated into two parts by one empty row in the middle of the file. They have different column names and different number of columns. I only need the second part of the file. I want to read this file as a pandas dataframe. The number of rows in the first part is not fixed, different files will have different number of rows. So if I use skiprows=4 will not work.
I actually already have a solution for that. But I want to know whether there is a better solution.
import pandas as pd
path = r'C:\Users\'
file = 'test-file.xlsx'
# Read the whole file without skipping
df_temp = pd.read_excel(path + '/' + file)
The data looks like this in pandas. Empty row will have null values in all the columns.
A B C Unnamed: 3 Unnamed: 4
0 1 1 1 NaN NaN
1 1 1 1 NaN NaN
2 NaN NaN NaN NaN NaN
3 D E F G H
4 1 1 1 1 1
5 1 1 1 1 1
I try to find all empty rows and return the index of the first empty row
first_empty_row = df_temp[df_temp.isnull().all(axis=1)].index[0]
del df_temp
Read the file again but skip number of rows by using the number provided above
df= pd.read_excel(path + '/' + file, skiprows=first_empty_row+2)
print(df)
The drawback of this solution is I need to read the file twice. If the file has a lot of rows in the first part, it might take a long time to read these useless rows. I can also possibly use readline loop rows until it reach an empty row, but that will be inefficient.
Does anyone have a better solution? Thanks
Find the position if the first empty row:
pos = df_temp[df_temp.isnull().all(axis=1)].index[0]
Then select everything after that position:
df = df_temp.iloc[pos+1:]
df.columns = df.iloc[0]
df.columns.name = ''
df = df.iloc[1:]
Your first line looks across the entire row for all null. Would it be possible to just look for the first null in the first column?
first_empty_row = df_temp[df_temp.isnull().all(axis=1)].index[0]
How does this compare in performance?
import pandas as pd
import numpy as np
data1 = {'A' : [1,1, np.NaN, 'D', 1,1],
'B' : [1,1, np.NaN, 'E', 1,1],
'C' : [1,1, np.NaN, 'F', 1,1],
'Unnamed: 3' : [np.NaN,np.NaN,np.NaN, 'G', 1,1],
'Unnamed: 4' : [np.NaN,np.NaN,np.NaN, 'H', 1,1]}
df1 = pd.DataFrame(data1)
print(df1)
A B C Unnamed: 3 Unnamed: 4
0 1 1 1 NaN NaN
1 1 1 1 NaN NaN
2 NaN NaN NaN NaN NaN
3 D E F G H
4 1 1 1 1 1
5 1 1 1 1 1
# create empty list to append the rows that need to be deleted
list1 = []
# loop through the first column of the dataframe and append the index to a list until the row is null
for index, row in df1.iterrows():
if (pd.isnull(row[0])):
list1.append(index)
break
else:
list1.append(index)
# drop the rows based on list created from for loop
df1 = df1.drop(df1.index[list1])
# reset index so you can replace the old columns names
# with the secondary column names easier
df1 = df1.reset_index(drop = True)
# create empty list to append the new column names to
temp = []
# loop through dataframe and append the new column names
for label in df1.columns:
temp.append(df1[label][0])
# replace column names with the desired names
df1.columns = temp
# drop the old column names which are always going to be at row 0
df1 = df1.drop(df1.index[0])
# reset index so it doesn't start at 1
df1 = df1.reset_index(drop = True)
print(df1)
D E F G H
0 1 1 1 1 1
1 1 1 1 1 1

Prevent column name from disappearing after using replace on dataframe

So I have a real dataframe that somewhat follows the next structure:
d = {'col1':['1_ABC','2_DEF','3 GHI']}
df = pd.DataFrame(data=d)
Basically, some entries have the " _ ", others have " ".
My goal is to split that first number into a new column and keep the rest. For this, I thought I'd first replace the '_' by ' ' to normalize everything, and then simply split by ' ' to get the new column.
#Replace the '_' for ' '
new_df['Name'] = df['Name'].str.replace('_',' ')
My problem is that now my new_df now lost its column name:
0 1 ABC
1 2 DEF
Any way to prevent this from happening?
Thanks!
Function str.replace return Series, so there is no column name, only Series name.
s = df['col1'].str.replace('_',' ')
print (s)
0 1 ABC
1 2 DEF
2 3 GHI
Name: col1, dtype: object
print (type(s))
<class 'pandas.core.series.Series'>
print (s.name)
col1
If need new column assign to same DataFrame - df['Name']:
df['Name'] = df['col1'].str.replace('_',' ')
print (df)
col1 Name
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI
Or overwrite values of original column:
df['col1'] = df['col1'].str.replace('_',' ')
print (df)
col1
0 1 ABC
1 2 DEF
2 3 GHI
If need new one column DataFrame use Series.to_frame for convert Series to df:
df2 = df['col1'].str.replace('_',' ').to_frame()
print (df2)
col1
0 1 ABC
1 2 DEF
2 3 GHI
Also is possible define new column name:
df1 = df['col1'].str.replace('_',' ').to_frame('New')
print (df1)
New
0 1 ABC
1 2 DEF
2 3 GHI
Like #anky_91 commented, if need new 2 columns add str.split:
df1 = df['col1'].str.replace('_',' ').str.split(expand=True)
df1.columns = ['A','B']
print (df1)
A B
0 1 ABC
1 2 DEF
2 3 GHI
If need add columns to existing DataFrame:
df[['A','B']] = df['col1'].str.replace('_',' ').str.split(expand=True)
print (df)
col1 A B
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI

How to add string to all values in a column of pandas DataFrame

Say you have a DataFrame with columns;
col_1 col_2
1 a
2 b
3 c
4 d
5 e
how would you change the values of col_2 so that, new value = current value + 'new'
Use +:
df.col_2 = df.col_2 + 'new'
print (df)
col_1 col_2
0 1 anew
1 2 bnew
2 3 cnew
3 4 dnew
4 5 enew
Thanks hooy for another solution:
df.col_2 += 'new'
Or assign:
df = df.assign(col_2 = df.col_2 + 'new')
print (df)
col_1 col_2
0 1 anew
1 2 bnew
2 3 cnew
3 4 dnew
4 5 enew

Resources