I have a dataframe with multiple columns and I want to delete such columns where all rows just contain any of the punctuation character. e-g
col_1 col_2 col_3 col_4
0 1 _ ab 1,235
1 2 ? cd 8,900
2 3 _ ef 1,235
3 4 - gh 8,900
Here I just want to delete col_2. How can I achieve that?
Idea is test if all values of column contains number or string by Series.str.contains in DataFrame.apply and DataFrame.all, last filter by DataFrame.loc:
df = df.loc[:, df.astype(str).apply(lambda x: x.str.contains('\d|\w')).all()]
Or:
df = df.loc[:, df.astype(str).apply(lambda x: x.str.contains('\d|[a-zA-Z]')).all()]
print (df)
col_1 col_3 col_4
0 1 ab 1,235
1 2 cd 8,900
2 3 ef 1,235
3 4 gh 8,900
If possible get all values for remove in string is posible add ^ for start of string and $ for end of string and then invert mask by ~:
p = """[!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ ]"""
df = df.loc[:, ~df.astype(str).apply(lambda x: x.str.contains('^' + p + '$')).all()]
print (df)
col_1 col_3 col_4
0 1 ab 1,235
1 2 cd 8,900
2 3 ef 1,235
3 4 gh 8,900
You could use python's string punctuation method and filter out the column that has punctuations with pandas' all method :
import string
#create a list of individual punctuations
punctuation_list = list(string.punctuation)
#filter out unwanted column
df.loc[:,~df.isin(punctuation_list).all()]
col_1 col_3 col_4
0 1 ab 1,235
1 2 cd 8,900
2 3 ef 1,235
3 4 gh 8,900
Related
I have a dataframe that looks like this
col_1 col_2 col_3 col_4
0 1 security team steve#example.com
1 james#example.com chef testing.csv
2 megan#example.com
I want to implement this logic:
If column 1 contains a value that contains an # sign, then replace column 4 with that column
else check column 2 and implement the same logic
else check column 3 and implement the same logic
else column 4 remains an empty string
My output will look like this:
col_1 col_2 col_3 col_4
0 1 security team steve#example.com steve#example.com
1 james#example.com chef testing.csv james#example.com
2 megan#example.com megan#example.com
I thought about trying to use a Numpy select statement somehow, but I believe that may be too difficult.
Any suggestions?
Solution if some rows has multiple values with # then email are joined by ,:
s = df.stack(dropna=False)
df['col_4'] = s[s.str.contains('#', na=False)].groupby(level=0).agg(','.join)
print (df)
col_1 col_2 col_3 col_4
0 1 security team steve#example.com steve#example.com
1 james#example.com chef testing.csv james#example.com
2 megan#example.com NaN NaN megan#example.com
Another idea with list comprehension:
df['col_4'] = [','.join(y for y in x if '#' in y) for x in df.fillna('').to_numpy()]
How can I drop the whole group by city and district if date's value of 2018/11/1 not exits in the following dataframe:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
3 b d 2018/9/1 3
4 b d 2018/10/1 7
The expected result will like this:
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Thank you!
Create helper column by DataFrame.assign, compare by datetime and test if at least one true per groups with GroupBy.any and GroupBy.transform for possible filter by boolean indexing:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
If error with misisng values in mask one possivle idea is replace misisng values in columns used for groups:
mask = (df.assign(new=df['date'].eq('2018/11/1'),
city= df['city'].fillna(-1),
district= df['district'].fillna(-1))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
Another idea is add possible misisng index values by reindex and also replace missing values to False:
mask = (df.assign(new=df['date'].eq('2018/11/1'))
.groupby(['city','district'])['new'].transform('any'))
df = df[mask.reindex(df.index, fill_value=False).fillna(False)]
print (df)
city district date value
0 a c 2018/9/1 12
1 a c 2018/10/1 4
2 a c 2018/11/1 5
There's a special GroupBy.filter() method for this. Assuming date is already datetime:
filter_date = pd.Timestamp('2018-11-01').date()
df = df.groupby(['city', 'district']).filter(lambda x: (x['date'].dt.date == filter_date).any())
The data in an excel file looks like this
A B C
1 1 1
1 1 1
D E F G H
1 1 1 1 1
1 1 1 1 1
The file is separated into two parts by one empty row in the middle of the file. They have different column names and different number of columns. I only need the second part of the file. I want to read this file as a pandas dataframe. The number of rows in the first part is not fixed, different files will have different number of rows. So if I use skiprows=4 will not work.
I actually already have a solution for that. But I want to know whether there is a better solution.
import pandas as pd
path = r'C:\Users\'
file = 'test-file.xlsx'
# Read the whole file without skipping
df_temp = pd.read_excel(path + '/' + file)
The data looks like this in pandas. Empty row will have null values in all the columns.
A B C Unnamed: 3 Unnamed: 4
0 1 1 1 NaN NaN
1 1 1 1 NaN NaN
2 NaN NaN NaN NaN NaN
3 D E F G H
4 1 1 1 1 1
5 1 1 1 1 1
I try to find all empty rows and return the index of the first empty row
first_empty_row = df_temp[df_temp.isnull().all(axis=1)].index[0]
del df_temp
Read the file again but skip number of rows by using the number provided above
df= pd.read_excel(path + '/' + file, skiprows=first_empty_row+2)
print(df)
The drawback of this solution is I need to read the file twice. If the file has a lot of rows in the first part, it might take a long time to read these useless rows. I can also possibly use readline loop rows until it reach an empty row, but that will be inefficient.
Does anyone have a better solution? Thanks
Find the position if the first empty row:
pos = df_temp[df_temp.isnull().all(axis=1)].index[0]
Then select everything after that position:
df = df_temp.iloc[pos+1:]
df.columns = df.iloc[0]
df.columns.name = ''
df = df.iloc[1:]
Your first line looks across the entire row for all null. Would it be possible to just look for the first null in the first column?
first_empty_row = df_temp[df_temp.isnull().all(axis=1)].index[0]
How does this compare in performance?
import pandas as pd
import numpy as np
data1 = {'A' : [1,1, np.NaN, 'D', 1,1],
'B' : [1,1, np.NaN, 'E', 1,1],
'C' : [1,1, np.NaN, 'F', 1,1],
'Unnamed: 3' : [np.NaN,np.NaN,np.NaN, 'G', 1,1],
'Unnamed: 4' : [np.NaN,np.NaN,np.NaN, 'H', 1,1]}
df1 = pd.DataFrame(data1)
print(df1)
A B C Unnamed: 3 Unnamed: 4
0 1 1 1 NaN NaN
1 1 1 1 NaN NaN
2 NaN NaN NaN NaN NaN
3 D E F G H
4 1 1 1 1 1
5 1 1 1 1 1
# create empty list to append the rows that need to be deleted
list1 = []
# loop through the first column of the dataframe and append the index to a list until the row is null
for index, row in df1.iterrows():
if (pd.isnull(row[0])):
list1.append(index)
break
else:
list1.append(index)
# drop the rows based on list created from for loop
df1 = df1.drop(df1.index[list1])
# reset index so you can replace the old columns names
# with the secondary column names easier
df1 = df1.reset_index(drop = True)
# create empty list to append the new column names to
temp = []
# loop through dataframe and append the new column names
for label in df1.columns:
temp.append(df1[label][0])
# replace column names with the desired names
df1.columns = temp
# drop the old column names which are always going to be at row 0
df1 = df1.drop(df1.index[0])
# reset index so it doesn't start at 1
df1 = df1.reset_index(drop = True)
print(df1)
D E F G H
0 1 1 1 1 1
1 1 1 1 1 1
So I have a real dataframe that somewhat follows the next structure:
d = {'col1':['1_ABC','2_DEF','3 GHI']}
df = pd.DataFrame(data=d)
Basically, some entries have the " _ ", others have " ".
My goal is to split that first number into a new column and keep the rest. For this, I thought I'd first replace the '_' by ' ' to normalize everything, and then simply split by ' ' to get the new column.
#Replace the '_' for ' '
new_df['Name'] = df['Name'].str.replace('_',' ')
My problem is that now my new_df now lost its column name:
0 1 ABC
1 2 DEF
Any way to prevent this from happening?
Thanks!
Function str.replace return Series, so there is no column name, only Series name.
s = df['col1'].str.replace('_',' ')
print (s)
0 1 ABC
1 2 DEF
2 3 GHI
Name: col1, dtype: object
print (type(s))
<class 'pandas.core.series.Series'>
print (s.name)
col1
If need new column assign to same DataFrame - df['Name']:
df['Name'] = df['col1'].str.replace('_',' ')
print (df)
col1 Name
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI
Or overwrite values of original column:
df['col1'] = df['col1'].str.replace('_',' ')
print (df)
col1
0 1 ABC
1 2 DEF
2 3 GHI
If need new one column DataFrame use Series.to_frame for convert Series to df:
df2 = df['col1'].str.replace('_',' ').to_frame()
print (df2)
col1
0 1 ABC
1 2 DEF
2 3 GHI
Also is possible define new column name:
df1 = df['col1'].str.replace('_',' ').to_frame('New')
print (df1)
New
0 1 ABC
1 2 DEF
2 3 GHI
Like #anky_91 commented, if need new 2 columns add str.split:
df1 = df['col1'].str.replace('_',' ').str.split(expand=True)
df1.columns = ['A','B']
print (df1)
A B
0 1 ABC
1 2 DEF
2 3 GHI
If need add columns to existing DataFrame:
df[['A','B']] = df['col1'].str.replace('_',' ').str.split(expand=True)
print (df)
col1 A B
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI
Say you have a DataFrame with columns;
col_1 col_2
1 a
2 b
3 c
4 d
5 e
how would you change the values of col_2 so that, new value = current value + 'new'
Use +:
df.col_2 = df.col_2 + 'new'
print (df)
col_1 col_2
0 1 anew
1 2 bnew
2 3 cnew
3 4 dnew
4 5 enew
Thanks hooy for another solution:
df.col_2 += 'new'
Or assign:
df = df.assign(col_2 = df.col_2 + 'new')
print (df)
col_1 col_2
0 1 anew
1 2 bnew
2 3 cnew
3 4 dnew
4 5 enew