Prevent column name from disappearing after using replace on dataframe - python-3.x

So I have a real dataframe that somewhat follows the next structure:
d = {'col1':['1_ABC','2_DEF','3 GHI']}
df = pd.DataFrame(data=d)
Basically, some entries have the " _ ", others have " ".
My goal is to split that first number into a new column and keep the rest. For this, I thought I'd first replace the '_' by ' ' to normalize everything, and then simply split by ' ' to get the new column.
#Replace the '_' for ' '
new_df['Name'] = df['Name'].str.replace('_',' ')
My problem is that now my new_df now lost its column name:
0 1 ABC
1 2 DEF
Any way to prevent this from happening?
Thanks!

Function str.replace return Series, so there is no column name, only Series name.
s = df['col1'].str.replace('_',' ')
print (s)
0 1 ABC
1 2 DEF
2 3 GHI
Name: col1, dtype: object
print (type(s))
<class 'pandas.core.series.Series'>
print (s.name)
col1
If need new column assign to same DataFrame - df['Name']:
df['Name'] = df['col1'].str.replace('_',' ')
print (df)
col1 Name
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI
Or overwrite values of original column:
df['col1'] = df['col1'].str.replace('_',' ')
print (df)
col1
0 1 ABC
1 2 DEF
2 3 GHI
If need new one column DataFrame use Series.to_frame for convert Series to df:
df2 = df['col1'].str.replace('_',' ').to_frame()
print (df2)
col1
0 1 ABC
1 2 DEF
2 3 GHI
Also is possible define new column name:
df1 = df['col1'].str.replace('_',' ').to_frame('New')
print (df1)
New
0 1 ABC
1 2 DEF
2 3 GHI
Like #anky_91 commented, if need new 2 columns add str.split:
df1 = df['col1'].str.replace('_',' ').str.split(expand=True)
df1.columns = ['A','B']
print (df1)
A B
0 1 ABC
1 2 DEF
2 3 GHI
If need add columns to existing DataFrame:
df[['A','B']] = df['col1'].str.replace('_',' ').str.split(expand=True)
print (df)
col1 A B
0 1_ABC 1 ABC
1 2_DEF 2 DEF
2 3 GHI 3 GHI

Related

How to count unique values in one colulmn based on value in another column by group in Pandas

I'm trying to count unique values in one column only when the value meets a certain condition based on another column. For example, the data looks like this:
GroupID ID Value
ABC TX123 0
ABC TX678 1
ABC TX678 2
DEF AG123 1
DEF AG123 1
DEF AG123 1
GHI TE203 0
GHI TE203 0
I want to count the number of unique ID by Group ID but only when the value column >0. When all values for a group ID =0, it will simply have 0. For example, the result dataset would look like this:
GroupID UniqueNum
ABC 1
DEF 1
GHI 0
I've tried the following but it simply returns the uique number of IDs regardless of its value. How do I add the condition of when value >0?
count_df = df.groupby(['GroupID'])['ID'].nunique()
positive counts only
You can use pre-filtering with loc and named aggregation with groupby.agg('nunique'):
(df.loc[df['Value'].gt(0), 'ID']
.groupby(df['GroupID'])
.agg(UniqueNum='nunique')
.reset_index()
)
Output:
GroupID UniqueNum
0 ABC 1
1 DEF 1
all counts (including zero)
If you want to count as zero, the groups with no match, you can reindex:
(df.loc[df['Value'].gt(0), 'ID']
.groupby(df['GroupID'])
.agg(UniqueNum='nunique')
.reindex(df['GroupID'].unique(), fill_value=0)
.reset_index()
)
Or mask:
(df['ID'].where(df['Value'].gt(0))
.groupby(df['GroupID'])
.agg(UniqueNum='nunique')
.reset_index()
)
Output:
GroupID UniqueNum
0 ABC 1
1 DEF 1
2 GHI 0
Used input:
GroupID ID Value
ABC TX123 0
ABC TX678 1
ABC TX678 2
DEF AG123 1
DEF AG123 1
DEF AG123 1
GHI AB123 0
If need 0 for non matched values use Series.where for NaNs for non matched condition, then aggregate by DataFrameGroupBy.nunique:
df = pd.DataFrame({ 'GroupID': ['ABC', 'ABC', 'ABC', 'DEF', 'DEF', 'NEW'],
'ID': ['TX123', 'TX678', 'TX678', 'AG123', 'AG123', 'AG123'],
'Value': [0, 1, 2, 1, 1, 0]})
df = (df['ID'].where(df["Value"].gt(0)).groupby(df['GroupID'])
.nunique()
.reset_index(name='nunique'))
print (df)
GroupID nunique
0 ABC 1
1 DEF 1
2 NEW 0
How it working:
print (df.assign(new=df['ID'].where(df["Value"].gt(0))))
GroupID ID Value new
0 ABC TX123 0 NaN
1 ABC TX678 1 TX678
2 ABC TX678 2 TX678
3 DEF AG123 1 AG123
4 DEF AG123 1 AG123
5 NEW AG123 0 NaN <- set NaN for non matched condition

line feed inside row in column with pandas

are there any way in pandas to separate data inside a row in a column? row have multiple data, I mean, I group by col1 and the result is that I have a df like that:
col1 Col2
0 1 abc,def,ghi
1 2 xyz,asd
and desired output would be:
Col1 Col2
0 1 abc
def
ghi
1 2 xyz
asd
thanks
Use str.split and explode:
print (df.assign(Col2=df["Col2"].str.split(","))
.explode("Col2"))
col1 Col2
0 1 abc
0 1 def
0 1 ghi
1 2 xyz
1 2 asd

Pandas dataframe, match column against list of sub-strings, continuous rows, keep only sub-string

In a Pandas dataframe, I want to match Col1 against a list of keywords as follow:
Keywords need to be different, located in the same column and on 3 continuous rows (keyword1 != keyword2 != keyword3 and they are located for example on rows x, x+1 and x+2)
I only want the keywords to be returned as results (in the example below " def" is being removed)
list_keywords = ['abc', 'ghi', 'jkl mnop','blabla']
Index Col1
1 abc def
2 ghi
3 jkl mnop
4 qrstu
5 vw
>>>
1 abc
2 ghi
3 jkl mnop
You could do something like this with df.iterrows().
for _, row in df.iterrows():
if row['col1'] in list_keywords:
row['col1'] = row['col1']
else:
val = row['col1'].split()
row['col1'] = ' '.join(str(i) for i in val if i in list_keywords)
df
col1
0 abc
1 ghi
2 jkl mnop
3
4
Based on #HTRS 's answer, here is what seems to be a partial answer to my question.
This piece of code filters column Brands against a list of keywords and filter out the parts of strings that differ from keywords.
import pandas as pd
list_filtered = []
list_keywords = ['abc', 'ghi', 'jkl mnop','blabla']
for _, row in df.iterrows():
if row['Brand'] in list_keywords:
row['Brand'] = row['Brand']
list_filtered.append(row['Brand'])
else:
val = row['Brand'].split()
row['Brand'] = ' '.join(str(i) for i in val if i in list_keywords)
list_filtered.append(row['Brand'])
df['Filtered'] = list_filtered
print(df)

In python, how to locate the position of the empty rows in the middle of the file and skip some rows at the beginning dynamically

The data in an excel file looks like this
A B C
1 1 1
1 1 1
D E F G H
1 1 1 1 1
1 1 1 1 1
The file is separated into two parts by one empty row in the middle of the file. They have different column names and different number of columns. I only need the second part of the file. I want to read this file as a pandas dataframe. The number of rows in the first part is not fixed, different files will have different number of rows. So if I use skiprows=4 will not work.
I actually already have a solution for that. But I want to know whether there is a better solution.
import pandas as pd
path = r'C:\Users\'
file = 'test-file.xlsx'
# Read the whole file without skipping
df_temp = pd.read_excel(path + '/' + file)
The data looks like this in pandas. Empty row will have null values in all the columns.
A B C Unnamed: 3 Unnamed: 4
0 1 1 1 NaN NaN
1 1 1 1 NaN NaN
2 NaN NaN NaN NaN NaN
3 D E F G H
4 1 1 1 1 1
5 1 1 1 1 1
I try to find all empty rows and return the index of the first empty row
first_empty_row = df_temp[df_temp.isnull().all(axis=1)].index[0]
del df_temp
Read the file again but skip number of rows by using the number provided above
df= pd.read_excel(path + '/' + file, skiprows=first_empty_row+2)
print(df)
The drawback of this solution is I need to read the file twice. If the file has a lot of rows in the first part, it might take a long time to read these useless rows. I can also possibly use readline loop rows until it reach an empty row, but that will be inefficient.
Does anyone have a better solution? Thanks
Find the position if the first empty row:
pos = df_temp[df_temp.isnull().all(axis=1)].index[0]
Then select everything after that position:
df = df_temp.iloc[pos+1:]
df.columns = df.iloc[0]
df.columns.name = ''
df = df.iloc[1:]
Your first line looks across the entire row for all null. Would it be possible to just look for the first null in the first column?
first_empty_row = df_temp[df_temp.isnull().all(axis=1)].index[0]
How does this compare in performance?
import pandas as pd
import numpy as np
data1 = {'A' : [1,1, np.NaN, 'D', 1,1],
'B' : [1,1, np.NaN, 'E', 1,1],
'C' : [1,1, np.NaN, 'F', 1,1],
'Unnamed: 3' : [np.NaN,np.NaN,np.NaN, 'G', 1,1],
'Unnamed: 4' : [np.NaN,np.NaN,np.NaN, 'H', 1,1]}
df1 = pd.DataFrame(data1)
print(df1)
A B C Unnamed: 3 Unnamed: 4
0 1 1 1 NaN NaN
1 1 1 1 NaN NaN
2 NaN NaN NaN NaN NaN
3 D E F G H
4 1 1 1 1 1
5 1 1 1 1 1
# create empty list to append the rows that need to be deleted
list1 = []
# loop through the first column of the dataframe and append the index to a list until the row is null
for index, row in df1.iterrows():
if (pd.isnull(row[0])):
list1.append(index)
break
else:
list1.append(index)
# drop the rows based on list created from for loop
df1 = df1.drop(df1.index[list1])
# reset index so you can replace the old columns names
# with the secondary column names easier
df1 = df1.reset_index(drop = True)
# create empty list to append the new column names to
temp = []
# loop through dataframe and append the new column names
for label in df1.columns:
temp.append(df1[label][0])
# replace column names with the desired names
df1.columns = temp
# drop the old column names which are always going to be at row 0
df1 = df1.drop(df1.index[0])
# reset index so it doesn't start at 1
df1 = df1.reset_index(drop = True)
print(df1)
D E F G H
0 1 1 1 1 1
1 1 1 1 1 1

Pandas Split on '. '

Given the following data frame:
import pandas as pd
df=pd.DataFrame({'foo':['abc','2. abc','3. abc']})
df
foo
abc
2. abc
3. abc
I'd like to split on '. ' to produce this:
foo bar
abc
1 abc
2 abc
Thanks in advance!
you can do it using .str.extract() function:
In [163]: df.foo.str.extract(r'(?P<foo>\d*)[\.\s]*(?P<bar>.*)', expand=True)
Out[163]:
foo bar
0 abc
1 2 abc
2 3 abc
You can use str.split, but then you need swap values if mask is True by numpy.where. Last fillna by '' column foo:
df1 = (df.foo.str.split('. ', expand=True))
df1.columns = ['foo','bar']
print (df1)
foo bar
0 abc None
1 2 abc
2 3 abc
mask = df1.bar.isnull()
print (mask)
0 True
1 False
2 False
Name: bar, dtype: bool
df1['foo'], df1['bar'] = np.where(mask, df1['bar'], df1['foo']),
np.where(mask, df1['foo'], df1['bar'] )
df1.foo.fillna('', inplace=True)
print (df1)
foo bar
0 abc
1 2 abc
2 3 abc
If you have a folder you can put a temporary file into, you can create a csv file and reread it with your new separator:
df.to_csv('yourfolder/yourfile.csv',index = False)
df = pd.read_csv('yourfolder/yourfile.csv',sep = '. ')

Resources