skipping empty list and continuing with function

skipping empty list and continuing with function - python-3.x

Background
import pandas as pd
Names = [list(['Jon', 'Smith', 'jon', 'John']),
list([]),
list(['Bob', 'bobby', 'Bobs'])]
df = pd.DataFrame({'Text' : ['Jon J Smith is Here and jon John from ',
'',
'I like Bob and bobby and also Bobs diner '],
'P_ID': [1,2,3],
'P_Name' : Names
})
#rearrange columns
df = df[['Text', 'P_ID', 'P_Name']]
df
Text P_ID P_Name
0 Jon J Smith is Here and jon John from 1 [Jon, Smith, jon, John]
1 2 []
2 I like Bob and bobby and also Bobs diner 3 [Bob, bobby, Bobs]
Goal
I would like to use the following function
df['new']=df.Text.replace(df.P_Name,'**BLOCK**',regex=True)
but skip row 2, since it has an empty list []
Tried
I have tried the following
try:
df['new']=df.Text.replace(df.P_Name,'**BLOCK**',regex=True)
except ValueError:
pass
But I get the following output
Text P_ID P_Name
0 Jon J Smith is Here and jon John from 1 [Jon, Smith, jon, John]
1 2 []
2 I like Bob and bobby and also Bobs diner 3 [Bob, bobby, Bobs]
Desired Output
Text P_ID P_Name new
0 `**BLOCK**` J `**BLOCK**` is Here and `**BLOCK**` `**BLOCK**` from
1 []
2 I like `**BLOCK**` and `**BLOCK**` and also `**BLOCK**` diner
Question
How do I get my desired output by skipping row 2 and continuing with my function?

Locate the rows which do not have an empty list and use your replace method only on those rows:
# Boolean indexing the rows which do not have an empty list
m = df['P_Name'].str.len().ne(0)
df.loc[m, 'New'] = df.loc[m, 'Text'].replace(df.loc[m].P_Name,'**BLOCK**',regex=True)
Output
Text P_ID P_Name New
0 Jon J Smith is Here and jon John from 1 [Jon, Smith, jon, John] **BLOCK** J **BLOCK** is Here and **BLOCK** **BLOCK** from
1 Test 2 [] NaN
2 I like Bob and bobby and also Bobs diner 3 [Bob, bobby, Bobs] I like **BLOCK** and **BLOCK** and also **BLOCK**s diner

Related

How do you fill uneven pandas dataframe column with first value in column

import pandas as pd
dict = {'Name' : ['John'], 'Last Name': ['Smith'], 'Activity':['Run', 'Jump', 'Hide', 'Swim', 'Eat', 'Sleep']}
df = pd.DataFrame(dict)
How do I make it so 'John' & 'Smith' are populated in each 'Activity' that he does in a dataframe?

Let us try json_normalize
out = pd.json_normalize(d,'Activity',['Name','Last Name'])
Out[160]:
0 Name Last Name
0 Run John Smith
1 Jump John Smith
2 Hide John Smith
3 Swim John Smith
4 Eat John Smith
5 Sleep John Smith
Input
d = {'Name' : ['John'], 'Last Name': ['Smith'], 'Activity':['Run', 'Jump', 'Hide', 'Swim', 'Eat', 'Sleep']}

If you strictly have one pair of Name/Last Name, you can modify the dictionary so that pandas reads activity as a list
d = {k: [v] if len(v) > 1 else v for k, v in d.items()}
df = pd.DataFrame(d)
df.explode('Activity')
Name Last Name Activity
0 John Smith Run
0 John Smith Jump
0 John Smith Hide
0 John Smith Swim
0 John Smith Eat
0 John Smith Sleep

including word boundary in string modification to be more specific

Background
The following is a minor change from modification of skipping empty list and continuing with function
import pandas as pd
Names = [list(['ann']),
list([]),
list(['elisabeth', 'lis']),
list(['his','he']),
list([])]
df = pd.DataFrame({'Text' : ['ann had an anniversery today',
'nothing here',
'I like elisabeth and lis 5 lists ',
'one day he and his cheated',
'same here'
],
'P_ID': [1,2,3, 4,5],
'P_Name' : Names
})
#rearrange columns
df = df[['Text', 'P_ID', 'P_Name']]
df
Text P_ID P_Name
0 ann had an anniversery today 1 [ann]
1 nothing here 2 []
2 I like elisabeth and lis 5 lists 3 [elisabeth, lis]
3 one day he and his cheated 4 [his, he]
4 same here 5 []
The code below works
m = df['P_Name'].str.len().ne(0)
df.loc[m, 'New'] = df.loc[m, 'Text'].replace(df.loc[m].P_Name,'**BLOCK**',regex=True)
And does the following
1) uses the name in P_Name to block the corresponding text in the Text column by placing **BLOCK**
2) produces a new column New
This is shown below
Text P_ID P_Name New
0 **BLOCK** had an **BLOCK**iversery today
1 NaN
2 I like **BLOCK** and **BLOCK** 5 **BLOCK**ts
3 one day **BLOCK** and **BLOCK** c**BLOCK**ated
4 NaN
Problem
However, this code works a little "too well."
Using ['his','he'] from P_Name to block Text:
Example: one day he and his cheated becomes one day **BLOCK** and **BLOCK** c**BLOCK**ated
Desired: one day he and his cheated becomes one day **BLOCK** and **BLOCK** cheated
In this example, I would like cheated to stay as cheated and not become c**BLOCK**ated
Desired Output
Text P_ID P_Name New
0 **BLOCK** had an anniversery today
1 NaN
2 I like **BLOCK** and **BLOCK**5 lists
3 one day **BLOCK** and **BLOCK** cheated
4 NaN
Question
How do I achieve my desired output?

You need to add word boundary to each string in lists of df.loc[m].P_Name as follows:
s = df.loc[m].P_Name.map(lambda x: [r'\b'+item+r'\b' for item in x])
Out[71]:
0 [\bann\b]
2 [\belisabeth\b, \blis\b]
3 [\bhis\b, \bhe\b]
Name: P_Name, dtype: object
df.loc[m, 'Text'].replace(s, '**BLOCK**',regex=True)
Out[72]:
0 **BLOCK** had an anniversery today
2 I like **BLOCK** and **BLOCK** 5 lists
3 one day **BLOCK** and **BLOCK** cheated
Name: Text, dtype: object

Sometime for loop is good practice
df['New']=[pd.Series(x).replace(dict.fromkeys(y,'**BLOCK**') ).str.cat(sep=' ')for x , y in zip(df.Text.str.split(),df.P_Name)]
df.New.where(df.P_Name.astype(bool),inplace=True)
df
Text ... New
0 ann had an anniversery today ... **BLOCK** had an anniversery today
1 nothing here ... NaN
2 I like elisabeth and lis 5 lists ... I like **BLOCK** and **BLOCK** 5 lists
3 one day he and his cheated ... one day **BLOCK** and **BLOCK** cheated
4 same here ... NaN
[5 rows x 4 columns]

rearrange name order in pandas column

Background
I have the following df
import pandas as pd
df= pd.DataFrame({'Text' : ['Hi', 'Hello', 'Bye'],
'P_ID': [1,2,3],
'Name' :['Bobby,Bob Lee Brian', 'Tuck,Tom T ', 'Mark, Marky '],
})
Name P_ID Text
0 Bobby,Bob Lee Brian 1 Hi
1 Tuck,Tom T 2 Hello
2 Mark, Marky 3 Bye
Goal
1) rearrange the Name column from e.g. Bobby,Bob Lee Brian to Bob Lee Brian Bobby
2) create new column Rearranged_Name
Desired Output
Name P_ID Text Rearranged_Name
0 Bobby,Bob Lee Brian 1 Hi Bob Lee Brian Bobby
1 Tuck,Tom T 2 Hello Tom T Tuck
2 Mark, Marky 3 Bye Marky Mark
Question
How do I achieve my desired output?

Use Series.str.replace with values before and after ,, \s* means there are optionally whitespace after ,:
df['Rearranged_Name'] = df['Name'].str.replace(r'(.+),\s*(.+)', r'\2 \1')
print (df)
Text P_ID Name Rearranged_Name
0 Hi 1 Bobby,Bob Lee Brian Bob Lee Brian Bobby
1 Hello 2 Tuck,Tom T Tom T Tuck
2 Bye 3 Mark, Marky Marky Mark
Or use Series.str.split for helper DataFrame and join columns together:
df1 = df['Name'].str.split(',\s*', expand=True)
df['Rearranged_Name'] = df1[1] + ' ' + df1[0]

How to remove first chracter from the string and store the same into new column in Pandas?

I have a column name called Student name and each row has four or five student names -- like this John mills, Tim Harry, Alex win, Kate marry... I want to take the first two student names and store into a new column called Student 1 and Student 2. Names have been separated from comma.
I created a function and i can able to extract first student name . result storing into my dataframe called student_0
def find_student(df2):
for i in range(2):
df2[f"student name_{i}"] = [x.split(',')[i] for x in df2["student name"]]
return df2
new_df = find_student(df2)
df2 is my dataframe name
I AM NOT GETTING SECOND STUDENT NAME. PLEASE ADVISE

Use Series.str.split with select first 2 columns by positions by DataFrame.iloc if need name and surnames:
print (df2)
student name
0 John mills, Tim Harry, Alex win, Kate marry
1 Brando XI, James Caan, Richard S. Castellano
2 Heath Ledger, Aaron Eckhart, Michael Caine
N = 2
df3 = df2["student name"].str.split(', ', expand=True).iloc[:, :N]
#rename columns names
df3.columns = [f"student name_{i+1}" for i in range(len(df3.columns))]
print (df3)
student name_1 student name_2
0 John mills Tim Harry
1 Brando XI James Caan
2 Heath Ledger Aaron Eckhart
Or use list comprehension:
N = 2
L = [x.split(',')[:2] for x in df2["student name"]]
df3 = pd.DataFrame(L, columns=[f"student name_{i+1}" for i in range(N)])
print (df3)
student name_1 student name_2
0 John mills Tim Harry
1 Brando XI James Caan
2 Heath Ledger Aaron Eckhart
If need only names:
N = 2
L = [[y.split()[0] for y in x.split(',')[:2]] for x in df2["student name"]]
df3 = pd.DataFrame(L, columns=[f"student name_{i+1}" for i in range(N)])
print (df3)
student name_1 student name_2
0 John Tim
1 Brando James
2 Heath Aaron
#join to original if necessary
df2 = df2.join(df3)

try this
def find_student(df2):
for i in range(2):
df2[f"student name_{i}"] = pd.Series(map(lambda x: x.split(',')[i], df2["student name"]))
return df2

Use pandas functionality(str and split), you don't need to write a function.
df = [["John mills, Tim Harry, Alex win, Kate marry"],
["Brando XI, James Caan, Richard S. Castellano"],
["Heath Ledger,Aaron Eckhart, Michael Caine"]]
df2 = pd.DataFrame(df)
df2.columns = ['Student_Name']
df2['student name_1'] = df2.Student_Name.str.split(",").str[0]
df2['student name_2'] = df2.Student_Name.str.split(",").str[1]

Split a column into 3 columns in pandas

I have a column called Names which looks like this, I need to compare it other column in a different panda dataframe which has the last name and first name but not the initials like this one. I am trying to split the initials out of the column in a new column, using space as delimiter, but will probably need to do it for the whole string. I tried this:
transpose_enron['lastname'], transpose_enron['firstname'], transpose_enron['middle initial'] = zip(*transpose_enron['Names'].apply(lambda x: x.split(' ', 1)))
and it gives me this error
"ValueError: need more than 1 value to unpack"
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
8 BELFER ROBERT
Any ideas on how to do this.

Use the vectorised str.split with expand=True, this will unpack the list into the new cols:
In [17]:
df[['lastname', 'firstname', 'middle initial']] = df['name'].str.split(expand=True)
df
Out[17]:
name lastname firstname middle initial
index
0 ALLEN PHILLIP K ALLEN PHILLIP K
1 BADUM JAMES P BADUM JAMES P
2 BANNANTINE JAMES M BANNANTINE JAMES M
8 BELFER ROBERT BELFER ROBERT None

You can use DataFrame constructor and if you need delete original column drop:
print df
Names
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
3 BELFER ROBERT
df[['lastname', 'firstname', 'middle initial']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
#if you want delete original column
df = df.drop('Names', axis=1)
print df
lastname firstname middle initial
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
3 BELFER ROBERT None
Timings: len(df) = 10000*4
df = pd.concat([df]*10000).reset_index(drop=True)
print df.head()
def jez(df):
df[['lastname', 'firstname', 'middle initial']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
return df
def edc(df):
df[['lastname', 'firstname', 'middle initial']] = df['Names'].str.split(expand=True)
return df
print jez(df).head()
print edc(df).head()
My is fastest as Edchum's solution if dataframe is larger:
In [51]: %timeit jez(df)
10 loops, best of 3: 30.1 ms per loop
In [52]: %timeit edc(df)
10 loops, best of 3: 78 ms per loop
EDIT by comment error:
Problem is with data, that contains 3 separators instead 2, so you need split them to four columns and then delete temporary column tmp:
print df
Names
0 ALLEN PHILLIP K
1 BADUM JAMES P tttt
2 BANNANTINE JAMES M
df[['lastname', 'firstname', 'middle initial', 'tmp']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
print df
Names lastname firstname middle initial tmp
0 ALLEN PHILLIP K ALLEN PHILLIP K None
1 BADUM JAMES P tttt BADUM JAMES P tttt
2 BANNANTINE JAMES M BANNANTINE JAMES M None
#if you want delete original column
df = df.drop(['Names', 'tmp'], axis=1)
print df
lastname firstname middle initial
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

skipping empty list and continuing with function - python-3.x

Related

How do you fill uneven pandas dataframe column with first value in column

including word boundary in string modification to be more specific

rearrange name order in pandas column

How to remove first chracter from the string and store the same into new column in Pandas?

Split a column into 3 columns in pandas

Categories

Resources