Split a column into 3 columns in pandas - string

I have a column called Names which looks like this, I need to compare it other column in a different panda dataframe which has the last name and first name but not the initials like this one. I am trying to split the initials out of the column in a new column, using space as delimiter, but will probably need to do it for the whole string. I tried this:
transpose_enron['lastname'], transpose_enron['firstname'], transpose_enron['middle initial'] = zip(*transpose_enron['Names'].apply(lambda x: x.split(' ', 1)))
and it gives me this error
"ValueError: need more than 1 value to unpack"
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
8 BELFER ROBERT
Any ideas on how to do this.

Use the vectorised str.split with expand=True, this will unpack the list into the new cols:
In [17]:
df[['lastname', 'firstname', 'middle initial']] = df['name'].str.split(expand=True)
df
Out[17]:
name lastname firstname middle initial
index
0 ALLEN PHILLIP K ALLEN PHILLIP K
1 BADUM JAMES P BADUM JAMES P
2 BANNANTINE JAMES M BANNANTINE JAMES M
8 BELFER ROBERT BELFER ROBERT None

You can use DataFrame constructor and if you need delete original column drop:
print df
Names
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
3 BELFER ROBERT
df[['lastname', 'firstname', 'middle initial']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
#if you want delete original column
df = df.drop('Names', axis=1)
print df
lastname firstname middle initial
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M
3 BELFER ROBERT None
Timings: len(df) = 10000*4
df = pd.concat([df]*10000).reset_index(drop=True)
print df.head()
def jez(df):
df[['lastname', 'firstname', 'middle initial']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
return df
def edc(df):
df[['lastname', 'firstname', 'middle initial']] = df['Names'].str.split(expand=True)
return df
print jez(df).head()
print edc(df).head()
My is fastest as Edchum's solution if dataframe is larger:
In [51]: %timeit jez(df)
10 loops, best of 3: 30.1 ms per loop
In [52]: %timeit edc(df)
10 loops, best of 3: 78 ms per loop
EDIT by comment error:
Problem is with data, that contains 3 separators instead 2, so you need split them to four columns and then delete temporary column tmp:
print df
Names
0 ALLEN PHILLIP K
1 BADUM JAMES P tttt
2 BANNANTINE JAMES M
df[['lastname', 'firstname', 'middle initial', 'tmp']] = pd.DataFrame([ x.split() for x in df['Names'].tolist() ])
print df
Names lastname firstname middle initial tmp
0 ALLEN PHILLIP K ALLEN PHILLIP K None
1 BADUM JAMES P tttt BADUM JAMES P tttt
2 BANNANTINE JAMES M BANNANTINE JAMES M None
#if you want delete original column
df = df.drop(['Names', 'tmp'], axis=1)
print df
lastname firstname middle initial
0 ALLEN PHILLIP K
1 BADUM JAMES P
2 BANNANTINE JAMES M

Related

Algo to identify slightly different uniquely identifiable common names in 3 DataFrame columns

Sample DataFrame df has 3 columns to identify any given person, viz., name, nick_name, initials. They can have slight differences in the way they are specified but looking at three columns together it is possible to overcome these differences and separate out all the rows for given person and normalize these 3 columnns with single value for each person.
>>> import pandas as pd
>>> df = pd.DataFrame({'ID':range(9), 'name':['Theodore', 'Thomas', 'Theodore', 'Christian', 'Theodore', 'Theodore R', 'Thomas', 'Tomas', 'Cristian'], 'nick_name':['Tedy', 'Tom', 'Ted', 'Chris', 'Ted', 'Ted', 'Tommy', 'Tom', 'Chris'], 'initials':['TR', 'Tb', 'TRo', 'CS', 'TR', 'TR', 'tb', 'TB', 'CS']})
>>> df
ID name nick_name initials
0 0 Theodore Tedy TR
1 1 Thomas Tom Tb
2 2 Theodore Ted TRo
3 3 Christian Chris CS
4 4 Theodore Ted TR
5 5 Theodore R Ted TR
6 6 Thomas Tommy tb
7 7 Tomas Tom TB
8 8 Cristian Chris CS
In this case desired output is as follows:
ID name nick_name initials
0 0 Theodore Ted TR
1 1 Thomas Tom TB
2 2 Theodore Ted TR
3 3 Christian Chris CS
4 4 Theodore Ted TR
5 5 Theodore Ted TR
6 6 Thomas Tom TB
7 7 Thomas Tom TB
8 8 Christian Chris CS
The common value can be anything as long as it is normalized to same value. For example, name is Theodore or Theodore R - both fine.
My actual DataFrame is about 4000 rows. Could someone help specify optimal algo to do this.
You'll want to use Levenshtein distance to identify similar strings. A good Python package for this is fuzzywuzzy. Below I used a basic dictionary approach to collect similar rows together, then overwrite each chunk with a designated master row. Note this leaves a CSV with many duplicate rows, I don't know if this is what you want, but if not, easy enough to take the duplicates out.
import pandas as pd
from itertools import chain
from fuzzywuzzy import fuzz
def cluster_rows(df):
row_clusters = {}
threshold = 90
name_rows = list(df.iterrows())
for i, nr in name_rows:
name = nr['name']
new_cluster = True
for other in row_clusters.keys():
if fuzz.ratio(name, other) >= threshold:
row_clusters[other].append(nr)
new_cluster = False
if new_cluster:
row_clusters[name] = [nr]
return row_clusters
def normalize_rows(row_clusters):
for name in row_clusters:
master = row_clusters[name][0]
for row in row_clusters[name][1:]:
for key in row.keys():
row[key] = master[key]
return row_clusters
if __name__ == '__main__':
df = pd.read_csv('names.csv')
rc = cluster_rows(df)
normalized = normalize_rows(rc)
pd.DataFrame(chain(*normalized.values())).to_csv('norm-names.csv')

Groupby and calculate count and means based on multiple conditions in Pandas

For the given dataframe as follows:
id|address|sell_price|market_price|status|start_date|end_date
1|7552 Atlantic Lane|1170787.3|1463484.12|finished|2019/8/2|2019/10/1
1|7552 Atlantic Lane|1137782.02|1422227.52|finished|2019/8/2|2019/10/1
2|888 Foster Street|1066708.28|1333385.35|finished|2019/8/2|2019/10/1
2|888 Foster Street|1871757.05|1416757.05|finished|2019/10/14|2019/10/15
2|888 Foster Street|NaN|763744.52|current|2019/10/12|2019/10/13
3|5 Pawnee Avenue|NaN|928366.2|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|NaN|2025924.16|current|2019/10/10|2019/10/11
3|5 Pawnee Avenue|Nan|4000000|forward|2019/10/9|2019/10/10
3|5 Pawnee Avenue|2236138.9|1788938.9|finished|2019/10/8|2019/10/9
4|916 W. Mill Pond St.|2811026.73|1992026.73|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|13664803.02|10914803.02|finished|2019/9/30|2019/10/1
4|916 W. Mill Pond St.|3234636.64|1956636.64|finished|2019/9/30|2019/10/1
5|68 Henry Drive|2699959.92|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|5830725.66|NaN|failed|2019/10/8|2019/10/9
5|68 Henry Drive|2668401.36|1903401.36|finished|2019/12/8|2019/12/9
#copy above data and run below code to reproduce dataframe
df = pd.read_clipboard(sep='|')
I would like to groupby id and address and calculate mean_ratio and result_count based on the following conditions:
mean_ratio: which is groupby id and address and calculate mean for the rows meet the following conditions: status is finished and start_date isin the range of 2019-09 and 2019-10
result_count: which is groupby id and address and count the rows meet the following conditions: status is either finished or failed, and start_date isin the range of 2019-09 and 2019-10
The desired output will like this:
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0
1 2 888 Foster Street 1.32 1
2 3 5 Pawnee Avenue 1.25 1
3 4 916 W. Mill Pond St. 1.44 3
4 5 68 Henry Drive NaN 2
I have tried so far:
# convert date
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
# calculate ratio
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
In order to filter start_date isin the range of 2019-09 and 2019-10:
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
df = df[np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])]
To filter row status is finished or failed, I use:
mask = df['status'].str.contains('finished|failed')
df[mask]
But I don't know how to use those to get final result. Thanks your help at advance.
I think you need GroupBy.agg, but because some rows are excluded like id=1, then add them by DataFrame.join with all unique pairs id and address in df2, last replace missing values in result_count columns:
df2 = df[['id','address']].drop_duplicates()
print (df2)
id address
0 1 7552 Atlantic Lane
2 2 888 Foster Street
5 3 5 Pawnee Avenue
9 4 916 W. Mill Pond St.
12 5 68 Henry Drive
df[['start_date', 'end_date']] = df[['start_date', 'end_date']].apply(lambda x: pd.to_datetime(x, format = '%Y/%m/%d'))
df['ratio'] = round(df['sell_price']/df['market_price'], 2)
L = [pd.Period('2019-09'), pd.Period('2019-10')]
c = ['start_date']
mask = df['status'].str.contains('finished|failed')
mask1 = np.logical_or.reduce([df[x].dt.to_period('m').isin(L) for x in c])
df = df[mask1 & mask]
df1 = df.groupby(['id', 'address']).agg(mean_ratio=('ratio','mean'),
result_count=('ratio','size'))
df1 = df2.join(df1, on=['id','address']).fillna({'result_count': 0})
print (df1)
id address mean_ratio result_count
0 1 7552 Atlantic Lane NaN 0.0
2 2 888 Foster Street 1.320000 1.0
5 3 5 Pawnee Avenue 1.250000 1.0
9 4 916 W. Mill Pond St. 1.436667 3.0
12 5 68 Henry Drive NaN 2.0
Some helpers
def mean_ratio(idf):
# filtering data
idf = idf[
(idf['start_date'].between('2019-09-01', '2019-10-31')) &
(idf['mean_ratio'].notnull()) ]
return np.round(idf['mean_ratio'].mean(), 2)
def result_count(idf):
idf = idf[
(idf['status'].isin(['finished', 'failed'])) &
(idf['start_date'].between('2019-09-01', '2019-10-31')) ]
return idf.shape[0]
# We can caluclate `mean_ratio` before hand
df['mean_ratio'] = df['sell_price'] / df['market_price']
df = df.astype({'start_date': np.datetime64, 'end_date': np.datetime64})
# Group the df
g = df.groupby(['id', 'address'])
mean_ratio = g.apply(lambda idf: mean_ratio(idf)).to_frame('mean_ratio')
result_count = g.apply(lambda idf: result_count(idf)).to_frame('result_count')
# Final result
pd.concat((mean_ratio, result_count), axis=1)

skipping empty list and continuing with function

Background
import pandas as pd
Names = [list(['Jon', 'Smith', 'jon', 'John']),
list([]),
list(['Bob', 'bobby', 'Bobs'])]
df = pd.DataFrame({'Text' : ['Jon J Smith is Here and jon John from ',
'',
'I like Bob and bobby and also Bobs diner '],
'P_ID': [1,2,3],
'P_Name' : Names
})
#rearrange columns
df = df[['Text', 'P_ID', 'P_Name']]
df
Text P_ID P_Name
0 Jon J Smith is Here and jon John from 1 [Jon, Smith, jon, John]
1 2 []
2 I like Bob and bobby and also Bobs diner 3 [Bob, bobby, Bobs]
Goal
I would like to use the following function
df['new']=df.Text.replace(df.P_Name,'**BLOCK**',regex=True)
but skip row 2, since it has an empty list []
Tried
I have tried the following
try:
df['new']=df.Text.replace(df.P_Name,'**BLOCK**',regex=True)
except ValueError:
pass
But I get the following output
Text P_ID P_Name
0 Jon J Smith is Here and jon John from 1 [Jon, Smith, jon, John]
1 2 []
2 I like Bob and bobby and also Bobs diner 3 [Bob, bobby, Bobs]
Desired Output
Text P_ID P_Name new
0 `**BLOCK**` J `**BLOCK**` is Here and `**BLOCK**` `**BLOCK**` from
1 []
2 I like `**BLOCK**` and `**BLOCK**` and also `**BLOCK**` diner
Question
How do I get my desired output by skipping row 2 and continuing with my function?
Locate the rows which do not have an empty list and use your replace method only on those rows:
# Boolean indexing the rows which do not have an empty list
m = df['P_Name'].str.len().ne(0)
df.loc[m, 'New'] = df.loc[m, 'Text'].replace(df.loc[m].P_Name,'**BLOCK**',regex=True)
Output
Text P_ID P_Name New
0 Jon J Smith is Here and jon John from 1 [Jon, Smith, jon, John] **BLOCK** J **BLOCK** is Here and **BLOCK** **BLOCK** from
1 Test 2 [] NaN
2 I like Bob and bobby and also Bobs diner 3 [Bob, bobby, Bobs] I like **BLOCK** and **BLOCK** and also **BLOCK**s diner

rearrange name order in pandas column

Background
I have the following df
import pandas as pd
df= pd.DataFrame({'Text' : ['Hi', 'Hello', 'Bye'],
'P_ID': [1,2,3],
'Name' :['Bobby,Bob Lee Brian', 'Tuck,Tom T ', 'Mark, Marky '],
})
Name P_ID Text
0 Bobby,Bob Lee Brian 1 Hi
1 Tuck,Tom T 2 Hello
2 Mark, Marky 3 Bye
Goal
1) rearrange the Name column from e.g. Bobby,Bob Lee Brian to Bob Lee Brian Bobby
2) create new column Rearranged_Name
Desired Output
Name P_ID Text Rearranged_Name
0 Bobby,Bob Lee Brian 1 Hi Bob Lee Brian Bobby
1 Tuck,Tom T 2 Hello Tom T Tuck
2 Mark, Marky 3 Bye Marky Mark
Question
How do I achieve my desired output?
Use Series.str.replace with values before and after ,, \s* means there are optionally whitespace after ,:
df['Rearranged_Name'] = df['Name'].str.replace(r'(.+),\s*(.+)', r'\2 \1')
print (df)
Text P_ID Name Rearranged_Name
0 Hi 1 Bobby,Bob Lee Brian Bob Lee Brian Bobby
1 Hello 2 Tuck,Tom T Tom T Tuck
2 Bye 3 Mark, Marky Marky Mark
Or use Series.str.split for helper DataFrame and join columns together:
df1 = df['Name'].str.split(',\s*', expand=True)
df['Rearranged_Name'] = df1[1] + ' ' + df1[0]

How to remove first chracter from the string and store the same into new column in Pandas?

I have a column name called Student name and each row has four or five student names -- like this John mills, Tim Harry, Alex win, Kate marry... I want to take the first two student names and store into a new column called Student 1 and Student 2. Names have been separated from comma.
I created a function and i can able to extract first student name . result storing into my dataframe called student_0
def find_student(df2):
for i in range(2):
df2[f"student name_{i}"] = [x.split(',')[i] for x in df2["student name"]]
return df2
new_df = find_student(df2)
df2 is my dataframe name
I AM NOT GETTING SECOND STUDENT NAME. PLEASE ADVISE
Use Series.str.split with select first 2 columns by positions by DataFrame.iloc if need name and surnames:
print (df2)
student name
0 John mills, Tim Harry, Alex win, Kate marry
1 Brando XI, James Caan, Richard S. Castellano
2 Heath Ledger, Aaron Eckhart, Michael Caine
N = 2
df3 = df2["student name"].str.split(', ', expand=True).iloc[:, :N]
#rename columns names
df3.columns = [f"student name_{i+1}" for i in range(len(df3.columns))]
print (df3)
student name_1 student name_2
0 John mills Tim Harry
1 Brando XI James Caan
2 Heath Ledger Aaron Eckhart
Or use list comprehension:
N = 2
L = [x.split(',')[:2] for x in df2["student name"]]
df3 = pd.DataFrame(L, columns=[f"student name_{i+1}" for i in range(N)])
print (df3)
student name_1 student name_2
0 John mills Tim Harry
1 Brando XI James Caan
2 Heath Ledger Aaron Eckhart
If need only names:
N = 2
L = [[y.split()[0] for y in x.split(',')[:2]] for x in df2["student name"]]
df3 = pd.DataFrame(L, columns=[f"student name_{i+1}" for i in range(N)])
print (df3)
student name_1 student name_2
0 John Tim
1 Brando James
2 Heath Aaron
#join to original if necessary
df2 = df2.join(df3)
try this
def find_student(df2):
for i in range(2):
df2[f"student name_{i}"] = pd.Series(map(lambda x: x.split(',')[i], df2["student name"]))
return df2
Use pandas functionality(str and split), you don't need to write a function.
df = [["John mills, Tim Harry, Alex win, Kate marry"],
["Brando XI, James Caan, Richard S. Castellano"],
["Heath Ledger,Aaron Eckhart, Michael Caine"]]
df2 = pd.DataFrame(df)
df2.columns = ['Student_Name']
df2['student name_1'] = df2.Student_Name.str.split(",").str[0]
df2['student name_2'] = df2.Student_Name.str.split(",").str[1]

Resources