Algo to identify slightly different uniquely identifiable common names in 3 DataFrame columns - python-3.x

Sample DataFrame df has 3 columns to identify any given person, viz., name, nick_name, initials. They can have slight differences in the way they are specified but looking at three columns together it is possible to overcome these differences and separate out all the rows for given person and normalize these 3 columnns with single value for each person.
>>> import pandas as pd
>>> df = pd.DataFrame({'ID':range(9), 'name':['Theodore', 'Thomas', 'Theodore', 'Christian', 'Theodore', 'Theodore R', 'Thomas', 'Tomas', 'Cristian'], 'nick_name':['Tedy', 'Tom', 'Ted', 'Chris', 'Ted', 'Ted', 'Tommy', 'Tom', 'Chris'], 'initials':['TR', 'Tb', 'TRo', 'CS', 'TR', 'TR', 'tb', 'TB', 'CS']})
>>> df
ID name nick_name initials
0 0 Theodore Tedy TR
1 1 Thomas Tom Tb
2 2 Theodore Ted TRo
3 3 Christian Chris CS
4 4 Theodore Ted TR
5 5 Theodore R Ted TR
6 6 Thomas Tommy tb
7 7 Tomas Tom TB
8 8 Cristian Chris CS
In this case desired output is as follows:
ID name nick_name initials
0 0 Theodore Ted TR
1 1 Thomas Tom TB
2 2 Theodore Ted TR
3 3 Christian Chris CS
4 4 Theodore Ted TR
5 5 Theodore Ted TR
6 6 Thomas Tom TB
7 7 Thomas Tom TB
8 8 Christian Chris CS
The common value can be anything as long as it is normalized to same value. For example, name is Theodore or Theodore R - both fine.
My actual DataFrame is about 4000 rows. Could someone help specify optimal algo to do this.

You'll want to use Levenshtein distance to identify similar strings. A good Python package for this is fuzzywuzzy. Below I used a basic dictionary approach to collect similar rows together, then overwrite each chunk with a designated master row. Note this leaves a CSV with many duplicate rows, I don't know if this is what you want, but if not, easy enough to take the duplicates out.
import pandas as pd
from itertools import chain
from fuzzywuzzy import fuzz
def cluster_rows(df):
row_clusters = {}
threshold = 90
name_rows = list(df.iterrows())
for i, nr in name_rows:
name = nr['name']
new_cluster = True
for other in row_clusters.keys():
if fuzz.ratio(name, other) >= threshold:
row_clusters[other].append(nr)
new_cluster = False
if new_cluster:
row_clusters[name] = [nr]
return row_clusters
def normalize_rows(row_clusters):
for name in row_clusters:
master = row_clusters[name][0]
for row in row_clusters[name][1:]:
for key in row.keys():
row[key] = master[key]
return row_clusters
if __name__ == '__main__':
df = pd.read_csv('names.csv')
rc = cluster_rows(df)
normalized = normalize_rows(rc)
pd.DataFrame(chain(*normalized.values())).to_csv('norm-names.csv')

Related

Cast topic modeling outcome to dataframe

I have used BertTopic with KeyBERT to extract some topics from some docs
from bertopic import BERTopic
topic_model = BERTopic(nr_topics="auto", verbose=True, n_gram_range=(1, 4), calculate_probabilities=True, embedding_model='paraphrase-MiniLM-L3-v2', min_topic_size= 3)
topics, probs = topic_model.fit_transform(docs)
Now I can access the topic name
freq = topic_model.get_topic_info()
print("Number of topics: {}".format( len(freq)))
freq.head(30)
Topic Count Name
0 -1 1 -1_default_greenbone_gmp_manager
1 0 14 0_http_tls_ssl tls_ssl
2 1 8 1_jboss_console_web_application
and inspect the topics
[('http', 0.0855701486234524),
('tls', 0.061977919455444744),
('ssl tls', 0.061977919455444744),
('ssl', 0.061977919455444744),
('tcp', 0.04551718585531556),
('number', 0.04551718585531556)]
[('jboss', 0.14014705432060262),
('console', 0.09285308122803233),
('web', 0.07323749337563096),
('application', 0.0622930523123512),
('management', 0.0622930523123512),
('apache', 0.05032395169459188)]
What I want is to have a final dataframe that has in one column the topic name and in another column the elements of the topic
expected outcome:
class entities
o http_tls_ssl tls_ssl HTTP...etc
1 jboss_console_web_application JBoss, console, etc
and one dataframe with the topic name on different columns
http_tls_ssl tls_ssl jboss_console_web_application
o http JBoss
1 tls console
2 etc etc
I did not find out how to do this. Is there a way?
Here is one way to to it:
Setup
import pandas as pd
from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
docs = fetch_20newsgroups(subset="all", remove=("headers", "footers", "quotes"))["data"]
topic_model = BERTopic()
# To keep the example reproducible in a reasonable time, limit to 3,000 docs
topics, probs = topic_model.fit_transform(docs[:3_000])
df = topic_model.get_topic_info()
print(df)
# Output
Topic Count Name
0 -1 23 -1_the_of_in_to
1 0 2635 0_the_to_of_and
2 1 114 1_the_he_to_in
3 2 103 2_the_to_in_and
4 3 59 3_ditto_was__
5 4 34 4_pool_andy_table_tell
6 5 32 5_the_to_game_and
First dataframe
Using Pandas string methods:
df = (
df.rename(columns={"Name": "class"})
.drop(columns=["Topic", "Count"])
.reset_index(drop=True)
)
df["entities"] = [
[item[0] if item[0] else pd.NA for item in topics]
for topics in topic_model.get_topics().values()
]
df = df.loc[~df["class"].str.startswith("-1"), :] # Remove -1 topic
df["class"] = df["class"].replace(
"^-?\d+_", "", regex=True
) # remove prefix '1_', '2_', ...
print(df)
# Output
class entities
1 the_to_of_and [the, to, of, and, is, in, that, it, for, you]
2 the_he_to_in [the, he, to, in, and, that, is, of, his, year]
3 the_to_in_and [the, to, in, and, of, he, team, that, was, game]
4 ditto_was__ [ditto, was, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>, <NA>]
5 pool_andy_table_tell [pool, andy, table, tell, us, well, your, about, <NA>, <NA>]
6 the_to_game_and [the, to, game, and, games, espn, on, in, is, have]
Second dataframe
Using Pandas transpose:
other_df = df.T.reset_index(drop=True)
new_col_labels = other_df.iloc[0] # save first row
other_df = other_df[1:] # remove first row
other_df.columns = new_col_labels
other_df = pd.DataFrame({col: other_df.loc[1, col] for col in other_df.columns})
print(other_df)
# Output
the_to_of_and the_he_to_in the_to_in_and ditto_was__ pool_andy_table_tell the_to_game_and
0 the the the ditto pool the
1 to he to was andy to
2 of to in <NA> table game
3 and in and <NA> tell and
4 is and of <NA> us games
5 in that he <NA> well espn
6 that is team <NA> your on
7 it of that <NA> about in
8 for his was <NA> <NA> is
9 you year game <NA> <NA> have

Using FuzzyWuzzy with pandas

I am trying to calculate the similarity between cities in my dataframe, and 1 static city name. (eventually I want to iterate through a dataframe and choose the best matching city name from that data frame, but I am testing my code on this simplified scenario).
I am using fuzzywuzzy token set ratio.
For some reason it calculates the first row correctly, and it seems it assigns the same value for all rows.
code:
from fuzzywuzzy import fuzz
test_df= pd.DataFrame( {"City" : ["Amsterdam","Amsterdam","Rotterdam","Zurich","Vienna","Prague"]})
test_df = test_df.assign(Score = lambda d: fuzz.token_set_ratio("amsterdam",test_df["City"]))
print (test_df.shape)
test_df.head()
Result:
City Score
0 Amsterdam 100
1 Amsterdam 100
2 Rotterdam 100
3 Zurich 100
4 Vienna 100
If I do the comparison one by one it works:
print (fuzz.token_set_ratio("amsterdam","Amsterdam"))
print (fuzz.token_set_ratio("amsterdam","Rotterdam"))
print (fuzz.token_set_ratio("amsterdam","Zurich"))
print (fuzz.token_set_ratio("amsterdam","Vienna"))
Results:
100
67
13
13
Thank you in advance!
I managed to solve it via iterating through the rows:
for index,row in test_df.iterrows():
test_df.loc[index, "Score"] = fuzz.token_set_ratio("amsterdam",test_df.loc[index,"City"])
The result is:
City Country Code Score
0 Amsterdam NL 100
1 Amsterdam NL 100
2 Rotterdam NL 67
3 Zurich NL 13
4 Vienna NL 13

how to remove rows from a pandas dataframe if two rows contains at least one matching element

i have a pandas dataframe contains many columns like Name, Email, Mobile Number etc. . which looks like this :
Sr No. Name Email Mobile Number
1. John joh***#gmail.com 1234567890,2345678901
2. kylie k.ki**#yahoo.com 6789012345
3. jon null 1234567890
4. kia kia***#gmail.com 6789012345
5. sam b.sam**#gmail.com 4567890123
I want to remove the rows which contains same Mobile Number. One person can have more than one number. I done this through drop_duplicates function. I tried this:
newdf = df.drop_duplicates(subset = ['Mobile Number'],keep=False)
Here is output :
Sr No. Name Email Mobile Number
1. John joh***#gmail.com 1234567890,2345678901
3. jon null 1234567890
5. sam b.sam**#gmail.com 4567890123
But the problem is it only removes the rows which are exactly same. but i want to remove the row which contains at least one same number i.e Sr. No. 1 and 3 they have one same number. How can i remove them so the final output looks like this :
final output:
Sr No. Name Email Mobile Number
5. sam b.sam**#gmail.com 4567890123
Alright. It is a complicated solution but I was able to solve for it.
Here's how I am doing it.
First, I take all the mobile numbers and split them by ,. Then I explode them (it will retain same index).
Then find all the index of rows with duplicates.
Then exclude the rows from the dataframe if the index was part of the duplicate
This will give you the unique rows that do not have any duplicates.
I modified your dataframe to have a few options.
c = ['Name','Email','Mobile Number']
d = [['John','joh***#gmail.com','1234567890,2345678901,6789012345'],
['kylie','k.ki**#yahoo.com','6789012345'],
['jon','null','1234567890'],
['kia','kia***#gmail.com','6789012345'],
['mac','mac***#gmail.com','2345678901,1098765432'],
['kfc','kfc***#gmail.com','6237778901,1098765432,3034045050'],
['pig','pig***#gmail.com','8007778001,8018765454,5054043030'],
['bil','bil***#gmail.com','1098765432'],
['jun','jun***#gmail.com','9098785434'],
['sam','b.sam**#gmail.com','4567890123']]
import pandas as pd
df = pd.DataFrame(d,columns=c)
print (df)
temp = df.copy()
temp['Mobile Number'] = temp['Mobile Number'].apply(lambda x: x.split(','))
temp = temp.explode('Mobile Number')
#print (temp)
df2 = df[~df.index.isin(temp[temp['Mobile Number'].duplicated(keep=False)].index)]
print (df2)
The output of this is:
Original DataFrame:
Name Email Mobile Number
0 John joh***#gmail.com 1234567890,2345678901,6789012345 # duplicated index: 1, 2,3, 4
1 kylie k.ki**#yahoo.com 6789012345 # duplicated index: 0, 3
2 jon null 1234567890 # duplicated index: 0
3 kia kia***#gmail.com 6789012345 # duplicated index: 0
4 mac mac***#gmail.com 2345678901,1098765432 # duplicated index: 0
5 kfc kfc***#gmail.com 6237778901,1098765432,3034045050 # duplicated index: 7
6 pig pig***#gmail.com 8007778001,8018765454,5054043030 # no duplicate; should output
7 bil bil***#gmail.com 1098765432 # duplicated index: 5
8 jun jun***#gmail.com 9098785434 # no duplicate; should output
9 sam b.sam**#gmail.com 4567890123 # no duplicate; should output
The output of this will be the 3 rows (index: 6, 8, and 9):
Name Email Mobile Number
6 pig pig***#gmail.com 8007778001,8018765454,5054043030
8 jun jun***#gmail.com 9098785434
9 sam b.sam**#gmail.com 4567890123
Since temp is not needed anymore, you can just delete it using del temp.
One possible solution is to do the following. Say your df is given by
Sr No. Name Email Mobile Number
0 1.0 John joh***#gmail.com 1234567890 , 2345678901
1 2.0 kylie k.ki**#yahoo.com 6789012345
2 3.0 jon NaN 1234567890
3 4.0 kia kia***#gmail.com 6789012345
4 5.0 sam b.sam**#gmail.com 4567890123
You can split your Mobile Number column into two (or more) columns mob1, mob2,.... and then drop duplicates
df[['mob1', 'mob2']]= df["Mobile Number"].str.split(" , ", n = 1, expand = True)
newdf = df.drop_duplicates(subset = ['mob1'],keep=False)
which returns
Sr No. Name Email Mobile Number mob1 mob2
4 5.0 sam b.sam**#gmail.com 4567890123 4567890123 None
EDIT
To handle the possible swapped order of numbers, one can extend the method by dropping duplicates from all created columns:
df[['mob1', 'mob2']]= df["Mobile Number"].str.split(" , ", n = 1, expand = True)
newdf = df.drop_duplicates(subset = ['mob1'],keep=False)
newdf = df.drop_duplicates(subset = ['mob2'],keep=False)
which returns:
Sr No. Name Email Mobile Number mob1 \
0 1.0 John joh***#gmail.com 2345678901 , 1234567890 2345678901
mob2
0 1234567890
If there are individuals with more than two number then as many columns as the maximum number of phone numbers need to be created.

list of visited interval

Given the following dataframe
import pandas as pd
df = pd.DataFrame({'visited': ['2015-3-1', '2015-3-5','2015-3-6','2016-3-4', '2016-3-6', '2016-3-8'],'name':['John','John','John','Mary','Mary','Mary']})
df['visited']=pd.to_datetime(df['visited'])
visited name
0 2015-03-01 John
1 2015-03-05 John
2 2015-03-06 John
3 2016-03-04 Mary
4 2016-03-06 Mary
5 2016-03-08 Mary
I wish to get the list of visited interval for two people, in this example, the outcome should be
avg_visited_interval name
0 [4,1] John
1 [2,2] Mary
How should I achieve this?
(e.g., for first example there is 4 days between rows 0 and 1 and 2 days between rows 1 and 2, which resulted in [4,1])
Use custom lambda function with Series.diff, remove first value by position, convert to integers and lists:
df = (df.groupby('name')['visited']
.apply(lambda x: x.diff().iloc[1:].dt.days.astype(int).tolist())
.reset_index(name='intervals'))
print (df)
name intervals
0 John [4, 1]
1 Mary [2, 2]

Convert dataframe from long to wide with custom column names [duplicate]

I have data in long format and am trying to reshape to wide, but there doesn't seem to be a straightforward way to do this using melt/stack/unstack:
Salesman Height product price
Knut 6 bat 5
Knut 6 ball 1
Knut 6 wand 3
Steve 5 pen 2
Becomes:
Salesman Height product_1 price_1 product_2 price_2 product_3 price_3
Knut 6 bat 5 ball 1 wand 3
Steve 5 pen 2 NA NA NA NA
I think Stata can do something like this with the reshape command.
Here's another solution more fleshed out, taken from Chris Albon's site.
Create "long" dataframe
raw_data = {'patient': [1, 1, 1, 2, 2],
'obs': [1, 2, 3, 1, 2],
'treatment': [0, 1, 0, 1, 0],
'score': [6252, 24243, 2345, 2342, 23525]}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
Make a "wide" data
df.pivot(index='patient', columns='obs', values='score')
A simple pivot might be sufficient for your needs but this is what I did to reproduce your desired output:
df['idx'] = df.groupby('Salesman').cumcount()
Just adding a within group counter/index will get you most of the way there but the column labels will not be as you desired:
print df.pivot(index='Salesman',columns='idx')[['product','price']]
product price
idx 0 1 2 0 1 2
Salesman
Knut bat ball wand 5 1 3
Steve pen NaN NaN 2 NaN NaN
To get closer to your desired output I added the following:
df['prod_idx'] = 'product_' + df.idx.astype(str)
df['prc_idx'] = 'price_' + df.idx.astype(str)
product = df.pivot(index='Salesman',columns='prod_idx',values='product')
prc = df.pivot(index='Salesman',columns='prc_idx',values='price')
reshape = pd.concat([product,prc],axis=1)
reshape['Height'] = df.set_index('Salesman')['Height'].drop_duplicates()
print reshape
product_0 product_1 product_2 price_0 price_1 price_2 Height
Salesman
Knut bat ball wand 5 1 3 6
Steve pen NaN NaN 2 NaN NaN 5
Edit: if you want to generalize the procedure to more variables I think you could do something like the following (although it might not be efficient enough):
df['idx'] = df.groupby('Salesman').cumcount()
tmp = []
for var in ['product','price']:
df['tmp_idx'] = var + '_' + df.idx.astype(str)
tmp.append(df.pivot(index='Salesman',columns='tmp_idx',values=var))
reshape = pd.concat(tmp,axis=1)
#Luke said:
I think Stata can do something like this with the reshape command.
You can but I think you also need a within group counter to get the reshape in stata to get your desired output:
+-------------------------------------------+
| salesman idx height product price |
|-------------------------------------------|
1. | Knut 0 6 bat 5 |
2. | Knut 1 6 ball 1 |
3. | Knut 2 6 wand 3 |
4. | Steve 0 5 pen 2 |
+-------------------------------------------+
If you add idx then you could do reshape in stata:
reshape wide product price, i(salesman) j(idx)
Karl D's solution gets at the heart of the problem. But I find it's far easier to pivot everything (with .pivot_table because of the two index columns) and then sort and assign the columns to collapse the MultiIndex:
df['idx'] = df.groupby('Salesman').cumcount()+1
df = df.pivot_table(index=['Salesman', 'Height'], columns='idx',
values=['product', 'price'], aggfunc='first')
df = df.sort_index(axis=1, level=1)
df.columns = [f'{x}_{y}' for x,y in df.columns]
df = df.reset_index()
Output:
Salesman Height price_1 product_1 price_2 product_2 price_3 product_3
0 Knut 6 5.0 bat 1.0 ball 3.0 wand
1 Steve 5 2.0 pen NaN NaN NaN NaN
A bit old but I will post this for other people.
What you want can be achieved, but you probably shouldn't want it ;)
Pandas supports hierarchical indexes for both rows and columns.
In Python 2.7.x ...
from StringIO import StringIO
raw = '''Salesman Height product price
Knut 6 bat 5
Knut 6 ball 1
Knut 6 wand 3
Steve 5 pen 2'''
dff = pd.read_csv(StringIO(raw), sep='\s+')
print dff.set_index(['Salesman', 'Height', 'product']).unstack('product')
Produces a probably more convenient representation than what you were looking for
price
product ball bat pen wand
Salesman Height
Knut 6 1 5 NaN 3
Steve 5 NaN NaN 2 NaN
The advantage of using set_index and unstacking vs a single function as pivot is that you can break the operations down into clear small steps, which simplifies debugging.
pivoted = df.pivot('salesman', 'product', 'price')
pg. 192 Python for Data Analysis
An old question; this is an addition to the already excellent answers. pivot_wider from pyjanitor may be helpful as an abstraction for reshaping from long to wide (it is a wrapper around pd.pivot):
# pip install pyjanitor
import pandas as pd
import janitor
idx = df.groupby(['Salesman', 'Height']).cumcount().add(1)
(df.assign(idx = idx)
.pivot_wider(index = ['Salesman', 'Height'], names_from = 'idx')
)
Salesman Height product_1 product_2 product_3 price_1 price_2 price_3
0 Knut 6 bat ball wand 5.0 1.0 3.0
1 Steve 5 pen NaN NaN 2.0 NaN NaN

Resources