How to find near duplicates for string? - string

I have a dataset that consists of over 50000 names. There are a lot of near duplicates inside because of how these names were entered into the system.
The ideal solution should flag clusters of similar names for consideration for merging as they belong to the same person OR make constructive steps towards uncovering duplicates.
Suppose that these 6 persons are listed in a variety of ways in the 30 names in the second list:
Jon Stewart
Sun Yat-Sen
Xi Jinping
Jawaharlal Nehru
Hồ Chí Minh
Elaine Chao
30 Variations of 6 Names
Jon Stewart
John Stuart
Stewart Jon
Jonathon Stuart Leibowitz
Jon Stewart Jr
Stuart Leibowitz Jr
Sun Yat Sen
Sun Yat-Sen
Sun Sen Yat
Sun Sun-Yat
Sen Yat-Sun
Xi Jinping
Xi Jin Ping
Xi Jing Pin
Xin Ji Ping
Jawaharlal Nehru
Jawa Haral Nehru
Jawaharul Neru
Jawaha Rul Naru
Ho Chi Minh
Hồ Chí Minh
Ho Chimin
Ho Chimminh
Elaine Chao
Chao Xiao Lan Elaine
Xiao Lan, Elaine Chao
Xiao Chao Lan, Elaine
Chao Xiao Elaine
Elaine Xiao Lan Chao
Ellaine Chao Xiao-Lan
How do I start cleaning this dataset so that I can "flag" up all these names to take a closer look to make data cleaning decisions?
I have implemented methods that used Levenshtein Distance, but they do not do well with strings of very different lengths. Most people have 3 (plus minus 1) names. Some names are hyphenated.
I have situations where the same persons have been entered very differently like #1 and #2 where there are no overlaps in terms of spellings, but their reference to the same person can be inferred / triangulated through other entries and information in other columns. I acknowledge that there are limitations in finding a solution, as it may require some triangulation of different cells. For instance, Asians often have multiple names and people are often unsure which is which, and they often choose some parts of name to input. I am happy work with solutions that can partially tackle some but not all of the problems.
My estimate of the number of unique individuals in this 50000 is about 20000 or so. These are individuals providing professional services in a geographical location, in which I can in a somewhat laborious way, check against that database as a last resort (as that database does not allow any sort of automated querying).

I created a tool years ago to do the same with streets. In case it gives you a direction, I used difflib:
https://docs.python.org/3/library/difflib.html
My approach was:
To create a list with the standard names I wanted to use and replace names when the matching score was very high.
For those with a lesser score and did some checking loops to show me the 2-3 closest matches.
After checking several dozens, I did some cleanse with replace over the typical results not providing good scores and start again with the SequenceMatcher.
An example of the sorted result:
result=sorted(b, key=lambda x: difflib.SequenceMatcher(None, x, c).ratio(), reverse=True)
UPDATE:
I've done two examples, the first to create a new_agenda based on the matching results. The second an example to let you see the scores:
import difflib
#Your list of names (the six provided)
clean_agenda=['Jon Stewart','Sun Yat-Sen','Xi Jinping','Jawaharlal Nehru','Hồ Chí Minh','Elaine Chao']
#Your agenda with any kind of names (the fifty provided)
agenda=['Jon Stewart','John Stuart','Stewart Jon','Jonathon Stuart Leibowitz','Jon Stewart Jr','Stuart Leibowitz Jr','Sun Yat Sen','Sun Yat-Sen','Sun Sen Yat','Sun Sun-Yat','Sen Yat-Sun','Xi Jinping','Xi Jin Ping','Xi Jing Pin','Xin Ji Ping','Jawaharlal Nehru','Jawa Haral Nehru','Jawaharul Neru','Jawaha Rul Naru','Ho Chi Minh','H? Chí Minh','Ho Chimin','Ho Chimminh','Elaine Chao','Chao Xiao Lan Elaine','Xiao Lan, Elaine Chao','Xiao Chao Lan, Elaine','Chao Xiao Elaine','Elaine Xiao Lan Chao','Ellaine Chao Xiao-Lan']
#Your new agenda to store your matching clean names
new_agenda=[]
# Creating a new agenda with your standard list of names
for i in range(len(agenda)):
resultado=sorted(clean_agenda, key=lambda x: difflib.SequenceMatcher(None, x, agenda[i]).ratio(), reverse=True)
new_agenda.append(resultado[0])
The example to see the ratios:
# An example of a loop to check ratios for each value in agenda
for i in range(len(agenda)):
for j in range(len(clean_agenda)):
print(clean_agenda[j], difflib.SequenceMatcher(None,agenda[i],clean_agenda[j]).ratio())
print('\n')

Related

How to collect from Internet, Postal Codes of Cities and States? (WEB SCRAPPING SSIS)

In an SSIS project, I need to merge my Cities with Postal codes.
I tried to scrap datas from Internet but there are many errors like special characters that are not readable or makes errors,warnings in ssis.
The answer is you will want to buy a product to clean and standardize your address data. Many exist and I've dealt with a number of them over my career but ultimately, they derive their power from the USPS master list of addresses.
There are plenty of places on the internet where you can get a list of the approximately 43k zip codes. However, here's your first problem
Let's say you have these two businesses in your spreadsheet
9012 Blue Ash Rd, Blue Ash, OH
9856 Montgomery Rd, Cincinnati, OH
Blue Ash is zip code 45242 and Cincinnati is also 45242 (and a host of other zip codes) so the first problem is that multiple cities can share the same zip code.
Along those lines, ZipCode 42602 is Eubank KY. It's also Static, Seventy Six, Snow, Seminary, Highway, Aaron and Browns Crossroads.
Next up, Springfield, MO. Or more appropriately
SPFD, MO
SPGFD, MO
SPNGFLD, MO
SPRNGFLD, MO
SPGFLD, MO
Springfield, MO
Spring field, MO
Data is dirty when entered by dirty people. All of those cities are the same place but different approaches people take to abbreviate it. How do you expect to make a match there?
Pay for an API call from MelissaData/SmartyStreets/Google or Bing Maps/ and let the professionals deal with it.
If you insist on trying to brute force this, start with feeding your City & State into the USPS and consume the results https://tools.usps.com/zip-code-lookup.htm?bycitystate I have no idea what the rate limiter might be there so be wary that you could get your IP banned, soft or hard, for too many calls in a period of time.

How to find what strings (in a big list of strings) are in a text in Python?

I'm trying to find out what names of list are in a news text.
I have a big text file (around 100MB) with many place names. Each name is a line in the file.
Part of the file.
Brasiel
Brasier Gap
Brasier Tank
Brasiilia
Brasil
Brasil Colonial
and the news texts are like this:
"It's thought the couple may have contracted the Covid-19 virus in the US or while travelling to Australia, according to Queensland Health officials.
Hanks is not the only celebrity to have tested positive for the virus. British actor Idris Elba also revealed last week he had tested positive."
For instance, in this text the strings Australia and Queensland should be founded.
I'm using the NLTK library and creating ngrams from the news.
To do this, I'm doing this:
from nltk.util import ngrams
# readings the place name file
file = open("top-ord.txt", "r")
values = file.readlines()
news = "It's thought the couple may have contracted the Covid-19 virus in the US or while travelling to Australia, according to Queensland Health officials."
# ngrams_list is all ngrams from the news
for item in ngrams_list:
if item in values:
print(item)
This is too slow. How can I improve it?
Convert values to a set like so:
value_set = {country for country in values}
That should significantly speed things up as lookup with sets runs in constant time (as opposed to linear time as with lists)
Also, make sure you strip away trailing newlines when parsing the file (if needed).

Python3: writing article in own words

I am trying to extract the summary from news article. Here is what I have tried till now:
>>> from newspaper import Article
>>> url = 'http://abcnews.go.com/International/wireStory/north-korea-ready-deploy-mass-produce-missile-47552675'
>>> article = Article(url)
>>> article.download()
>>> article.parse()
>>> article.nlp()
>>> article.keywords
['ready', 'north', 'test', 'missiles', 'deploy', 'tested', 'korea', 'missile', 'launch', 'nuclear', 'capable', 'media', 'massproduce']
>>> article.summary
'North Korea says it\'s ready to deploy and start mass-producing a new medium-range missile capable of reaching Japan and major U.S. military bases there following a test launch it claims confirmed the missile\'s combat readiness and is an "answer" to U.S. President Donald Trump\'s policies.\nPyongyang\'s often-stated goal is to perfect a nuclear warhead that it can put on a missile capable of hitting Washington or other U.S. cities.\nAt the request of diplomats from the U.S., Japan and South Korea, a United Nations\' Security Council consultation on the missile test will take place Tuesday.\nNorth Korea a week earlier had successfully tested a new midrange missile — the Hwasong 12 — that it said could carry a heavy nuclear warhead.\nExperts said that rocket flew higher and for a longer time than any other missile previously tested by North Korea and represents another big advance toward a viable ICBM.'
I have seen that the summary generated in the above paragraph is taken exactly from the news article itself. Whereas I want to achieve human like summarization (In own words or spin content or anything, but should be relevant).
Kindly, advice me or suggest me the what I need to do so that my code works exactly what I want?
There is sumy which does offer several ways to summarize english texts. Most (if not all) of those algorithm will extract sentences from the input document. Based on those sentences you can postprocess them to split and/or merge sentences and use synonyms.
Outside that, this topic is still not much in the field of engineering but research. Try AI StackExchange.

Determining customary distance unit from ISO 3166 country code

ISO 3166 defines country codes such as GB, US, FR or RU.
I would like a reasonably definitive association from these country codes to the customary unit of measure for distances between places in those countries.
Specifically on iOS and OS X, the country code can be retrieved from NSLocale:
[[NSLocale currentLocale] objectForKey: NSLocaleCountryCode];
NSLocale also provides a way to see if a country uses metric or non metric units:
const bool useMetric = [[[NSLocale currentLocale] objectForKey: NSLocaleUsesMetricSystem] boolValue];
However, this is not sufficient. For example, in Great Britain (GB) the metric system is widely used, but distances between places continue to be officially measured in miles rather than kilometres.
I also faced this problem :-)
Countries which uses Metric system but still use miles :--
1. GB is only exception which still uses miles instead of metric.
Note: Canada also stared using KMs for road transport. Although, Canada still follows miles for train and horse transport
Countries which do not uses Metric System
Liberia, Myanmar and United States of America.
Note: Myanmar (Formerly Burma) is planning to move to metric system. Currently, Myanmar uses its own system different from imperial and metric.
In my app, i check whether country uses imperial or metric.
if (metric) then assign kms for all countries except britan
if (imperial) then assign miles for all countries except Burma
if burma then assign burma unit
if britan then assign miles
A chart showing countries using miles per hour for road speeds is available. It cites Wikipedia's articles on miles per hour as its source, which has the following to say:
These include roads in the United Kingdom,[1] the United States,[2] and UK and US territories; American Samoa,[3] the Bahamas,[4] Belize,[5] British Virgin Islands,[6] the Cayman Islands,[7] Dominica,[8] the Falkland Islands,[9] Grenada,[10] Guam,[11] Burma,[12] The N. Mariana Islands,[13] Samoa,[14] St. Lucia,[15] St. Vincent & The Grenadines,[16] St. Helena,[17] St. Kitts & Nevis,[18] Turks & Caicos Islands,[19] the U.S. Virgin Islands,[20][21] Antigua & Barbuda (although km are used for distance),[22] and Puerto Rico (same as former).[22]
I don't see a way to download this as data keyed from ISO3166 country code, but it's not a huge task to compile one.
I'll leave this answer unaccepted in case a better suggestion is available.
Officially, road distances in the UK are in kilometres, but road signs are in miles. Confusing? Yes! When a road engineer get aplan of a road, everythign is in kiolometres, government statistics are in kilometres, but road signs and car odometers are in miles. See https://en.wikipedia.org/wiki/Driver_location_sign for more info.

Information extraction. Counting mentions to measure relevance

Is it possible to count how many times an entity has been mentioned in an article? For example
ABC Company is one of the largest car manufacturers in the
world. It is also the largest
company in terms of annual production.
It is also the second largest exporter of luxury cars, after XYZ
company. Both ABC and XYZ
together produces over n% of total car
production in the country.
mentions ABC company 4 times.
Yes, this is possible. It's a combination of
named-entity recognition (NER), which for English is practically a solved problem, and
coreference resolution, which is the subject of ongoing research (but give this package a try)

Resources