Heuristics for splitting full names - user-experience

Splitting a full name into first and last names is an unsolvable problem because names are really, really complicated. As a result, my model, which represents authors and other contributors to a book, includes both name and filingName fields, where filingName should usually be "Last, First" (for Western names).
However, as a convenience for my users, I'd like to have my app make a reasonable guess at the filing name when the user fills in the regular name. The user can edit the filing name if the guess is wrong, of course, but if I guess right, I'll have saved them some time. Currently I'm simply assuming the last space-separated "word" is the last name and moving it to the front with a comma:
NSMutableArray * parts = [self.name componentsSeparatedByCharactersInSet:NSCharacterSet.whitespaceCharacterSet].mutableCopy;
if(parts.count < 2) {
return self.name;
}
NSString * lastName = parts.lastObject;
[parts removeLastObject];
return [NSString stringWithFormat:#"%#, %#", lastName, [parts componentsJoinedByString:#" "]];
I can immediately think of one case where this will lead me astray: suffixes like "Jr". But I'm sure there are many others. Are there any good resources explaining common naming caveats, or good examples of code tackling this problem, that I can use to improve my heuristic? I'm using Objective-C on the Mac (in case there's some obscure corner of a framework that could help me), but I'm willing to learn from code written in any language.
This sort of question has been asked before, but most answers either focus on the mechanics of splitting apart a string, or devolve into "design your model differently". I am designing my model differently; I'm just looking to let the computer do most of my users' work for them.
As I said earlier, this code is mainly handling the names of authors and other contributors to books. Some of the specific ramifications of that include:
There should only be one name in name, because I support attaching multiple authors to a book.
Most names will not have titles, but professional titles like "Dr." could show up. Ideally these would be discarded, not treated as part of the first name.
The names will usually be of people, but could sometimes be of organizations. I'm perfectly willing to risk mangling organization names to get better person name handling.
I expect I will mostly be handling European names, although detecting the orthography of the name should not be difficult.
The code should not be particularly sensitive to the user's locale.

When you build a software system, there are always serious problems that consume a lot of time. I wouldn't get stucked with this because there is no worldwide naming conventions nor rules. I don't think asking the user to enter his/her filing name will be a bother, for they'll do it just once.
That seems to be the easier solution IMHO.

Related

Smart search for acronyms in Salesforce

In Salesforce's Service Cloud one can enable the out of the box search function where the user enters a term and the system searches all parts of the database for a match. I would like to enable smart searching of acronyms so that if I spell an organizations name the search functionality will also search for associated acronyms in the database. For example, if I search type in American Automobile Association, I would also get results that contain both "American Automobile Association" and "AAA".
I imagine such a script would involve declaring that if the term being searched contains one or more spaces or periods, take the first letter of the first word and concatenate it with the letters that follow subsequent spaces or periods.
I have unsuccessfully tried to find scripts for this or articles on enabling this functionality in Salesforce. Any guidance would be appreciated.
Interesting question! I don't think there's a straightforward answer but as it's standard search functionality, not 100% programming related - you might want to cross-post it to salesforce.stackexchange.com
Let's start with searchable fields list: https://help.salesforce.com/articleView?id=search_fields_business_accounts.htm&type=0
In Setup there's standard functionality for Synonyms, quite easy to use. It's not a silver bullet though, applies only to certain objects like Knowledge Base (if you use it). Still - it claims to work on Cases too so if there's "AAA" in Case description it should still be good enough?
You could also check out the trick with marking a text field as indexed and/or external ID and adding there all your variations / acronyms: https://success.salesforce.com/ideaView?id=08730000000H6m2 This is more work, to prepare / sanitize your data upfront but it's not a bad idea.
Similar idea would be to use Tags although that could explode in size very quickly. It's ridiculous to create a tag for every single company.
You can do some really smart things in data deduplication rules. Too much to write it all here, check out the trailhead: https://trailhead.salesforce.com/en/modules/sales_admin_duplicate_management/units/sales_admin_duplicate_management_unit_2 No idea if it impacts search though.
If you suffer from bad address data there are State & Country picklists, no more mess with CA / California / SoCal... https://resources.docs.salesforce.com/204/latest/en-us/sfdc/pdf/state_country_picklists_impl_guide.pdf Might not help with Name problem...
Data.com cleanup might help. Paid service I think, no idea if it affects search too. But if enabling it can bring these common abbreviations into your org - might be better than reinventing the wheel.

nlp: alternate spelling identification

Help by editing my question title and tags is greatly appreciated!
Sometimes one participant in my corpus of "conversations" will refer to another participant using a nickname, usually an abbreviation or misspelling, but hereafter I'll just say "nicknames". Let's say I'm willing to manually tell my software whether or not I think various possible nicknames are in fact nicknames, but I want software to come up with a list of possible matches between the handle's that identify people, and the potential nicknames. How would I go about doing that?
Background on me and then my corpus: I have no experience doing natural language processing but I'm a competent data analyst with R. My data is produced by 70 teams, each forecasting the likelihood of 100 distinct events occurring some time in the future. The result that I have 70 x 100 = 7000 text files, containing the stream of forecasts participants make and the comments they include with their forecasts. I'll paste a very short snip of one of these text files below, this one had to do with whether the Malian government would enter talks with the MNLA:
02/12/2013 20:10: past_returns answered Yes: (50%)
I hadn't done a lot of research when I put in my previous
placeholder... I'm bumping up a lot due to DougL's forecast
02/12/2013 19:31: DougL answered Yes: (60%)
Weak President Traore wants talks if MNLA drops territorial claims.
Mali's military may not want talks. France wants talks. MNLA sugggests
it just needs autonomy. But in 7 weeks?
02/12/2013 10:59: past_returns answered No: (75%)
placeholder forecast...
http://www.irinnews.org/Report/97456/What-s-the-way-forward-for-Mali
My initial thoughts: Obviously I can start by providing the names I'm looking to match things up with... in the above example they would be past_returns and DougL (though there is no use of nicknames in the above). I wouldn't think it'd be that hard to get a computer to guess at minor misspellings (though I wouldn't personally know where to start). I can imagine that other tricks could be used, like assuming that a string is more likely to be a nickname if it is used much much more by one team, than by other teams. A nickname is more likely to refer to someone who spoke recently than someone who spoke long ago, or not at all on regarding this question. And they should be used in sentences in a manner similar to the way the full name/screenname is typically used in the corpus. But I'm interested to hear about simple approaches, as well as ones that try to consider more sophisticated techniques.
This could get about as complicated as you want to make it. From the semi-linguistic side of things, research topics would include Levenshtein Distance (for detecting minor misspellings of known names/nicknames) and Named Entity Recognition (for the task of detecting names/nicknames in the first place). Actually, NER's worth reading about, but existing systems might not help you much in your domain of forum handles and nicknames.
The first rough idea that comes to mind is that you could run a tokenized version of your corpus against an English dictionary (perhaps a dataset compiled from Wiktionary or something like WordNet) to find words that are candidates for names, then filter those through some heuristics (do they start with the same letters as known full names? Do they have a low Levenshtein distance from known names? Are they used more than once?).
You could also try some clustering or supervised ML algorithms against the non-word tokens. That might reveal some non-"word" tokens that often occur in the same threads as a given username; again, heuristics could help rule out some false positives.
Good luck; sounds like a fun problem - hope I mentioned at least one thing you hadn't already thought of.

Algorithm to shorten city names to human readable codes

I am using the region and city in urls for this project. Now many regions and cities might have very long names while also combinations of shortened region/city information might lead to ambiguity.
Is there an easy approach to automatically shorten words in a way so they still make sense and are readable but are shortened without just cutting the end of?
Like turning Bremerhaven into Brmhvn or New Haven, Connecticut to NewHvn-Cnctct?
There are 2797245 cities in the list of world cities freely available from http://www.maxmind.com/app/worldcities
I would design the URL pattern similar to
{Country TLD}/{Abbreviation for state, province or prefecture}/{Trim of county or district}/{Trim of city}
Some examples,
www.example.com/US/NY/-/NewYork
www.example.com/US/NY/Westcheste/MtVernon (Mount Vernon in Westchester County, New York)
- county trimmed to 10 first characters. Also common words in city names abbreviated
www.example.com/DE/Bavaria/Munich
www.example.com/JP/Tohoku/Miyagi/Sendai (Sendai)
For the region, you may want to consider using the ISO 3166 code. So, the above examples for Munich and Sendai would look like
www.example.com/JP/Tohoku/JP-04/Sendai
www.example.com/DE/DE-BY/Munich
Other leads
HASC (Hierarchical administrative subdivision codes) which is used to
represent names of country subdivisions, such as states, province,
regions.
http://en.wikipedia.org/wiki/ISO_3166-2:US
http://www.statoids.com/uus.html
Hope it helps.
You're probably better advised to adopt an existing list of codes than to make up your own. For example you could use IATA codes or zip/postal codes, even telephone dialling codes (find your own link for these).
If you want the shorter version to make sense for humans I think this is incredibly complex as its not very obvious what will represent the full name properly.
example: my own city of Helsingborg. Given just this name i would split it down to one letter per Syllable. Hel-Sing-Borg -> HSB.
But I have never once seen anyone use this acronym. Everyone I know uses HBG.
In short, I would say its fairly easy to make a function that makes a logical acronym for any given game, but very hard to try and make one that is recognizable for a human.
If you just want to crop out some letters from a name, that would probably be a lot easier, but you'd probably want to talk to a English professor to understand what parts of names you can cut out without affecting the readability. But it is possible, and there are most likely meany publicly available studies on how we read words that you could reference.

Accurate algorithm for normalizing taxonomy terms?

I'm developing a shopping comparison website, and the project is in a very advanced stage. We index 50 million products daily using merchant feeds from various affiliate networks. Most of the problems I had is already solved, including the majority of the performance bottlenecks.
What is my problem: Please, first of all, we are using apache solr with drupal BUT, this problem IS NOT specific to drupal or solr, if you do not have knowledge of them, it doesn't matter.
We receive product feeds from over 2000 different merchants, and those feeds are a mess. They have no specific pattern, each merchant send the feeds the way they want. We already solved many problems regarding this, but one remains. Normalizing the taxonomy terms for the faceted browsing functionality.
Suppose that I have a "Narrow by Brands" browsing facet on my website. Now suppose that 100 merchants offer products from Microsoft. Now comes the problem. Some merchants put in the "Brands" column of the data feed "Microsoft", others "Microsoft, Inc.", others "Microsoft Corporation" others "Products from Microsoft", etc... there is no specific pattern between merchants and worst, some individual merchants are so sloppy that they have different strings for the same brand IN THE SAME DATA FEED.
We do not want all those different brands appearing in the navigation. We have a manual solution to the problem where we manually map the imported brands to the "good" brands table ("Microsoft Corporation" -> "Microsoft", "Products from Microsoft" -> "Microsoft", etc..). We have something like 10,000 brands in the database and this is doable. The problem is when it comes with bigger things like "Authors". When we import books into the system, there are over 800,000 authors and we have the same problem and this is not doable by hand mapping. The problem is the same: "Tom Mike Apostol", "Tom M. Apostol", "Apostol, Tom M.", etc...
Does anybody know a good way to automatically solve this problem with an acceptable degree of accuracy (85%-95% accuracy)?
Thanks you for the help!
Some idea that comes to my mind, altough it's just a loose thought:
Convert names to initials (in your example: TMA). Treat '-' as spaces, so fe. Antoine de Saint-Exupéry would be ADSE. Problem here is how to treat ",", altough, it's common usage is to have surname before forename, so just swapping positions should work (so A,TM would be TM,A, get rid of comma - TMA).
Filters authors in database by those initials
For each intitial, if you have whole name (Tom, Apostol) check if it match, otherwise (M.) consider it a match automatically.
If you want some tolerance, you can compare names with Levenshtein distance and tolerate some differences (here you have Oracle implementation)
Names that match you treat as the same authors, to find the whole name, for each initial (T, M, A) you look up your filtered authors (after step 2) and try to find one without just initial (M.) but with whole name (Mike), if you can't find one, use initial. Therefore, each of examples you gave would be converted to the same value, which would be full name (Tom Mike Apostol).
Things that are worth to think about:
Include mappings for name synonyms (would be more likely maximally hundred of records, like Thomas <-> Tom
This way is crucial to have valid initials (no M instead of N etc.).
edit: I've coded such thing some time ago, when I had to identify a person by it's signature, ignoring scanning problems, people sometimes sign by Name S. Surname, or N.S. or just by Name Surname (which is another thing maybe you should consider in the solution, to allow the algorithm to ignore second name, altough in your situation it would be rather rare to ommit someone's second name I guess).

I have a list of names, some of them are fake, I need to use NLP and Python 3.1 to keep the real names and throw out the fake names

I have no clue of where to start on this. I've never done any NLP and only programmed in Python 3.1, which I have to use. I'm looking at the site http://www.linkedin.com and I have to gather all of the public profiles and some of them have very fake names, like 'aaaaaa k dudujjek' and I've been told I can use NLP to find the real names, where would I even start?
This is a difficult problem to solve, and one which starts with acquiring valid given name & surname lists.
How large is the set of names that you're evaluating, and where do they come from? These are both important things for you to consider. If you're evaluating a small set of "American" names, your valid name lists will differ greatly from lists of Japanese or Indian names, for instance.
Your idea of scraping LinkedIn is on the right track, but you were right to catch the fake profile/name flaw. A better website would probably be something like IMDB (perhaps scraping names by iterating over different birth years), or Wikipedia's lists of most popular given names and most common surnames.
When it comes down to it, this is a precision vs. recall problem: in order to miss fewer fakes, you're inevitably going to throw out some real names. If you loosen up your restrictions, you'll get more fakes, but you'll also throw out fewer real names.
Several possibilities here, but the most obvious seems to be with HMMs, i.e. Hidden Markov Models. The NLTK kit includes [at least] one module for HMMs, although I must admit I never used it.
Another possible snag is that AFAIK, NTLK is not yet ported to Python 3.0
This said, and while I'm quite keen on using NLP techniques where applicable, I think that a process which would use several paradigms, including some NLP tricks may be a better solution for this particular problem. For example, storing even a reduced dictionary of common family names (and first names) in a traditional database may offer both a more reliable and more computationally efficient way of filtering a significant portion of the input data, leaving precious CPU resources to be spent on less obvious cases.
i am afraid this problem is not solveable if your list is even only minimally ‘open’ — if the names are eg customers from a small traditionally acting population, you might end up with a few hundred names for thousands of people. but generally you can hardly predict what is a real name and what is not, however unusual an arabic, chinese, or bantu name may look in a sample of, say, south english rural neighborhood names. i mean, ‘Ng’ is a common cantonese surname, and ‘O’ is common in korea, so assumptions may fail. there is this place in austria called ‘fucking’, so even looking out for four letter words is no guarantee for success.
what you could do is work through a sufficiently big sample of such names and sort them out manually. then, use all kinds of textprocessing tools and collect metrics. maybe you can derive a certain likelyhood for a name to be recognized as fake, maybe it will not be viable. you will never go beyond likelyhoods here, though.
as an aside, we used to use google maps and the telephone directory for validating customer data years ago. if google maps could find the place, we called the address validated. it is clear that under stricter requirements, true validation must go much further. let’s not forget the validation of such data is much more a social question than a linguistic one.

Resources