Cross Referencing Databases on Fuzzy Data - search

I am currently working on project where I have to match up a large quantity of user-generated names with a separate list of the same names in a canonical format. The problem is that the user-generated names contains numerous misspellings, abbreviations, as well as simply invalid data, making it hard to do a cross-reference with the canonical data. Any suggestions on methods to do this?
This does not have to be done in real-time and in this case accuracy is more important than speed.
Current ideas for this are:
Do a fuzzy search for the user entered name in the canonical database using an existing search implementation like Lucene or Sphinx, which I presume use something like the Levenshtein distance for this.
Cross-reference on the SOUNDEX hash (which is supposedly computed on the sound of the name rather than spelling) instead of using the actual name.
Some combination of the above
Anyone have any feedback on any of these or ideas of their own?
One of my concerns is that none of the above methods will handle abbreviations very well. Can anyone point me in a direction for some machine learning methods to actually search on expanded abbreviations (or tell me I'm crazy)? Thanks in advance.

First, I'd add to your list the techniques discussed at Peter Norvig's post on spelling correction.
Second, I'd ask what kind of "user-generated names" you're talking about. Having dealt with both, I believe that the heuristics you'd use for street names are somewhat different from the heuristics for person names. (As a simple example, does "Dr" expand to "Drive" or "Doctor"?)
Third, I'd look at a combination using testing to establish the set of coefficients for combining the results of the various techniques.

Related

Smart search for acronyms in Salesforce

In Salesforce's Service Cloud one can enable the out of the box search function where the user enters a term and the system searches all parts of the database for a match. I would like to enable smart searching of acronyms so that if I spell an organizations name the search functionality will also search for associated acronyms in the database. For example, if I search type in American Automobile Association, I would also get results that contain both "American Automobile Association" and "AAA".
I imagine such a script would involve declaring that if the term being searched contains one or more spaces or periods, take the first letter of the first word and concatenate it with the letters that follow subsequent spaces or periods.
I have unsuccessfully tried to find scripts for this or articles on enabling this functionality in Salesforce. Any guidance would be appreciated.
Interesting question! I don't think there's a straightforward answer but as it's standard search functionality, not 100% programming related - you might want to cross-post it to salesforce.stackexchange.com
Let's start with searchable fields list: https://help.salesforce.com/articleView?id=search_fields_business_accounts.htm&type=0
In Setup there's standard functionality for Synonyms, quite easy to use. It's not a silver bullet though, applies only to certain objects like Knowledge Base (if you use it). Still - it claims to work on Cases too so if there's "AAA" in Case description it should still be good enough?
You could also check out the trick with marking a text field as indexed and/or external ID and adding there all your variations / acronyms: https://success.salesforce.com/ideaView?id=08730000000H6m2 This is more work, to prepare / sanitize your data upfront but it's not a bad idea.
Similar idea would be to use Tags although that could explode in size very quickly. It's ridiculous to create a tag for every single company.
You can do some really smart things in data deduplication rules. Too much to write it all here, check out the trailhead: https://trailhead.salesforce.com/en/modules/sales_admin_duplicate_management/units/sales_admin_duplicate_management_unit_2 No idea if it impacts search though.
If you suffer from bad address data there are State & Country picklists, no more mess with CA / California / SoCal... https://resources.docs.salesforce.com/204/latest/en-us/sfdc/pdf/state_country_picklists_impl_guide.pdf Might not help with Name problem...
Data.com cleanup might help. Paid service I think, no idea if it affects search too. But if enabling it can bring these common abbreviations into your org - might be better than reinventing the wheel.

Proper approach to get words like "dentistry", "dentist" from query like "dental" (and vice versa)

I'm somewhat familiar with stemming, but the stemming library I've been given to use for a project doesn't work very well for a case where I want to find related words like if I do a query for any of these:
"dental", "dentist", "dentistry"
I should get a match for the others. I've been looking into this and I'm learning about parts of speech I didn't even know existed, like pertainyms and troponyms so I'm wondering if there isn't a library out there that has a mapping between all of these different parts of speech that could give back the sort of match I'm looking for?
I've been searching on this and haven't found a whole lot that I can make sense of. I probably don't know the right terminology, etc and I would greatly appreciate if anyone can point me in the right direction.
One approach common in IR is to stem all words in the index and the query itself. Meaning, documents containing the word 'dentistry' will be stemmed and stored in the index as 'dentist'. The keyword 'dental' is also stemmed as 'dentist' thereby matching it in the index.
Have a look at WordNet. WordNet is an organized ontology of words and concepts with links for various types of relations between words. I'm not sure if it will have exactly the relationships you want, but it's probably a good start. There are many interfaces in various programming languages (Java and Python that I've used; presumably many more).

what algorithm does freebase use to match by name?

I'm trying to build a local version of the freebase search api using their quad dumps. I'm wondering what algorithm they use to match names? As an example, if you go to freebase.com and type in "Hiking" you get
"Apo Hiking Society"
"Hiking"
"Hiking Georgia"
"Hiking Virginia's national forests"
"Hiking trail"
Wow, a lot of guesses! I hope I don't muddy the waters too much by not guessing too.
The auto-complete box is basically powered by Freebase Suggest which is powered, in turn, by the Freebase Search service. Strings which are indexed by the search service for matching include: 1) the name, 2) all aliases in the given language, 3) link anchor text from the associated Wikipedia articles and 4) identifiers (called keys by Freebase), which includes things like Wikipedia article titles (and redirects).
How the various things are weighted/boosted hasn't been disclosed, but you can get a feel for things by playing with it for while. As you can see from the API, there's also the ability to do filtering/weighting by types and other criteria and this can come into play depending on the context. For example, if you're adding a record label to an album, topics which are typed as record labels will get a boost relative to things which aren't (but you can still get to things of other types to allow for the use case where your target topic doesn't hasn't had the appropriate type applied yet).
So that gives you a little insight into how their service works, but why not build a search service that does what you need since you're starting from scratch anyway?
BTW, pre-Google the Metaweb search implementation was based on top of Lucene, so you could definitely do worse than using that as your starting point. You can read some of the details in the mailing list archive
Probably they use an inverted Index over selected fields, such as the English name, aliases and the Wikipedia snippet displayed. In your application you can achieve that using something like Lucene.
For the algorithm side, I find the following paper a good overview
Zobel and Moffat (2006): "Inverted Files for Text Search Engines".
Most likely it's a trie with lexicographical order.
There are a number of algorithms available: Boyer-Moore, Smith-Waterman-Gotoh, Knuth Morriss-Pratt etc. You might also want to check up on Edit distance algorithms such as Levenshtein. You will need to play around to see which best suits your purpose.
An implementation of such algorithms is the Simmetrics library by the University of Sheffield.

I have a list of names, some of them are fake, I need to use NLP and Python 3.1 to keep the real names and throw out the fake names

I have no clue of where to start on this. I've never done any NLP and only programmed in Python 3.1, which I have to use. I'm looking at the site http://www.linkedin.com and I have to gather all of the public profiles and some of them have very fake names, like 'aaaaaa k dudujjek' and I've been told I can use NLP to find the real names, where would I even start?
This is a difficult problem to solve, and one which starts with acquiring valid given name & surname lists.
How large is the set of names that you're evaluating, and where do they come from? These are both important things for you to consider. If you're evaluating a small set of "American" names, your valid name lists will differ greatly from lists of Japanese or Indian names, for instance.
Your idea of scraping LinkedIn is on the right track, but you were right to catch the fake profile/name flaw. A better website would probably be something like IMDB (perhaps scraping names by iterating over different birth years), or Wikipedia's lists of most popular given names and most common surnames.
When it comes down to it, this is a precision vs. recall problem: in order to miss fewer fakes, you're inevitably going to throw out some real names. If you loosen up your restrictions, you'll get more fakes, but you'll also throw out fewer real names.
Several possibilities here, but the most obvious seems to be with HMMs, i.e. Hidden Markov Models. The NLTK kit includes [at least] one module for HMMs, although I must admit I never used it.
Another possible snag is that AFAIK, NTLK is not yet ported to Python 3.0
This said, and while I'm quite keen on using NLP techniques where applicable, I think that a process which would use several paradigms, including some NLP tricks may be a better solution for this particular problem. For example, storing even a reduced dictionary of common family names (and first names) in a traditional database may offer both a more reliable and more computationally efficient way of filtering a significant portion of the input data, leaving precious CPU resources to be spent on less obvious cases.
i am afraid this problem is not solveable if your list is even only minimally ‘open’ — if the names are eg customers from a small traditionally acting population, you might end up with a few hundred names for thousands of people. but generally you can hardly predict what is a real name and what is not, however unusual an arabic, chinese, or bantu name may look in a sample of, say, south english rural neighborhood names. i mean, ‘Ng’ is a common cantonese surname, and ‘O’ is common in korea, so assumptions may fail. there is this place in austria called ‘fucking’, so even looking out for four letter words is no guarantee for success.
what you could do is work through a sufficiently big sample of such names and sort them out manually. then, use all kinds of textprocessing tools and collect metrics. maybe you can derive a certain likelyhood for a name to be recognized as fake, maybe it will not be viable. you will never go beyond likelyhoods here, though.
as an aside, we used to use google maps and the telephone directory for validating customer data years ago. if google maps could find the place, we called the address validated. it is clear that under stricter requirements, true validation must go much further. let’s not forget the validation of such data is much more a social question than a linguistic one.

Ways to do "related searches" functionality

I've seen a few sites that list related searches when you perform a search, namely they suggest other search queries you may be interested in.
I'm wondering the best way to model this in a medium-sized site (not enough traffic to rely on visitor stats to infer relationships). My initial thought is to store the top 10 results for each unique query, then when a new search is performed to find all the historical searches that match some amount of the top 10 results but ideally not matching all of them (matching all of them might suggest an equivalent search and hence not that useful as a suggestion).
I imagine that some people have done this functionality before and may be able to provide some ideas of different ways to do this. I'm not necessarily looking for one winning idea since the solution will no doubt vary substantially depending on the size and nature of the site.
have you considered a matrix of with keywords on 1 axis vs. documents on another axis. once you find the set of vetors representing the keywords, find sets of keyword(s) found in your initial result set and then find a way to rank the other keywords by how many documents they reference or how many times they interset the intial result set.
I've tried a number of different approaches to this, with various degrees of success. In the end, I think the best approach is highly dependent on the domain/topics being searched, and how the users form queries.
Your thought about storing previous searches seems reasonable to me. I'd be curious to see how it works in practice (I mean that in the most sincere way -- there are many nuances that can cause these techniques to fail in the "real world", particularly when data is sparse).
Here are some techniques I've used in the past, and seen in the literature:
Thesaurus based approaches: Index into a thesaurus for each term that the user has used, and then use some heuristic to filter the synonyms to show the user as possible search terms.
Stem and search on that: Stem the search terms (eg: with the Porter Stemming Algorithm and then use the stemmed terms instead of the initially provided queries, and given the user the option of searching for exactly the terms they specified (or do the opposite, search the exact terms first, and use stemming to find the terms that stem to the same root. This second approach obviously takes some pre-processing of a known dictionary, or you can collect terms as your indexing term finds them.)
Chaining: Parse the results found by the user's query and extract key terms from the top N results (KEA is one library/algorithm that you can look at for keyword extraction techniques.)

Resources