Approximate string matching algorithms state-of-the-art [closed] - string

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I seek a state of the art algorithms to approximate string matching.
Do you offer me references(article, thesis,...)?
thank you

You might have got your answer already but I want to convey my points on approximate string matching so that, others might benefit. I am speaking from my experience having worked to solve cloud service problems to handle really large scale requirements.
If we just want to talk about the Approximate String matching algorithms, then there are many.
Few of them are:
Jaro-Winkler, Edit distance(Levenshtein), Jaccard similarity, Soundex/Phonetics based algorithms etc.
A simple googling would give us all the details.
Irony is, they work while you try to match two given input strings. Alright theoretically and to demonstrate the way fuzzy or approximate string matching works.
However, grossly understated point is, how do we use the same in production settings. Not everybody that I know of who were scouting for an approximate string matching algorithm knew how they could solve the same in the production environment.
Assuming that we have a list of millions of names and if we want to search a given input name against all the entries in the list using one of the standard algorithms above would mean disaster.
A typical, edit distance algorithm has a time complexity of O(N^2) where N is the number of characters in a string. To scan the list of size M, the complexity would be O(M * N^2). This would mean very high hardware requirements and it just doesn't work in your favor regardless of how much h/w you want to stack up.
This is where we have to start thinking about other approaches.
One of the common approaches to solve such a problem in the production environment is to use a standard search engine like -
Apache Lucene.
https://lucene.apache.org/
Lucene indexing engine indexes the reference data(Called as documents) and input query can be fired against the engine. The results are returned which are ranked based on how close they are to the input.
This is close to how google search engine works. Googles crawles and index the whole web but you should have a miniature system mimicking what Google does.
This works for most of the cases including complicated name matching where the first, middle and the last names are interchanged.
You can select your results based on the scores emitted by Lucene.
While you mature in your role, you will start thinking about using hosted solutions like Amazon Cloudsearch which wraps the Solr and ElastiSearch for you. Of-course it uses Lucene underneath and keeps you independent of potential size of the index due to larger reference data which is used for indexing.
http://aws.amazon.com/cloudsearch/

You might want to read about Levenshtein distance.
http://en.wikipedia.org/wiki/Levenshtein_distance

Related

Comparing texts based on their meanings [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
We have a pool of documents (word and plain texts) that could include as many as 1000, 2000 or even more items. Each document may contain thousands of words. There is one reference document given to us that we should find the closest matches to this reference document semantically from the pool.
We first used SQL Server 2017's semantic search feature but it's not returning more than 10 records which is a limitation! What other technologies or tools are out there in the market to serve this purpose. We prefer to leverage Microsoft's cognitive tools and services but we are open to any other options including the open source that can help.
I would recommend looking into TF-IDF approaches if the documents are of a technical nature. TF-IDFs look at the frequencies of terms (TF) in a document and multiply it with the inverse document frequency (IDF), a measure of the scarcity of the term in the overall corpus. The thinking there is: A word that you use often, but is very scarcely used in the overall corpus, is likely to make it an important term for the meaning of the document. A similarity measure (such as Cosine similarity) is then applied to the TFIDF to find documents with a similar profile in terms of TFIDF scores (i.e. a similar over-usage of the relatively unique terms)
If the texts are less technical in nature, you could take a look at Word Embedding approaches such as Document2Vec - basically they use trained sets with multi-dimensional vectors. These multi-dimensional vectors try to capture the meaning of a word, which means you are not dependent on the same keywords being used (which is the case with TFIDF).
Existing implementations are around (especially Python based), but Azure can probably facilitate these technologies as well (c.f. HDInsight https://learn.microsoft.com/en-us/azure/architecture/data-guide/technology-choices/natural-language-processing). You can also look up ElasticSearch that does some of these things out of the box.

String meaning comparison [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
Is there some sort of algorithm out there or concept that can help with the following problem?
Say I have two snippets of text, snippet 1, and snippet two.
Snippet 1 reads as follows:
"The dog was too scared to go out into the storm"
Snippet 2 reads as follows:
"The canine was intimidated to venture into the rainy weather"
Is there a way to compare those snippets using some sort of algorithm, or maybe some sort of string theory system? I want to know if there are any kinds of systems that have solved this problem before I tackle it.
UPDATE:
Okay, to give a more specific example, say I wanted to reduce the number of bugs in a ticketing system. And I wanted to do some sort of scan, to see if there are any related or similar tickets. I wanted to know the best systematic way of determining the issue based on the body of a ticket. The Levenshtein Distance algorithm doesn't particularly work well, since it wouldn't know the difference between wet and dry.
Is there a way to compare those snippets using some sort of algorithm, or maybe some sort of string theory system? I want to know if there are any kinds of systems that have solved this problem before I tackle it.
Well, this is a very famous problem in NLP, and to be more precise, you are comparing semantics of two sentences.
Maybe you can look into libraries like gensim, Wordnet::Similarity etc which provide ways to retrieve semantically similar documents.
Here's another semantically similar SO question question.
An option here could be the Levenshtein Distance between two strings.
It is a measure of the number of operations required to get from one string to another. So, the larger the distance, the less similar the two strings.
This kind of algorithm is great for spell checking or voice recognition because the given string and expected string generally only differ by just a couple words/characters.
For your example, the Levenshtein Distance is 32 (you can try this calculator) which indicates that the strings are not very similar (since the strings are not much longer than the distance of 32).
This algorithm is not great for context sensitive comparisons but your example is kind of an extreme case. Very likely there would be more words in common which would result in a smaller Levenshtein Distance. You could use this algorithm in conjunction with some other methods (See: What are some algorithms for comparing how similar two strings are?) to try to get a more optimal comparison.

Looking for ICD10 API [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Any body knows of a good ICD10 API to do diagnostic code lookups that can recommend. I am currently building a simple app to tag patients with medical condition and the idea is to have a lookup API where one can type asthma for example and get back all the different ICD10 codes for asthma
My R package, icd converts ICD-9 and ICD-10 codes to descriptions, in addition to its main function of finding comorbidities. Documentation at https://jackwasey.github.io/icd/ , and code at https://github.com/jackwasey/icd . It does this using the function explain_code. It currently uses ICD-10-CM, i.e. the USA billing adapted ICD-10 code set, which in general is more specific than the canonical WHO version, but does have some areas of less detail.
E.g. WHO ICD-10 has HIV disease resulting in Pneumocystis jirovecii pneumonia as a subdivision for HIV infection, whereas ICD-10-CM just has HIV. On the other hand, ICD-10-CM has Sucked into jet engine, subsequent encounter whereas the WHO is happy with the terribly vague: Person on ground injured in air transport accident.
The volume of data for all the descriptions is not very high, just handful of megabytes, so although an API may seem convenient, you might consider just having all the data and not having to ping some random server.
I'm going to assume you're ignoring all of the usual stuff around variations of spelling of medical terms, proper terms vs. colloquialisms, labels vs. descriptions, etc. that get to be a pain with term / code finders.
If you want to use a hosted option and are OK with the terms of use, you could use UMLS (https://uts.nlm.nih.gov/home.html#apidocumentation). It's a great resource, but the use case you're describing isn't necessarily what it's intended to address.
Personally - and I usually don't like to roll my own stuff - I'd consider doing your own thing. You could do something focused on your needs and tailor it to any specific behaviors you might want (like preferring specific codes based on an organization - EX: billing preference). You could also probably make it far, far more ... perky ... and address short forms of terms (EX: synonyms like "DVT") or misspellings ("asthma" vs. "athsma"). If you go that route, I'd suggest considering getting your hands on the ICD-10 code info and then mashing it into Elastic Search. You could extend the data by mixing it with other info and really make it hum. And Elastic is wicked fast.
That's just my $0.02, though.
There is a project called "Unified Medical Language System (UMLS)", funded by NIH and apparently they are working on a RESTful Web API for medical terms.
https://documentation.uts.nlm.nih.gov/rest/home.html
I didn't work with their API yest and the samples I am seeing on their website sounds like they are more SNOMED-CT oriented.
The option I would go for is to get the whole ICD-10-CM from CMS and build my own Web API.
https://www.cms.gov/Medicare/Coding/ICD10/2016-ICD-10-CM-and-GEMs.html
you can check the full documentation from WHO https://icd.who.int/icdapi

design a search function for high scale [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
(not sure if this is the right forum for this question)
I am very curious about how search in major site, say youtube/quora/stackexcahnge, works?
And I'm NOT looking for an answer like 'They Use Lucene Search engine'. I want to understand exactly how the indexing works there.
Is there a different Index for text search than the autocomplete feature?
Is it done in the background like map reduce.
How exactly does map reduce help deliver results? (I know that it counts words in each document but what happens after that when I search for a keyword?)
I also heard that google stopped using map reduce and now using cloud dataFlow here - how does that work?
Help Please :-)
I voted to close, because I think your question is too broad. Each bullet could form the basis of an SO question. That stated, I'll take a crack at answer how SolrCloud attempts to solve each of the problems you are asking about:
Is there a different Index for text search than the autocomplete feature?
The short answer is "yes". Solr has several options for implementing an autocomplete feature and all of them rely on either building a separate index or being supplied a separate dictionary. You can also roll your own in an even more sophisticated fashion as the blog post "Super flexible AutoComplete with Solr" demonstrates.
Is it done in the background like map reduce?
Generally speaking no. SolrCloud is based on the idea of shards with leaders and replicas. A shard being a subset of your overall index. With a shard being comprised of a leader and possibly one or more replicas.
Queries are executed against all shard leaders. With assigning a particular shard to serve as the aggregator of each shard's response, but unlike map reduce where the individual node responses have all the data the reducing node needs, the aggregating Solr shard may make multiple requests back to the other shards to figure out sort order - for example.
How exactly does map reduce help deliver results? (I know that it counts words in each document but what happens after that when I search for a keyword?)
See my response to your previous question. In short the query is executed against each shard, aggregated by one of those shards, and returned to the requestor. What Solr does - Lucene really - that's the useful magic part that people most often associate with it is Term Frequency Inverse Document Frequency indexing usually with stemming on text searches. While this is not exactly what happens under the hood, and you can vary what's actually done via configuration, it provides a fairly good idea of what's being done.
Other searching, on dates and numbers, or simple textual values is done in a fashion similar to database indexing. That is a simplification, if you want to understand it more fully read the JavaDoc on NumericRangeQuery for an in-depth explanation.
I also heard that google stopped using map reduce and now using cloud dataFlow here - how does that work?
If I knew the answer to that I would probably be working for Google and not answering StackOverflow questions :). Seriously whatever they've built is new PhD level work that as far as I know they haven't even release a research paper on, which is what they did with map reduce that led to Yahoo building Hadoop.

Difference between a search engine's relevance rankings and a recommender system [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions concerning problems with code you've written must describe the specific problem — and include valid code to reproduce it — in the question itself. See SSCCE.org for guidance.
Closed 9 years ago.
Improve this question
What's the difference between a search engine's relevance rankings and a recommender system?
Don't both try and achieve the same purpose, i.e. finding the most relevant items for the user?
There is a major difference between a search engine and a recommender system:
In a search engine, the user knows what he is looking for, and he makes the query ! For instance, I might wonder if I should go to see a movie, and search information about it, like actors and directors.
In a recommender system, the user isn't supposed to know what we are recommending to her. We match her tastes with neighbours or whatever algorithm you like, and find things that she would't have looked after, like a new movie!
One is more about information retrieval, while the other is more about information filtering and discovery.
No, there are at two differents levels of analysis.
A search engine, look into a collection of data to get data that match a query. Even if all result are identicals or the results doesn't change from day to day. Very like a special form of databases.
The recommender system, use informations about you to provide specific improved content about the searched data. Very like a servant that know you well and use search engine for you.
Beward, some tool that starts as web-search engine are now more like recommender system.

Resources