I'm having difficulty looking for NEST ElasticSearch sample code that will treat singular/plural term as same. E.g. 'Shoes' and 'Shoe', 'Mouse' and 'Mice', etc. Need help on how to achieve this functionality. Thanks
Have you tried using snowball for stemming?
http://www.elasticsearch.org/guide/reference/index-modules/analysis/snowball-analyzer/
Here is a link that discusses using snowball over the standard analyzer for your exact scenario.
Lucene Standard Analyzer vs Snowball
Related
I want to do search mechanism similar to google using NLP(Natural Language Processing) in java. The algorithm should be able to give auto-suggestions , spellcheck , get the meaning out of the sentence and display top relevant results.
For Example , if I typed "laptop" relevant results to be shown ["laptop bags","laptop deals","laptop prices","laptop services" ,"laptop tablet"]
Is it possible to achieve with NLP and Semantics? It would be appreciable if you post any reference links or ideas to achieve.
"Get the meaning out of a sentence" - that's really difficult task. I don't believe even google does that in their search engine;) When talking about searching getting the meaning of query is not that important...but it really depends on what do you mean by "get the meaning", anyway you always can buy yourself something like "Google Search Appliance" - its a private google search box.
All the other requirements are quite straightforward. I'm from java land soi'd suggest you to look at:
Apache Lucene - if you are a developer, it's an indexer created around full text searches
Elasticsearch It's full blown,fast scalable server build around lucene that can do most of what you are asking.
Solr Another one, in terms of functionality equal to elastic IMHO.
First a little bit of context: I'm trying to identify street addresses in a corpus of documents and we decided that the obvious solution for this would be to use an NLP (Apache OpenNLP in this case) tool to achieve this and so far everything looks great although we still need to train the model with a lot of documents, but that's not really an issue. We improved the solution by adding a extra step for address validation by using the USAddress parser from Datamade. My biggest issue is the fact that the addresses by themselves are nothing without a location next to them, sometimes the location is specified in the text and we will assume that this happens quite often.
Here comes my question: Is there someway to use coreference to associate the entities in the text? Or better yet is there a way to annotate arbitrary words in the text and identify them as being one entity?
I've been looking at the Apache OpenNLP documentation but...it's pretty thin and I think it still needs some work.
If you want to use coreference for this problem, you can have a look at this blog
But a simpler solution would be using a sentence detector+ RegEx or a location NER+ sentence detector(presuming addresses are in a single line)
I think the US addresses can be identified using a Regular Expression and once the regex matches, you can use opennlp's sentence detector to print the whole address line.
Similarly you can use NER model provided by opennlp to find locations and print the sentence you want.
Hope this helps!
edit
this Github Repo made it simple for us. Check it out!
OpenNLP does not provide a coreference resolution module. You have to use either Stanford or Illinois or Berkeley system to accomplish the task. They may not work out of the box, you may have to do some parameter tuning or supervised training to achieve reasonable performance.
#edit
Thanks #Alaye for pointing out that OpenNLP does have a coref module, for more details see his answer.
Thanks
Ok, several months later! It wasn't Coref what I was after... what I as actually looking for was Relation Extraction (Information Extraction). I used MITIE (BinaryRelation) and that did the trick, I trained my own model using Brat annotation tool and I got an F1 score of 0.81. Pretty neat...
I am experimenting with elasticsearch as a search server and my task is to build a "semantic" search functionality. From a short text phrase like "I have a burst pipe" the system should infer that the user is searching for a plumber and return all plumbers indexed in elasticsearch.
Can that be done directly in a search server like elasticsearch or do I have to use a natural language processing (NLP) tool like e.g. Maui Indexer. What is the exact terminology for my task at hand, text classification? Though the given text is very short as it is a search phrase.
There may be several approaches with different implementation complexity.
The easiest one is to create list of topics (like plumbing), attach bag of words (like "pipe"), identify search request by majority of keywords and search only in specified topic (you can add field topic to your elastic search documents and set it as mandatory with + during search).
Of course, if you have lots of documents, manual creation of topic list and bag of words is very time expensive. You can use machine learning to automate some of tasks. Basically, it is enough to have distance measure between words and/or documents to automatically discover topics (e.g. by data clustering) and classify query to one of these topics. Mix of these techniques may also be a good choice (for example, you can manually create topics and assign initial documents to them, but use classification for query assignment). Take a look at Wikipedia's article on latent semantic analysis to better understand the idea. Also pay attention to the 2 linked articles on data clustering and document classification. And yes, Maui Indexer may become good helper tool this way.
Finally, you can try to build an engine that "understands" meaning of the phrase (not just uses terms frequency) and searches appropriate topics. Most probably, this will involve natural language processing and ontology-based knowledgebases. But in fact, this field is still in active research and without previous experience it will be very hard for you to implement something like this.
You may want to explore https://blog.conceptnet.io/2016/11/03/conceptnet-5-5-and-conceptnet-io/.
It combines semantic networks and distributional semantics.
When most developers need word embeddings, the first and possibly only place they look is word2vec, a neural net algorithm from Google that computes word embeddings from distributional semantics. That is, it learns to predict words in a sentence from the other words around them, and the embeddings are the representation of words that make the best predictions. But even after terabytes of text, there are aspects of word meanings that you just won’t learn from distributional semantics alone.
Some results
The ConceptNet Numberbatch word embeddings, built into ConceptNet 5.5, solve these SAT analogies better than any previous system. It gets 56.4% of the questions correct. The best comparable previous system, Turney’s SuperSim (2013), got 54.8%. And we’re getting ever closer to “human-level” performance on SAT analogies — while particularly smart humans can of course get a lot more questions right, the average college applicant gets 57.0%.
Semantic search is basically search with meaning. Elasticsearch uses JSON serialization by default, to apply search with meaning to JSON you would need to extend it to support edge relations via JSON-LD. You can then apply your semantic analysis over the JSON-LD schema to word disambiguate plumber entity and burst pipe contexts as a subject, predicate, object relationships. Elasticsearch has a very weak semantic search support but you can go around it using faceted searching and bag of words. You can index a thesaurus schema for plumbing terms, then do a semantic matching over the text phrases in your sentences.
"Elasticsearch 7.3 introduced introduced text similarity search with vector fields".
They describe the application of using text embeddings (e.g., word embeddings and sentence embeddings) to implement this sort of semantic similarity measure.
A bit late to the party, but part II of this blog seems to address this through "contextual searches". It basically makes a two-part query to Elasticsearch in order to build a list of "seed" documents and then an expanded query via the more-like-this API. The result is a set of documents most contextually similar to the search query.
it's possible. This GitHub repo shows how to integrate Elasticsearch with the current state-of-the-art on NLP for semantic representation of language: BERT (Bidirectional Encoder Representations from Transformers) https://github.com/Hironsan/bertsearch
Good luck.
My suggestion is to use BERT embedding for your sentences and add an embedding field to your ElasticSearch, as it is described in https://www.elastic.co/blog/text-similarity-search-with-vectors-in-elasticsearch
For BERT embedding I suggest to use sentence-transformers from Huggingface library. You can find sample codes in https://towardsdatascience.com/how-to-build-a-semantic-search-engine-with-transformers-and-faiss-dcbea307a0e8
There are several options for that:
You can perform it in elasticsearch itself. Elasticsearch supports the indexing of Dense Embedding of docs. From there, you can write your own pipeline for search and use your preferred relevancy score formula ie. cosine similarity or something else.
Use Haystack pipeline, refer to my blog which describes setting up a semantic search pipeline (end-to-end).
You can use Meta's Faiss
Any recommendations for small, lightweight, bag of words search engine?
I have a set of 'documents' that are each basically a small bag of arbitrary words.
Given a new document, I need to get a list of 'similar' documents along with some weight for how similar they might be. Documents are likely to be small.. a couple paragraphs at most.
Stemming would be great but not highly required.
Word expansion with word nets not required.
opensource or freeware preferred, as this is a prototype, not a full-blow project.
unix/linux platform preferred.
I'd be using it as a subcomponent and expect only to feed it documents with an ID and would later do searches for 'similar' documents to one I currently have.
Whoosh is a pure Python (no C, no external database) indexer / search engine. Check out the documentation for more information. It does support stemming.
I tried it out on an XML dump of a mediawiki instance and it seemed to work pretty well!
Solr or Sphinx. They aren't exactly lightweight but I wouldn't recommend anything smaller, if the project turns out to be successful and it needs to grow, switching the search engine might be painful.
I think that Lucene is an option. It should allow you to build a custom bag of words search engine.
I wonder about MongoDB http://www.mongodb.org/display/DOCS/Home
It seems like 'full-text-search' may be what I'm after...
and having additional fields to search with may be handy.
I'm wondering if there are any good .NET recommendation algorithms available in open source projects, whether attached to a search engine or not. By recommendation I mean something that accepts a full-text article and recommends other articles from its index based on keyword similarity.
At the high end there are document classification engines like Autonomy; at the low-end spam filters and blog "related posts" widgets. Possibly advertisement-to-article matching, too. I'd like to incorporate one into a project but can't afford the high end and the low end seems to all be LAMP-based.
[Sorry, one answer asked for clarification: What I'm looking for is ideally a standalone library, but I'm willing to adapt good source code as necessary. The end result is that I need to be able to create a C# service that accepts an arbitrary amount of text and returnsa list of similar previously-indexed articles. Basicallly, the exact thing that StackOverflow itself does as you are submitting a question!]
Thanks!
Steve
I think that in StackOverflow they extract all common english words from the text and then compare this words with the remaining words of other posts to get the "Related" posts.
Question is not very clear (algorithm or library???) but only thing that comes to mind is Lucene.NET, the porting of the popular Lucene library on the .Net framework. HTH.