Decompose compound words

Decompose compound words - search

I'm using Lucene 4.10.4
I would like that if a document contains the word cheeseburger and the search for cheese burger or cheese or burger should bring this document.
I tried DictionaryCompoundWordTokenFilter. And the words to match against is passed as word dictionary externally.
Is there any way to match against the indexed words and to do the above in more efficient way than the DictionaryCompoundWordTokenFilter

Related

Azure Search Lucene (full query type) single character

I'm using Azure Cognitive Search with QueryType = SearchQueryType.Full. It works fine but it doesn't search a word leas or equal 3 characters e.g. "the", "AC" etc.
I have some specific words which contain two characters.
Is it possible to somehow turn on search by all words even have less or equal to 3 characters?
Update: I believe it's not a problem with a searching but with highlighting results .

Having QueryType = SearchQueryType.Full is not a problem.
If you are using standard.lucene by default stopwards list is empty.
https://learn.microsoft.com/en-us/azure/search/index-add-custom-analyzers#predefined-analyzers-reference
If you are using English language analyzer all common filling words will not be indexed. https://learn.microsoft.com/en-us/azure/search/index-add-language-analyzers#english-analyzers
If you are searching for words for "starts with" you need to use wild card at the end of each word. Ex: the* for searching theatre

Azure Search: Prioritize closest exact match over others in a prefix search

I'm currently doing a prefix search with Azure Cognitive Search like so:
docs?api-version=2019-05-06&search=Do*
Suppose that my index contains Dog, Big Dog, and Small Dog. The result set seems to be sorted alphabetically by default and looks like:
Big Dog
Dog
Small Dog
How can I change my query string so that the closest exact match appears first and the rest is sorted alphabetically? Here's the output I want:
Dog
Big Dog
Small Dog
So, if the user types D, Do, or Dog, I want to show Dog first to help them short-circuit typing.

The results are ordered according to a score. This, is the result of TFxIDF formula. In other words, the results are displayed according to which term is more relevant according to your documents.
Saying that, I believe you must use NGram in order to get the most relevant term.
more info:
https://azure.microsoft.com/en-us/blog/custom-analyzers-in-azure-search/

Can you share what your exact document looks like? As Thiago mentioned Azure Cognitive Search returns a relevance score which shows the relative relevance of the entire document corresponding to the input query.
If your documents have only 1 matching field with the exact text you shared, it should return "Dog" with the highest score as it's more relevant to the query.

MongoDB: Indexing for a live search

Situation
I need to create a live search with MongoDB. But I don't know, which index is better to use normal or text. Yesterday I found main differences between them. I have a following document:
{
title: 'What vitamins are found in blueberries'
//other fields
}
So, when user enter blue, the system must find this document (... blueberries).
Problem
I found these differences in the article about them:
A text index on the other hard will tokenize and stem the content of the field. So it will break the string into individual words or tokens, and will further reduce them to their stems so that variants of the same word will match ("talk" matching "talks", "talked" and "talking" for example, as "talk" is a stem of all three).
So, Why is a text index, and its subsequent searchs faster than a regex on a non-indexed text field? It's because text indexes work as a dictionary, a clever one that's capable of discarding words on a per-language basis (defaults to english). When you run a text search query, you run it against the dictionary, saving yourself the time that would otherwise be spent iterating over the whole collection.
That's what I need, but:
The $text operator can search for words and phrases. The query matches on the complete stemmed words. For example, if a document field contains the word blueberry, a search on the term blue will not match the document. However, a search on either blueberry or blueberries will match.
Question
I need a fast clever dictionary but I also need searching by substring. How can I join these two methods?

Solr multiple word search

I would like to know how can search row by adding multiple word in search.
i.e text is
The quick brown fox jumps over the lazy dog
I want to search
quick dog
so that I can get this row in result
if i search
quick elephant
still i should get this row in result.
The quick brown fox jumps over the lazy dog
The lazy brown fox jumps over the lazy dog
if i search brown i should get both row in result
if i search quick brown i should get only first line
Is this achievable with solr?

You can tune the way Solr matches multiple terms by using the mm parameter in the edismax query parser (as well as in the dismax query parser). While the second example (where the second line should be excluded), the mm parameter allows you to adjust exactly how many terms needs to be matched for a document to be considered valid for the search.
The second row will be scored lower than the first row in the second example, but you won't be able to exclude it.

Dear you can use AND operator (&&) between both work you wanat to search in a single document .
Like : "quick" AND "brown" will give yoy one first document .
The AND operator matches documents where both terms exist anywhere in the text of a single document
Also prefer use of + sign stand for compulsion of word in documengt .
+brown +quick
The "+" or required operator requires that the term after the "+" symbol exist somewhere in a the field of a single document.
Ref : https://lucene.apache.org/core/2_9_4/queryparsersyntax.html

Best way to rank sentences based on similarity from a set of Documents

I want to know the best way to rank sentences based on similarity from a set of documents.
For e.g lets say,
1. There are 5 documents.
2. Each document contains many sentences.
3. Lets take Document 1 as primary, i.e output will contain sentences from this document.
4. Output should be list of sentences ranked in such a way that sentence with FIRST rank is the most similar sentence in all 5 documents, then 2nd then 3rd...
Thanks in advance.

I'll cover the basics of textual document matching...
Most document similarity measures work on a word basis, rather than sentence structure. The first step is usually stemming. Words are reduced to their root form, so that different forms of similar words, e.g. "swimming" and "swims" match.
Additionally, you may wish to filter the words you match to avoid noise. In particular, you may wish to ignore occurances of "the" and "a". In fact, there's a lot of conjunctions and pronouns that you may wish to omit, so usually you will have a long list of such words - this is called "stop list".
Furthermore, there may be bad words you wish to avoid matching, such as swear words or racial slur words. So you may have another exclusion list with such words in it, a "bad list".
So now you can count similar words in documents. The question becomes how to measure total document similarity. You need to create a score function that takes as input the similar words and gives a value of "similarity". Such a function should give a high value if the same word appears multiple times in both documents. Additionally, such matches are weighted by the total word frequency so that when uncommon words match, they are given more statistical weight.
Apache Lucene is an open-source search engine written in Java that provides practical detail about these steps. For example, here is the information about how they weight query similarity:
http://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/Similarity.html
Lucene combines Boolean model (BM) of Information Retrieval with
Vector Space Model (VSM) of Information Retrieval - documents
"approved" by BM are scored by VSM.
All of this is really just about matching words in documents. You did specify matching sentences. For most people's purposes, matching words is more useful as you can have a huge variety of sentence structures that really mean the same thing. The most useful information of similarity is just in the words. I've talked about document matching, but for your purposes, a sentence is just a very small document.
Now, as an aside, if you don't care about the actual nouns and verbs in the sentence and only care about grammar composition, you need a different approach...
First you need a link grammar parser to interpret the language and build a data structure (usually a tree) that represents the sentence. Then you have to perform inexact graph matching. This is a hard problem, but there are algorithms to do this on trees in polynomial time.

As a starting point you can compute soundex for each word and then compare documents based on soundexes frequencies.

Tim's overview is very nice. I'd just like to add that for your specific use case, you might want to treat the sentences from Doc 1 as documents themselves, and compare their similarity to each of the four remaining documents. This might give you a quick aggregate similarity measure per sentence without forcing you to go down the route of syntax parsing etc.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string