Situation
I need to create a live search with MongoDB. But I don't know, which index is better to use normal or text. Yesterday I found main differences between them. I have a following document:
{
title: 'What vitamins are found in blueberries'
//other fields
}
So, when user enter blue, the system must find this document (... blueberries).
Problem
I found these differences in the article about them:
A text index on the other hard will tokenize and stem the content of the field. So it will break the string into individual words or tokens, and will further reduce them to their stems so that variants of the same word will match ("talk" matching "talks", "talked" and "talking" for example, as "talk" is a stem of all three).
So, Why is a text index, and its subsequent searchs faster than a regex on a non-indexed text field? It's because text indexes work as a dictionary, a clever one that's capable of discarding words on a per-language basis (defaults to english). When you run a text search query, you run it against the dictionary, saving yourself the time that would otherwise be spent iterating over the whole collection.
That's what I need, but:
The $text operator can search for words and phrases. The query matches on the complete stemmed words. For example, if a document field contains the word blueberry, a search on the term blue will not match the document. However, a search on either blueberry or blueberries will match.
Question
I need a fast clever dictionary but I also need searching by substring. How can I join these two methods?
Related
I have a master corpus containing thousands of popular novels. I then have a .csv file that's a list containing about 450 different phrases (rhetorical_devices.csv). I am trying to use regex to do two things with these data:
Return a boolean telling me whether or not any phrase from the .csv list is present in the master_corpus.
Search for and then count the number exact match phrases between the .csv list and the master_corpus. I don't need to know which phrases matched, just the number of matches.
The .csv list is almost all multi-word phrases, things like:
huffed loudly
felt light-headed
couldn't they?
stop!
Some of the phrases contain pieces of punctuation that are relevant to my search, so for example, I need to be able to ID "couldn't they?" with the words in that exact order, question mark included. I keep getting all sorts of hits on sentences that contain "couldn't" and "they" and "?" in any random order. For this example, "They couldn't just stop?" is returning 2 hits for the count. Seems like my code is just looking for all of the words rather than them in the correct order and containing stipulated punctuation.
Right now, this is my attempt at a boolean, where master_corpus is all of the novels:
phrase_list = self.corpora['rhetorical_devices.csv'][0].to_list()
phrase_list = [i.lower() for i in phrase_list]
regex = '|'.join(phrase_list)
return bool(re.search(regex, master_corpus.lower()))
I think the ! and ? from the list are ending up as regex operators, but also I'm not sure how to import the list and make sure I'm looking for those exact matches.
Any help would be greatly appreciated.
Instead of using a regex, you should loop over the phrases like Mike L suggested:
total_matches = 0
corpus = master_corpus.lower()
for phrase in phrase_list:
total_matches += corpus.count(phrase)
Facing an issue with ‘wildcarded’ search for ‘unfiltered’ cts search query.
Problem explanation:
I have inserted the below docs in DB.
xdmp:document-insert('/a/a1.xml', <root><aa>123</aa></root>);
xdmp:document-insert('/a/a2.xml', <root><aa>12</aa></root>);
xdmp:document-insert('/a/a3.xml', <root><aa>1</aa></root>);
In the below query I am looking for documents having only one digit in ‘aa’ element.
But the below query returning me all the documents I have inserted above.
cts:search(
doc(),
cts:element-word-query(xs:QName('aa'), '?', ('wildcarded')),
'unfiltered'
)
If I will perform ‘filtered’ search I am getting the right result which is doc ‘/a/a3.xml.
Same issue is when the search term is ‘??’(docs expected which contain two digit number in ‘aa’ element) and
‘???’ (docs expected which contain three digit number in ‘aa’ element)
Below indexes are set to true:
three character searches
three character word positions
fast element character searches
trailing wildcard searches
trailing wildcard word positions
fast element trailing wildcard searches
I am curious to know why this is happening and how can I correct this?
An unfiltered search can only return accurate results if there is an index that can satisfy the query. You can see how your query is being formulated to index resolution using xdmp:plan:
xdmp:plan(
cts:search(doc(),cts:element-word-query(xs:QName("aa"),"?","wildcarded"))
In your case, you have no index that can do this and the plan will show that you are just asking for all documents with that element in them. The three character and trailing wildcard indexes only work if there are three or more non-wildcard characters, and the fast element character index just means to apply whatever character indexes you have with the element context. We recommend that for wildcards you add a codepoint collation word lexicon. You can add it to the database as a whole, or, if you know you only need these kinds of wildcards for this particular element, you can add an element word lexicon. Lexicon expansion can then be used to resolve the wildcard.
This happens in a heuristic way automatically (which is to say, depending on the size of your database and the number of lexicon matches, we may formulate the query in more or less accurate ways), but there are also various options to force the handling to behave a certain way. See the API for cts:element-word-query
I have 4842 documents with a sample format
{"ID":"12345","NAME":"name_value","KIND":"kind_value",...,"Secondary":{...},"Tertiary":{...}} where “...” are a few more varying number of key value pairs per object
I have indexed KIND as a full text index using - db.collection.ensureFulltextIndex("KIND") before inserting data.Also, KIND is just a one word string. ie. without spaces
Via AQL following queries were executed:
FOR doc IN FULLTEXT(collection, 'KIND', 'DeploymentFile') RETURN doc --> takes 3.54s (avg)
FOR doc IN collection FILTER doc.KIND == 'DeploymentFile' RETURN doc --> takes 1.16s (avg)
2944 Objects returned in both queries
Q1. Assuming that we have used a fulltext index and I haven't hash indexed KIND, shouldn't the query using FULLTEXT function be faster than the normal == operation (since == doesn't utilize the full text index). If so, what am I doing wrong here?
Q2. Utilizing the fulltext index, can i perform a query which does a CONTAINS string or LIKE string?
---UPDATE Q2.The requirement is searching for a substring within a parent string (which is only one word). The substring can lie anywhere within the parent string. (SQL equivalent of LIKE '%text%')
Q1: The fulltext index does allow for more complex query. It splits the text at word breaks and checks if a word occurs within a larger text. All of these features are not needed in your example. Therefore it generates more overhead than it is saving.
In your example it would be better to create a skip-list or hash-index and search for equality.
Q2: In the simplest form, a fulltext query contains just the sought word. If multiple search words are given in a query, they should be separated by commas. All search words will be combined with a logical AND by default, and only such documents will be returned that contain all search words. This default behavior can be changed by providing the extra control characters in the fulltext query, which are:
+: logical AND (intersection)
|: logical OR (union)
-: negation (exclusion)
Examples:
"banana": searches for documents containing "banana"
"banana,apple": searches for documents containing both "banana" AND "apple"
"banana,|orange": searches for documents containing either "banana" OR "orange" OR both
"banana,-apple": searches for documents that contains "banana" but NOT "apple".
Logical operators are evaluated from left to right.
Each search word can optionally be prefixed with complete: or prefix:, with complete: being the default. This allows searching for complete words or for word prefixes. Suffix searches or any other forms are partial-word matching are currently not supported.
Examples:
"complete:banana": searches for documents containing the exact word "banana"
"prefix:head": searches for documents with words that start with prefix "head"
"prefix:head,banana": searches for documents contain words starting with prefix - "head" and that also contain the exact word "banana".
Complete match and prefix search options can be combined with the logical operators.
I am trying to perform exact and stemmed searches on text and get back "compiled" results.
Currently what I have:
There is text being stored in the stem field and it's copied into the quoted field. Stem queries and exact queries work on their respective fields. When I search for (problem only with and)
"word1 word2" and/or word3
which gives me the query
stemmed:word3 &/or quote:"word1 word2
What I get is results from the two fields respectively. With or, this is fine but with and, I get back two or more results back for the same text and each has different highlighting.
The question is: what's the best way to do stem/exact search on the same text (guessing multiple fields) and if I have the right approach, what's the best way to merge these and if solr can do it?
Thanks!!
Edit: I checked out edismax but fail to see how to use it properly. My results are in the comments of the answer suggesting it...
Please check on the Edismax Query Parser which will allow you to define the fields and have the text being searched on all of them with variable boost.
I want to know the best way to rank sentences based on similarity from a set of documents.
For e.g lets say,
1. There are 5 documents.
2. Each document contains many sentences.
3. Lets take Document 1 as primary, i.e output will contain sentences from this document.
4. Output should be list of sentences ranked in such a way that sentence with FIRST rank is the most similar sentence in all 5 documents, then 2nd then 3rd...
Thanks in advance.
I'll cover the basics of textual document matching...
Most document similarity measures work on a word basis, rather than sentence structure. The first step is usually stemming. Words are reduced to their root form, so that different forms of similar words, e.g. "swimming" and "swims" match.
Additionally, you may wish to filter the words you match to avoid noise. In particular, you may wish to ignore occurances of "the" and "a". In fact, there's a lot of conjunctions and pronouns that you may wish to omit, so usually you will have a long list of such words - this is called "stop list".
Furthermore, there may be bad words you wish to avoid matching, such as swear words or racial slur words. So you may have another exclusion list with such words in it, a "bad list".
So now you can count similar words in documents. The question becomes how to measure total document similarity. You need to create a score function that takes as input the similar words and gives a value of "similarity". Such a function should give a high value if the same word appears multiple times in both documents. Additionally, such matches are weighted by the total word frequency so that when uncommon words match, they are given more statistical weight.
Apache Lucene is an open-source search engine written in Java that provides practical detail about these steps. For example, here is the information about how they weight query similarity:
http://lucene.apache.org/java/2_9_0/api/all/org/apache/lucene/search/Similarity.html
Lucene combines Boolean model (BM) of Information Retrieval with
Vector Space Model (VSM) of Information Retrieval - documents
"approved" by BM are scored by VSM.
All of this is really just about matching words in documents. You did specify matching sentences. For most people's purposes, matching words is more useful as you can have a huge variety of sentence structures that really mean the same thing. The most useful information of similarity is just in the words. I've talked about document matching, but for your purposes, a sentence is just a very small document.
Now, as an aside, if you don't care about the actual nouns and verbs in the sentence and only care about grammar composition, you need a different approach...
First you need a link grammar parser to interpret the language and build a data structure (usually a tree) that represents the sentence. Then you have to perform inexact graph matching. This is a hard problem, but there are algorithms to do this on trees in polynomial time.
As a starting point you can compute soundex for each word and then compare documents based on soundexes frequencies.
Tim's overview is very nice. I'd just like to add that for your specific use case, you might want to treat the sentences from Doc 1 as documents themselves, and compare their similarity to each of the four remaining documents. This might give you a quick aggregate similarity measure per sentence without forcing you to go down the route of syntax parsing etc.