ArangoDB Full Text Index Performance - arangodb

I have 4842 documents with a sample format
{"ID":"12345","NAME":"name_value","KIND":"kind_value",...,"Secondary":{...},"Tertiary":{...}} where “...” are a few more varying number of key value pairs per object
I have indexed KIND as a full text index using - db.collection.ensureFulltextIndex("KIND") before inserting data.Also, KIND is just a one word string. ie. without spaces
Via AQL following queries were executed:
FOR doc IN FULLTEXT(collection, 'KIND', 'DeploymentFile') RETURN doc --> takes 3.54s (avg)
FOR doc IN collection FILTER doc.KIND == 'DeploymentFile' RETURN doc --> takes 1.16s (avg)
2944 Objects returned in both queries
Q1. Assuming that we have used a fulltext index and I haven't hash indexed KIND, shouldn't the query using FULLTEXT function be faster than the normal == operation (since == doesn't utilize the full text index). If so, what am I doing wrong here?
Q2. Utilizing the fulltext index, can i perform a query which does a CONTAINS string or LIKE string?
---UPDATE Q2.The requirement is searching for a substring within a parent string (which is only one word). The substring can lie anywhere within the parent string. (SQL equivalent of LIKE '%text%')

Q1: The fulltext index does allow for more complex query. It splits the text at word breaks and checks if a word occurs within a larger text. All of these features are not needed in your example. Therefore it generates more overhead than it is saving.
In your example it would be better to create a skip-list or hash-index and search for equality.
Q2: In the simplest form, a fulltext query contains just the sought word. If multiple search words are given in a query, they should be separated by commas. All search words will be combined with a logical AND by default, and only such documents will be returned that contain all search words. This default behavior can be changed by providing the extra control characters in the fulltext query, which are:
+: logical AND (intersection)
|: logical OR (union)
-: negation (exclusion)
Examples:
"banana": searches for documents containing "banana"
"banana,apple": searches for documents containing both "banana" AND "apple"
"banana,|orange": searches for documents containing either "banana" OR "orange" OR both
"banana,-apple": searches for documents that contains "banana" but NOT "apple".
Logical operators are evaluated from left to right.
Each search word can optionally be prefixed with complete: or prefix:, with complete: being the default. This allows searching for complete words or for word prefixes. Suffix searches or any other forms are partial-word matching are currently not supported.
Examples:
"complete:banana": searches for documents containing the exact word "banana"
"prefix:head": searches for documents with words that start with prefix "head"
"prefix:head,banana": searches for documents contain words starting with prefix - "head" and that also contain the exact word "banana".
Complete match and prefix search options can be combined with the logical operators.

Related

Search and Search All in COBOL

I want to know the result when we try to search an item in COBOL using SEARCH or SEARCH ALL and this item appears multiple times in the table.
Will any of the two will find all the occurrences ?
Will any of the two will find all the occurrences ?
No. Either SEARCH will identify one, and only one, table element.
The SEARCH statement is used to search a table for a table element that satisfies the specified condition and to adjust the value of the associated index to indicate that table element.
For SEARCH, the final value, of the index or identifier, will be the first table element that matches the conditions.
For SEARCH ALL, the final setting of the search index is equal to one of them, but it is undefined which one.
Neither will, but with the search you can set the initial starting index and do a second search to find subsequent entries.
Search
The Search verb does a linear search through the table. Table entries can be in any sequence.
If there are multiple entries the first after the starting index will be found.
You can use the Set index to verb to set the starting position.
Search All
The Search All does a binary search of the Table. The table must be in Key Sequence.
If there are multiple match's, any one could be found. For large tables, the Search All will be faster option.

cts search returning wrong results for wildcarded search

Facing an issue with ‘wildcarded’ search for ‘unfiltered’ cts search query.
Problem explanation:
I have inserted the below docs in DB.
xdmp:document-insert('/a/a1.xml', <root><aa>123</aa></root>);
xdmp:document-insert('/a/a2.xml', <root><aa>12</aa></root>);
xdmp:document-insert('/a/a3.xml', <root><aa>1</aa></root>);
In the below query I am looking for documents having only one digit in ‘aa’ element.
But the below query returning me all the documents I have inserted above.
cts:search(
doc(),
cts:element-word-query(xs:QName('aa'), '?', ('wildcarded')),
'unfiltered'
)
If I will perform ‘filtered’ search I am getting the right result which is doc ‘/a/a3.xml.
Same issue is when the search term is ‘??’(docs expected which contain two digit number in ‘aa’ element) and
‘???’ (docs expected which contain three digit number in ‘aa’ element)
Below indexes are set to true:
three character searches
three character word positions
fast element character searches
trailing wildcard searches
trailing wildcard word positions
fast element trailing wildcard searches
I am curious to know why this is happening and how can I correct this?
An unfiltered search can only return accurate results if there is an index that can satisfy the query. You can see how your query is being formulated to index resolution using xdmp:plan:
xdmp:plan(
cts:search(doc(),cts:element-word-query(xs:QName("aa"),"?","wildcarded"))
In your case, you have no index that can do this and the plan will show that you are just asking for all documents with that element in them. The three character and trailing wildcard indexes only work if there are three or more non-wildcard characters, and the fast element character index just means to apply whatever character indexes you have with the element context. We recommend that for wildcards you add a codepoint collation word lexicon. You can add it to the database as a whole, or, if you know you only need these kinds of wildcards for this particular element, you can add an element word lexicon. Lexicon expansion can then be used to resolve the wildcard.
This happens in a heuristic way automatically (which is to say, depending on the size of your database and the number of lexicon matches, we may formulate the query in more or less accurate ways), but there are also various options to force the handling to behave a certain way. See the API for cts:element-word-query

MongoDB: Indexing for a live search

Situation
I need to create a live search with MongoDB. But I don't know, which index is better to use normal or text. Yesterday I found main differences between them. I have a following document:
{
title: 'What vitamins are found in blueberries'
//other fields
}
So, when user enter blue, the system must find this document (... blueberries).
Problem
I found these differences in the article about them:
A text index on the other hard will tokenize and stem the content of the field. So it will break the string into individual words or tokens, and will further reduce them to their stems so that variants of the same word will match ("talk" matching "talks", "talked" and "talking" for example, as "talk" is a stem of all three).
So, Why is a text index, and its subsequent searchs faster than a regex on a non-indexed text field? It's because text indexes work as a dictionary, a clever one that's capable of discarding words on a per-language basis (defaults to english). When you run a text search query, you run it against the dictionary, saving yourself the time that would otherwise be spent iterating over the whole collection.
That's what I need, but:
The $text operator can search for words and phrases. The query matches on the complete stemmed words. For example, if a document field contains the word blueberry, a search on the term blue will not match the document. However, a search on either blueberry or blueberries will match.
Question
I need a fast clever dictionary but I also need searching by substring. How can I join these two methods?

lucene index special characters

I am indexing fields with characters like <,{,//," etc in Lucene
Is that fine?
My search query will not contain these special characters but the fields which will be retrieved in response to query might contain one or more of these.
Table: keywords
Fields->keyword,text
When the user will enter a search term, it will be matched against the column keyword which will have special characters.
I think it doesnt matter as long as your search term/search phrase/search query doesn't contain special characters

How to search phrase queries in inverted index structure?

If we want to search a query like this "t1 t2 t3" (t1,t2 ,t3 must be queued) in an inverted index structure ,
which ways should we do ?
1-First we search the "t1" term and find all documents that contains "t1" , then do this work for "t2" and then "t3" . Then find documents that positions of "t1" , "t2" and "t3" are next to each other .
2-First we search the "t1" term and find all documents that contains "t1" , then in all documents that we found , we search the "t2" and next , in the result of this , we find documents that contains "t3" .
I have a full inverted index . I want to know which ways above is optimized , (1) or (2) ?
thanks a lot.
As the wikipedia entry well explains,
There are two main variants of
inverted indexes: A record level
inverted index (or inverted file index
or just inverted file) contains a list
of references to documents for each
word. A word level inverted index (or
full inverted index or inverted list)
additionally contains the positions of
each word within a document. The
latter form offers more functionality
(like phrase searches), but needs more
time and space to be created.
Since you don't tell us which variant you have, we can't really answer your question precisely, but thinking about each possibility will help.
To open and search documents is typically a costly operation, unless your documents are unusually small, so you want to minimize that -- and option (2) doesn't really minimize it. If you have an inverted list, with option (1) you won't even need to open any document; if you only have an inverted file, you'll inevitably need to open documents and scan them (since you otherwise lack information to confirm word adjacency) -- but at least with option (1) you minimize the number of documents you have to open and scan (only those in the intersection of the lists of documents containing each word).
So, in either case, option (1) is more promising (unless your documents are peculiarly small).

Resources