Why is this ArangoDB query too slow? - arangodb

I am new ArangoDB user and I am using the following query
FOR i IN meteo
FILTER
i.`POM` == "Maxial"
&& TO_NUMBER(i.`TMP`) < 4.2
&& DATE_TIMESTAMP(i.`DTM`) > DATE_TIMESTAMP("2014-12-10")
&& DATE_TIMESTAMP(i.`DTM`) < DATE_TIMESTAMP("2014-12-15")
RETURN
i.`TMP`
on a 2 million document collection. It has an index on the three fields that are filtered. It takes aprox. 9 secs on the Web Interface.
Is it possible to run it faster?
Thank you
Hugo

I have no access to the underlying data and data distribution nor the exact index definitions, so I can only give rather general advice:
Use the explain() command in order to see if the query makes use of indexes, and if yes, which.
If explain() shows that no index is used, check if the attributes contained in the query's FILTER conditions are actually indexed. There is the db.<collection>.getIndexes() command to check which attributes are indexed.
If indexes are present but not used by query, the indexes may have the wrong type. For example, a hash index will only be used for equality comparisons (i.e. ==) but not for other comparison types (<, <=, >, >= etc.). A hash index will only be used if all the indexed attributes are used in the query's FILTER conditions. A skiplist index will only be used if at least its first attribute is used is used in a FILTER condition. If further of the skiplist index attributes are specified in the query (from left-to-right), they may also be used and allow to filter more documents.
Only a single index will be picked when scanning a collection. Having multiple, separate indexes on "POM", "TMP", and "DTM" won't help this query because it will only use one of them per collection that it iterates over. Instead, I suggest trying to put multiple attributes into an index if the query could benefit from this.
The more selective an index is, the better. For example, an index on a single attribute may filter a lot of documents, but a combined index on multiple attributes may filter even more. For this particular query, a skiplist index on [ "POM", "DTM" ] may be the right choice (in combination with 6.)
The only attribute for which the optimizer may consider an index lookup in the given original query is the "POM" attribute. The reason is that the other attributes are used inside function calls (i.e. TO_NUMBER(), DATE_TIMESTAMP()). In general, indexes will not be used for attributes which are used inside functions (e.g. for TO_NUMBER(i.tmp) < 4.2 no index will be used. Same for DATE_TIMESTAMP(i.DTM) > DATE_TIMESTAMP("2014-12-10"). Modifying the conditions so the indexed attributes are directly compared to some constant or a one-time calculated value can enable more candidate indexes. If possible, try to rewrite the conditions so that only the indexed attributes are present on the one side of the comparison. For this particular query, it would be better to use i.DTM > "2014-12-10" instead of DATE_TIMESTAMP(i.DTM) > DATE_TIMESTAMP("2014-12-10").

Related

Mongo DB like search with count is very slow on 50 million collection data

In my application, I have a collection of 50 million data. I am using like search and then count the results on a particular field(i.e Patientfirstname). I also created an index on the Patientfirstname field it improved the performance but still it is taking a lot of time.
db.patients.find({"Patientfirstname":{"$regex":"Testuser"}}).count() without index 40 sec
db.patients.find({"Patientfirstname":{"$regex":"Testuser"}}).count() after adding index on the Patientfirstname field 31 sec
db.patients.find({"Patientfirstname":{"$regex":"Testuser"}}).count()
I tried with a different approach (aggregate) but still, response is very slow
db.patients.aggregate.([{$match:{"Patientfirstname":{"$regex":"Testuser"}}},
{$project:{"Patientfirstname":1,"_id":1}},
{$group : {_id:"$Patientfirstname", count:{$sum:1}}},
{$sort:{"count":-1}} ])
this query also takes the same to time fetch the results 31 sec
another approach was tried but the results are not correct
select only the field from the entire collection and then apply like search and count and result.
db.patients.find({},{Patientfirstname:1,_id:1}).count({"Patientfirstname":{"$regex":"Testuser"}})
applying a filter in the count is not working, entire collection count is displayed
Please help in this query to fetch results faster.Thanks in advance
So here is the deal:
As rightly pointed in the comments, $regex is an operator that would not perform well with or without indexes. Here is the reason why:
Queries without indexes are slow because they executed using COLLSCAN - which is essentially iteration of the whole 50 Million documents on the disk one-by-one, filtering data and returning only the ones that match. Disks being an inherently slow piece of hardware does not help the situation either.
Now, When indexed - MongoDB creates a B-Tree in the RAM. And $regex operator being not very selective in nature, it forces a complete Tree Scan (as compared to a reduced / partial tree scan in case of equalities or ranges) in the index b-tree - which is as bad as a Collection Scan itself. The only reason you get a benefit on 9 seconds is because this Tree Scan occurs in the RAM and not the disk.
Having said that, there are a few alternatives to it:
Optimize your $regex. From the MongoDB Documentation itself:
For case sensitive regular expression queries, if an index exists for the field, then MongoDB matches the regular expression against the values in the index, which can be faster than a collection scan. Further optimization can occur if the regular expression is a "prefix expression", which means that all potential matches start with the same string. This allows MongoDB to construct a "range" from that prefix and only match against those values from the index that fall within that range.
A regular expression is a "prefix expression" if it starts with a caret (^) or a left anchor (\A), followed by a string of simple symbols. For example, the regex /^abc.*/ will be optimized by matching only against the values from the index that start with abc.
Additionally, while /^a/, /^a./, and /^a.$/ match equivalent strings, they have different performance characteristics. All of these expressions use an index if an appropriate index exists; however, /^a./, and /^a.$/ are slower. /^a/ can stop scanning after matching the prefix.
Case insensitive regular expression queries generally cannot use indexes effectively. The $regex implementation is not collation-aware and is unable to utilize case-insensitive indexes.
Create a Text Index - This would tokenize your text string and enable faster text based searches
If you are deployed on MongoDB Atlas - Then you can use Atlas Search which is a Lucene based Text Search Engine (Works almost like elasticsearch on steroids). This offers significantly greater performance and functionalities like fuzzy text search, text automcomplete etc.

AQL: collection not found. non-blocking query

If I run this query:
FOR i IN [
my_collection[*].my_prop,
my_other_collection[*].my_prop,
]
RETURN i
I get this error:
Query: AQL: collection not found: my_other_collection (while parsing)
It's true that 'my_other_collection' may not exist, but I still want the result from 'my_collection'.
How can I make this error non-blocking ?
A missing collection will cause an error and this can not be ignored or suppressed. There is also no concept of late-bound collections which would allow you to evaluate a string as collection reference at runtime. In short: this is not supported.
The question is why you would want to use such a pattern in the first place. I assume that both collections exists, then it will materialize the full array before returning anything, which is presumably memory intensive.
It would be much better to either keep the documents of both collections in a single collection (you may add an extra attribute type to distinguish between them) or to use an ArangoSearch view, so that you can search indexes attributes across collections.
Beyond the two methods already mentioned in the previous answer (Arango search and single collection), you could do this in JavaScript , probably inside Foxx:
Check if collections exist with db_collection(collection-name)
Then build the query string using aql fragments and use union to merge the fragments to pull results from the different collections.
Note that ,if the collections are large, you probably will want to filter that results instead of just pulling all the documents.

ArangoDB REGEX_TEST index acceleration?

Is there a way to index while performing REGEX_TEST() on a string to field to retrieve documents in ArangoDB?
Also if there is any way to optimize this please let me know
There is no index acceleration available for the REGEX_TEST() AQL function, and it is unlikely to come in the future. Not because there is no interest from users and developers, but because it's not really possible to build any sort of index data structure that would allow to speed up regular expression evaluation.
Regular expressions as supported by ArangoDB allow for many different types of expressions, but because they can differ so much, there is almost no chance to have a suitable index. For equality comparisons there are hash indexes, which are probably the fastest kind of index. For range queries there are skiplist indexes, and there are of course quite a few more index types known in computer science, but I'm not aware of a single one that could speed up arbitrary regex.
If your expression allows, maybe there is a chance add a filter criterion before REGEX_TEST() which might utilize an index? This will mostly be limited to case-sensitive prefix matching, e.g. FILTER REGEX_TEST(doc.str, "a[a-z]*") could be extended to FILTER doc.str >= "a" AND doc.str < "b" AND REGEX_TEST(doc.str, "a[a-z]*") and allow for a skiplist index being used to only evaluate the regex on documents where str starts with a. Or some simple regex like [fm]oo|bar could be rewritten to a set of equality comparisons: FILTER doc.str IN ["foo","moo","bar"]. Also have a look at ArangoSearch.

Solr: find groups containing a value

I have a structure of documents indexed in Solr that are grouped together by a property.
Let's say I have a group consisting of three documents:
A -> B -> C
I want to performa query by a property value V that will return the whole group whether A or B or C contain the value V.
For example - the query will return my whole group (A,B,C) if B contains the value V.
Is this possible in solr?
Thanks!
If I understand correctly, yes, this possible. You can use Graph query parser to do this:
you index your docs with the right info on how docs are linked to each other in some fields (see sample in the docs).
then, you query like this:
q={!graph+from=in_edge+to=out_edge}id:A
where id:A is the query to get the starting set of docs, and the {!graph ...} is to get all docs reachable from the starting set.
Some caveats:
works only in standalone solr or single Solrcloud shard (though some graph features are available also in Streaming expressions, those would work across all shards, but have less features for now)
depending on how the graph of docs you want to reach looks like (and how you index the edge info), works case you might need to run two queries, one to get the docs 'from' the starting set and another to get the docs 'to' the starting set, for example.

inverted index sets - querying key prefixes

I'm using Redis in order to build an inverted index system for words and the documents that contains those words.
the setup is really simple: Redis Sets where the key of the Set is: i:word and the values of the Set are the documents ids that have this word
let's say i have 2 sets: i:example and i:result
the query - "example result" will intersect i:example and i:result and return all the ids that have both example and result as members
but what i'm looking for is a way to perform (in efficient manner) a query like: "ex res". the result set should contain at least all the ids from the query "example result"
Solutions that i thought of:
create prefix sets of size 2: p:ex - contains {"example", "expertise", "ex"...}. the lookup running time will not be a problem - O(1) to get the set and O(n) to check all elements in the set for words that start with the prefix (where n = set.size()) but i worry about the added size price.
Using scan: but i'm not sure about the running time - query like scan 0 match ex* will take O(n) where n is the number of keys in the db? I know redis is fast but it's probably not an optimized solution for query like "ex machi cont".
The usual way to go about this is the first approach you had mentioned, but usually you'd go with segments that are 3+ chars long. Note that you'll need to have a set for each segment, i.e.g. i:exa, i:exam, i:examp, i:exampl and of course i:example.
This will naturally take up space in your database (hence the suggestion to start at 3 rather than 2 characters). A possible tweak is to keep in the i:len(3) sets only references to i:len(4+) sets instead of document ids. This will required more read operations but will have significant savings in terms of RAM.
You should explore v2.8.9's addition of lexicographical ranges for Sorted Sets. By calling ZRANGEBYLEX you can get ranges of members (i.e.g. all the words that start with ex). While this could be useful in this context by itself, consider that you can also use your Sorted Set's members creatively to encode a word and its document reference. This can help you get over the "loss" of the score (since all scores need to be the same for lexicographical ordering to work). For example, assuming the words "bed" and "beg" in docs 1 and 2:
ZADD index 0 "beg:1" 0 "bed:2"
Lastly, here's a little something to think about too - adding suffix searching (i.e.g, everything that ends with "ample"): https://redislabs.com/blog/how-to-use-redis-at-least-x1000-more-efficiently

Resources