Arangodb Search view is not consistent with collection - arangodb

I am using arangodb version 3.9.3 and I am experiencing an inconsistency between an arangodb collection and the corresponding arangosearch view on top of it.
The number of documents found by querying the collection is nearly half of the number of documents found by querying the view.
I am using the following query
RETURN COUNT(FOR n IN collection RETURN n)
which returns 4353.
And the query on view
RETURN COUNT(FOR n IN view SEARCH true RETURN n)
returns 7303.
Because of this inconsistency, the LIMIT operation is not working as expected and queries are also returning extra results. Moreover, it seems that it is returning old documents as well.
I also tested it on older arangodb versions.
It is happening on both 3.7.18, 3.8.7 and on latest 3.9.3. And I am using a single node instance.
My workflow is something like this:
I just delete and create a document (changing few attributes of the document) multiple times (like 1000 times) in a loop which leads to inconsistency in the view.
And this seems to only happen when I have primarySortOrder/storedValues defined.
I am creating the View on app startup which looks like this
{
"writebufferSizeMax": 33554432,
"writebufferIdle": 64,
"cleanupIntervalStep": 2,
"commitIntervalMsec": 250,
"consolidationIntervalMsec": 500,
"consolidationPolicy": {
"type": "tier",
"segmentsBytesFloor": 2097152,
"segmentsBytesMax": 5368709120,
"segmentsMax": 10,
"segmentsMin": 1,
"minScore": 0
},
"primarySortCompression": "none",
"writebufferActive": 0,
"links": {
"FilterNodes": {
"analyzers": [
"identity"
],
"fields": {
"name": {},
"city": {}
},
"includeAllFields": false,
"storeValues": "none",
"trackListPositions": false
}
},
"globallyUniqueId": "h9943DEC4CDFB/231",
"id": "231",
"storedValues": [],
"primarySort": [
{
"field": "name",
"asc": true
},
{
"field": "city",
"asc": true
}
],
"type": "arangosearch"
}
I am unsure if I am doing something wrong or if this is a longstanding bug, which seems like a pretty strong inconsistency.
Has anyone else encountered this? And can anyone help?
Thanks

Related

How to include two analyzers into a single SEARCH statement?

I have a feeds collection with documents like this:
{
"created": 1510000000,
"find": [
"title of the document",
"body of the document"
],
"filter": [
"/example.com",
"-en"
]
}
created contains an epoch timestamp
find contains an array of fulltext snippets, e.g. the title and the body of a text
filter is an array with further search tokens, such as hashtags, domains, locales
Problem is that find contains fulltext snippets, which we want to tokenize, e.g. with a text analyzer, but filter contains final tokens which we want to compare as a whole, e.g. with the identity analyzer.
Goal is to combine find and filter into a single custom analyzer or to combine two analyzers using two SEARCH statements or something to that end.
I did manage to query by either find or by filter successfully, but do not manage to query by both. This is how I query by filter:
I created a feeds_search view:
{
"writebufferIdle": 64,
"type": "arangosearch",
"links": {
"feeds": {
"analyzers": [
"identity"
],
"fields": {
"find": {},
"filter": {},
"created": {}
},
"includeAllFields": false,
"storeValues": "none",
"trackListPositions": false
}
},
"consolidationIntervalMsec": 10000,
"writebufferActive": 0,
"primarySort": [],
"writebufferSizeMax": 33554432,
"consolidationPolicy": {
"type": "tier",
"segmentsBytesFloor": 2097152,
"segmentsBytesMax": 5368709120,
"segmentsMax": 10,
"segmentsMin": 1,
"minScore": 0
},
"cleanupIntervalStep": 2,
"commitIntervalMsec": 1000,
"id": "362444",
"globallyUniqueId": "hD6FBD6EE239C/362444"
}
and I created a sample query:
FOR feed IN feeds_search
SEARCH ANALYZER(feed.created < 9990000000 AND feed.created > 1500000000
AND (feed.find == "title of the document")
AND (feed.`filter` == "/example.com" OR feed.`filter` == "-uk"), "identity")
SORT feed.created
LIMIT 20
RETURN feed
The sample query works, because find contains the full text (identity analyzer). As soon as I switch to a text analyzer, single word tokens work for find, but filter no longer works.
I tried using a combination of SEARCH and FILTER, which gives me the desired result, but I assume it probably performs worse than having the SEARCH analyzer do the whole thing. I see that analyzers is an array in the view syntax, but I seem not to be able to set individual fields for each analyzer.
The analyzers can be added as a property to each field in fields. What is specified in analyzers is the default and is used in case a more specific analyzer is not set for a given field.
"analyzers": [
"identity"
],
"fields": {
"find": {
"analyzers": [
"text_en"
]
},
"filter": {},
"created": {}
},
Credits: Simran at ArangoDB

Time Series Insights - 'uniqueValues' aggregate not working as expected: does not return any data

I'm trying to execute some aggregate queries against data in TSI. For example:
{
"searchSpan": {
"from": "2018-08-25T00:00:00Z",
"to": "2019-01-01T00:00:00Z"
},
"top": {
"sort": [
{
"input": {
"builtInProperty": "$ts"
}
}
]
},
"aggregates": [
{
"dimension": {
"uniqueValues": {
"input": {
"builtInProperty": "$esn"
},
"take": 100
}
},
"measures": [
{
"count": {}
}
]
}
]
}
The above query, however, does not return any record, although there are many events stored in TSI for that specific searchSpan. Here is the response:
{
"warnings": [],
"events": []
}
The query is based on the examples in the documentation which can be found here and which is actually lacking crucial information for requirements and even some examples do not work...
Any help would be appreciated. Thanks!
#Vladislav,
I'm sorry to hear you're having issues. In reviewing your API call, I see two fixes that should help remedy this issue:
1) It looks like you're using our /events API with payload for /aggregates API. Notice the "events" in the response. Additionally, “top” will be redundant for /aggregates API as we don't support top-level limit clause for our /aggregates API.
2) We do not enforce "count" property to be present in limit clause (“take”, “top” or “sample”) and it looks like you did not specify it, so by default, the value was set to 0, that’s why the call is returning 0 events.
I would recommend that you use /aggregates API rather than /events, and that “count” is specified in the limit clause to ensure you get some data back.
Additionally, I'll note your feedback on documentation. We are ramping up a new hire on documentation now, so we hope to improve the quality soon.
I hope this helps!
Andrew

How to reduce query execution time using mango query in CouchDB?

I am doing pagination of 15000 records using mango query in CouchDB, but as I skip the records in more numbers then the execution time is increasing.
Here is my query:
{
"selector": {
"name": {"$ne": "null"}
},
"fields": ["_id", "_rev", "name", "email" ],
"sort": [{"name": "asc" }],
"limit": 10,
"skip": '.$skip.'
}
Here skip documents are dynamic depends upon the pagination number and as soon as the skip number increases the query execution time also get increase.
CouchDB "Mango" queries that use the $ne (not equal) operator tend to suffer performance issues because of the way the indexing works. One solution is to create and index that *only contains documents where name does not equal null by using CouchDB's relative new partial index feature.
Partial indexes allow the database to be filtered at index time, so that the built index only contains documents that pass the filter test you specify. The index can then be used with a query at query time to further winnow the data set down.
An index is created by calling the /db/_index endpoint:
POST /db/_index HTTP/1.1
Content-Type: application/json
Content-Length: 144
Host: localhost:5984
{
"index": {
"partial_filter_selector": {
"name": {
"$ne": "null"
}
},
"fields": ["_id", "_rev", "name", "email"]
},
"ddoc": "mypartialindex",
"type" : "json"
}
This creates an index where only documents whose name is not null are included. We can then specify this index at query time:
{
"selector": {
"name": {
"$ne": "null"
}
},
"use_index": "mypartialindex"
}
In the above query, my selector is choosing all records, but the index it is accessing is already filtered. You may add additional clauses to the selector here to further filter the data at query time.
Partial indexing is described in the CouchDB documentation here and in this blog post.

Ordering view by document type in couchDB

I have diferent kinds of documents in my couchDB, for example:
{
"_id": "c9f3ebc1-78f4-4dd1-8fc2-ab96f804287c",
"_rev": "7-1e8fcc048237366e24869dadc9ba54f1",
"to_customer": false,
"box_type": {
"id": 9,
"name": "ZF3330"
},
"erp_creation_date": "16/12/2017",
"type": "pallet",
"plantation": {
"id": 62,
"name": "FRF"
},
"pallet_type": {
"id": 2565,
"name": "ZF15324"
},
"creation_date": "16/12/2017",
"article_id": 3,
"updated": "2017/12/16 19:01",
"server_status": {
"in_server": true,
"errors": null,
"modified_in_server": false,
"dirty": false,
"delete_in_server": false
},
"pallet_article": {
"id": 11,
"name": "BLUE"
}
}
So , in all my documents, I have the field : type. In the other hand I have a view that get all the documents whose type is pallet || shipment
this is my view:
function(doc) {
if (doc.completed == true && (doc.type == "shipment" || doc.type == "pallet" )){
emit([doc.type, doc.device_num, doc.num], doc);
}
}
So in this view I get always a list with the view query result, the problem I have is that list is ordering by receiving date(I guess) and I need to order it by document type.
so my question is: How Can I order documents by document.type in a View?
View results are always sorted by key, so your view is sorted by doc.type: first you will get all pallets, then all the shipments. the pallets are sorted by device_num and then num. If you emit several rows with the same keys, the rows are then sorted by _id. You can find more detailed info in the CouchDB documentation.
So your view should actually work the way you want. ;-)

Couchdb 2 _find query not using index

I'm struggling with something that should be easy but it's making no sense to me, I have these 2 documents in a database:
{ "name": "foo", "type": "typeA" },
{ "name": "bar", "type": "typeB" }
And I'm posting this to _find:
{
"selector": {
"type": "typeA"
},
"sort": ["name"]
}
Which works as expected but I get a warning that there's no matching index, so I've tried posting various combinations of the following to _index which makes no difference:
{
"index": {
"fields": ["type"]
}
}
{
"index": {
"fields": ["name"]
}
}
{
"index": {
"fields": ["name", "type"]
}
}
If I remove the sort by name and only index the type it works fine except it's not sorted, is this a limitation with couchdbs' mango implementation or am I missing something?
Using a view and map function works fine but I'm curious what mango is/isn't doing here.
With just the type index, I think it will normally be almost as efficient unless you have many documents of each type (as it has to do the sorting stage in memory.)
But since fields are ordered, it would be necessary to do:
{
"index": {
"fields": ["type", "name"]
}
}
to have a contiguous slice of this index for each type that is already ordered by name. But the query planner may not determine that this index applies.
As an example, the current pouchdb-find (which should be similar) needs the more complicated but equivalent query:
{
selector: {type: 'typeA', name: {$gte: null} },
sort: ['type','name']
}
to choose this index and build a plan that doesn't resort to building in memory for any step.

Resources