How to reduce query execution time using mango query in CouchDB? - couchdb

I am doing pagination of 15000 records using mango query in CouchDB, but as I skip the records in more numbers then the execution time is increasing.
Here is my query:
{
"selector": {
"name": {"$ne": "null"}
},
"fields": ["_id", "_rev", "name", "email" ],
"sort": [{"name": "asc" }],
"limit": 10,
"skip": '.$skip.'
}
Here skip documents are dynamic depends upon the pagination number and as soon as the skip number increases the query execution time also get increase.

CouchDB "Mango" queries that use the $ne (not equal) operator tend to suffer performance issues because of the way the indexing works. One solution is to create and index that *only contains documents where name does not equal null by using CouchDB's relative new partial index feature.
Partial indexes allow the database to be filtered at index time, so that the built index only contains documents that pass the filter test you specify. The index can then be used with a query at query time to further winnow the data set down.
An index is created by calling the /db/_index endpoint:
POST /db/_index HTTP/1.1
Content-Type: application/json
Content-Length: 144
Host: localhost:5984
{
"index": {
"partial_filter_selector": {
"name": {
"$ne": "null"
}
},
"fields": ["_id", "_rev", "name", "email"]
},
"ddoc": "mypartialindex",
"type" : "json"
}
This creates an index where only documents whose name is not null are included. We can then specify this index at query time:
{
"selector": {
"name": {
"$ne": "null"
}
},
"use_index": "mypartialindex"
}
In the above query, my selector is choosing all records, but the index it is accessing is already filtered. You may add additional clauses to the selector here to further filter the data at query time.
Partial indexing is described in the CouchDB documentation here and in this blog post.

Related

Moving specific Index Data into a new Index within Elasticsearch

I have several million docs, that I need to move into a new index, but there is a condition on which docs should flow into the index. Say I have a field named, offsets, that needs to be queried against. The values I need to query for are: [1,7,99,32, ....., 10000432] (very large list) in the offset field..
Does anyone have thoughts on how I can move the specific docs, with those values in the list into a new elasticsearch index.? My first though was reindexing with a query, but there is no pattern for the offsets list..
Would it be a python loop appending each doc to a new index? Looking for any guidance.
Thanks
Are the documents really large, or can you add them into an jsonl file for bulk ingestion?
In what form is the selector list, the one shown as "[1,7,99,32, ....., 10000432]"?
I'd do it in Pandas, but here is an idea in ES parlance.
Whatever you do, do use the _bulk API, or the job will never finish.
You can run a query based upon as file as per
GET my_index/_search?_file="myquery_file"
You can put all the ids into a file, myquery_file, as below:
{
"query": {
"ids" : {
"values" : ["1", "4", "100"]
}
},
"format": "jsonl"
}
and output as jsonl to ingest.
You can do the above for the reindex API.
{
"source": {
"index": "source",
**"query": {
"match": {
"company": "cat"
}
}**
},
"dest": {
"index": "dest",
"routing": "=cat"
}
}
Unfortunately,
I was facing a time crunch, and had to throw in a personalized loop to query a very specific subset of indices..
df = pd.read_csv('C://code//part_1_final.csv')
offsets = df['OFFSET'].tolist()
# Offsets are the "unique" values I need to identify the docs by.. There is no pattern in these values, thus I must go one by one..
missedDocs = []
for i in offsets:
print(i)
try:
client.reindex({
"source": {
"index": "<source_index>,
"query": {
"bool": {
"must": [
{ "match" : {"<index_filed_1>": "1" }},
{ "match" : {"<index_with_that_needs_values_to_match": i }}
]
}
}
},
"dest": {
"index": "<dest_index>"
}
})
except KeyError:
print('error')
#missedDocs.append(query)
print('DOC ERROR')

Mango index "does not contain a valid index for this query" even when specified manually

I'm trying to efficiently query data via Mango (as that seems to be the only option given my requirements Searching for sub-objects with a date range containing the queried date value), but I can't even get a very simple index/query pair to work: although I specify my index manually for the query, I'm told that my index "was not used because it does not contain a valid index for this query. No matching index found, create an index to optimize query time."
(I'm doing all of this via Fauxton on CouchDB v. 3.0.0)
Let's say my documents look like this:
{
"tenant": "TNNT_a",
"$doctype": "JobOpening",
// a bunch of other fields
}
All documents with a $doctype of "JobOpening" are guaranteed to have a tenant property. The searches I wish to perform will only ever be for documents with $doctype of "JobOpening" and a tenant selector will always be provided when querying.
Here's the test index I've configured:
{
"index": {
"fields": [
"tenant",
"$doctype"
],
"partial_filter_selector": {
"\\$doctype": {
"$eq": "JobOpening"
}
}
},
"ddoc": "job-openings-doctype-index",
"type": "json"
}
And here's the query
{
"selector": {
"tenant": "TNNT_a",
"\\$doctype": "JobOpening"
},
"use_index": "job-openings-doctype-index"
}
Why isn't the index being used for the query?
I've tried not using a partial index, and I think the $doctype escaping is done properly in the requisite places, but nothing seems to keep CouchDB from performing a full scan.
The index isn't being used because the $doctype field is not being recognized by the query planner as expected.
Changing the fields declaration from $doctype to \\$doctype in the design document solves the issue.
{
"index": {
"fields": [
"tenant",
"\\$doctype"
],
"partial_filter_selector": {
"\\$doctype": {
"$eq": "JobOpening"
}
}
},
"ddoc": "job-openings-doctype-index",
"type": "json"
}
After that small refactor, the query
{
"selector": {
"tenant": "TNNT_a",
"\\$doctype": "JobOpening"
},
"use_index": "job-openings-doctype-index"
}
Returns the expected result, and produces an "explain" which confirms the job-openings-doctype-index was queried:
{
"dbname": "stack",
"index": {
"ddoc": "_design/job-openings-doctype-index",
"name": "7f5c5cea5acd90f11fffca3e3355b6a03677ad53",
"type": "json",
"def": {
"fields": [
{
"tenant": "asc"
},
{
"\\$doctype": "asc"
}
],
"partial_filter_selector": {
"\\$doctype": {
"$eq": "JobOpening"
}
}
}
},
// etc etc etc
Whether this change is intuitive or not is unclear, however it is consistent - and perhaps reveals leading field names with a "special" character may not be desirable.
Regarding the indexing of the filtered field, as per the documentation regarding partial_filter_selector
Technically, we don’t need to include the filter on the "status" [e.g.
$doctype here] field in the query selector ‐ the partial index
ensures this is always true - but including it makes the intent of the
selector clearer and will make it easier to take advantage of future
improvements to query planning (e.g. automatic selection of partial
indexes).
Despite that, I would not choose to index a field whose value is constant.

Cloudant Sorting on a nullable field

I want to sort on a field lets say name which is indexed in Cloudant DB. I am getting all the documents both which has this name field and which doesn't by using the index without sort . But when i try to sort with the name field I am not getting the documents which doesn't have this name field in the doc.
Is there any way to do this by using the query indexes. I want all the documents in sorted order which doesn't have the name field too.
For Example :
Below are some documents:
{
"_id": 1234,
"classId": "abc",
"name": "Happa"
}
{
"_id": 12345,
"classId": "abc",
"name": "Prasanth"
}
{
"_id": 123456,
"classId": "abc",
}
Below is the Query what i am trying to execute:
{
"selector": {
"classId": "abc",
"name" :{
"or" : [
{"$exists": true},{"$exists": false}
]
}
},
"sort": [{ "classId": "asc" }, { "name": "asc" }],
"use_index": "idx-classId_name"
},
I am expecting all the documents to be returned in a sorted order including the document which doesn't have that name field.
Your query makes no sense to me as it stands. You're requesting a listing of documents which either have, or don't have a specific field (meaning every document), and expecting to sort those on this field that may or may not exist. Such an order isn't defined out of the box.
I'd remove the name clause from the selector, sorting only on the classId field which appear in every document, and then do the secondary partial ordering on the client side, so you can decide how you intend to mix in the documents without the name field with those that have it.
Another solution is to use a view instead of a Cloudant Query index. I've not tested this, but hopefully the intent is clear:
function(doc) {
if (doc && doc.classId) {
var name = doc.name || "[notfound]";
emit(doc.classId+"-"+name, 1);
}
}
which will key the docs on "classId-name" and for docs with no name, a specified sentinel value.
Querying the view should return the documents lexicographically ordered on this compound key (which you can reverse with a query parameter if you wish).

Couchdb mango query speed

I have following type of documents:
{
"_id": "0710b1dd6cc2cdc9c2ffa099c8000f7b",
"_rev": "1-93687d40f54ff6ca72e66ca7fc99caff",
"date": "2018-06-04T07:46:08.848Z",
"topic": "some topic",
}
The collection is not very large. Only 20k documents.
However, the following query is very slow. Takes ca 5 secs!
{
selector: {
topic: 'some topic'
},
sort: ['date'],
}
I tried various indexes, e.g.
index: {
fields: ['topic', 'date']
}
but nothing really worked well.
What I am missing here?
When sorting in a Mango query, you need to ensure that the sort order you are asking for matches the index that you are using.
If you are indexing the data set in topic,date order then you can use the following query on "topic" to get the data out in data order using the index:
{
"selector": {
"topic": "some topic"
},
"sort": [
"topic",
"date"
]
}
Because the sort matches the form of the data in the index, the index is used to answer the query which should speed up your query time considerably.

Elastic Search filtering in facets

I want to simulate a parent child relation in elastic search and perform some analytics work over it. My use case is something like this
I have a shop owner like this
"_source": {
"shopId": 5,
"distributorId": 4,
"stateId": 1,
"partnerId": 2,
}
and now have child records (for each day) like this:
"_source": {
"shopId": 5,
"date" : 2013-11-13,
"transactions": 150,
"amount": 1980,
}
The parent is a record per store, while the child is the transactions each store does for
day. Now I want to do some complex query like
Find out total transaction for each day for the last 30 days where distributor is 5
POST /newdb/shopsDaily/_search
{
"query": {
"match_all": {}
},
"filter": {
"has_parent": {
"type": "shop",
"query": {
"match": {
"distributorId": "5"
}
}
}
},
"facets": {
"date": {
"histogram": {
"key_field": "date",
"value_field": "transactions",
"interval": 100
}
}
}
}
But the result I get do not take the filtering into account which I applied.
So I changed the query to this:
POST /newdb/shopDaily/_search
{
"query": {"filtered": {
"query": {"match_all": {}},
"filter": { "has_parent": {
"type": "shop",
"query": {"match": {
"distributorId": "13"
}}
}}
}},
"facets": {
"date": {
"histogram": {
"key_field": "date",
"value_field": "transactions",
"interval": 100
}
}
}
}
And then the final histogram facet took filtering into count.
So, when I browsed though I found out this is due to using filtered(which can only be used inside query clause and not outside like filter) rather than filter,
but it also mentioned that to have fast search you should use filter. Will searching as I did in second step (when I used filtered instead of filter) effect the performance of elastic search? If so, how can I make my facets honor filters and not effect the performance?
Thanks for you time
filters in Filtered query (filters in query clause) are cached, hence faster. These type of filters affect both search result and facet counts.
Filters outside the query clause are not considered during facet calculations. They are considered only for search results. Facet is calculated only on the query clause. If you want filtered facets then you need to set filters to each of the facet clauses.

Resources