Couchdb mango query speed - couchdb

I have following type of documents:
{
"_id": "0710b1dd6cc2cdc9c2ffa099c8000f7b",
"_rev": "1-93687d40f54ff6ca72e66ca7fc99caff",
"date": "2018-06-04T07:46:08.848Z",
"topic": "some topic",
}
The collection is not very large. Only 20k documents.
However, the following query is very slow. Takes ca 5 secs!
{
selector: {
topic: 'some topic'
},
sort: ['date'],
}
I tried various indexes, e.g.
index: {
fields: ['topic', 'date']
}
but nothing really worked well.
What I am missing here?

When sorting in a Mango query, you need to ensure that the sort order you are asking for matches the index that you are using.
If you are indexing the data set in topic,date order then you can use the following query on "topic" to get the data out in data order using the index:
{
"selector": {
"topic": "some topic"
},
"sort": [
"topic",
"date"
]
}
Because the sort matches the form of the data in the index, the index is used to answer the query which should speed up your query time considerably.

Related

Moving specific Index Data into a new Index within Elasticsearch

I have several million docs, that I need to move into a new index, but there is a condition on which docs should flow into the index. Say I have a field named, offsets, that needs to be queried against. The values I need to query for are: [1,7,99,32, ....., 10000432] (very large list) in the offset field..
Does anyone have thoughts on how I can move the specific docs, with those values in the list into a new elasticsearch index.? My first though was reindexing with a query, but there is no pattern for the offsets list..
Would it be a python loop appending each doc to a new index? Looking for any guidance.
Thanks
Are the documents really large, or can you add them into an jsonl file for bulk ingestion?
In what form is the selector list, the one shown as "[1,7,99,32, ....., 10000432]"?
I'd do it in Pandas, but here is an idea in ES parlance.
Whatever you do, do use the _bulk API, or the job will never finish.
You can run a query based upon as file as per
GET my_index/_search?_file="myquery_file"
You can put all the ids into a file, myquery_file, as below:
{
"query": {
"ids" : {
"values" : ["1", "4", "100"]
}
},
"format": "jsonl"
}
and output as jsonl to ingest.
You can do the above for the reindex API.
{
"source": {
"index": "source",
**"query": {
"match": {
"company": "cat"
}
}**
},
"dest": {
"index": "dest",
"routing": "=cat"
}
}
Unfortunately,
I was facing a time crunch, and had to throw in a personalized loop to query a very specific subset of indices..
df = pd.read_csv('C://code//part_1_final.csv')
offsets = df['OFFSET'].tolist()
# Offsets are the "unique" values I need to identify the docs by.. There is no pattern in these values, thus I must go one by one..
missedDocs = []
for i in offsets:
print(i)
try:
client.reindex({
"source": {
"index": "<source_index>,
"query": {
"bool": {
"must": [
{ "match" : {"<index_filed_1>": "1" }},
{ "match" : {"<index_with_that_needs_values_to_match": i }}
]
}
}
},
"dest": {
"index": "<dest_index>"
}
})
except KeyError:
print('error')
#missedDocs.append(query)
print('DOC ERROR')

How to define an index to use in a Mango Query

I am trying to create a CouchDB Mango Query with an index with the hope that the query runs faster. At the moment I have the following Mango Query which returns what I am looking for but it's slow. Therefore, I assume, I need to create an index to make it faster. I need help figuring out how to create that index.
selector: {
categoryIds: {
$in: categoryIds,
},
},
sort: [{ publicationDate: 'desc' }],
You can assume that my documents are let say news articles from different categories. Therefore in each document I have a field that contains one or more categories that the news article belongs to. For that I have an array of categoryIds for each document. My query needs to be optimized for queries like "Give me all news that have categoryId1 in their array of categoryIds sorted by publicationDate". What I don't know how to do is 1. How to define an index 2. What that index should be 3. How to use that index in "use_index" field of the Mango Query. Any help is appreciated.
Update after "Alexis Côté" answer:
If I define the index like this:
{
"_id": "_design/0f11ca4ef1ea06de05b31e6bd8265916c1bbe821",
"_rev": "6-adce50034e870aa02dc7e1e075c78361",
"language": "query",
"views": {
"categoryIds-json-index": {
"map": {
"fields": {
"categoryIds": "asc"
},
"partial_filter_selector": {}
},
"reduce": "_count",
"options": {
"def": {
"fields": [
"categoryIds"
]
}
}
}
}
}
And run the Mango Query like this:
{
"selector": {
"categoryIds": {
"$in": [
"e0bd5f97ac35bdf6893351337d269230"
]
}
},
"use_index": "categoryIds-json-index"
}
It still does return the results but they are not sorted in the order I want by publicationDate. So I am not clear what you are suggesting the solution is.
You can create an index as documented here
In your case, you will need an index on the "categoryIds" field.
You can specify the index using "use_index": "_design/<name>"
Note:The query planner should automatically pick this index if it's compatible.

How to reduce query execution time using mango query in CouchDB?

I am doing pagination of 15000 records using mango query in CouchDB, but as I skip the records in more numbers then the execution time is increasing.
Here is my query:
{
"selector": {
"name": {"$ne": "null"}
},
"fields": ["_id", "_rev", "name", "email" ],
"sort": [{"name": "asc" }],
"limit": 10,
"skip": '.$skip.'
}
Here skip documents are dynamic depends upon the pagination number and as soon as the skip number increases the query execution time also get increase.
CouchDB "Mango" queries that use the $ne (not equal) operator tend to suffer performance issues because of the way the indexing works. One solution is to create and index that *only contains documents where name does not equal null by using CouchDB's relative new partial index feature.
Partial indexes allow the database to be filtered at index time, so that the built index only contains documents that pass the filter test you specify. The index can then be used with a query at query time to further winnow the data set down.
An index is created by calling the /db/_index endpoint:
POST /db/_index HTTP/1.1
Content-Type: application/json
Content-Length: 144
Host: localhost:5984
{
"index": {
"partial_filter_selector": {
"name": {
"$ne": "null"
}
},
"fields": ["_id", "_rev", "name", "email"]
},
"ddoc": "mypartialindex",
"type" : "json"
}
This creates an index where only documents whose name is not null are included. We can then specify this index at query time:
{
"selector": {
"name": {
"$ne": "null"
}
},
"use_index": "mypartialindex"
}
In the above query, my selector is choosing all records, but the index it is accessing is already filtered. You may add additional clauses to the selector here to further filter the data at query time.
Partial indexing is described in the CouchDB documentation here and in this blog post.

Couchdb 2 _find query not using index

I'm struggling with something that should be easy but it's making no sense to me, I have these 2 documents in a database:
{ "name": "foo", "type": "typeA" },
{ "name": "bar", "type": "typeB" }
And I'm posting this to _find:
{
"selector": {
"type": "typeA"
},
"sort": ["name"]
}
Which works as expected but I get a warning that there's no matching index, so I've tried posting various combinations of the following to _index which makes no difference:
{
"index": {
"fields": ["type"]
}
}
{
"index": {
"fields": ["name"]
}
}
{
"index": {
"fields": ["name", "type"]
}
}
If I remove the sort by name and only index the type it works fine except it's not sorted, is this a limitation with couchdbs' mango implementation or am I missing something?
Using a view and map function works fine but I'm curious what mango is/isn't doing here.
With just the type index, I think it will normally be almost as efficient unless you have many documents of each type (as it has to do the sorting stage in memory.)
But since fields are ordered, it would be necessary to do:
{
"index": {
"fields": ["type", "name"]
}
}
to have a contiguous slice of this index for each type that is already ordered by name. But the query planner may not determine that this index applies.
As an example, the current pouchdb-find (which should be similar) needs the more complicated but equivalent query:
{
selector: {type: 'typeA', name: {$gte: null} },
sort: ['type','name']
}
to choose this index and build a plan that doesn't resort to building in memory for any step.

Elastic Search filtering in facets

I want to simulate a parent child relation in elastic search and perform some analytics work over it. My use case is something like this
I have a shop owner like this
"_source": {
"shopId": 5,
"distributorId": 4,
"stateId": 1,
"partnerId": 2,
}
and now have child records (for each day) like this:
"_source": {
"shopId": 5,
"date" : 2013-11-13,
"transactions": 150,
"amount": 1980,
}
The parent is a record per store, while the child is the transactions each store does for
day. Now I want to do some complex query like
Find out total transaction for each day for the last 30 days where distributor is 5
POST /newdb/shopsDaily/_search
{
"query": {
"match_all": {}
},
"filter": {
"has_parent": {
"type": "shop",
"query": {
"match": {
"distributorId": "5"
}
}
}
},
"facets": {
"date": {
"histogram": {
"key_field": "date",
"value_field": "transactions",
"interval": 100
}
}
}
}
But the result I get do not take the filtering into account which I applied.
So I changed the query to this:
POST /newdb/shopDaily/_search
{
"query": {"filtered": {
"query": {"match_all": {}},
"filter": { "has_parent": {
"type": "shop",
"query": {"match": {
"distributorId": "13"
}}
}}
}},
"facets": {
"date": {
"histogram": {
"key_field": "date",
"value_field": "transactions",
"interval": 100
}
}
}
}
And then the final histogram facet took filtering into count.
So, when I browsed though I found out this is due to using filtered(which can only be used inside query clause and not outside like filter) rather than filter,
but it also mentioned that to have fast search you should use filter. Will searching as I did in second step (when I used filtered instead of filter) effect the performance of elastic search? If so, how can I make my facets honor filters and not effect the performance?
Thanks for you time
filters in Filtered query (filters in query clause) are cached, hence faster. These type of filters affect both search result and facet counts.
Filters outside the query clause are not considered during facet calculations. They are considered only for search results. Facet is calculated only on the query clause. If you want filtered facets then you need to set filters to each of the facet clauses.

Resources