How to define an index to use in a Mango Query - couchdb

I am trying to create a CouchDB Mango Query with an index with the hope that the query runs faster. At the moment I have the following Mango Query which returns what I am looking for but it's slow. Therefore, I assume, I need to create an index to make it faster. I need help figuring out how to create that index.
selector: {
categoryIds: {
$in: categoryIds,
},
},
sort: [{ publicationDate: 'desc' }],
You can assume that my documents are let say news articles from different categories. Therefore in each document I have a field that contains one or more categories that the news article belongs to. For that I have an array of categoryIds for each document. My query needs to be optimized for queries like "Give me all news that have categoryId1 in their array of categoryIds sorted by publicationDate". What I don't know how to do is 1. How to define an index 2. What that index should be 3. How to use that index in "use_index" field of the Mango Query. Any help is appreciated.
Update after "Alexis Côté" answer:
If I define the index like this:
{
"_id": "_design/0f11ca4ef1ea06de05b31e6bd8265916c1bbe821",
"_rev": "6-adce50034e870aa02dc7e1e075c78361",
"language": "query",
"views": {
"categoryIds-json-index": {
"map": {
"fields": {
"categoryIds": "asc"
},
"partial_filter_selector": {}
},
"reduce": "_count",
"options": {
"def": {
"fields": [
"categoryIds"
]
}
}
}
}
}
And run the Mango Query like this:
{
"selector": {
"categoryIds": {
"$in": [
"e0bd5f97ac35bdf6893351337d269230"
]
}
},
"use_index": "categoryIds-json-index"
}
It still does return the results but they are not sorted in the order I want by publicationDate. So I am not clear what you are suggesting the solution is.

You can create an index as documented here
In your case, you will need an index on the "categoryIds" field.
You can specify the index using "use_index": "_design/<name>"
Note:The query planner should automatically pick this index if it's compatible.

Related

Moving specific Index Data into a new Index within Elasticsearch

I have several million docs, that I need to move into a new index, but there is a condition on which docs should flow into the index. Say I have a field named, offsets, that needs to be queried against. The values I need to query for are: [1,7,99,32, ....., 10000432] (very large list) in the offset field..
Does anyone have thoughts on how I can move the specific docs, with those values in the list into a new elasticsearch index.? My first though was reindexing with a query, but there is no pattern for the offsets list..
Would it be a python loop appending each doc to a new index? Looking for any guidance.
Thanks
Are the documents really large, or can you add them into an jsonl file for bulk ingestion?
In what form is the selector list, the one shown as "[1,7,99,32, ....., 10000432]"?
I'd do it in Pandas, but here is an idea in ES parlance.
Whatever you do, do use the _bulk API, or the job will never finish.
You can run a query based upon as file as per
GET my_index/_search?_file="myquery_file"
You can put all the ids into a file, myquery_file, as below:
{
"query": {
"ids" : {
"values" : ["1", "4", "100"]
}
},
"format": "jsonl"
}
and output as jsonl to ingest.
You can do the above for the reindex API.
{
"source": {
"index": "source",
**"query": {
"match": {
"company": "cat"
}
}**
},
"dest": {
"index": "dest",
"routing": "=cat"
}
}
Unfortunately,
I was facing a time crunch, and had to throw in a personalized loop to query a very specific subset of indices..
df = pd.read_csv('C://code//part_1_final.csv')
offsets = df['OFFSET'].tolist()
# Offsets are the "unique" values I need to identify the docs by.. There is no pattern in these values, thus I must go one by one..
missedDocs = []
for i in offsets:
print(i)
try:
client.reindex({
"source": {
"index": "<source_index>,
"query": {
"bool": {
"must": [
{ "match" : {"<index_filed_1>": "1" }},
{ "match" : {"<index_with_that_needs_values_to_match": i }}
]
}
}
},
"dest": {
"index": "<dest_index>"
}
})
except KeyError:
print('error')
#missedDocs.append(query)
print('DOC ERROR')

CouchDB Mango query - Match any key with array item

I have the following documents:
{
"_id": "doc1"
"binds": {
"subject": {
"Test1": ["something"]
},
"object": {
"Test2": ["something"]
}
},
},
{
"_id": "doc2"
"binds": {
"subject": {
"Test1": ["something"]
},
"object": {
"Test3": ["something"]
}
},
}
I need a Mango selector that retrieves documents where any field inside binds (subject, object etc) has an object with key equals to any values from an array passed as parameter. That is, if keys of binds contains any values of some array it should returns that document.
For instance, consider the array ["Test2"] my selector should retrieve doc1 since binds["subject"]["Test1"] exists; the array ["Test1"] should retrieve doc1 and doc2 and the array ["Test2", "Test3"] should also retrieve doc1 and doc2.
F.Y.I. I am using Node.js with nano lib to access CouchDB API.
I am providing this answer because the luxury of altering document "schema" is not always an option.
With the given document structure this cannot be done with Mango in any reasonable manner. Yes, it can be done, but only when employing very brittle and inefficient practices.
Mango does not provide an efficient means of querying documents for dynamic properties; it does support searching within property values e.g. arrays1.
Using worst practices, this selector will find docs with binds properties subject and object having properties named Test2 and Test3
{
"selector": {
"$or": [
{
"binds.subject.Test2": {
"$exists": true
}
},
{
"binds.object.Test2": {
"$exists": true
}
},
{
"binds.subject.Test3": {
"$exists": true
}
},
{
"binds.object.Test3": {
"$exists": true
}
}
]
}
}
Yuk.
The problems
The queried property names vary so a Mango index cannot be leveraged (Test37 anyone?)
Because of (1) a full index scan (_all_docs) occurs every query
Requires programmatic generation of the $or clause
Requires a knowledge of the set of property names to query (Test37 anyone?)
The given document structure is a show stopper for a Mango index and query.
This is where map/reduce shines
Consider a view with the map function
function (doc) {
for(var prop in doc.binds) {
if(doc.binds.hasOwnProperty(prop)) {
// prop = subject, object, foo, bar, etc
var obj = doc.binds[prop];
for(var objProp in obj) {
if(obj.hasOwnProperty(objProp)) {
// objProp = Test1, Test2, Test37, Fubar, etc
emit(objProp,prop)
}
}
}
}
}
So the map function creates a view for any docs with a binds property with two nested properties, e.g. binds.subject.Test1, binds.foo.bar.
Given the two documents in the question, this would be the basic view index
id
key
value
doc1
Test1
subject
doc2
Test1
subject
doc1
Test2
object
doc2
Test3
object
And since view queries provide the keys parameter, this query would provide your specific solution using JSON
{
include_docs: true,
reduce: false,
keys: ["Test2","Test3"]
}
Querying that index with cUrl
$ curl -G http://{view endpoint} -d 'include_docs=false' -d
'reduce=false' -d 'keys=["Test2","Test3"]'
would return
{
"total_rows": 4,
"offset": 2,
"rows": [
{
"id": "doc1",
"key": "Test2",
"value": "object"
},
{
"id": "doc2",
"key": "Test3",
"value": "object"
}
]
}
Of course there are options to expand the form and function of such a view by leveraging collation and complex keys, and there's the handy reduce feature.
I've seen commentary that Mango is great for those new to CouchDB due to it's "ease" in creating indexes and the query options, and that map/reduce if for the more seasoned. I believe such comments are well intentioned but misguided; Mango is alluring but has its pitfalls1. Views do require considerable thought, but hey, that's we're supposed to be doing anyway.
1) $elemMatch for example require in memory scanning which can be very costly.

Couchdb mango query speed

I have following type of documents:
{
"_id": "0710b1dd6cc2cdc9c2ffa099c8000f7b",
"_rev": "1-93687d40f54ff6ca72e66ca7fc99caff",
"date": "2018-06-04T07:46:08.848Z",
"topic": "some topic",
}
The collection is not very large. Only 20k documents.
However, the following query is very slow. Takes ca 5 secs!
{
selector: {
topic: 'some topic'
},
sort: ['date'],
}
I tried various indexes, e.g.
index: {
fields: ['topic', 'date']
}
but nothing really worked well.
What I am missing here?
When sorting in a Mango query, you need to ensure that the sort order you are asking for matches the index that you are using.
If you are indexing the data set in topic,date order then you can use the following query on "topic" to get the data out in data order using the index:
{
"selector": {
"topic": "some topic"
},
"sort": [
"topic",
"date"
]
}
Because the sort matches the form of the data in the index, the index is used to answer the query which should speed up your query time considerably.

How would I query keys such that it would partially match?

Let's take this document for example:
{
"id":1
"planet":"earth-616"
"data":[
["wolverine","mutant"],
["Storm","mutant"],
["Mark Zuckerberg","human"]]
}
I created a search index to index the name and type, for example if searched for name:wolverine or type:mutant I'd get the document that has it. But as per my requirement I don't want the whole document, I only want ["wolverine","mutant"] I've created a view that outputs as:
{
"id":1,
"key":"earth-616",
"value":["earth-616","wolverine","mutant"]
}
Then I found out I can query only with keys. (Is it possible to create search indexes on views?, Couldn't find anything in the documentation)
Or should I create views along with the one above like this:
{
"id":1,
"key":"wolverine",
"value":["earth-616","wolverine","mutant"]
}
And
{
"id":,
"key":"mutant"
"value":["earth-616","wolverine","mutant"]
}
This way I can query with keys that I want but I can't seem to partial match keys(Am I missing something?)
If you need the output to be exactly as described then I believe you have to use views, and to support wildcard searches I believe you will have to index every substring of a key.
One alternative is to use Cloudant Query, although admittedly you cannot get the exact output you are looking for. If you issue a query like so:
{
"selector": {
"_id": {
"$gt": 0
},
"data": {
"$elemMatch": {
"$elemMatch": {
"$regex": "(?i)zuck"
}
}
}
},
"fields": [
"data"
]
}
The result will be the entire data array:
{
"data": [
["wolverine", "mutant"],
["Storm", "mutant"],
["Mark Zuckerberg", "human"]
]
}

Query data where userID in multiples ID

I try to make a query and i don't know the right way to do this.
The mongo collection structure contains multiples user ID (uid) and i want to make a query that get all datas ("Albums") where the User ID match one of the uid.
I do not know if the structure of the collection is good for that and I would like to know if I should do otherwise.
{
"_id": ObjectId("55814a9799677ba44e7826d1"),
"album": "album1",
"pictures": [
"1434536659272.jpg",
"1434552570177.jpg",
"1434552756857.jpg",
"1434552795100.jpg"
],
"uid": [
"12814a8546677ba44e745d85",
"e745d677ba4412814e745d7b",
"28114a85466e745d677d85qs"
],
"__v": 0
}
I just searched on internet and found this documentation http://docs.mongodb.org/manual/reference/operator/query/in/ but I'm not certain that this is the right way.
In short, I need to know: if I use the right method for the stucture of the collection and the operator "$in" is the right solution (knowing that it may have a lot of "User ID": between 2 and 2000 maximum).
You don't need $in unless you are matching for more than one possible value in a field, and that field does not have to be an array. $in is in fact shorthand for $or.
You just need a simple query here:
Model.find({ "uid": "12814a8546677ba44e745d85" },function(err,results) {
})
If you want "multiple" user id's then you can use $in:
Model.find(
{ "uid": { "$in": [
"12814a8546677ba44e745d85",
"e745d677ba4412814e745d7b",
] } },
function(err,results) {
}
)
Which is short for $or in this way:
Model.find(
{
"$or": [
{ "uid": "12814a8546677ba44e745d85" },
{ "uid": "e745d677ba4412814e745d7b" }
]
},
function(err,results) {
}
)
Just to answer your question, you can use the below query to get the desired result.
db.mycollection.find( {uid : {$in : ["28114a85466e745d677d85qs"] } } )
However, you need to revisit your data structure, looks like its a Many-to-Many problem and you might need to think about introducing a mid collection for that.

Resources