Cloudant couchdb changes api and geospatial indexes - couchdb

Currently I'm doing filtered replication by monitoring the below resource:
_changes?filter=_selector&include_docs=true&attachments=true&limit=20
As you can see, I'm using a selector defined by
"selector": {
"type": "Property"
}
and everything is working great. Now I need to add another criteria which is geospatial index. I want to replicate documents with locations in a radius. e.g.
lat=-11.05987446&lon=12.28339928&radius=100
How can I replicate using the above filtered replication technique and replicate documents within a radius?
Thanks

The selector filter for _changes is not backed by an index - it just uses the same syntax as Query selectors, which currently does not support geospatial operations.
I think you have 3 options:
1. Use a bounding box
Your selector would then look something like:
"selector": {
"type": "Property",
"lat": {
"$gt": -11
},
"lat": {
"$lt": 11
},
"lon": {
"$gt": 12
},
"lon": {
"$lt": 14
}
}
Perhaps you could then further restrict the results on the client if you need exactly a radial search.
2. Implement a radius search in a JavaScript filter
This means dropping the use of the `selector, would be relatively slow (anything that involves JavaScript in Couch/Cloudant will be) but gives you exactly the result you want.
3. Run a query and replicate the resulting ids
Use a search or geospatial query to get the set of _ids you need and use a doc_id based replication to fetch them.
Lastly, it's worth considering whether you really want replication (which implies the ability for documents to sync both ways) or if you're just caching / copying data to the client. Replication carries some overhead beyond just copying the data (it needs to figure out the delta between the client and server, retrieve the rev history for each doc, etc) so, if you don't need to write the docs back to the server, maybe you don't need it.
If you do go down the replication route, you may need to handle cases where documents that you previously replicated no longer match the query so updates do not propagate in subsequent replications.
If not, you may be better off just running a query with include_docs=true and manually inserting the documents to a local database.

The _selector filter for the changes feed isn't backed by an index, it's basically a handy shortcut to achieve the same thing as a javascript filter, but much faster as it's executed directly in the Erlang.
As it's not index-backed, you can't tap into a geo-index that way.
You'd be better off to run either a bounding box or radius query to get the ids and then fetch those documents with a bulk_get or with a post to all_docs with the ids in the body.
https://cloudant.com/wp-content/uploads/Cloudant-Geospatial-technical-overview.pdf

Related

Is it faster to use aggregation or manually filter through data with nodejs and mongoose?

I'm at a crossroads trying to decide what methodology to use. Basically, I have a mongodb collection and i want to query it with specific params provided by the user, then i want to group the response according to the value of some of those parameters. For example, let's say my collection is animals and if i query all animals i get something like this
[
{type:"Dog",age:3,name:"Kahla"},
{type:"Cat",age:6,name:"mimi"},
...
]
Now i would like to return to the user a response that is grouped by the animal type, so that i end up with something like
{
Dogs: [...dog docs],
Cats: [...cat docs],
Cows: [...],
}
So basically I have 2 ways of doing this. One is to just use Model.find() and fetch all the animals that match my specific queries, such as age or any other field, and then manually filter and format my json response before sending it back to the user with res.json({}) (im using express btw)
Or I can use mongo's aggregate framework and $group to do this at the query level, hence returning from the DB an already grouped response to my request. The only inconvenience I've found with this so far with this is how the response is formatted, and ends up looking more something like this
[
{
"_id":"Dog",
"docs":[{dog docs...}]
},
{
"_id":"Cat",
"docs":[{...}]
}
]
The overall result is BASICALLY the same, but the formatting of the response is quite different, and my front end client needs to adjust to how Im sending the response. I don't really like the array of objects from the aggregation, and prefer a json-like object response with key names correponding to the arrays as I see fit.
So the real question here is whether there is one significant advantage of one way over the other? Is the aggregation framework so fast that it will scale well if my collection grows to huge numbers? Is filtering through the data with javascript and mapping the response so I can shape it to my liking a very inefficient process, and hence it's better to use aggregation and adapt the front end to this response shape?
I'm considering that by Faster you meant the least time to serve a request. That said, let's divide the time required to process your request:
Asynchronous Operations (Network Operations, File read/write etc)
Synchronous Operations
Synchronous operations are usually much more faster than the Asynchronous ones.(This also depends on the nature of the operation and the amount of data being processed). For example, if you loop over an iterable(e.g. Array, Map etc) which has a length of less than 1000 it won't take more than a few milliseconds.
On the other hand, Asynchronous operations takes more times. For example, if you run an HTTP request it would take couple of milliseconds to get the response.
When you are querying on the MongoDB with mongoose, it's an asynchronous call and it will take more time. So, if you run more queries to Database it will make your API slower. MongoDB Aggregation can help you reducing the total number of queries which may help you to make APIs faster. But the problem is, Aggregations are usually slower than normal find requests.
The summary is, if you can manually filter data without adding any DB query it's going to be faster.

Validate documents during CouchDB replication process

During replication, I need to validate documents that a client tries to write to my CouchDB instance. Ideally, I should reject just "invalid" documents allowing all others to pass through. Other possible outcome might be to reject the whole replication process and do not accept any documents altogether. I cannot use the validate_doc_update function because it lacks all the necessary information to make a decision.
I thought about serving all endpoints needed for replication behind the service and validate docs on an application level. For example, take all docs from POST /_bulk_docs and send back 400 error response if some docs are invalid.
Do I understand it right that such an approach stops the replication process, and the database might be left with partially replicated documents? It's because documents are uploaded in chunks during replication, and therefore there are might be couple POST /_bulk_docs calls where the first one has all valid docs and second invalid.
Is there another way, how can I discard only invalid docs?
Thanks for your help!
You can apply a Cloudant Query selector via your replication document to specify which documents are valid. The canonical example is to ditch tombstones:
{
"_id": "repl01",
"source" : "https://.../<source>",
"target" : "https://.../<target>",
"selector": {
"_deleted": {
"$exists": false
}
}
}
See https://cloud.ibm.com/docs/Cloudant?topic=Cloudant-replication-api#the-selector-field for more details.

Design a counting measure of the element number in a MongoDB array property in cube.js

I'm using cube.js with MongoDB through MongoDB Connector for BI and MongoBI Driver and so far so good. I'd like to have a cube.js numerical measure that counts the element length from a MongoDB array of object nested property. Something like:
{
"nested": {
"arrayPropertyName": [
{
"name": "Leatha Bauch",
"email": "Leatha.Bauch76#hotmail.com"
},
{
"name": "Pedro Hermiston",
"email": "Pedro76#hotmail.com"
}
]
}
}
I wasn't able to figure that out looking at the docs and I was wondering if that is even possible.
I tried with type: count:
MyNestedArrayPropertyCounter: {
sql: `${CUBE}.\`nested.arrayPropertyName\``,
type: `count`,
format: `number`,
},
but I'm getting
Error: Error: Unknown column 'nested.arrayPropertyName' in 'field list'
Any help/advice is really appreciated. Thanks
BI treats nested arrays as separate relational tables. See https://www.mongodb.com/blog/post/introducing-the-mongodb-connector-for-bi-20
That's why you get unknown column error, it's not part of the parent document table.
So my guess you have to build schema on the nested array and then build measure count with dimension on parent object id.
Hope it halps.
I followed Michael Parshin's advice and here's my findings and outcomes to overcome the problem:
LEFT JOIN approach with cube.js joins. I found it painfully slow and most of the time it endend out in a timeout even when querying was performed through command line SQL clients;
Launch mongosqld with --prejoin flag. That was a better option since mongosqld automatically adds master table columns/properties to the secondary tables thus enabling you to conveniently query cube.js measures without joining a secondary Cube;
Wrote a mongo script that fetch/iterate/precalc and persist nested.arrayPropertyName count in a separate property of the collection documents.
Conclusion
Leaving out option 1, option 3 significantly outperforms option 2, typically less than a seconds against more than 20 seconds on my local machine. I compared both options with the same measure, different timeDimension ranges and granularity.
Most probably I'll incorporate array count precalculation into mongo document back-end persisting logic.

Mongodb: data versioning with search

Related to Ways to implement data versioning in MongoDB and structure of documents for versioning of a time series on mongodb
What data structure should I adopt for versioning when I also need to be able to handle queries?
Suppose I have 8500 documents of the form
{ _id: '12345-11',
noFTEs: 5
}
Each month I get details of a change to noFTEs in about 30 docs, I want to store the new data along with the previous one(s), together with a date.
That would seem to result in:
{ _id: '12345-11',
noFTEs: {
'2015-10-28T00:00:00+01:00': 5,
'2015-1-8T00:00:00+01:00': 3
}
}
But I also want to be able to do searches on the most recent data (e.g. noFTEs > 4, and the element should be considered as 5, not 3). At that stage I all I know is I want to use the most recent data, and will not know the key. So an alternative would be an array
{ _id: '12345-11',
noFTEs: [
{date: '2015-10-28T00:00:00+01:00', val: 5},
{date: '2015-1-8T00:00:00+01:00', val: 3}
}
}
Another alternative - as suggested by #thomasbormans in the comments below - would be
{ _id: '12345-11',
versions: [
{noFTEs: 5, lastModified: '2015-10-28T00:00:00+01:00', other data...},
{noFTEs: 3, lastModified: '2015-1-8T00:00:00+01:00', other...}
}
}
I'd really appreciate some insights about considerations I need to make before jumping all the way in, I fear I am resulting in a query that is pretty high workload for Mongo. (In practise there are 3 other fields that can be combined for searching, and one of these is also likely to see changes over time.)
When you model a noSQL database, there are some things you need to keep in mind.
First of all is the size of each document. If you use arrays in your document, be sure that it won't pass the 16 Mb size limit for each document.
Second thing, you must model your database in order to retrieve things easily. Some "denormalization" is acceptable in favor of speed and easy of use to your application.
So if you need to know the current noFTE value, and you need to keep a history only to audit purposes, you could go with 2 collections:
collection["current"] = [
{
_id: '12345-11',
noFTEs: 5,
lastModified: '2015-10-28T00:00:00+01:00'
}
]
collection["history"] = [
{ _id: ...an object id...
source_id: '12345-11',
noFTEs: 5,
lastModified: '2015-10-28T00:00:00+01:00'
},
{
_id: ...an object id...
source_id: '12345-11',
noFTEs: 3,
lastModified: '2015-1-8T00:00:00+01:00'
}
]
By doing this way, you keep your most frequent accessed records smaller (I suppose the current version is more frequently accessed). This will make mongo more prone to keep the "current" collection in memory cache. And documents will be retrieved faster from disk, because they are smaller.
I seem this design to be best in therms of memory optimisation. But this decision is directly related on what use you will make of your data.
EDIT: I changed my original response in order to create separated inserts for each history entry. In my original answer, I tried to keep your history entries close to your original solution to focus on denormalization topic. However, keeping history in an array is a poor design decision and I decided to make this answer more complete.
The choice to keep separated inserts in the history instead of creating an array are many:
1) Whenever you change the size of a document (for example, inserting more data into it), mongo may need to move this document to an empty part of your disk in order to accommodate the larger document. This way, you end up creating storage gaps making your collections larger.
2) Whenever you insert a new document, Mongo tries to predict how big it can become based on previous inserts/updates. This way, if your history documents' sizes are similar, the padding factor will become next to optimal. However, when you maintain growing arrays, this prediction won't be good and mongo will waste space with padding.
3) In the future, you will probably want to shrink your history collection if it grows too large. Usually, we define a policy for history retention (example: 5 years), and you can backup and prune data older than that. If you have kept separated documents for each history entry, it will be much easier to do this operation.
I can find other reasons, but I believe those 3 are enough to get into the point.
To add versioning without compromising usability and speed of access for the most recent data, consider creating two collections: one with the most recent documents and one to archive the old versions of the documents when they get changed.
You can use currentVersionCollection.findAndModify to update a document while also receiving the previous (or new, depending on parameters) version of said document in one command. You then just need to remove the _id of the returned document, add a timestamp and/or revision number (when you don't have these already) and insert it into the archive collection.
By storing each old version in an own document you also avoid document growth and prevent documents from bursting the 16MB document limit when they get changed a lot.

MongoDB Bulk Insert where many documents already exist

I have a largish (~100) array of smallish documents (maybe 10 fields each) to insert in MongoDB. But many of them (perhaps all, but typically 80% or so) of them will already exist in the DB. The documents represent upcoming events over the next few months, and I'm updating the database every couple of days. So most of the events are already in there.
Anybody know (or want to guess) if it would be more efficient to:
Do the bulk update but with continueOnError = true, e.g.
db.collection.insert(myArray, {continueOnError: true}, callback)
do individual inserts, checking first if the _ID exists?
First do a big remove (something like db.collection.delete({_id: $in : [array of all the IDs in my new documents] }), then a bulk insert?
I'll probably do #1 as that is the simplest, and I don't think that 100 documents is all that large so it may not matter, but if there were 10,000 documents? I'm doing this in JavaScript with the node.js driver if that matters. My background is in Java where exceptions are time consuming and that's the main reason I'm asking - will the "continueOnError" option be time consuming???
ADDED: I don't think "upsert" makes sense. That is for updating an individual document. In my case, the individual document, representing an upcoming event, is not changing. (well, maybe it is, that's another issue)
What's happening is that a few new documents will be added.
My background is in Java where exceptions are time consuming and that's the main reason I'm asking - will the "continueOnError" option be time consuming???
The ContinueOnError flag for Bulk Inserts only affects the behaviour of the batch processing: rather than stopping processing on the first error encountered, the full batch will be processed.
In MongoDB 2.4 you will only get a single error for the batch, which will be the last error encountered. This means if you do care about catching errors you would be better doing individual inserts.
The main time savings for bulk insert vs single insert is reduced network round trips. Instead of sending a message to the MongoDB server per document inserted, drivers can break down bulk inserts into batches of up to the MaxMessageSizeBytes accepted by the mongod server (currently 48Mb).
Are bulk inserts appropriate for this use case?
Given your use case of only 100s (or even 1000s) of documents to insert where 80% already exist, there may not be a huge benefit in using bulk inserts (especially if this process only happens every few days). Your small inserts will be combined in batches, but 80% of the documents don't actually need to be sent to the server.
I would still favour bulk insert with ContinueOnError over your approach of deletion and re-insertion, but bulk inserts may be an unnecessary early optimisation given the number of documents you are wrangling and the percentage that actually need to be inserted.
I would suggest doing a few runs with the different approaches to see what the actual impact is for your use case.
MongoDB 2.6
As a head's up, the batch functionality is being significantly improved in the MongoDB 2.5 development series (which will culminate in the 2.6 production release). Planned features include support for bulk upserts and accumulating per-document errors rather than a single error per batch.
The new write commands will require driver changes to support, but may change some of the assumptions above. For example, with ContinueOnError using the new batch API you could end up getting a result back with the 80% of your batch IDs that are duplicate keys.
For more details, see the parent issue SERVER-9038 in the MongoDB issue tracker.
collection.insert(item, {continueOnError: true, safe: true}, function(err, result) {
if (err && err.code != "11000"){
throw err;
}
db.close();
callBack();
});
For your case, I'd suggest you consider fetching a list of the existing document _ids, and then only sending the documents that aren't in that list already. While you could use update with upsert to update individually, there's little reason to do so. Unless the list of _ids is extremely long (tens of thousands), it would be more efficient to grab the list and do the comparison than do individual updates to the database for each document (with some large percentage apparently failing to update).
I wouldn't use the continueOnError and send all documents ... it's less efficient.
I'd vouch to use an upsert to let mongo deal with the update or insert logic, you can also use multi to update multiple documents that match your criteria:
From the documentation:
upsert
Optional parameter, if set to true, creates a new document when no document matches the query criteria. The default value is false, which does not insert a new document when no match is found. The syntax for this parameter depends on the MongoDB version. See Upsert Parameter.
multi
Optional parameter, if set to true, updates multiple documents that meet the query criteria. If set to false, updates one document. The default value is false. For additional information, see Multi Parameter.
db.collection.update(
<query>,
<update>,
{ upsert: <boolean>, multi: <boolean> }
)
Here is the referenced documentation:
http://docs.mongodb.org/manual/reference/method/db.collection.update/

Resources