Mongodb: data versioning with search

Mongodb: data versioning with search - node.js

Related to Ways to implement data versioning in MongoDB and structure of documents for versioning of a time series on mongodb
What data structure should I adopt for versioning when I also need to be able to handle queries?
Suppose I have 8500 documents of the form
{ _id: '12345-11',
noFTEs: 5
}
Each month I get details of a change to noFTEs in about 30 docs, I want to store the new data along with the previous one(s), together with a date.
That would seem to result in:
{ _id: '12345-11',
noFTEs: {
'2015-10-28T00:00:00+01:00': 5,
'2015-1-8T00:00:00+01:00': 3
}
}
But I also want to be able to do searches on the most recent data (e.g. noFTEs > 4, and the element should be considered as 5, not 3). At that stage I all I know is I want to use the most recent data, and will not know the key. So an alternative would be an array
{ _id: '12345-11',
noFTEs: [
{date: '2015-10-28T00:00:00+01:00', val: 5},
{date: '2015-1-8T00:00:00+01:00', val: 3}
}
}
Another alternative - as suggested by #thomasbormans in the comments below - would be
{ _id: '12345-11',
versions: [
{noFTEs: 5, lastModified: '2015-10-28T00:00:00+01:00', other data...},
{noFTEs: 3, lastModified: '2015-1-8T00:00:00+01:00', other...}
}
}
I'd really appreciate some insights about considerations I need to make before jumping all the way in, I fear I am resulting in a query that is pretty high workload for Mongo. (In practise there are 3 other fields that can be combined for searching, and one of these is also likely to see changes over time.)

When you model a noSQL database, there are some things you need to keep in mind.
First of all is the size of each document. If you use arrays in your document, be sure that it won't pass the 16 Mb size limit for each document.
Second thing, you must model your database in order to retrieve things easily. Some "denormalization" is acceptable in favor of speed and easy of use to your application.
So if you need to know the current noFTE value, and you need to keep a history only to audit purposes, you could go with 2 collections:
collection["current"] = [
{
_id: '12345-11',
noFTEs: 5,
lastModified: '2015-10-28T00:00:00+01:00'
}
]
collection["history"] = [
{ _id: ...an object id...
source_id: '12345-11',
noFTEs: 5,
lastModified: '2015-10-28T00:00:00+01:00'
},
{
_id: ...an object id...
source_id: '12345-11',
noFTEs: 3,
lastModified: '2015-1-8T00:00:00+01:00'
}
]
By doing this way, you keep your most frequent accessed records smaller (I suppose the current version is more frequently accessed). This will make mongo more prone to keep the "current" collection in memory cache. And documents will be retrieved faster from disk, because they are smaller.
I seem this design to be best in therms of memory optimisation. But this decision is directly related on what use you will make of your data.
EDIT: I changed my original response in order to create separated inserts for each history entry. In my original answer, I tried to keep your history entries close to your original solution to focus on denormalization topic. However, keeping history in an array is a poor design decision and I decided to make this answer more complete.
The choice to keep separated inserts in the history instead of creating an array are many:
1) Whenever you change the size of a document (for example, inserting more data into it), mongo may need to move this document to an empty part of your disk in order to accommodate the larger document. This way, you end up creating storage gaps making your collections larger.
2) Whenever you insert a new document, Mongo tries to predict how big it can become based on previous inserts/updates. This way, if your history documents' sizes are similar, the padding factor will become next to optimal. However, when you maintain growing arrays, this prediction won't be good and mongo will waste space with padding.
3) In the future, you will probably want to shrink your history collection if it grows too large. Usually, we define a policy for history retention (example: 5 years), and you can backup and prune data older than that. If you have kept separated documents for each history entry, it will be much easier to do this operation.
I can find other reasons, but I believe those 3 are enough to get into the point.

To add versioning without compromising usability and speed of access for the most recent data, consider creating two collections: one with the most recent documents and one to archive the old versions of the documents when they get changed.
You can use currentVersionCollection.findAndModify to update a document while also receiving the previous (or new, depending on parameters) version of said document in one command. You then just need to remove the _id of the returned document, add a timestamp and/or revision number (when you don't have these already) and insert it into the archive collection.
By storing each old version in an own document you also avoid document growth and prevent documents from bursting the 16MB document limit when they get changed a lot.

Related

Which method should I use to store data in firebase firestore document?

I have this sample data of two objects. Both can be put in a document with below two structures. And it's easy to perform CRUD on both methods. But I want to know which is the more efficient one.
Structure 1:
key1:{ sr.:1, name:'Raj', city: 'Mumbai'}
key2:{ sr.:2, name:'Aman', city: 'Delhi'}
It's easy to create different objects inside a single document using merge property and deletion can be performed using the below code.
db.collection('colName')
.doc('docName')
.update({
[key1]: firebase.firestore.FieldValue.delete(),
})
Structure 2:
It is basically objects in an array.
arr:[ { sr.:1, name:'Raj', city: 'Mumbai'} ,
{ sr.:2, name:'Aman', city: 'Delhi'} ]
The data can be pushed in array arr using the below code.
['arr']: firebase.firestore.FieldValue.arrayUnion(object3)
And the deletion can be performed like this.
['arr']: firebase.firestore.FieldValue.arrayRemove(indexToBeDeleted)
Which one is more efficient when it comes to CRUD operations?

CRUD is 4 different qualities, each of which has additional measurable attributes. Talking about CRUD in the context of firestore adds even more attributes to those as well.
There are firestore limits/quotas: https://cloud.google.com/firestore/quotas
And, There are firestores costs: https://firebase.google.com/docs/firestore/pricing
firestore charges per read.
Storing all your data into one document is cost efficient.
Firestore is optimized for reads.
In the limits/quotas document you may notice that there is a max write rate, to a document, of 1 per second. How frequently would you plan on writing new data into the array of that 1 document? Is 1 document still efficient?
Firestore has a max document size of 1MB.
Are you going to write more than 1MB to a document. After adding the logic to split your document apart is it still efficient?
There are many aspects to think about in designing your data structures. An efficiency of one quality is bound to create inefficiencies in another.

Mongodb: is replacing an array with a new version more efficient than adding elements to it?

I have a single /update-user endpoint on my server that triggers an updateUser query on mongo.
Basically, I retrieve the user id thanks to the cookie, and inject the received form, that can comprise any kind of key allowed in the User model, in the mongo query.
It looks like:
const form = {
friends: [{id: "1", name: "paul", thumbnail: "www.imglink.com"},
{id: "2", name: "joe", thumbnail: "www.imglink2.com"}],
locale: "en",
age: 77
}
function updateUser(form, _id){
const query = JSON.stringify(form)
return UserDB.findOneAndUpdate( { _id }, { $set: query })
}
So each time, I erase the necessary data and replace it by a brand new one. Sometimes, this data can be an array of 50 objects (let's say I've removed two persons in a 36 friends array as described above).
It is very convenient, because I can abstract all the logic both in the front and back with a single update function. However, is this a pure heresy from a performance point of view? Should I rather use 10 endpoints to update each part of the form?
The form is dynamic, I never know what is going to be inside, except that it belongs to the User model, this is why I've used this strategy.

From MongoDB's point of view, it doesn't matter much. MongoDB is a journalled database (particularly with the WiredTiger storage engine), and it probably (re)writes a large part of the document on update. It might make a minor difference under very heavy loads when replicating the oplog between primary and replicas, but if you have performance constraints like these, you'll know. If in doubt, benchmark and monitor your production system - don't over-optimize.
Focus on what's best for the business domain. Is your application collaborative? Do multiple users edit the same documents at the same time? What happens when they overwrite one another's changes? Are the JSONs that the client sends to the back-end large, or do they not clog up the network? These are the most important questions you should ask, and performance should only be optimized once you have the UX, the interaction model and the concurrency issues nailed.

Mongodb document insertion order

I have a mongodb collection for tracking user audit data. So essentially this will be many millions of documents.
Audits are tracked by loginID (user) and their activities on items. example: userA modified 'item#13' on date/time.
Case: I need to query with filters based on user and item. That's Simple. This returns many thousands of documents per item. I need to list them by latest date/time (descending order).
Problem: How can I insert new documents to the top of the stack? (like a capped collection) or Is it possible to find records from the bottom of the stack? (reverse order). I do NOT like the idea of find and sorting because when dealing with thousand and millions of documents sorting is a bottleneck.
Any solutions?
Stack: mongodb, node.js, mongoose.
Thanks!

the top of the stack?
you're implying there is a stack, but there isn't - there's a tree, or more precisely, a B-Tree.
I do NOT like the idea of find and sorting
So you want to sort without sorting? That doesn't seem to make much sense. Stacks are essentially in-memory data structures, they don't work well on disks because they require huge contiguous blocks (in fact, huge stacks don't even work well in memory, and growing stacks requires copying the entire data set, that would hardly work
sorting is a bottleneck
It shouldn't be, at least not for data that is stored closely together (data locality). Sorting is an O(m log n) operation, and since the _id field already encodes a timestamp, you already have a field that you can sort on. m is relatively small, so I don't see the problem here. Have you even tried that? With MongoDB 3.0, index intersectioning has become more powerful, you might not even need _id in the compound index.
On my machine, getting the top items from a large collection, filtered by an index takes 1ms ("executionTimeMillis" : 1) if the data is in RAM. The sheer network overhead will be in the same league, even on localhost. I created the data with a simple network creation tool I built and queried it from the mongo console.

I have encountered the same problem. My solution is to create another additional collection which maintain top 10 records. The good point is that you can query quickly. The bad point is you need update additional collection.
I found this which inspired me. I implemented my solution with ruby + mongoid.
My solution:
collection definition
class TrainingTopRecord
include Mongoid::Document
field :training_records, :type=>Array
belongs_to :training
index({training_id: 1}, {unique: true, drop_dups: true})
end
maintain process.
if t.training_top_records == nil
training_top_records = TrainingTopRecord.create! training_id: t.id
else
training_top_records = t.training_top_records
end
training_top_records.training_records = [] if training_top_records.training_records == nil
top_10_records = training_top_records.training_records
top_10_records.push({
'id' => r.id,
'return' => r.return
})
top_10_records.sort_by! {|record| -record['return']}
#limit training_records' size to 10
top_10_records.slice! 10, top_10_records.length - 10
training_top_records.save

MongoDb's ObjectId is structured in a way that has natural ordering.
This means the last inserted item is fetched last.
You can override that by using: db.collectionName.find().sort({ $natural: -1 }) during a fetch.
Filters can then follow.
You will not need to create any additional indices since this works on _id, which is indexed by default.
This is possibly the only efficient way you can achieve what you want.

CouchDB View that includes revision history

I am very new to CouchDB. Missing SQL already.
Anyways, I need to create a view that emits a few attributes of my documents along with all the revision IDs.
Something like this
function(doc) {
if (doc.type == 'template') {
emit(doc.owner, {_id: doc._id, _rev: doc._rev, owner: doc.owner, meta: doc.meta, key: doc.key, revisions_ids: What goes here?});
}
}
But how do I tell it to include all the revisions?
I know I can call
http://localhost:5984/main/94c4db9eb51f757ceab86e4a9b00cddf
for each document (from my app), but that really does not scale well.
Is there a batch way to fetch revision info?
Any help would be appreciated!!

CouchDB revisions are not intended to be a version control system. They are only used for ensuring write consistency. (and preventing the need for locks during concurrent writes)
That being said, only the most recent _rev number is useful for any given doc. Not only that, but a database compaction will delete all the old revisions as well. (a compaction is never run automatically, but is should be part of routine maintenance)
As you may have already noticed, your view outputs the most recent _rev number in the value of your view output. Also, if you are using include_docs=true, then the _rev number is also shown in the doc portion of your view result.
Strategies do exist for using CouchDB for revision history, but they are generally complicated, and not usually recommended. (check out this question and this blogpost for more information on that subject)

MongoDB Bulk Insert where many documents already exist

I have a largish (~100) array of smallish documents (maybe 10 fields each) to insert in MongoDB. But many of them (perhaps all, but typically 80% or so) of them will already exist in the DB. The documents represent upcoming events over the next few months, and I'm updating the database every couple of days. So most of the events are already in there.
Anybody know (or want to guess) if it would be more efficient to:
Do the bulk update but with continueOnError = true, e.g.
db.collection.insert(myArray, {continueOnError: true}, callback)
do individual inserts, checking first if the _ID exists?
First do a big remove (something like db.collection.delete({_id: $in : [array of all the IDs in my new documents] }), then a bulk insert?
I'll probably do #1 as that is the simplest, and I don't think that 100 documents is all that large so it may not matter, but if there were 10,000 documents? I'm doing this in JavaScript with the node.js driver if that matters. My background is in Java where exceptions are time consuming and that's the main reason I'm asking - will the "continueOnError" option be time consuming???
ADDED: I don't think "upsert" makes sense. That is for updating an individual document. In my case, the individual document, representing an upcoming event, is not changing. (well, maybe it is, that's another issue)
What's happening is that a few new documents will be added.

My background is in Java where exceptions are time consuming and that's the main reason I'm asking - will the "continueOnError" option be time consuming???
The ContinueOnError flag for Bulk Inserts only affects the behaviour of the batch processing: rather than stopping processing on the first error encountered, the full batch will be processed.
In MongoDB 2.4 you will only get a single error for the batch, which will be the last error encountered. This means if you do care about catching errors you would be better doing individual inserts.
The main time savings for bulk insert vs single insert is reduced network round trips. Instead of sending a message to the MongoDB server per document inserted, drivers can break down bulk inserts into batches of up to the MaxMessageSizeBytes accepted by the mongod server (currently 48Mb).
Are bulk inserts appropriate for this use case?
Given your use case of only 100s (or even 1000s) of documents to insert where 80% already exist, there may not be a huge benefit in using bulk inserts (especially if this process only happens every few days). Your small inserts will be combined in batches, but 80% of the documents don't actually need to be sent to the server.
I would still favour bulk insert with ContinueOnError over your approach of deletion and re-insertion, but bulk inserts may be an unnecessary early optimisation given the number of documents you are wrangling and the percentage that actually need to be inserted.
I would suggest doing a few runs with the different approaches to see what the actual impact is for your use case.
MongoDB 2.6
As a head's up, the batch functionality is being significantly improved in the MongoDB 2.5 development series (which will culminate in the 2.6 production release). Planned features include support for bulk upserts and accumulating per-document errors rather than a single error per batch.
The new write commands will require driver changes to support, but may change some of the assumptions above. For example, with ContinueOnError using the new batch API you could end up getting a result back with the 80% of your batch IDs that are duplicate keys.
For more details, see the parent issue SERVER-9038 in the MongoDB issue tracker.

collection.insert(item, {continueOnError: true, safe: true}, function(err, result) {
if (err && err.code != "11000"){
throw err;
}
db.close();
callBack();
});

For your case, I'd suggest you consider fetching a list of the existing document _ids, and then only sending the documents that aren't in that list already. While you could use update with upsert to update individually, there's little reason to do so. Unless the list of _ids is extremely long (tens of thousands), it would be more efficient to grab the list and do the comparison than do individual updates to the database for each document (with some large percentage apparently failing to update).
I wouldn't use the continueOnError and send all documents ... it's less efficient.

I'd vouch to use an upsert to let mongo deal with the update or insert logic, you can also use multi to update multiple documents that match your criteria:
From the documentation:
upsert
Optional parameter, if set to true, creates a new document when no document matches the query criteria. The default value is false, which does not insert a new document when no match is found. The syntax for this parameter depends on the MongoDB version. See Upsert Parameter.
multi
Optional parameter, if set to true, updates multiple documents that meet the query criteria. If set to false, updates one document. The default value is false. For additional information, see Multi Parameter.
db.collection.update(
<query>,
<update>,
{ upsert: <boolean>, multi: <boolean> }
)
Here is the referenced documentation:
http://docs.mongodb.org/manual/reference/method/db.collection.update/

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string