Mongoose 4.4 now has an insertMany function which lets you validate an array of documents and insert them if valid all with one operation, rather than one for each document:
var arr = [{ name: 'Star Wars' }, { name: 'The Empire Strikes Back' }];
Movies.insertMany(arr, function(error, docs) {});
If I have a very large array, should I batch these? Or is there no limit on the size or array?
For example, I want to create a new document for every Movie, and I have 10,000 movies.
I'd recommend based on personal experience, batch of 100-200 gives you good blend of performance without putting strain on your system.
insertMany group of operations can have at most 1000 operations. If a group exceeds this limit, MongoDB will divide the group into smaller groups of 1000 or less. For example, if the queue consists of 2000 operations, MongoDB creates 2 groups, each with 1000 operations.
The sizes and grouping mechanics are internal performance details and are subject to change in future versions.
Executing an ordered list of operations on a sharded collection will generally be slower than executing an unordered list since with an ordered list, each operation must wait for the previous operation to finish.
Mongo 3.6 update:
The limit for insertMany() has increased in Mongo 3.6 from 1000 to 100,000.
Meaning that now, groups above 100,000 operations will be divided into smaller groups accordingly.
For example: a queue that has 200,000 operations will be split and Mongo will create 2 groups of 100,000 each.
The method takes that array and starts inserting them through the insertMany method in MongoDB, so the size of the array itself actually depends on how much your machine can handle.
But please note that there is another point, which is not a limitation but something worth keeping into consideration, on how MongoDB deals with multiple operations, by default it handles a batch of 1000 operations at a time and splits whats more than that.
Related
I have several large "raw" collections of documents which are processed in a queue, and the processed results are all placed into a single collection.
The queue only runs when the system isn't otherwise indisposed, and new data is being added into the "raw" collections all the time.
What I need to do is make sure the queue knows which documents it has already processed, so it doesn't either (a) process any documents more than once, or (b) skip documents. Updating each raw record with a "processed" flag as I go isn't a good option because it adds too much overhead.
I'm using MongoDB 4.x, with NodeJS and Mongoose. (I don't need a strictly mongoose-powered answer, but one would be OK).
My initial attempt was to do this by retrieving the raw documents sorted by _id in a smallish batch (say 100), then grabbing the first and last _id values in the return result, and storing those values, so when I'm ready to process the next batch, I can limit my find({}) query to records with an _id greater than what I stored as the last-processed result.
But looking into it a bit more, unless I'm misunderstanding something, it appears I can't really count on a strict ordering by _id.
I've looked into ways to implement an auto-incrementing numeric ID field (SQL style), which would have a strict ordering, but the solutions I've seen look like they add a nontrivial amount of overhead each time I create a record (not dissimilar to what it would take to mark processed records, just would be on the insertion end instead of the processing end), and this system needs to process a LOT of records very fast.
Any ideas? Is there a way to do an auto-incrementing numeric ID that's super efficient? Will default _id properties actually work in this case and I'm misunderstanding? Is there some other way to do it?
As per the documentation of ObjectID:
While ObjectId values should increase over time, they are not
necessarily monotonic. This is because they:
Only contain one second of temporal resolution, so ObjectId values created within the same second do not have a guaranteed ordering, and
Are generated by clients, which may have differing system clocks.
So if you are creating that many records per second then _id ordering is not for you.
However Timestamp within a mongo instance is guaranteed to be unique.
BSON has a special timestamp type for internal MongoDB use and is not
associated with the regular Date type. Timestamp values are a 64 bit
value where:
the first 32 bits are a time_t value (seconds since the Unix epoch)
the second 32 bits are an incrementing ordinal for operations within a
given second.
Within a single mongod instance, timestamp values are always unique.
Although it clearly states that this is for internal use it maybe something for you to consider. Assuming you are dealing with a single mongod instance you can decorate your records when they are getting into the "raw" collections with timestamps ... then you could remember the last processed record only. Your queue would only pick records with timestamps larger that the last processed timestamp.
I am having 20 millions documents in my collection.I was coming across slow response issue.So I wrote a script to divide this collection into multiple collections.So now I am having one collection of 1.3 million documents.I took a random query to check response,so for a same query I am getting response time on bigger collection is around 19000ms & smaller collection is around 13000ms.I am using mongoose lib on node to connect with mongodb.
I am new to mongodb,so please tell me in which direction I should look to these issue.
MongoDB cannot perform JOINS, so there is little gain from combining collections.But With MongoDB, and most other NoSQL databases, Its easy to use separate databases in a single instance. So if your data sets is large you can group them into different collections as per range keys,etc. Further breaking down those collections into different databases to gain performance. I'm little sceptical about the gain in performance but on theory it should.
for instance : Having two database to serve different data over entire million data sets in a single instance
db["collectionSet1"].AppCollection.find({application_status: {$in: ["pending", "approved", "rejected"]}}).sort({_id: -1})
db["collectionSet2"].OrigCollection.find({origination_status: {$in: ["originated", "approved", "rejected"]}}).sort({_id: -1})
I'm using ArangoDB for a Web Application through Strongloop.
I've got some performance problem when I run this query:
FOR result IN Collection SORT result.field ASC RETURN result
I added some index to speed up the query like skiplist index on the field sorted.
My Collection has inside more than 1M of records.
The application is hosted on n1-highmem-2 on Google Cloud.
Below some specs:
2 CPUs - Xeon E5 2.3Ghz
13 GB of RAM
10GB SSD
Unluckly, my query spend a lot of time to ending.
What can I do?
Best regards,
Carmelo
Summarizing the discussion above:
If there is a skiplist index present on the field attribute, it could be used for the sort. However, if its created sparse it can't. This can be revalidated by running
db.Collection.getIndexes();
in the ArangoShell. If the index is present and non-sparse, then the query should use the index for sorting and no additional sorting will be required - which can be revalidated using Explain.
However, the query will still build a huge result in memory which will take time and consume RAM.
If a large result set is desired, LIMIT can be used to retrieve slices of the results in several chunks, which will cause less stress on the machine.
For example, first iteration:
FOR result IN Collection SORT result.field LIMIT 10000 RETURN result
Then process these first 10,000 documents offline, and note the result value of the last processed document.
Now run the query again, but now with an additional FILTER:
FOR result IN Collection
FILTER result.field > #lastValue LIMIT 10000 RETURN result
until there are no more documents. That should work fine if result.field is unique.
If result.field is not unique and there are no other unique keys in the collection covered by a skiplist, then the described method will be at least an approximation.
Note also that when splitting the query into chunks this won't provide snapshot isolation, but depending on the use case it may be good enough already.
I have live time-series data generated by a light sensor, and presented as a rapidly changing (refreshing about every 20 milliseconds) variable in the public javascript file. How can I store them into mongo efficiently? Could anybody give me some suggestions about the best practices?
This sounds like a good case for using mongodb's Capped Collections.
Capped collections are fixed-size collections that support high-throughput operations that insert and retrieve documents based on insertion order. Capped collections work in a way similar to circular buffers: once a collection fills its allocated space, it makes room for new documents by overwriting the oldest documents in the collection.
You could insert each light sensor measurement as a new document in a Capped Collection, then you can efficiently retrieve the measurements in the same order as they were inserted, and also not have to worry about running out of storage space.
I have a mongodb collection for tracking user audit data. So essentially this will be many millions of documents.
Audits are tracked by loginID (user) and their activities on items. example: userA modified 'item#13' on date/time.
Case: I need to query with filters based on user and item. That's Simple. This returns many thousands of documents per item. I need to list them by latest date/time (descending order).
Problem: How can I insert new documents to the top of the stack? (like a capped collection) or Is it possible to find records from the bottom of the stack? (reverse order). I do NOT like the idea of find and sorting because when dealing with thousand and millions of documents sorting is a bottleneck.
Any solutions?
Stack: mongodb, node.js, mongoose.
Thanks!
the top of the stack?
you're implying there is a stack, but there isn't - there's a tree, or more precisely, a B-Tree.
I do NOT like the idea of find and sorting
So you want to sort without sorting? That doesn't seem to make much sense. Stacks are essentially in-memory data structures, they don't work well on disks because they require huge contiguous blocks (in fact, huge stacks don't even work well in memory, and growing stacks requires copying the entire data set, that would hardly work
sorting is a bottleneck
It shouldn't be, at least not for data that is stored closely together (data locality). Sorting is an O(m log n) operation, and since the _id field already encodes a timestamp, you already have a field that you can sort on. m is relatively small, so I don't see the problem here. Have you even tried that? With MongoDB 3.0, index intersectioning has become more powerful, you might not even need _id in the compound index.
On my machine, getting the top items from a large collection, filtered by an index takes 1ms ("executionTimeMillis" : 1) if the data is in RAM. The sheer network overhead will be in the same league, even on localhost. I created the data with a simple network creation tool I built and queried it from the mongo console.
I have encountered the same problem. My solution is to create another additional collection which maintain top 10 records. The good point is that you can query quickly. The bad point is you need update additional collection.
I found this which inspired me. I implemented my solution with ruby + mongoid.
My solution:
collection definition
class TrainingTopRecord
include Mongoid::Document
field :training_records, :type=>Array
belongs_to :training
index({training_id: 1}, {unique: true, drop_dups: true})
end
maintain process.
if t.training_top_records == nil
training_top_records = TrainingTopRecord.create! training_id: t.id
else
training_top_records = t.training_top_records
end
training_top_records.training_records = [] if training_top_records.training_records == nil
top_10_records = training_top_records.training_records
top_10_records.push({
'id' => r.id,
'return' => r.return
})
top_10_records.sort_by! {|record| -record['return']}
#limit training_records' size to 10
top_10_records.slice! 10, top_10_records.length - 10
training_top_records.save
MongoDb's ObjectId is structured in a way that has natural ordering.
This means the last inserted item is fetched last.
You can override that by using: db.collectionName.find().sort({ $natural: -1 }) during a fetch.
Filters can then follow.
You will not need to create any additional indices since this works on _id, which is indexed by default.
This is possibly the only efficient way you can achieve what you want.