Mongoose Results' Default Order [duplicate] - node.js

When we run a Mongo find() query without any sort order specified, what does the database internally use to sort the results?
According to the documentation on the mongo website:
When executing a find() with no parameters, the database returns
objects in forward natural order.
For standard tables, natural order is not particularly useful because,
although the order is often close to insertion order, it is not
guaranteed to be. However, for Capped Collections, natural order is
guaranteed to be the insertion order. This can be very useful.
However for standard collections (non capped collections), what field is used to sort the results?
Is it the _id field or something else?
Edit:
Basically, I guess what I am trying to get at is that if I execute the following search query:
db.collection.find({"x":y}).skip(10000).limit(1000);
At two different points in time: t1 and t2, will I get different result sets:
When there have been no additional writes between t1 & t2?
When there have been new writes between t1 & t2?
There are new indexes that have been added between t1 & t2?
I have run some tests on a temp database and the results I have gotten are the same (Yes) for all the 3 cases - but I wanted to be sure and I am certain that my test cases weren't very thorough.

What is the default sort order when none is specified?
The default internal sort order (or natural order) is an undefined implementation detail. Maintaining order is extra overhead for storage engines and MongoDB's API does not mandate predictability outside of an explicit sort() or the special case of fixed-sized capped collections which have associated usage restrictions. For typical workloads it is desirable for the storage engine to try to reuse available preallocated space and make decisions about how to most efficiently store data on disk and in memory.
Without any query criteria, results will be returned by the storage engine in natural order (aka in the order they are found). Result order may coincide with insertion order but this behaviour is not guaranteed and cannot be relied on (aside from capped collections).
Some examples that may affect storage (natural) order:
WiredTiger uses a different representation of documents on disk versus the in-memory cache, so natural ordering may change based on internal data structures.
The original MMAPv1 storage engine (removed in MongoDB 4.2) allocates record space for documents based on padding rules. If a document outgrows the currently allocated record space, the document location (and natural ordering) will be affected. New documents can also be inserted in storage marked available for reuse due to deleted or moved documents.
Replication uses an idempotent oplog format to apply write operations consistently across replica set members. Each replica set member maintains local data files that can vary in natural order, but will have the same data outcome when oplog updates are applied.
What if an index is used?
If an index is used, documents will be returned in the order they are found (which does necessarily match insertion order or I/O order). If more than one index is used then the order depends internally on which index first identified the document during the de-duplication process.
If you want a predictable sort order you must include an explicit sort() with your query and have unique values for your sort key.
How do capped collections maintain insertion order?
The implementation exception noted for natural order in capped collections is enforced by their special usage restrictions: documents are stored in insertion order but existing document size cannot be increased and documents cannot be explicitly deleted. Ordering is part of the capped collection design that ensures the oldest documents "age out" first.

It is returned in the stored order (order in the file), but it is not guaranteed to be that they are in the inserted order. They are not sorted by the _id field. Sometimes it can be look like it is sorted by the insertion order but it can change in another request. It is not reliable.

Related

Inconsistent results when sorting documents by _doc

I want to fetch elasticsearch hits using the sort+search_after paging mechanism.
The elasticsearch documentation states:
_doc has no real use-case besides being the most efficient sort order. So if you don’t care about the order in which documents are returned, then you should sort by _doc. This especially helps when scrolling.
However, when performing the same query multiple times, I get different results. More specifically, the first hit alternates randomly between two different hits, where the returned sort field is 0 for one hit, and some specific number for the other.
This obviously breaks the paging as it relies on the value returned in sorting to be later fed into sort_after for the next query.
No data is being written to the index while I am querying it, so this is not because of refreshes.
My questions are therefore:
Is it wrong to sort by _doc for paging? Seems the results I get are inconsistent.
How does sorting by _doc work internally? The documentation is lacking in this regard as it simply states the sort is performed by "index order".
The data was written to the index in parallel using Spark. I thought the problem might have been the parallel write combined with the "index order" sorting, however I did not manage to replicate this behavior with other indicies which were also written to in Spark.
es 7, index contains 2 shards, one primary and one replica
cheers.
The reason this happened is that the index consists of 2 shards. One primary and one replica. The documents were not indexed in the same order. Thus, the order of the results depends on the shard they were returned from. This is fine when using scrolling because Elasticsearch keeps an inner state of the results, but not with paging, which is stateless.

Sorting enormous dataset

I have an enormous dataset (over 300 million documents). It is a system for archiving data and rollback capability.
The rollback capability is a cursor which iterates trough the whole dataset and performs few post requests to some external end points, it's a simple piece of code.
The data being iterated over needs to be send ordered by the timestamp (filed in the document). The DB was down for some time, so backup DB was used, but has received older data which has been archived manually, and later all was merged with the main DB.
Older data breaks the order. I need to sort this dataset, but the problem is the size; there is not enough RAM available to perform this operation at once. How I can achieve this sorting?
PS: The documents do not contain any indexed fields.
There's no way to do an efficient sort without an index. If you had an index on the date field then things would already be sorted (in a sense), so getting things in a desired order is very cheap (after the overhead of the index).
The only way to sort all entries without an index is to fetch the field you want to sort for every single document and sort them all in memory.
The only good options I see are to either create an index on the date field (by far the best option) or increase the RAM on the database (expensive and not scalable).
Note: since you have a large number of documents it's possible that even your index wouldn't be super scalable -- in that case you'd need to look into sharding the database.

MongoDB (+ Node/Mongoose): How to track processed records without "marking" all of them?

I have several large "raw" collections of documents which are processed in a queue, and the processed results are all placed into a single collection.
The queue only runs when the system isn't otherwise indisposed, and new data is being added into the "raw" collections all the time.
What I need to do is make sure the queue knows which documents it has already processed, so it doesn't either (a) process any documents more than once, or (b) skip documents. Updating each raw record with a "processed" flag as I go isn't a good option because it adds too much overhead.
I'm using MongoDB 4.x, with NodeJS and Mongoose. (I don't need a strictly mongoose-powered answer, but one would be OK).
My initial attempt was to do this by retrieving the raw documents sorted by _id in a smallish batch (say 100), then grabbing the first and last _id values in the return result, and storing those values, so when I'm ready to process the next batch, I can limit my find({}) query to records with an _id greater than what I stored as the last-processed result.
But looking into it a bit more, unless I'm misunderstanding something, it appears I can't really count on a strict ordering by _id.
I've looked into ways to implement an auto-incrementing numeric ID field (SQL style), which would have a strict ordering, but the solutions I've seen look like they add a nontrivial amount of overhead each time I create a record (not dissimilar to what it would take to mark processed records, just would be on the insertion end instead of the processing end), and this system needs to process a LOT of records very fast.
Any ideas? Is there a way to do an auto-incrementing numeric ID that's super efficient? Will default _id properties actually work in this case and I'm misunderstanding? Is there some other way to do it?
As per the documentation of ObjectID:
While ObjectId values should increase over time, they are not
necessarily monotonic. This is because they:
Only contain one second of temporal resolution, so ObjectId values created within the same second do not have a guaranteed ordering, and
Are generated by clients, which may have differing system clocks.
So if you are creating that many records per second then _id ordering is not for you.
However Timestamp within a mongo instance is guaranteed to be unique.
BSON has a special timestamp type for internal MongoDB use and is not
associated with the regular Date type. Timestamp values are a 64 bit
value where:
the first 32 bits are a time_t value (seconds since the Unix epoch)
the second 32 bits are an incrementing ordinal for operations within a
given second.
Within a single mongod instance, timestamp values are always unique.
Although it clearly states that this is for internal use it maybe something for you to consider. Assuming you are dealing with a single mongod instance you can decorate your records when they are getting into the "raw" collections with timestamps ... then you could remember the last processed record only. Your queue would only pick records with timestamps larger that the last processed timestamp.

DocumentDB Lazy indexing mode: Querying a document before the index has updated

In order to improve write performance, a lazy indexing strategy can be used. From what I understand, this essentially means that the index propagation is eventually consistent after the write (as opposed to the Consistent mode, where the write will only complete once the index update has also completed).
Lazy index on MSDN:
To allow maximum document ingestion throughput, a DocumentDB
collection can be configured with lazy consistency; meaning queries
are eventually consistent. The index is updated asynchronously when a
DocumentDB collection is quiescent i.e. when the collection’s
throughput capacity is not fully utilized to serve user requests. For
"ingest now, query later" workloads requiring unhindered document
ingestion, "lazy" indexing mode may be suitable.
Question:
If a document is queried for (let's say by id) BEFORE the index has updated (it is still lazily propagating), will the document not be found OR will there be some form of "table scan" and the document returned with sacrifice in RU (and performance)?
EDIT: The above is assuming Session Consistency on the database.
Queries that use the "id" and "_rid" properties do not depend on the indexing policy and are always updated consistently. i.e. same as the configured consistency level of the account (in your case, session consistency).
Queries using "id" are always served from the index, are a seek, not a scan.

How concurrent are CQL's collections?

I am taking a look at Cassandra's CQL collections (list, set, map) and I can't find a reliable source stating on their concurrency.
I want to know if having multiple writers adding different elements to the same set is supported.
From what I read of the implementation (http://www.opensourceconnections.com/blog/2013/07/24/understanding-how-cql3-maps-to-cassandras-internal-data-structure-sets-lists-and-maps/) it seems that sets are implemented using columns, so I should be safe.
But on the other hand, I've read here and there that the operations on the collections always triggered a full read (even writes). This suggests that I could get in trouble if multiple writers were using the same collection.
So can I (and should I) use collection from multiple writers? (And also, the documentation mentions that collections should be used for "small amount of data", how much would that be? Tens, hundreds, thousands?
Updates are atomic which should include any collections in the row. So it should be fine to have multiple writers.
"In an UPDATE statement, all updates within the same partition key are applied atomically and in isolation."
http://cassandra.apache.org/doc/cql3/CQL.html#updateStmt
"Values of items in collections are limited to 64K"
http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddlWhenCollections.html
Cheers,

Resources