How can I use indexes in aggregate?
I saw the document https://docs.mongodb.com/manual/core/aggregation-pipeline/#pipeline-operators-and-indexes
The $match and $sort pipeline operators can take advantage of an index when they occur at the beginning of the pipeline.
Is there any way of using index not the beginning situation?
like $sort,
$match or $group
Please help me
An index works by keeping a record of certain pieces of data that point to a given record in your collection. Think of it like having a novel, and then having a sheet of paper that lists the names of various people or locations in that novel with the page numbers where they're mentioned.
Aggregation is like taking that novel and transforming the different pages into an entirely different stream of information. You don't know where the new information is located until the transformation actually happens, so you can't possibly have an index on that transformed information.
In other words, it's impossible to use an index in any aggregation pipeline stage that is not at the very beginning because that data will have been transformed and MongoDB has no way of knowing if it's even possible to efficiently make use of the newly transformed data.
If your aggregation pipeline is too large to handle efficiently, then you need to limit the size of your pipeline in some way such that you can handle it more efficiently. Ideally this would mean having a $match stage that sufficiently limits the documents to a reasonably-sized subset. This isn't always possible, however, so additional effort may be required.
One possibility is generating "summary" documents that are the result of aggregating all new data together, then performing your primary aggregation pipeline using only these summary documents. For example, if you have a log of transactions in your system that you wish to aggregate, then you could generate a daily summary of the quantities and types of the different transactions that have been logged for the day, along with any other additional data you would need. You would then limit your aggregation pipeline to only these daily summary documents and avoid using the normal transaction documents.
An actual solution is beyond the scope of this question, however. Just be aware that the index usage is a limitation that you cannot avoid.
Related
I'm collecting a large amount of data coming from a market data websocket stream. I'm collecting 2 different types of events from this single stream that are to be stored with event date/time and have no parent/child database relation. They're being stored their own respective MongoDB collections due to the difference in data structures.
Once a certain amount of data has been stored (100k+ events), I will be running analysis on the events, but I'd like to do so in a fashion where I'm simulating the original single stream of events by time (not processing both collection streams individually).
What I'd like to be able to do is make a single query from Mongoose, if possible, that joins both collections, sorts by date, and outputs as a stream for memory-saving purposes. So, performance is important in this case due to the number of events.
All answers I've seen when searching for a solution are regarding a parent/child aggregation of some sort, but since this isn't a user/userData-related segment of an application I'm having trouble finding an answer.
Also, storing the data in 2 separate collections seems necessary since their fields are all different except for time. But... would there be more pros than cons to keep these events in a single collection, if it eliminates the need for this type of solution?
The data structure reasoning is slightly inverted. Mongodb is schemaless and it's natural to have documents with different structure in the same collection.
It makes it easy to collect and analyse data but cause problems on application level since devs cannot rely on data structure and have to validate it on each data retrieval.
Mongoose aims to solve this problem by introducing data structures on the application level and taking all the routine validation tasks. Sometimes a single collection stores multiple models with some discrimination fields to resolve which one to unmarshall documents to.
Having a single stream from multiple collections is the simplest part of the question, $unionWith does exactly that:
db.collection1.aggregate( [
{ $unionWith: "collection2" },
{ $sort: { time: 1 } }
] )
Unmarshalling of the documents to mongoose models will be a little bit more complex - you will need to do it manually since the documents will have different structure.
Sorting might be a problem tho. https://docs.mongodb.com/manual/core/aggregation-pipeline/#pipeline-operators-and-indexes says the query can benefit from the indexed "time" column as long as there is no $project, $unwind, and $group stages but I would double check it can be used after $unionWith stage.
It would be much simpler to store the whole websocket stream in a single collection and use it straight from there.
I'm trying to write a method using the MongoDB NodeJS driver that will give me a single, random document from a collection as its result.
I've seen people recommend the use of db.collection.aggregation and the $sample pipeline stage to do this. Here is my code:
async findOneRandom(collection) {
try {
return await this.db.collection(collection).aggregate([
{ $sample: {size: 1} }
]).toArray();
} catch (error) {
console.log(error.stack);
return null;
}
}
I have a collection that has 328 documents in it. Each document has an _id field e.g. { _id: 1 } and the IDs run sequentially from 1 to 328.
Whilst testing this I observed that I was never seeing a result above { _id: 250 }.
To investigate further, I ran the code 10,000 times and looked at the distribution of the random results. In 10,000 runs I never got a result with the ID 1 or any number above 251. Here is the distribution visualised:
The mongo documentation says this, but to me, that doesn't explain why the result doesn't ever contain IDs higher than 251:
If all the following conditions are met, $sample uses a pseudo-random cursor to select documents:
$sample is the first stage of the pipeline
N is less than 5% of the total documents in the collection
The collection contains more than 100 documents
If any of the above conditions are NOT met, $sample performs a collection scan followed by a random sort to select N documents. In this case, the $sample stage is subject to the sort memory restrictions.
My use case does not require a perfectly random distribution, but it appears to me as if something is wrong, or I am not fully understanding the terminology in the docs and what this pipeline can and can't do.
Is anyone able to explain/point to documentation that explains why the $sample never appears to select from all the documents?
Are there any changes I can make to my code e.g. supplying options that will address my issue? The fact that the distribution is spread across "250" documents seems very un-coincidental! I am perhaps not understanding how the mongo cursor is working (I am new and learning).
Mongo version I'm using is 4.2.6
Note: I am happy to consider other methods to randomly select a document, but my question is specifically about $sample as it seems like the use of $sample is the solution that is commonly recommended and I haven't yet found an article that references this issue or generally how 'pseudo-random' is implemented.
After more searching I found this answer which adequately describes why the sample is not random: "Random" sample from MongoDB returning heavily skewed results
As of MongoDB 3.4.9, part of the reason for the bias you've observed is that $sample relies almost entirely on the storage engine's random cursor implementation (see SERVER-19183). This is done so that $sample could be performant when the collection contains a lot of data. However, since the storage engine stores documents in a sorted order using a B-tree type implementation, it's not always possible to create a truly random result.
There are currently two feature requests for better $sample mechanics, namely SERVER-22069 and SERVER-22068.
SERVER-22068 describes how the first random sampling strategy (as described in the MongoDB documentation) is statistically biased, but the second strategy (which my code is not triggering) is statistically better distributed.
The $sample stage currently has two algorithms to select a random sample:
Using a random cursor (does a random walk over some B-tree like structure).
A full collection scan, sorting by a random value.
The latter strategy has a better statistical distribution, since it only relies on the random number generator, and doesn't depend on any trees being balanced. It is also better at weighting the results from shards with different amounts of data accordingly. The random walk approach has some special logic to approximate weighting per shard, but it is flawed because it only has an estimate of the number of owned documents on the shard.
We should add an option to the $sample stage to force it to perform the scan + random sort approach. When this option is passed, it should probably use the better random number generator as well.
The proposal is to add an option to force the second strategy, but it has not yet been actioned, so I assume the correct approach for my use case is to use a method other than $sample.
I have to design a schema in such a way that I can store user id and their order which can be multiple products like bread, butter plus in addition to that I want to store the quantity of product ordered, please guide.
It is difficult to provide you with a real solution to your problem as designing a NoSQL DB structure depends on how you want to access your data. You can keep orders as nested/embedded documents in the User model or store them in a separate collection. In the first case, you will have all the data in one requests, but you will not be able to query and receive orders, that match certain criteria as you will get all orders including those that match. And then you would need to filter them out. Or you could use aggregation to get exactly what you need.
However, there is a limitation to keep in mind. MongoDB document has a size limitation - 16 megabytes. Since users may have very many orders, you can reach the document size limit for some users for sure. Aggregation also has a limitation - Pipeline stages have a limit of 100 megabytes of RAMe but you can override it.
Having orders in a separate collection would require you to separately load them for users. While it is one more request, it will give you more flexibility in terms of how you query them.
Then, of course, create/update operations are also done differently for both cases.
My advice would be that you carefully design your application first - what data you need and where you will show it, how you create/update it. It will give you a better idea and chances are that relational DB will be a better choice for what you need (though absolutely not necessary).
Let's say I have 2 datasets, one for rules, and the other for values.
I need to filter the values based on rules.
I am using a Key-Value store (couchbase, cassandra etc.). I can use multi-get to retrieve all the values from one table, and all rules for the other one, and perform validation in a loop.
However I find this is very inefficient. I move massive volume of data (values) over the network, and the client busy working on filtering.
What is the common pattern for finding the intersection between two tables with Key-Value store?
The idea behind the nosql data model is to write data in a denormalized way so that a table can answer to a precise query. To make an example imagine you have reviews made by customers on shops. You need to know the reviews made by a user on shops and also reviews received by a shop. This would be modeled using two tables
ShopReviews
UserReviews
In the first table you query by shop id in the second by user id but data are written twice and accessed directly using just a key access.
In the same way you should organize values by rules (can't be more precise without knowing what's the relation between them) and so on. One more consideration: newer versions of nosql db supports collections which might help to model 1 to many relations.
HTH, Carlo
Is there an efficient way to do a range-based query across multiple collections, sorted by an index on timestamps? I basically need to pull in the latest 30 documents from 3 collections and the obvious way would be to query each of the collections for the latest 30 docs and then filter and merge the result. However that's somewhat inefficient.
Even if I were to select only for the timestamp field in the query then do a second batch of queries for the latest 30 docs, I'm not sure that be a better approach. That would be 90 documents (whole or single field) per pagination request.
Essentially the client can be subscribed to articles and each category of article differs by 0 - 2 fields. I just picked 3 since that is the average number of articles that users are subscribed to so far in the beta. Because of the possible field differences, I didn't think it would be very consistent to put all of the articles of different types in a single collection.
MongoDB operations operate on one and only one collection at a time. Thus you need to structure your schema with collections that match your query needs.
Option A: Get Ids from supporting collection, load full docs, sort in memory
So you need to either have a collection that combines the ids, main collection names, and timestamps of the 3 collections into a single collection, and query that to get your 30 ID/collection pairs, and then load the corresponding full documents with 3 additional queries (1 to each main collection), and of course remember those won't come back in correct combined order, so you need to sort that page of results manually in memory before returning it to your client.
{
_id: ObjectId,
updated: Date,
type: String
}
This way allows mongo to do the pagination for you.
Option B: 3 Queries, Union, Sort, Limit
Or as you said load 30 documents from each collection, sort the union set in memory, drop the extra 60, and return the combined result. This avoids the extra collection overhead and synchronization maintenance.
So I would think your current approach (Option B as I call it) is the lesser of those 2 not-so-great options.
If your query is really to get the most recent articles based on a selection of categories, then I'd suggest you:
A) Store all of the documents in a single collection so they can utilize a a single query for fetching a combine paged result. Unless you have a very consistent date range across collections, you'll need to track date ranges and counts so that you can reasonably fetch a set of documents that can be merged. 30 from one collection may be older than all from another. You can add an index for timestamp and category and then limit the results.
B) Cache everything aggressively so that you rarely need to do the merges
You could use the same idea I explained here, although this post is related to MongoDB text search it applies to any kind of query
MongoDB Index optimization when using text-search in the aggregation framework
The idea is to query all your collections ordering them by date and id, then sort/mix the results in order to return the first page. Subsequent pages are retrieved by using last document's date and id from the previous page.