Question:
i’m using the Node MongoDB driver. I’m trying to determine whether i should write a single query that gets data from three collections or whether the database needs to have one collection with references or embedded documents etc… that joins these three unrelated collections.
User case:
During search i get an array of objects, i take the first 10 from the array, each object is meta data about a document belonging in one of the three collections. The collections are unrelated but have some common fields and this meta data is the only way to go get information at later stages.
For example, during search i get and store this array in React state (see example object below), then when the user clicks on a search result, i have to go and loop inside this array so that i can go grab the relevant metadata to be able to retrieve more content…
Example Object inside Array of Objects (Meta data):
[{
collection: 'pmc_test',
id_field: 'id_int',
id_type: 'int',
id_value: 2657156
},
{
collection: 'arxiv',
id_field: 'id_int',
id_type: 'int',
id_value: 2651582
},
{
collection: 'crossref',
id_field: 'DOI',
id_type: 'string',
id_value: "10.1098/rsbm.1955.0005"
},
...] // different collections, usually passed with 10 objects
However to display the 10 search results to begin with i have to loop over each object in the array, modify and run a query which could result in 10 separate queries. So i can at least minimise this by doing 3 queries using the $in operator and provide three arrays of IDs representing each collection.
This is still multiple queries, i have to go to the 1st collection, then 2nd collection, then 3rd collection and then combine all the results together for display search results. This is what i'm trying to avoid. This is how each of the three collections roughly look like.
Any suggestions on what querying approach i could use? Will the database benefit from having a single collection / approach that will avoid having to use the meta data to look in three different collections?
Currently this is a massive breaking change to the application resulting in at least 15 features / api calls needing updates, i'd like to maintain the ability to query one collection and suggest this as an optimal change.
Thanks in advance.
Edit
Example collections here:
Arxiv collection: https://gist.github.com/Natedeploys/6734dffccea7b293ca16b5bd7c73a6b6
Crossref collection:
https://gist.github.com/Natedeploys/9b0d3b02c665d7507ed75c9d5fbff159
Pubmed collection (pmc_test):
https://gist.github.com/Natedeploys/09527e8ceaf5d3f0f70ba28984b87a73
You can do all these operations by mongodb aggregation , in your case lookup and group stages will applicable , for further please share (1 document) json data of each collection so it would easy to guide
Related
I'm collecting a large amount of data coming from a market data websocket stream. I'm collecting 2 different types of events from this single stream that are to be stored with event date/time and have no parent/child database relation. They're being stored their own respective MongoDB collections due to the difference in data structures.
Once a certain amount of data has been stored (100k+ events), I will be running analysis on the events, but I'd like to do so in a fashion where I'm simulating the original single stream of events by time (not processing both collection streams individually).
What I'd like to be able to do is make a single query from Mongoose, if possible, that joins both collections, sorts by date, and outputs as a stream for memory-saving purposes. So, performance is important in this case due to the number of events.
All answers I've seen when searching for a solution are regarding a parent/child aggregation of some sort, but since this isn't a user/userData-related segment of an application I'm having trouble finding an answer.
Also, storing the data in 2 separate collections seems necessary since their fields are all different except for time. But... would there be more pros than cons to keep these events in a single collection, if it eliminates the need for this type of solution?
The data structure reasoning is slightly inverted. Mongodb is schemaless and it's natural to have documents with different structure in the same collection.
It makes it easy to collect and analyse data but cause problems on application level since devs cannot rely on data structure and have to validate it on each data retrieval.
Mongoose aims to solve this problem by introducing data structures on the application level and taking all the routine validation tasks. Sometimes a single collection stores multiple models with some discrimination fields to resolve which one to unmarshall documents to.
Having a single stream from multiple collections is the simplest part of the question, $unionWith does exactly that:
db.collection1.aggregate( [
{ $unionWith: "collection2" },
{ $sort: { time: 1 } }
] )
Unmarshalling of the documents to mongoose models will be a little bit more complex - you will need to do it manually since the documents will have different structure.
Sorting might be a problem tho. https://docs.mongodb.com/manual/core/aggregation-pipeline/#pipeline-operators-and-indexes says the query can benefit from the indexed "time" column as long as there is no $project, $unwind, and $group stages but I would double check it can be used after $unionWith stage.
It would be much simpler to store the whole websocket stream in a single collection and use it straight from there.
I am trying to query my Firestore collection (in Node.js for my flutter app), and to get the 10 documents which has the most objects in their subcolllection called Purchases (to get the best sellers).
Is it possible in Firestore? Or do I have to keep an int field outside of my subcollection to represent its length?
Thank you!
I thought this was answered recently, but can't find it right now, so...
Firestore queries (and other read operations) work on a single collection, or a group of collections with the same name. They don't consider any data in other (nested or otherwise) collections, nor can they query based on aggregates (such as the number of documents), unless those aggregates are stored in a document in the queried collection.
So the solution is indeed to keep a counter in a document in the collection you are querying against, and updating that counter with every add/delete to the subcollection.
Im currently building a multi User ecommerce app, (like shopify but for another sector). Im using nodejs with mongoDB. What is the best practice to store the orders?
Im currently have this schema:
UserModel{
username,
password,
ownsStore:}
StoreModel{storeID,
owner:,
products:[productarray],
categories:[],
orders:[array with all OrderModel{
orderid:
oderitems}]}
Will the Orders array in the store get to big?
Is it better Practice to put the Orders in a own colloction and assign them to the stores by using find and only saving the ObjectId in the order array of the store?
This schema will create you problems when the data will get bigger. Instead of doing this you should create different object for each order and map them with any unique ID of user.
The problem in this will come when you would like to sort various orders then you have to either use aggregation query which will take more time as compared to normal query or you will have to manually sort orders by using forEach or map functions.
Secondly you will also face problems while updating document if nested arrays comes into play because mongoDB does not provide support for deeply nested arrays so you would have to set again and again the value of array by manually updating array.
If you make different objects then all those things would get easier and faster to do.
I have worked on ecommerce project and have experienced the issues, so I later changed them to different object for each order.
So define a different schema for orders and add a unique ID of user in each order, so that the mapping becomes easy
I have Location table as below. I need to fetch the details of only driver whose ratings are above three for that particular location.
Thanks in advance...
[
{
"name":"Delhi",
"cab_details(sub table)":[
{
"driver_details"(join):{
"name":"1111",
"ratings_above_three":true
},
"date_joining": date
},
{
"driver_details":{
"name":"2222",
"ratings_above_three":false
},
"date_joining": date
}
]
}
]
It would be much easier if you put your drivers into separate collection.
MongoDB does not handle very well growing objects.
Follow in your schema design the principle of least cardinality
Refer to this question: mongodb-performance-with-growing-data-structure
I suspect the model you presented is City from collection cities. So I would recommend creating a separate collection called cab_drivers, and populating it with CabDriver documents. You can maintain a relationship between them by using ObjectId's . Store an array of CabDriver._id values in your City as cab_drivers, it would be much easier to query, validate and update them. Also the cab_drivers array inside a City document would grow much slower than in case of keeping the whole documents within the parent.
If you use mongoose you could then use: city.populate('cab_drivers') to load all related documents on application level.
Is there an efficient way to do a range-based query across multiple collections, sorted by an index on timestamps? I basically need to pull in the latest 30 documents from 3 collections and the obvious way would be to query each of the collections for the latest 30 docs and then filter and merge the result. However that's somewhat inefficient.
Even if I were to select only for the timestamp field in the query then do a second batch of queries for the latest 30 docs, I'm not sure that be a better approach. That would be 90 documents (whole or single field) per pagination request.
Essentially the client can be subscribed to articles and each category of article differs by 0 - 2 fields. I just picked 3 since that is the average number of articles that users are subscribed to so far in the beta. Because of the possible field differences, I didn't think it would be very consistent to put all of the articles of different types in a single collection.
MongoDB operations operate on one and only one collection at a time. Thus you need to structure your schema with collections that match your query needs.
Option A: Get Ids from supporting collection, load full docs, sort in memory
So you need to either have a collection that combines the ids, main collection names, and timestamps of the 3 collections into a single collection, and query that to get your 30 ID/collection pairs, and then load the corresponding full documents with 3 additional queries (1 to each main collection), and of course remember those won't come back in correct combined order, so you need to sort that page of results manually in memory before returning it to your client.
{
_id: ObjectId,
updated: Date,
type: String
}
This way allows mongo to do the pagination for you.
Option B: 3 Queries, Union, Sort, Limit
Or as you said load 30 documents from each collection, sort the union set in memory, drop the extra 60, and return the combined result. This avoids the extra collection overhead and synchronization maintenance.
So I would think your current approach (Option B as I call it) is the lesser of those 2 not-so-great options.
If your query is really to get the most recent articles based on a selection of categories, then I'd suggest you:
A) Store all of the documents in a single collection so they can utilize a a single query for fetching a combine paged result. Unless you have a very consistent date range across collections, you'll need to track date ranges and counts so that you can reasonably fetch a set of documents that can be merged. 30 from one collection may be older than all from another. You can add an index for timestamp and category and then limit the results.
B) Cache everything aggressively so that you rarely need to do the merges
You could use the same idea I explained here, although this post is related to MongoDB text search it applies to any kind of query
MongoDB Index optimization when using text-search in the aggregation framework
The idea is to query all your collections ordering them by date and id, then sort/mix the results in order to return the first page. Subsequent pages are retrieved by using last document's date and id from the previous page.