Create View from multiple collections MongoDB - node.js

I have following Mongo Schemas(truncated to hide project sensitive information) from a Healthcare project.
let PatientSchema = mongoose.Schema({_id:String})
let PrescriptionSchema = mongoose.Schema({_id:String, patient: { type: Number, ref: 'Patient', createdAt:Date }})
let ReportSchema = mongoose.Schema({_id:String, patient: { type: Number, ref: 'Patient', createdAt:Date }})
let EventsSchema = mongoose.Schema({_id:String, patient: { type: Number, ref: 'Patient', createdAt:Date }})
There is ui screen from the mobile and web app called Health history, where I need to paginate the entries from prescription, reports and events sorted based on createAt. So I am building a REST end point to get this heterogeneous data. How do I achieve this. Is it possible to create a "View" from multiple schema models so that I won't load the contents of all 3 schema to fetch one page of entries. The schema of my "View" should look like below so that I can run additional queries on it (e.g. find last report)
{recordType:String,/* prescription/report/event */, createdDate:Date, data:Object/* content from any of the 3 tables*/}

I can think of three ways to do this.
Imho the easiest way to achieve this is by using an aggregation something like this:
db.Patients.aggregate([
{$match : {_id: <somePatientId>},
{
$lookup:
{
from: Prescription, // replicate this for Report and Event,
localField: _id,
foreignField: patient,
as: prescriptions // or reports or events,
}
},
{ $unwind: prescriptions }, // or reports or events
{ $sort:{ $createDate : -1}},
{ $skip: <positive integer> },
{ $limit: <positive integer> },
])
You'll have to adapt it further, to also get the correct createdDate. For this, you might want to look at the $replaceRoot operator.
The second option is to create a new "meta"-collection, that holds your actual list of events, but only holds a reference to your patient as well as the actual event using a refPath to handle the three different event types. This solution is the most elegant, because it makes querying your data way easier, and probably also more performant. Still, it requires you to create and handle another collection, which is why I didn't want to recommend this as the main solution, since I don't know if you can create a new collection.
As a last option, you could create virtual populate fields in Patient, that automatically fetch all prescriptions, reports and events. This has the disadvantage that you can not really sort and paginate properly...

Related

What is the most efficient way to perform CRUD operations to millions of documents in MongoDB

I am new to MongoDB and currently doing a project where MongoDB is my primary Database Management System. I am using Mongoose as the Object Data Modeling. Suppose, I have two collections called products and features. And each product may have multiple features that is a one-to-many relationship.
// products schema
const products = mongoose.Schema({
id: Number,
name: String,
description: String,
category: String,
price: Number,
}, {
strict: false,
});
// features schema
const features = mongoose.Schema({
id: Number,
product_id: Number,
feature: String,
value: String,
}, {
strict: false,
});
I have imported documents/records for both of the collections from external .csv files and the number of records for both collections are more than 3 million. My client-side application requires data about a particular product with all the features in it like below:
{
productId: 3,
name: 'Denim Jeans',
description: '',
category: 'Cloth',
price: 40.00,
features: [
{
feature: 'Material',
value: 'Cotton',
},
{
feature: 'color',
value: 'blue',
},
....
]
}
Each product will not have more than 5-6 features. So, what I wanted to do is to embed the features document as a subdocument in the products document like the response above. So, I wrote a piece of code like this. It's not the exact same code as I deleted it from my code when it was not working but the logic is the same.
db.products.find({}, (err, product) => {
// product -> array of all documents from products collection
// for each product, I am trying to find the corresponding feature from
// the features collections and embed it to each product document
product.forEach(item => {
db.features.find({product_id: item.id}, (err, feature) => {
// feature -> array of all the features of a product
// embed the feature array to each individual product item
item.features = feature;
})
})
})
Now, the issue is when I run the above piece of code, I got errors like OutOfMemory as it is trying to read from millions of records and my memory is not capable of holding all of this. My question is what is the best way to retrieve all the products and for each individual product write a query to get its corresponding features and embed it inside each product document.
I have a couple of ideas. Kindly correct me if I am wrong. Instead of storing all the products and their features in memory, I want to store them on the disk and update the individual product using the Bulk API of MongoDB. But in that case, how to achieve this and I am concerned about the performance. What is the best practice to follow in this case? Or, should I keep them in a separate collection and from the application server make two queries and package the response there? Or, should I use any kind of aggregation pipeline on the database level? Thanks in advance.

Mongoose: How to populate 2 level deep population without populating fields of first level? in mongodb

Here is my Mongoose Schema:
var SchemaA = new Schema({
field1: String,
.......
fieldB : { type: Schema.Types.ObjectId, ref: 'SchemaB' }
});
var SchemaB = new Schema({
field1: String,
.......
fieldC : { type: Schema.Types.ObjectId, ref: 'SchemaC' }
});
var SchemaC = new Schema({
field1: String,
.......
.......
.......
});
While i access schemaA using find query, i want to have fields/property
of SchemaA along with SchemaB and SchemaC in the same way as we apply join operation in SQL database.
This is my approach:
SchemaA.find({})
.populate('fieldB')
.exec(function (err, result){
SchemaB.populate(result.fieldC,{path:'fieldB'},function(err, result){
.............................
});
});
The above code is working perfectly, but the problem is:
I want to have information/properties/fields of SchemaC through SchemaA, and i don't want to populate fields/properties of SchemaB.
The reason for not wanting to get the properties of SchemaB is, extra population will slows the query unnecessary.
Long story short:
I want to populate SchemaC through SchemaA without populating SchemaB.
Can you please suggest any way/approach?
As an avid mongodb fan, I suggest you use a relational database for highly relational data - that's what it's built for. You are losing all the benefits of mongodb when you have to perform 3+ queries to get a single object.
Buuuuuut, I know that comment will fall on deaf ears. Your best bet is to be as conscious as you can about performance. Your first step is to limit the fields to the minimum required. This is just good practice even with basic queries and any database engine - only get the fields you need (eg. SELECT * FROM === bad... just stop doing it!). You can also try doing lean queries to help save a lot of post-processing work mongoose does with the data. I didn't test this, but it should work...
SchemaA.find({}, 'field1 fieldB', { lean: true })
.populate({
name: 'fieldB',
select: 'fieldC',
options: { lean: true }
}).exec(function (err, result) {
// not sure how you are populating "result" in your example, as it should be an array,
// but you said your code works... so I'll let you figure out what goes here.
});
Also, a very "mongo" way of doing what you want is to save a reference in SchemaC back to SchemaA. When I say "mongo" way of doing it, you have to break away from your years of thinking about relational data queries. Do whatever it takes to perform fewer queries on the database, even if it requires two-way references and/or data duplication.
For example, if I had a Book schema and Author schema, I would likely save the authors first and last name in the Books collection, along with an _id reference to the full profile in the Authors collection. That way I can load my Books in a single query, still display the author's name, and then generate a hyperlink to the author's profile: /author/{_id}. This is known as "data denormalization", and it has been known to give people heartburn. I try and use it on data that doesn't change very often - like people's names. In the occasion that a name does change, it's trivial to write a function to update all the names in multiple places.
SchemaA.find({})
.populate({
path: "fieldB",
populate:{path:"fieldC"}
}).exec(function (err, result) {
//this is how you can get all key value pair of SchemaA, SchemaB and SchemaC
//example: result.fieldB.fieldC._id(key of SchemaC)
});
why not add a ref to SchemaC on SchemaA? there will be no way to bridge to SchemaC from SchemaA if there is no SchemaB the way you currently have it unless you populate SchemaB with no other data than a ref to SchemaC
As explained in the docs under Field Selection, you can restrict what fields are returned.
.populate('fieldB') becomes populate('fieldB', 'fieldC -_id'). The -_id is required to omit the _id field just like when using select().
I think this is not possible.Because,when a document in A referring a document in B and that document is referring another document in C, how can document in A know which document to refer from C without any help from B.

Mongoose Private Chat Message Model

I'm trying to add private messaging between users into my data model. I've been going back and forth between two possible ways of doing this.
1) Each user has an array of user_id, chat_id pairs which correspond to chats they are participating in. Chat model just stores chat_id and array of messages.
2) Don't store chats with user at all and just have the Chat model store a pair of user_ids and array of messages.
The issue with option (1) is whenever a user joins or starts a chat, I would need to look first through the array for the user to see if the user_id, chat_id pair already exists. And then do a second find for the chat_id in Chat. If it doesn't exist, I would need to create the user_id, chat_id pair in two different places for both users who are participating.
With option (2) I would search through the Chat model for the user_id1, user_id2 pair, and if I find it I'm done, if not I would create a new Chat record for that pair and done.
Based on this option (2) does seem like the better way of handling this. However, I'm running into issues figuring out how to model the "pair" of user ids in a way that they are easily searchable in the chat model. i.e. how do I make sure I can find the chat record even if the user_ids are passed in the wrong order, i.e. user_id2, user_id1. What would be the best way to model this in Mongoose?
var chatSchema = mongoose.Schema({
messages: [{
text: {
type: String,
max: 2000
},
sender: {
type: mongoose.Schema.Types.ObjectId,
ref: 'User'
}
}],
participant1: [{
type: mongoose.Schema.Types.ObjectId,
ref: 'User'
}]
participant2: [{
type: mongoose.Schema.Types.ObjectId,
ref: 'User'
}]
});
If it's something like above, how would I search for a participant pair? Could I order the participant IDs in some way so that they are always participant1 < participant2 for example, making search simpler?
Well, there is no correct answer to this question, But definitely, the approaches you have mentioned are not the best at all!
Firstly, when you are thinking about designing a "chat" model, you need to take into account that there would be millions of messages between the users, so you need to care about performance when you want to fetch the chats.
Storing the messages into an array is not a good idea at all, your model's size will be large by the time and you have to consider that MongoDB's document size limit is currently 16 MB per document.
https://docs.mongodb.com/manual/reference/limits/
Secondly, You have to consider pagination aspect because it will affect the performance when the chat is large, when you retrieve the chat between 2 users you won't request all the chats since the beginning of the time, you will just request the most recent ones, and then you can request the older ones if the user scroll the chat, this aspect is very important and can't be neglected due to its effect on performance.
My approach will be to store each message in a separated document
First of all, storing each message in a single document will boost your performance during fetching the chats, and the document size will be very small.
This is a very simple example, you need to change the model according to your needs, it is just to represent the idea:
const MessageSchema = mongoose.Schema({
message:{
text: { type:String, required:true }
// you can add any other properties to the message here.
// for example, the message can be an image ! so you need to tweak this a little
}
// if you want to make a group chat, you can have more than 2 users in this array
users:[{
user: { type:mongoose.Schema.Types.ObjectId, ref:'User', required:true }
}]
sender: { type:mongoose.Schema.Types.ObjectId, ref:'User', required:true },
read: { type:Date }
},
{
timestamps: true
});
you can fetch the chats by this query:
Message.find(({ users: { "$in" : [#user1#,#user2#]} })
.sort({ updatedAt: -1 })
.limit(20)
Easy and clean!
as you see, pagination becomes very easy with this approach.
A few suggestions.
First - why store Participant1 and 2 as arrays? There is one specific sender, and one (or more) recipients (depending on if you want group messages).
Consider the following Schema:
var ChatSchema = new Schema({
sender : {
type : mongoose.Schema.Types.ObjectId,
ref : 'User'
},
messages : [
{
message : String,
meta : [
{
user : {
type : mongoose.Schema.Types.ObjectId,
ref : 'User'
},
delivered : Boolean,
read : Boolean
}
]
}
],
is_group_message : { type : Boolean, default : false },
participants : [
{
user : {
type : mongoose.Schema.Types.ObjectId,
ref : 'User'
},
delivered : Boolean,
read : Boolean,
last_seen : Date
}
]
});
This schema allows one chat document to store all messages, all participants, and all statuses related to each message and each participant.
the Boolean is_group_message is just a shorter way to filter which are direct / group messages, maybe for client side viewing or server-side processing. Direct messages are obviously easier to work with query-wise, but both are pretty simple.
the meta array lists the delivered/read status, etc, for each participant of a single message. If we weren't handling group messages, this wouldn't need to be an array, but we are, so that's fine.
the delivered and read properties on the main document (not the meta subdocument) are also just shorthand ways of telling if the last message was delivered/read or not. They're updated on each write to the document.
This schema allows us to store everything about a chat in one document. Even group chats.

How to calculate Rating in my MongoDB design

I'm creating a system that users can write review about an item and rate it from 0-5. I'm using MongoDB for this. And my problem is to find the best solution to calculate the total rating in product schema. I don't think querying all comments to get the size and dividing it by total rating is a good solution. Here is my Schema. I appreciate any advice:
Comments:
var commentSchema = new Schema({
Rating : { type: Number, default:0 },
Helpful : { type: Number, default:0 },
User :{
type: Schema.ObjectId,
ref: 'users'
},
Content: String,
});
Here is my Item schema:
var productSchema = new Schema({
//id is barcode
_id : String,
Rating : { type: Number, default:0 },
Comments :[
{
type: Schema.ObjectId,
ref: 'comments'
}
],
});
EDIT: HERE is the solution I got from another topic : calculating average in Mongoose
You can get the total using the aggregation framework. First you use the $unwind operator to turn the comments into a document stream:
{ $unwind: "$Comments" }
The result is that for each product-document is turned into one product-document per entry in its Comments array. That comment-entry is turned into a single object under the field Comments, all other fields are taken from the originating product-document.
Then you use $group to rejoin the documents for each product by their _id, while you use the $avg operator to calculate the average of the rating-field:
{ $group: {
_id: "$_id",
average: { $avg: "$Comments.Rating" }
} }
Putting those two steps into an aggregation pipeline calculates the average rating for every product in your collection. You might want to narrow it down to one or a small subset of products, depending on what the user requested right now. To do this, prepend the pipeline with a $match step. The $match object works just like the one you pass to find().
The underlying question that it would be useful to understand is why you don't think that finding all of the ratings, summing them up, and dividing by the total number is a useful approach. Understanding the underlying reason would help drive a better solution.
Based on the comments below, it sounds like your main concern is performance and the need to run map-reduce (or another aggregation framework) each time a user wants to see total ratings.
This person addressed a similar issue here: http://markembling.info/2010/11/using-map-reduce-in-a-mongodb-app
The solution they identified was to separate out the execution of the map-reduce function from the need in the view to see the total value. In this case, the optimal solution would be to run the map-reduce periodically and store the results in another collection, and have the average rating based on the collection that stores the averages, rather than doing the calculation in real-time each time.
As I mentioned in the previous version of this answer, you can improve performance further by limiting the map-reduce to addresing ratings that were created or updated more recently, or since the last map-reduce aggregation.

Mongoose: populate() / DBref or data duplication?

I have two collections:
Users
Uploads
Each upload has a User associated with it and I need to know their details when an Upload is viewed. Is it best practice to duplicate this data inside the the Uploads record, or use populate() to pull in these details from the Users collection referenced by _id?
OPTION 1
var UploadSchema = new Schema({
_id: { type: Schema.ObjectId },
_user: { type: Schema.ObjectId, ref: 'users'},
title: { type: String },
});
OPTION 2
var UploadSchema = new Schema({
_id: { type: Schema.ObjectId },
user: {
name: { type: String },
email: { type: String },
avatar: { type: String },
//...etc
},
title: { type: String },
});
With 'Option 2' if any of the data in the Users collection changes I will have to update this across all associated Upload records. With 'Option 1' on the other hand I can just chill out and let populate() ensure the latest User data is always shown.
Is the overhead of using populate() significant? What is the best practice in this common scenario?
If You need to query on your Users, keep users alone. If You need to query on your uploads, keep uploads alone.
Another question you should ask yourself is: Every time i need this data, do I need the embedded objects (and vice-versa)? How many time this data will be updated? How many times this data will be read?
Think about a friendship request:
Each time you need the request you need the user which made the request, then embed the request inside the user document.
You will be able to create an index on the embedded object too, and your search will be mono query / fast / consistent.
Just a link to my previous reply on a similar question:
Mongo DB relations between objects
I think this post will be right for you http://www.mongodb.org/display/DOCS/Schema+Design
Use Cases
Customer / Order / Order Line-Item
Orders should be a collection. customers a collection. line-items should be an array of line-items embedded in the order object.
Blogging system.
Posts should be a collection. post author might be a separate collection, or simply a field within posts if only an email address. comments should be embedded objects within a post for performance.
Schema Design Basics
Kyle Banker, 10gen
http://www.10gen.com/presentation/mongosf2011/schemabasics
Indexing & Query Optimization
Alvin Richards, Senior Director of Enterprise Engineering
http://www.10gen.com/presentation/mongosf-2011/mongodb-indexing-query-optimization
**These 2 videos are the bests on mongoddb ever seen imho*
Populate() is just a query. So the overhead is whatever the query is, which is a find() on your model.
Also, best practice for MongoDB is to embed what you can. It will result in a faster query. It sounds like you'd be duplicating a ton of data though, which puts relations(linking) at a good spot.
"Linking" is just putting an ObjectId in a field from another model.
Here is the Mongo Best Practices http://www.mongodb.org/display/DOCS/Schema+Design#SchemaDesign-SummaryofBestPractices
Linking/DBRefs http://www.mongodb.org/display/DOCS/Database+References#DatabaseReferences-SimpleDirect%2FManualLinking

Resources