Should I use lots of collections to help structure my data in MongoDB?

Should I use lots of collections to help structure my data in MongoDB? - node.js

I am making a server synced diary application with NodeJS, and using MongoDB. I have all my super relational data in MySQL. But for users Daily memoirs im going to use Mongo, because as you may have realised, there will be a crap load of notes/day diaries, and I want to learn MongoDB, and it is supposed to be way better for lots of non relational data.
I have learned how to create DBs and do everything, but something all the tutorials dont cover, is the most important thing of all, how do I structure my data?
Down below I have several examples of what ive thought, and as I am pretty unexperienced with Mongo, I would like some advice, on which option would be the best performance wise.
Thank you in advance for your time, and any help!
Example 1: My database has one HUGE collection called “Days” and each entry to that collection looks like this: (I am sorry, but no matter how much I think about it, this sounds like the least performant option, as said I am unexperienced in Mongo, and might be wrong.)
{
userID: 902, //This user ID will be fetched from MySQL when authenticating users request. From what ive read, I need to run a command similar to this: “db.posts.createIndex( { author_name : 1 } )”, on this collection to somehow optimize performance?
//What day? No, I wont use Date for this, because then id have to turn my JSON Query data to Date before querying (Maybe I wouldnt have to, as Mongo may store it as string anyway). BUT, I am not sure whether i should use 3 separate Integer fields, or one string field. Which would be faster? (EDIT: I know three separate fields with int will be WAY faster, as my application also has to query data for one month, etc. MAYBE Im wrong, and this is bad practice, let me know.)
day: 12,
month: 5,
year: 2018,
//Actual stored data:
dayTitle: “Lame day at home..”,
dayDescription: “Installed arch..”,
hugeLoadOfIndividualSmallNotesForThisDayWithTimeStamps: [
{ data: “Woke up, start now”, time: “9:44”,
{ data: “Finally figured out what fdisk is”, time: “21:29” } } …
]
}
Example 2: My database has a collection for each user which is named by their userID (This sounds VERY good and organized to me, and with my common sense, it would sound like the most performance one, but from what I googled, people said this wouldnt be good, and thats EXACTLY why I am asking here), and each entry to that collection looks like this:
{
day: 12,
month: 5,
year: 2018,
dayTitle: “Lame day at home..”,
dayDescription: “Installed arch..”,
hugeLoadOfIndividualSmallNotesForThisDayWithTimeStamps: [
{ data: “Woke up, start now”, time: “9:44”,
{ data: “Finally figured out what fdisk is”, time: “21:29” } } …
]
}
Example 3: My database has a collection for each day. (This is basically same as example 2, but there will be less collections. I am very unsure whether this would be bettter than option 2 performance wise, and also this would KIND of, be harder to implement because days change etc.), and each entry to that collection looks like this:
{
userID: 902,
dayTitle: “Lame day at home..”,
dayDescription: “Installed arch..”,
hugeLoadOfIndividualSmallNotesForThisDayWithTimeStamps: [
{ data: “Woke up, start now”, time: “9:44”,
{ data: “Finally figured out what fdisk is”, time: “21:29” } } …
]
}
As said before, thanks in advance guys!

It looks like for your case it would likely be best to put everything in one collection. All of the other ways you suggest breaking up the data look like they would be well served by building indexes over the user id and day field.
I tend to use collections to group together datasets in the same project, but that have different data structures.
If you broke out days or users into different collections, how would that scale? If you want to query for all the text for all days, do you want to connect to a few thousand different collections if your app has been used for ten years? Try writing some test cases for different user experiences and seeing how easy it would be to write queries to get them their data.
TLDR: Probably best to keep things together in one collection and use indexes to sort things out.

Related

NodeJS and Mongo line who's online

TL;DR
logging online users and reporting back a count (based on a mongo find)
We've got a saas app for schools and students, as part of this I've been wanting a 'live' who's online ticker.
Teachers from the schools will see the counter, and the students and parents will trigger it.
I've got a socket.io connect from the web app to a NodeJS app.
Where there is lots of traffic, the Node/Mongo servers can't handle it, and rather than trow more resources at it, I figured it's better to optomise the code - because I don't know what I'm doing :D
with each student page load:
Create a socket.io connection with the following object:
{
'name': 'student or caregiver name',
'studentID': 123456,
'schoolID': 123,
'role': 'student', // ( or 'mother' or 'father' )
'page': window.location
}
in my NODE script:
io.on('connection', function(client) {
// if it's a student connection..
if(client.handshake.query.studentID) {
let student = client.handshake.query; // that student object
student.online = new Date();
student.offline = null;
db.collection('students').updateOne({
"reference": student.schoolID + student.studentID + student.role }, { $set: student
}, { upsert: true });
}
// IF STAFF::: just show count!
if(client.handshake.query.staffID) {
db.collection('students').find({ 'offline': null, 'schoolID':client.handshake.query.schoolID }).count(function(err, students_connected) {
emit('online_users' students_connected);
});
}
client.on('disconnect', function() {
// then if the students leaves the page..
if(client.handshake.query.studentID) {
db.collection('students').updateMany({ "reference": student.reference }, { $set: { "offline": new Date().getTime() } })
.catch(function(er) {});
}
// IF STAFF::: just show updated count!
if(client.handshake.query.staffID) {
db.collection('students').find({ 'offline': null, 'schoolID':client.handshake.query.schoolID }).count(function(err, students_connected) {
emit('online_users' students_connected);
});
}
});
});
What Mongo Indexes would you add, would you store online students differently (and in a different collection) to a 'page tracking' type deal like this?
(this logs the page and duration so I have another call later that pulls that - but that's not heavily used or causing the issue.
If separately, then insert, then delete?
The EMIT() to staff users, how can I only emit to staff with the same schoolID as the Students?
Thanks!

You have given a brief about the issue but no diagnosis on why the issue is happening. Based on a few assumptions I will try to answer your question.
First of all you have mentioned that you'd like suggestions on what Indexes can help your cause, based on what you have mentioned it's a write heavy system and indexes in principle will only slow the writes because on every write the Btree that handles the indexes will have to be updated too. Although the reads become way better specially in case of a huge collection with a lot of data.
So an index can help you a lot if your collection has let's say, 1 million documents. It helps you to skim only the required data without even doing a scan on all data, thanks to the Btree.
And Index should be created specifically based on the read calls you make.
For e.g.
{"student_id" : "studentID", "student_fname" : "Fname"}
If the read call here is based on student_id then create and index on that, and if multiple values are involved (equality - sort or anything) then create a compound index on those fields, giving priority to Equality field first and range and sort fields thereafter.
Now the seconds part of question, what would be better in this scenario.
This is a subjective thing and I'm sure everyone will have a different approach to this. My solution is based on a few assumptions.
Assumption(s)
The system needs to cater to a specific feature where student's online status is updated in some time interval and that data is available for reads for parents, teachers, etc.
The sockets that you are using, if they stay connected continuously all the time then it's that many concurrent connections with the server, if that is required or not, I don't know. But concurrent connections are heavy for the server as you would already know and unless that's needed 100 % try a mixed approach.
If it would be okay for you disconnect for a while or keep connection with the server for only a short interval then please consider that. Which basically means, you disconnect from the server gracefully, connect send data and repeat.
Or, just adopt a heartbeat system where your frontend app will call an API after set time interval and ping the server, based on that you can handle if the student is online or not, a little time delay, yes but easily scaleable.
Please use redis or any other in memory data store for such frequent writes and specially when you don't need to persist the data for long.
For example, let's say we use a redis list for every class / section of user and only update the timestamp (epoch) when their last heartbeat was received from the frontend.
In a class with 60 students, sort the students based on student_id or something like that.
Create a list for that class
For student_id which is the first in ascended student's list, update the epoch like this
LSET mylist 0 "1266126162661" //Epoch Time Stamp
0 is your first student and 59 is our 60th student, update it on every heartbeat. Either via API or the same socket system you have. Depends on your use case.
When a read call is needed
LRANGE classname/listname 0 59
Now you have epochs of all users, maintain the list of students either via database or another list where you can simply match the indexes with a specific student.
LSET studentList 0 "student_id" //Student id of the student or any other data, I am trying to explain the logic
On frontend when you have the epochs take the latest epoch in account and based on your use case, for e.g. let's say I want a student to be online if the hearbeat was received 5 minutes back.
Current Timestamp - Timestamp (If less than 5 minutes (in seconds)) then online or else offline.

This won't be a complete answer without discussing the problem some more, but figured I'd post some general suggestions.
First, we should figure out where the performance bottlenecks are. Is it a particular query? Is it too many simultaneous connections to MongoDB? Is it even just too much round trip time per query (if the two servers aren't within the same data center)? There's quite a bit to narrow down here. How many documents are in the collection? How much RAM does the MongoDB server have access to? This will give us an idea of whether you should be having scaling issues at this point. I can edit my answer later once we have more information about the problem.
Based on what we know currently, without making any model changes, you could consider indexing the reference field in order to make the upsert call faster (if that's the bottleneck). That could look something like:
db.collection('students').createIndex({
"reference": 1
},
{ background: true });
If the querying is the bottleneck, you could create an index like:
db.collection('students').createIndex({
"schoolID": 1
},
{ background: true });
I'm not confident (without knowing more about the data) that including offline in the index would help, because optimizing for "not null" can be tricky. Depending on the data, that may lead to storing the data differently (like you suggested).

Every 'nth' document from a collection - MongoDB + NodeJS

I am looking for a method to return data at different resolutions that is stored in MongoDB. The most elegant solution I can envision is a query that returns every 'nth' (second, third, tenth, etc.) document from the collection.
I am storing data (say temperature) at a 5 second interval but want to look at different trends in the data.
To find the instantaneous trend, I look at the last 720 entries (1 hour). This part is easy.
If I want to look at slightly longer trend, say 3 hours, I could retrieve the last 2160 entries (3 hours) however that is more time to pull from the server, and more time and memory to plot. As when looking at the larger trends, the small movements are noise and I would be better off retrieving the same number of documents (720) but only every 3rd, still giving me 3 hours of results, with the same resources used, for a minor sacrifice in detail.
This only gets more extreme when I want to look at weeks (120,960 documents) or months (500,000+ documents).
My current code collects every single document (n = 1):
db.collection(collection).find().sort({$natural:-1}).limit(limit)
I could then loop through the returned array and remove every document when:
index % n != 0
This at least saves the client from dealing with all the data however this seems extremely inefficient and I would rather the database handle this part.
Does anyone know a method to accomplish this?

Apparenlty, there is no inbuilt solution in mongo to solve your problem.
The way forward would be to archive your data smartly, in fragments.
So you can store your data in a collection which will house no more than weekly or monthly data. A new month/week means storing your data in a different collection. That way you wont be doing a full table scan and wont be collecting every single document as you mentioned in your problem. Your application code will decide which collection to query.
If I were in your shoes, I would use a different tool as mongo is more suited for a general purpose database. Timeseries data(storing something every 5 sec) can be handled pretty well by database like cassandra which can handle frequent writes with ease, just as in your case.
Alternate fragmentation(update) :
Always write your current data in collection "week0" and in the background run a weekly scheduler that moves the data from "week0" to history collections "week1","week2" and so on. Fragmentation logic depends on your requirements.

I think the $bucket stage might help you with it.
You can do something like,
db.collection.aggregate([
{
$bucketAuto: {
groupBy: "$_id", // here you'll put the variable you need, in your example 'temperature'
buckets: 5 // this is the number of documents you want to return, so if you want a sample of 500 documents, you can put 500 here
}
}
])
Each document in the result for the above query would be something like this,
"_id": {
"max": 3,
"min": 1
},
"count": 2
If you had grouped by temperature, then each document will have the minimum and maximum temperature found in that sample

You might have another problem. Docs state not to rely on natural ordering:
This ordering is an internal implementation feature, and you should
not rely on any particular structure within it.
You can instead save the epoch seconds in each document and do your mod arithmetic on it as part of a query, with limit and sort.

How to deal with huge amount of group-chat messages in MongoDB?

I'm building a chat app with different groups. Therefore I'm using a collection in Mongodb (one for all groups). This is my message schema:
const MessageSchema = mongoose.Schema({
groupId: Number,
userId: Number,
messageIat: Date,
message: String,
reactions: []
});
Let's say I want to load the last 50 messages of the group with the id 10.
To sort the messages I'm using the default ObjectId.
I'm using the following query. For me, It seems like that I'm loading all messages of group 10, then sorting it to ensure the order and then I can limit the results. But this seems not very efficiently to me. If there are a lot messages it will cost quite some time right?
return Message.find({groupId:10}).sort( {_id: -1 }).limit(50)
My first try was to do the limit operation at first, but then I can not rely on the order so what's the commen way for this?
Is it more common to split it up , so to have a collection per group?
Thanks for helping.

According to docs:
For queries that include a sort operation without an index, the server
must load all the documents in memory to perform the sort before
returning any results.
So first off, make sure to create an index for whatever field you're going to sort the results by.
Also,
The sort can sometimes be satisfied by scanning an index in order. If
the query plan uses an index to provide the requested sort order,
MongoDB does not perform an in-memory sorting of the result set
Moreover, according to this page, the following queries are equivalent:
db.bios.find().sort( { name: 1 } ).limit( 5 )
db.bios.find().limit( 5 ).sort( { name: 1 } )
Finally, as longs as indices fit entirely in memory, you should be fine with your current approach. Otherwise you might want to consider doing some manual partitioning.

MongoDB data storage strategy

I'm using MongoDB and I'm quiet new to it, so I'd like your help on how to model my data. What is the best efficient way?
Here is my use case.
Let's say I have three income sources, named Income1, Income2, Income3. Tomorrow they might be 4 or 20. Each new Income source will suppose new integration to be implemented.
Let's say I have ten users, named User1, User2... User10. Tomorrow they might be 1000. (I hope ;-)). Here, no integration needed for a new user.
And let's say that I'm interested in storing, for each day, how much money User1 got from Income1, Income2, ... User2 from Income1, Income2... and so on. And even some day I'll aggregate all of this.
Still following me?
How should I model this?
First idea: Separate collections and separate documents
3 Collections : Income1, Income2, Income3. If an Income4 comes up, no problem, since I'll have to add some code, I can also create a new collection. Not an issue.
In each collection, the data for a user, with one document per user and per date, like this:
Income 1
{name:'user1', date:'2014-12-07',money:'24.32'}
{name:'user1', date:'2014-12-08',money:'14.20'}
{name:'user2', date:'2014-12-07',money:'0.00'}
{name:'user2', date:'2014-12-08',money:'0.00'}
{name:'user2', date:'2014-12-09',money:'10.00'}
{name:'user3', date:'2014-12-09',money:'124.32'}
Income 2
{name:'user1', date:'2014-12-05',money:'4.00'}
{name:'user2', date:'2014-12-06',money:'0.20'}
Second idea: Separate collections, and same document + embedded document
3 Collections as before. In each colection, the data for a user, with ONE document per user:
Income 1
{name:'user1', incomes:
[{date:'2014-12-07',money:'24.32'},{date:'2014-12-08',money:'14.20'}]}
{name:'user2', incomes:
[{date:'2014-12-07',money:'0.00'},{date:'2014-12-08',money:'0.00'},{date:'2014-12-09',money:'10.00'}]}
{name:'user3', incomes:
[{date:'2014-12-09',money:'124.32'}]}
Income 2
{name:'user1', incomes: [{date:'2014-12-05',money:'4.00'}]}
{name:'user2', incomes:[{date:'2014-12-06',money:'0.20'}]}
Third idea: SAme collection, and separate documents for everyghing.
{income_type:1,name:'user1', date:'2014-12-07',money:'24.32'}
{income_type:1,name:'user1', date:'2014-12-08',money:'14.20'}
{income_type:1,name:'user2', date:'2014-12-07',money:'0.00'}
{income_type:1,name:'user2', date:'2014-12-08',money:'0.00'}
{income_type:1,name:'user2', date:'2014-12-09',money:'10.00'}
{income_type:1,name:'user3', date:'2014-12-09',money:'124.32'}
{income_type:2,name:'user1', date:'2014-12-05',money:'4.00'}
{income_type:2,name:'user2', date:'2014-12-06',money:'0.20'}
These are some ideas. I'm sure there are others.
I will often have to query per user, on the most recent documents (i.e. with the most recent dates). I may from time to time need to aggregate information per week, month.... And, finally, I think I'll update the table from a cron running every night (to add the coresponding income for each income source and user)
Is this clear? I come from a relational database background (is it so obvious?) so maybe there is something I haven't even considered.
Thanks in advance.

At this point I would recommend the third idea. Rolling the data up per-user and / or per income stream is quite simple using the aggregation pipeline. Working with sub-documents is more pain than it's worth in my experience.

Aggregating data with CouchDB reduce function

I have a process which posts documents similar to the one below to CouchDB:
{
"timestamp": [2010, 8, 4, 9, 25, 24],
"type": "quote",
"bid": 95.0,
"offer": 96.5
}
Many such documents are posted over the course of a day, each timestamped appropriately.
I want to create a CouchDB view which returns the last quote stored every day.
I've been reading View Cookbook for SQL Jockeys on how to create complex views but I have trouble seeing how to combine map and reduce functions to achieve the desired result. The map function is easy; it's the reduce function I'm having trouble with.
Any pointers gratefully received.

Create a map-function that returns all documents for a given time period using the same key. For example, return all documents in the 17th hour of the day with key 17.
Create a reduce-function that emits only the latest bid for that hour. Your view will return 24 documents, and your client side code will do the final merge.
There are many ways to accomplish this. You can retrieve a single latest-bid by emitting from your map-function a single key and then reducing this by searching all bids, but I'm not sure how that will perform for extremely large sets, such as those you'd encounter with a bidding system.
Update
http://wiki.apache.org/couchdb/View_Snippets#Computing_simple_summary_statistics_.28min.2Cmax.2Cmean.2Cstandard_deviation.29

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string