Storing data efficiently in MongoLab and in general - node.js

I have an app that listens to a websocket and it stores usernames/userID's (Usernames are 1-20 bytes, UserID's are 17 bytes). This is not a big deal because it's only one document. However, every round they participate in, it pushes the round ID (24 bytes) and a 'score' decimal value (ex: 1190.0015239999999).
The thing is, there is no limit to how many rounds there are and I can't afford to pay so much per month for mongolab. What's the best way to handle this data?
My thoughts:
- If there is a way to replace the _id: field in mongodb, I will replace it with the userID which is 17 bytes long. Not sure if I can do that though.
Store user data with timestamps and remove OLD data that has a score value less than 200.
Cut off user names that are more than 10 characters.
Completely remove Round ID's (Or replace the _id field with roundId). (Won't work since there are multiple roundID's in each document)
Round the decimal value to two places.
Remove Round ID's after 30 days
tl;dr
Need to store data efficiently < 500 mb in mongo lab
Documents consists of username(1-20 characters), userid(17 characters), round(Object Array) = [{round Id(24 characters), score(1190.0015239999999)}].
Thanks in advance!
Edit:
Document Schema:
userID: {type: String},
userName: {type: String},
rounds: [{roundID: String, score: String}]

Modelling 1:n relationships as embedded document is not the best except for very rare cases. This is because there is a 16MB size limit for BSON documents at the time of this writing.
A better (read more scalable and efficient approach) is to do use document references.
First, you need your player data, of course. Here is an example:
{
_id: "SomeUserId",
name: "SomeName"
}
There is no need for an extra userId field since each document needs to have a _id field with unique values anyway. Contrary to popular belief, this fields value does not have to be an ObjectId. So we already reduced the size you need for your player data by 1/3, if I am not mistaken.
Next, the results of each round:
{
_id: {
round: "SomeString",
player: "SomeUserId"
},
score: 5,
createdAt: ISODate("2015-04-13T01:03:04.0002Z")
}
A few things are to note here. First and foremost: Do NOT use strings to record values. Even grades should rather be stored as corresponding numerical values. Otherwise you can not get averages and alike. I'll show more of that later. We are using a compound field for _id here, which is perfectly valid. Furthermore, it will give us a free index optimizing a few of the most likely queries, like "How did player X score in round Y?"
db.results.find({"_id.player":"X","_id.round":"Y"})
or "What where the results of round Y?"
db.results.find({"_id.round":"Y"})
or "What we're the scores of Player X in all rounds?"
db.results.find({"_id.player":"X"})
However, by not using a string to save the score, even some nifty stats become rather cheap, for example "What was the average score of round Y?"
db.results.aggregate(
{ $match: { "_id.round":"Y" } },
{ $group: { "round":"$_id.round", "averageScore": {$avg:"$score"} }
)
or "What is the average score of each player in all rounds?"
db.results.aggregate(
{ $group: { "player: "$_id.player", "averageAll": {$avg:"$score"} }
)
While you could do these calculation in your application, MongoDB can do them much more efficiently since the data does not have to be send to your app prior to processing it.
Next, for the data expiration. We have a createdAt field, of type ISODate. Now, we let MongoDB take care of the rest by creating a TTL index
db.results.ensureIndex(
{ "createdAt":1 },
{ expireAfterSeconds: 60*60*24*30}
)
So all in all, this should be pretty much the most efficient way of storing and expiring your data, while improving scalability in the same time.

So currently you are storing three data points in the array for each record.
_id: false will prevent mongoose from automatically creating an id for the document. If you don't need roundID, then you can use the following which only stores one data point in the array:
round[{_id:false, score:String}]
Otherwise if roundID actually has meaning, use the following which stores two data points in the array:
round[{_id:false, roundID: string, score:String}]
Lastly, if you just need an ID for reference purposes, use the following, which will store two data points in the array - a random id and the score:
round[{score:String}]

Related

MongoDb: Best way to store time range

I need to store something like startTime and endTime in my document. To give some more context, these will reflect the opening and closing times for a shop. So, for example, startTime could be 9AM and endTime could be 9PM. What is the best way to store this? This is what I am doing right now:
timings: {
startTime: {
type: String,
required: [true, "....."]
},
endTime: {
type: String,
required: [true, "....."]
}
}
The idea is to store the values as strings ("9AM", "9PM") and do some sort of time parsing each time I query the database. But I was wondering if there was a better approach to this? Another idea I had is to store it as DateTime and ignore the date part. What else can I do? I'd like to avoid parsing/processing on application level as much as possible and leverage the power of mongodb.
I'm using mongoose and nodeJS.
I would agree the Date type is not relevant (and it has something to do with time zones and you might not want to get there ...)
I would store it as a number, not a string. Why ? because you might want to query it (like all "give me all shops that opens after 8pm"), and doing it with a string will be annoying ...
I'd go with that :
{
startTime: {
value: number;
amOrPm: string; //(if you don't want to use a 24 hours base)
},
endTime: {
value: number;
amOrPm: string; //(if you don't want to use a 24 hours base)
},
timeOffset: number // So you keep track on the offset with the base timezone
}
You could also store the minutes, or even store the time only in "minutes ellapsed since midnight" and convert it every time there is an access.
Having the offset this way won't allow you to easily query for a specific moment across different timezones, but I guess it's totally useless in your case.
Also you could store days as a number ('officially' sunday is 0, then monday is 1), but nowadays it is as easy to store a name so well ...
Edit: for the days maybe it's better to go with an array :
{
daysOppenned: [0, 1, 4]
}
And finally, what if each day has a different time openning ? Maybe you would have to consider having an array of days, each containing the time openning and time of closing, like above.
If you want to get even more into details, some shops are closed in middays and some other (like restaurants) only opens two times a day, you could then offer them to tick cases on a schedule and store that in an array.
Let us know if you need to build somthing like that !

Cloudant 1 to many function

I’ve just started to use Cloudant and I just can’t get my head around the map functions. I’ve been fiddling with the data below but it isn’t working out as I expected.
The relationship is, a user can have many vehicles. A vehicle belongs to 1 user. The vehicle ‘userId’ is the key of the user. There is a bit of redundancy as in user the _id and userId is the same, guess later is not required.
Anyhow, how can I find for a/every user, the vehicles which belong to it? The closest I’ve come through trial and error is a result which displays the owner of every vehicle, but I would like it the other way round, the user and the vehicles belonging to it. All the examples I’ve found use another document which ‘joins’ two or more documents, but I don’t need to do that?
Any point in the right direction appreciated - I really have no idea.
function (doc) {
if (doc.$doctype == "vehicle")
{
emit(doc.userId, {_id: doc.userId});
}
}
EDIT: Getting closer. I'm not sure exactly what I was expecting, but the result seems a bit 'messy'. Row[0] is the user document, row[n > 0] are the vehicle documents. I guess it's fine when a startkey/endkey is used, but without the results are a bit jumbled up.
function (doc) {
if (doc.$doctype == 'user') {
emit([doc._id, 0], doc);
} else if (doc.$doctype == 'vehicle') {
emit([doc.userId, 1, doc._id], doc);
}
}
A user is described as,
{
"_id": "user:10",
"firstname": “firstnamehere",
"secondname": “secondnamehere",
"userId": "user:10",
"$doctype": "user"
}
a vehicle is described as,
{
"_id": "vehicle:4002”,
“name”: “avehicle”,
"userId": "user:10",
"$doctype": "vehicle",
}
You're getting in the right direction! You already got that right with the global IDs. Having the type of the document as part of the ID in some form is a very good idea, so that you don't get confused later (all documents are in the same "pot").
Here are some minor problems with your current solution (before getting to your actual question):
Don't emit the doc as value in emit(key, value). You can always ask for the document that belongs to a view row by querying with include_docs=true. Having the doc as view value increases the view indexes a lot. When you don't need a specific value, use emit(key, null).
You also don't need the ID in the emit value. You'll get the ID of the document that belongs to a view row as part of the row anyway.
View Collation
Now to your problem of aggregating the vehicles with their user. You got the basic pattern right. This pattern is called view collation, you can read more about it in the CouchDB docs (ignore that it is in the "Couchapp" section).
The trick with view collation is that you return two or more types of documents, but make sure that they are sorted in a way that allows for direct grouping. Thus it is important to understand how CouchDB sorts the view result. See the collation specification for more information on that one. An important key to understanding view collation is that rows with array keys are sorted by key elements. So when two rows have the same key[0], they sort by key[1]. If that's equal as well, key[2] is considered, and so on.
Your map function frist groups users and vehicles by user ID (key[0]). Your map function then uses the fact that 0 sorts before 1 in the second element of the key, so your view will contain the following:
user 1
vehicle of user 1
vehicle of user 1
vehicle of user 1
user 2
user 3
vehicle of user 3
user 4
etc.
As you can see, the vehicles of a user immediately follow their user. Thus you can group this result into aggregates without performing expensive sort or lookup operations.
Note that users are sorted according to their ID, and vehicles within users also according to their ID. This is because you use the IDs in the key array.
Creating Queries
Now that view isn't worth much if you can't query according to your needs. A view as you have it supports the following queries:
Get all users with their vehicles
Get a range of users with their vehicles
Get a single user with its vehicles
Get a single user without vehicles (you could also use the _all_docs view for that though)
Example query for "all users between user 1 and user 3 (inclusive) with their vehicles"
We want to query for a range, so we use startkey and endkey in the query:
startkey=["user:1", 0]
endkey=["user:3", 1, {}]
Note the use of {} as sentinel value, which is required so that the end key is larger than any row that has a key of ["user:3", 1, (anyConceivableVehicleId)]

mongodb: another "how to add a random record" thread

I've come across many of this same question here on StackOverflow. None providing a valid solid solution, so here we go:
I need to pick a random document from around 5 million documents in my MongoDB database in an efficient way.
I've tried getting the .count and using the .skip to get the random document, but it takes almost three seconds and very, very inefficient.
I can't make changes to the documents (like adding a "random") entry to each document or changing their _id's.
I've tried the solution of adding documents with an incremental _id (to pick a random _id to bypass using .skip) but this brought more headache than what it did when I try to add many documents in a short amount of time.
Adding data in an incremental way, or picking a random document, should not be this hard. I'm either missing some common knowledge, or doing something wrong, or this is what it really is..
Wanted to bring up the topic and get your responses.
Here is a way using the default ObjectId values for _id and a little math and logic.
// Get the "min" and "max" timestamp values from the _id in the collection and the
// diff between.
// 4-bytes from a hex string is 8 characters
var min = parseInt(db.collection.find()
.sort({ "_id": 1 }).limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
max = parseInt(db.collection.find()
.sort({ "_id": -1 })limit(1).toArray()[0]._id.str.substr(0,8),16)*1000,
diff = max - min;
// Get a random value from diff and divide/multiply be 1000 for The "_id" precision:
var random = Math.floor(Math.floor(Math.random(diff)*diff)/1000)*1000;
// work out a "random" _id value in the range:
var _id = new ObjectId(((min + random)/1000).toString(16) + "0000000000000000")
// Then query for the single document:
var randomDoc = db.collection.find({ "_id": { "$gte": _id } })
.sort({ "_id": 1 }).limit(1).toArray()[0];
That's the general logic in shell representation and easily adaptable.
So in points:
Find the min and max primary key values in the collection
Generate a random number that falls between the timestamps of those documents.
Add the random number to the minimum value and find the first document that is greater than or equal to that value.
This uses "padding" from the timestamp value in "hex" to form a valid ObjectId value since that is what we are looking for. Using integers as the _id value is essentially simplier but the same basic idea in the points.

How to get a list of all CouchDB documents that are valid on a given date?

I have a large collection of documents and each is valid for a range of days. The range could be from 1 week up to 1 year. I want to be able to get all the documents that are valid on a specific day.
How would I do that?
As an example say I have the following two documents:
doc1 = {
// 1 year ago to today
start_at: "2012-03-22T00:00:00Z",
end_at: "2013-03-22T00:00:00Z"
}
doc2 = {
// 2 months ago to today
start_at: "2012-01-22T00:00:00Z",
end_at: "2013-03-22T00:00:00Z"
}
And a map function:
(doc) ->
emit([doc.start_at, doc.end_at], null)
So for a date of 6 months ago I would only get doc1, a date of 1 week ago I would get both documents, and with a date of tomorrow I would receive no documents.
Note that actual resolution needs to be down to the second of the request being made and there are lots of documents, so strategies of emitting a key for every valid second would not be appropriate.
You could call emit for each day in your range, and then you can easily pick out the documents available for a specific day.
function(doc) {
var day = new Date(doc.start),
end = new Date(doc.end).getTime();
do {
emit(day);
day = new Date(day.getFullYear(), day.getMonth(), day.getDate() + 1);
} while (day.getTime() <= end);
}
Even though you will have lots of documents, if you leave out the value part (2nd param) of your emit, the index will be as small as it could possibly be.
If you need to get more sophisticated, you could try out couchdb-lucene. You can index date fields as date objects and execute range queries with multiple fields in 1 request.
You can translate the problem into the computational geometry problem of location. For documents in two dimensional plane [x,y]=[start_at,end_at] query for those, which are valid at date date is the list of the points in the rectangle bounded by: left=-infinity, right=date (start_at<date) and bottom=date, top=infinity (end_at>date).
Unfortunately, CouchDB team underrate the power of computational geometry and does not support multidimensional queries. There is GeoCouch extension that allows you to do this kind of queries as easy as:
http://localhost:5984/places/_design/main/_spatial/points?bbox=0,0,180,90
on the view emitting spatial value:
emit({ type: "Point", coordinates: [doc.start_at, doc.end_at] }, doc);
The problem is different data type. You get float in range of [-180.0,180.0]/[-90.0,90.0] and need at least int (UNIX time format). If GeoCouch works for you in ranges bigger then 180.0 and the precision of float operation designed for geographical calculation is sufficient for dates with precision of seconds your problem is solved :) I am sure, with few tricks and hacks, you could solve this problem efficiently in geo software. If not GeoCouch then perhaps ElastiSearch (also support multidimensional queries) which is easy to use with CouchDB with its River plugins system.

Querying documents containing two tags with CouchDB?

Consider the following documents in a CouchDB:
{
"name":"Foo1",
"tags":["tag1", "tag2", "tag3"],
"otherTags":["otherTag1", "otherTag2"]
}
{
"name":"Foo2",
"tags":["tag2", "tag3", "tag4"],
"otherTags":["otherTag2", "otherTag3"]
}
{
"name":"Foo3",
"tags":["tag3", "tag4", "tag5"],
"otherTags":["otherTag3", "otherTag4"]
}
I'd like to query all documents that contain ALL (not any!) tags given as the key.
For example, if I request using '["tag2", "tag3"]' I'd like to retrieve Foo1 and Foo2.
I'm currently doing this by querying by tag, first for "tag2", then for "tag3", creating the union manually afterwards.
This seems to be awfully inefficient and I assume that there must be a better way.
My second question - but they are quite related, I think - would be:
How would I query for all documents that contain "tag2" AND "tag3" AND "otherTag3"?
I hope a question like this hasn't been asked/answered before. I searched for it and didn't find one.
Do you have a maximum number of?
Tags per document, and
Tags allowed in the query
If so, you have an upper-bound on the maximum number of tags to be indexed. For example, with a maximum of 5 tags per document, and 5 tags allowed in the AND query, you could simply output every 1, 2, 3, 4, and 5-tag combination into your index, for a maximum of 1 (five-tag combos + 5 (four-tag combos) + 10 (three-tag combos) + 10 (two-tag combos) + 5 (one-tag combos) = 31 rows in the view for that document.
That may be acceptable to you, considering that it's quite a powerful query. The disk usage may be acceptable (especially if you simply emit(tags, {_id: doc._id}) to minimize data in the view, and you can use ?include_docs=true to get the full document later. The final thing to remember is to always emit the key array sorted, and always query it the same way, because you are emitting only tag combinations, not permutations.
That can get you so far, however it does not scale up indefinitely. For full-blown arbitrary AND queries, you will indeed be required to split into multiple queries, or else look into CouchDB-Lucene.

Resources