I have web page sending heartbeat events to nodejs backend to track how long users are viewing on particular page (or parts of the page). Events are stored to MongoDB in batches using Mongoose's insertMany:
Event.insertMany(events)
Here events is an array containing multiple events. Single event is structured as follows:
{user_page_id: 1234, time_spent: 30, ...}
Since I'm tracking time spent on page, only most recent time_spent value per user_page_id is meaningful and I don't want to fill MongoDB with unnecessary data. I tried to deal with this by defining user_page_id as unique index:
user_page_id: {type: Number, index: {unique: true, sparse: true}}
Now couple questions:
Is it possible to make user_site_id unique so that existing values are always replaced with new values? (Default functionality seems to just reject duplicates. Something was discussed here but didn't help.)
Is it possible to make user_site_id unique so that duplicate null values are allowed? (There is also other events where this data is null. sparse option seems to deal only with missing values)
Is unique index even possible way to solve this problem or should I find another approach?
Other possible solutions I could think of are:
Processing heartbeats individually with their own handler using upsert. (Possible but adds some unnecessary(?) complexity to processing pipeline. Also batch mode highly is preferred.)
Keeping heartbeats in memory and storing them to DB after some timeout. (Problem is that heartbeat timers should be able to pause. Also I would like to keep server stateless.)
Turning whole thing upside down using websockets. (Possible but adds some unnecessary(?) complexity since sometimes tracking is related to parts of the page and there can be multiple concurrent heartbeat timers on one page.
Store all events as they are and clean up unnecessary events later with some batch processing job. (Not exactly sure but might cause some performance issues in MongoDB). Also though using capped collection here but doesn't actually solve the problem.
So as a recap the most important question is how to effectively and elegantly deal with aggregation of this sort of heartbeat data?
Related
for my application I implemented a logical seperation of my documents with a type attribute. I have several views. I implemented for every view a dedicated change feed which gets triggerd if a certain document was added or updated. At the moment the performance is quite well, do I have to expect a slow down in the future?
Well, every filter function associated with your feed is executed once for each new (or updated) document. So, you may expect a slowdown with a large number of concurrent inserts and updates. It's not something related to the database dimension, but to the number of concurrent updates.
I'm storing key-value documents on a mongo collection, while multiple clients are pushing updates to this collection (posting to an API endpoint) at a very fast pace (updates will come in faster than once per second).
I need to expose another endpoint so that a watcher can poll all changes, in delta format, since last poll. Each diff must have a sequence number and/or timestamp.
What I'm thinking is:
For each update I calculate a diff and store it.
I store each diff on a mongo collection, with current timestamp (using Node Date object)
On each poll for changes: get all diffs from the collection, delete them and return.
The questions are:
Is it safe to use timestamps for sequencing changes?
Should I be using Mongo to store all diffs as changes are coming or some kind of message queue would be a better solution?
thanks!
On each poll for changes: get all diffs from the collection, delete them and return.
This sounds terribly fragile. What if client didn't receive the data (he crashed/network disappeared in the middle of receiving the response)? He retries the request, but oops, doesn't see anything. What I would do is that client remembers last version it saw and asks for updates like this:
GET /thing/:id/deltas?after_version=XYZ
When it receives a new batch of deltas, it gets the last version of that batch and updates its cached value.
Is it safe to use timestamps for sequencing changes?
I think so. ObjectId already contains a timestamp, so you might use just that, no need for separate time field.
Should I be using Mongo to store all diffs as changes are coming or some kind of message queue would be a better solution?
Depends on your requirements. Mongo should work well here. Especially if you'll be cleaning old data.
at a very fast pace (updates will come in faster than once per second)
By modern standards, 1 per second is nothing. 10 per second - same. 10k per second - now we're talking.
I'm receiving some periodic reports from a bunch of devices and storing them into a MongoDB database. They are incoming at about 20-30 seconds. However I would like to check when a device does not send the report for some time (for example the last report is more then 3 minutes old) and I would like to send an email or trigger some other mechanism.
So, the issue is how to check for the missing event in the most correct manner. I considered a cron job and a bunch of timers each related to each device record.
A cron job looks ok but I am fearing that starting a full scan query will overload the server/db and cause performance issues. Is there any kind of database structure that could aid this (some kind of index, maybe?).
Timers are probably the simpler solution but I fear how many timers can I create because I can get quite a number of devices.
Can anybody give me an advice what is the best approach to this? Thanks in advance.
Do you use Redis or something similar on this server? Set device ID as key with any value, e.g. 1. Expire key in 2-3 min and update expiration every time the device connects. Then fire cron jobs to check if ID is missing. This should be super fast.
Also you may use MongoDB's timed collections instead of Redis but in this case you will have to do a bunch of round trips to the DB server. http://docs.mongodb.org/manual/tutorial/expire-data/
Update:
As you do not know what IDs you will be looking for, this rather complicates the matter. Another option is to keep a log in a separate MongoDB collection with a timestamp of last ping you've got from the device.
Index timestamps and query .find({timestamp: {$lt: Date.now() - 60 * 1000}}) to get a list of stale devices.
It's very important that you update existing document rather than create a new one on each ping. So that if you have 10 devices connected you should have 10 documents in this collection. That's why you need a separate collection for this log.
There's a great article on time series data. I hope you find it useful http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb
An index on deviceid+timestamp handles this neatly.
Use distinct() to get your list of devices
For each device d,
db.events.find({ deviceid: d }).sort({ timestamp : -1 }).limit(1)
gives you the most recent event, whose timestamp you can now compare with the current time.
I have a largish (~100) array of smallish documents (maybe 10 fields each) to insert in MongoDB. But many of them (perhaps all, but typically 80% or so) of them will already exist in the DB. The documents represent upcoming events over the next few months, and I'm updating the database every couple of days. So most of the events are already in there.
Anybody know (or want to guess) if it would be more efficient to:
Do the bulk update but with continueOnError = true, e.g.
db.collection.insert(myArray, {continueOnError: true}, callback)
do individual inserts, checking first if the _ID exists?
First do a big remove (something like db.collection.delete({_id: $in : [array of all the IDs in my new documents] }), then a bulk insert?
I'll probably do #1 as that is the simplest, and I don't think that 100 documents is all that large so it may not matter, but if there were 10,000 documents? I'm doing this in JavaScript with the node.js driver if that matters. My background is in Java where exceptions are time consuming and that's the main reason I'm asking - will the "continueOnError" option be time consuming???
ADDED: I don't think "upsert" makes sense. That is for updating an individual document. In my case, the individual document, representing an upcoming event, is not changing. (well, maybe it is, that's another issue)
What's happening is that a few new documents will be added.
My background is in Java where exceptions are time consuming and that's the main reason I'm asking - will the "continueOnError" option be time consuming???
The ContinueOnError flag for Bulk Inserts only affects the behaviour of the batch processing: rather than stopping processing on the first error encountered, the full batch will be processed.
In MongoDB 2.4 you will only get a single error for the batch, which will be the last error encountered. This means if you do care about catching errors you would be better doing individual inserts.
The main time savings for bulk insert vs single insert is reduced network round trips. Instead of sending a message to the MongoDB server per document inserted, drivers can break down bulk inserts into batches of up to the MaxMessageSizeBytes accepted by the mongod server (currently 48Mb).
Are bulk inserts appropriate for this use case?
Given your use case of only 100s (or even 1000s) of documents to insert where 80% already exist, there may not be a huge benefit in using bulk inserts (especially if this process only happens every few days). Your small inserts will be combined in batches, but 80% of the documents don't actually need to be sent to the server.
I would still favour bulk insert with ContinueOnError over your approach of deletion and re-insertion, but bulk inserts may be an unnecessary early optimisation given the number of documents you are wrangling and the percentage that actually need to be inserted.
I would suggest doing a few runs with the different approaches to see what the actual impact is for your use case.
MongoDB 2.6
As a head's up, the batch functionality is being significantly improved in the MongoDB 2.5 development series (which will culminate in the 2.6 production release). Planned features include support for bulk upserts and accumulating per-document errors rather than a single error per batch.
The new write commands will require driver changes to support, but may change some of the assumptions above. For example, with ContinueOnError using the new batch API you could end up getting a result back with the 80% of your batch IDs that are duplicate keys.
For more details, see the parent issue SERVER-9038 in the MongoDB issue tracker.
collection.insert(item, {continueOnError: true, safe: true}, function(err, result) {
if (err && err.code != "11000"){
throw err;
}
db.close();
callBack();
});
For your case, I'd suggest you consider fetching a list of the existing document _ids, and then only sending the documents that aren't in that list already. While you could use update with upsert to update individually, there's little reason to do so. Unless the list of _ids is extremely long (tens of thousands), it would be more efficient to grab the list and do the comparison than do individual updates to the database for each document (with some large percentage apparently failing to update).
I wouldn't use the continueOnError and send all documents ... it's less efficient.
I'd vouch to use an upsert to let mongo deal with the update or insert logic, you can also use multi to update multiple documents that match your criteria:
From the documentation:
upsert
Optional parameter, if set to true, creates a new document when no document matches the query criteria. The default value is false, which does not insert a new document when no match is found. The syntax for this parameter depends on the MongoDB version. See Upsert Parameter.
multi
Optional parameter, if set to true, updates multiple documents that meet the query criteria. If set to false, updates one document. The default value is false. For additional information, see Multi Parameter.
db.collection.update(
<query>,
<update>,
{ upsert: <boolean>, multi: <boolean> }
)
Here is the referenced documentation:
http://docs.mongodb.org/manual/reference/method/db.collection.update/
Suppose I store a list of events in a Cassandra row, implemented with composite columns:
{
event:123 => 'something happened'
event:234 => 'something else happened'
}
It's almost fine by me, and, as far as I understand, that's a common pattern. Comparing to having a single column event with the jsonized list, that scales better since it's easy to add a new item to the list without reading it first and then writing back.
However, now I need to implement these two requirements:
I don't want to add a new event if the last added one is the same,
I want to keep only N last events.
Is there any standard way of doing that with the best possible performance? (Any storage schema changes are ok).
Checking whether or not things already exist, or checking how many that exist and removing extra items, are both read-modify-write operations, and they don't fit very well with the constraints of Cassandra.
One way of keeping only the N last events is to make sure they are ordered so that you can do a range query and read the N last (for example prefixing the column key with a timestamp/TimeUUID). This wouldn't remove the outdated events, that you need to do as a separate process, but by doing it this way the code that queries the data will only see the last N, which is the real requirement if I interpret things correctly. The garbage collection of old events is just an optimization to avoid keeping things that will never be needed again.
If the requirement isn't a strict N events, but events that are not older than T you can of course use the TTL feature, but I assume that it's not an option for you.
The first requirement is trickier. You can do a read before ever write and check if you have an item, but that would be slow, and unless you do some kind of locking outside of Cassandra there is no guarantee that two writers won't do both do a read and then both do a write, so that neither sees the other's write. Maybe that's not a problem for you, but there's no good way around it. Cassandra doesn't do CAS.
The way I've handled similar situations when using Cassandra is to keep a cache in the application nodes of what has been written, and check that before writing. You then need to make sure that each application node sees all events for the same row, and that events for the same row aren't distributed over multiple application nodes. One way of doing that is to have a message queue system in front of your application nodes, and divide the event stream over several queues by the same key as you use as row key in the database.