User Segmentation Engine using MongoDB - node.js

I have an analytics system that tracks customers and their attributes as well as their behavior in the form of events. It is implemented using Node.js and MongoDB (with Mongoose).
Now I need to implement a segmentation feature that allows to group stored users into segments based on certain conditions. For example something like purchases > 3 AND country = 'Netherlands'
In the frontend this would look something like this:
An important requirement here is that the segments get updated in realtime and not just periodically. This basically means, that every time a user's attributes change or he triggers a new event, I have to check again which segments he does belong to.
My current approach is to store the conditions for the segments as MongoDB queries, that I can then execute on the user collection in order to determine which users belong to a certain segment.
For example a segment to filter out all users that are using Gmail would look like this:
{
_id: '591638bf833f8c843e4fef24',
name: 'Gmail Users',
condition: {'email': { $regex : '.*gmail.*'}}
}
When a user matches the condition I would then store that he belongs to the 'Gmail Users' segment directly on the user's document:
{
username: 'john.doe',
email: 'john.doe#gmail.com',
segments: ['591638bf833f8c843e4fef24']
}
However by doing this, I would have to execute all queries for all segments every time a user's data changes, so I can check if he is part of the segment or not. This feels a bit complicated and cumbersome from a performance point of view.
Can you think of any alternative way to approach this? Maybe use a rule-engine and do the processing in the application and not on the database?

Unfortunately I don't know a better approach but you can optimize this solution a little bit.
I would do the same:
Store the segment conditions in a collection
Once you find a matching user, store the segment id in the user's document (segments)
An important requirement here is that the segments get updated in realtime and not just periodically.
You have no choice, you need to run the segmentation query every times when a segment changes.
I would have to execute all queries for all segments every time a user's data changes
This is where I would change your solution, actually just optimise it a little bit:
You don't need to run the segmentation queries on the whole collection. If you put your user id into the query with an $and, Mongodb will fetch the user first and after that will check the rest of the segmentation conditions. You need to make sure Mongodb uses the user's _id as an index, for this you can use .explain() to check it or .hint() to force it. Unfortunately you need to run N+1 queries if you have N segments (+1 is for the user update)
I would fetch every segments and store them in a cache (redis). If someone changed the segment I would update the cache as well. (Or just invalidate the cache and the next query will handle the rest, depends on the implementation). The point is that I would have every segments without fetching the database and if a user updated a record I would go through every segments with Node.js and validate the user by the conditions and I could update the user's segments array in the original update query so it would not require any extra database operation.
I know it could be a pain in the ass implementing something like this but it doesn't overload the database ...
Update
Let me give you some technical details about my second suggestion:
(This is just a pseudo code!)
Segment cache
module.exporst = function() {
return new Promise(resolve) {
Redis.get('cache:segments', function(err, segments) {
// handle error
// Segments are cached
if(segments) {
segments = JSON.parse(segments);
return resolve(segments);
}
//fetch segments and save it to the cache
Segments.find().exec(function(err, segments) {
// handle error
segments = JSON.stringify(segments);
// Save to the database but set 60 seconds as an expiration
Redis.set('cache:segments', segments, 'EX', 60, function(err) {
// handle error
return resolve(segments);
})
});
})
}
}
User update
// ...
let user = user.findOne(_id: ObjectId(req.body.userId));
// etc ...
// fetch segments from cache or from the database
let segments = yield segmentCache();
let userSegments = [];
segments.forEach(function(segment) {
if(checkSegment(user, segment)) {
userSegments.push(segment._id)
}
});
// Override user's segments with userSegments
This is where the magic happens, somehow you need to define the conditions in a way you can use them in an if statement.
Hint: Lodash has this functions: _.gt, _.gte, _.eq ...
Check segments
module.exports = function(user, segment) {
let keys = Object.keys(segment.condition);
keys.forEach(function(key) {
if(user[key] === segment.condition[key]) {
return false;
}
})
return true;
}

You are already storing an entire segment "query" in a document in segments collection - why not include a field in the same document which will enumerate which fields in the users document impact membership in a particular segment.
Since action of changing user data will know which fields are being changed, it can fetch only the segments which are computed using the fields being changed significantly reducing the size of segmentation "queries" you have to re-run.
Note that a change in user's data may add them to a segment they are not currently a member of, so checking only the segments currently stored in the user is not sufficient.

Related

Use an array of values to query Firestore and setup a snapshot listener

Here is my problem:
I have a firestore collection that has a number of documents. There are about 500 documents generated/updated every hour and saved to the collection.
I would like to query the collection and setup a real-time snapshot listener for a subset of document IDs, that are provided by the client.
I think maybe I could to something like this (this syntax is likely not correct...just trying to get a feel for if it's even possible...but isn't the "in" limited to an array of 10 items? ):
const subbedDocs = ["doc1","doc2","doc3","doc4","doc5"]
docsRef.where('docID', 'in', subbedDocs).onSnapshot((doc) => {
handleSnapshot(doc);
});
I'm sorry, that code probably doesn't make sense....I'm still trying to learn all the ins and outs of Firestore.
Essentially, what I am trying to do is take an array of ID's and setup a .onSnapshot listener for those ID's. This list of IDs could be upwards of 40-50 items. Is this even possible? I am trying to avoid just setting up a listener on the whole collection and filtering out things I am not "subscribed" too as that seems wasteful from a resources perspective.
If you have the doc IDs in your array (it looks like you have) you can loop over them and start a listener during that:
const subbedDocs = ["doc1", "doc2", "doc3", "doc4", "doc5"];
for (let i = 0; i < subbedDocs.length; i++) {
const docID = subbedDocs[i];
docsRef.doc(docID).onSnapshot((doc) => {
handleSnapshot(doc);
});
}
It would be better to listen to a query and all filtered docs at once. But if you want to listen to each of them with a explicit listener that would do the trick.
As you've discovered, Firestore's in operator only allows up to 10 entries in the array. I'm also guessing you've added the docID as a field in the document, since I don't believe 'docID references the actual documentid.
I would not take this approach, because of the 10-entry limitation. What I would do is, as the client is selecting documents to follow, set a field (same in each document) to a unique Id for the client, so your query completely avoids the limitation. You can allow an unlimited number of Client listeners (up to implementation limits of Firestore) if you add that client ID into an array (called something like "ListenerArray") [again, as the client is selecting them]. Your query would be more like:
docsRef.where('ListenerArray', 'array-contains', clientID).onSnapshot((doc) => {
handleSnapshot(doc);
})
array-contains checks a single value against all entries in a document array, without limit. Every client can mark any number of documents to subscribe to.

immutable _id error when performing MongoDB bulkWrite replaceOne on first attempt only

I'm working on a little web application that will crawl and update baseball standings by day and track teams positions (among other things) over time.
I have an API I grab all of this from and a collection in MongoDB that stores all the team data and information for the current day. Right now I just run this manually but eventually it'll be automated to run at like 3am or whenever.
The API supplies a unique ID for each team that never changes. So what I'm doing is I'm taking in the team data from the API. Passing it to a function that then extracts the teams data (there is other data from the response object I don't need), puts it into an object for replacement, and then wherever that team ID exist in the collection its document is replaced in a bulkWite.
async function currentStandings(db,team_standings,callback){
const current_standings = db.collection('current_standings');
let replacePool = [];
for(const single_team of team_standings.data.standing){
let replaceOnePusher = {
replaceOne: {
"filter": {"team_id": single_team.team_id},
"replacement": single_team
}
}
replacePool.push(replaceOnePusher);
}
await current_standings.bulkWrite(replacePool);
callback();
}
However when I execute this code for the first time each day I get an error reading BulkWriteError: After applying the update, the (immutable) field '_id' was found to have been altered to _id: ObjectId('5f26e57b6831761ac840bf1d') (not the same ID every day) and if I look in Compass the data isn't updated. If I immediately run the script again, it goes through successfully without error. Refreshing the data in compass generates the correct data.
Can someone explain to me what is going wrong here? This is actually my first time using MongoDB since I wanted to learn it and this pet project seemed like a good place to start.

Preventing concurrent access to documents in Mongoose

My server application (using node.js, mongodb, mongoose) has a collection of documents for which it is important that two client applications cannot modify them at the same time without seeing each other's modification.
To prevent this I added a simple document versioning system: a pre-hook on the schema which checks if the version of the document is valid (i.e., not higher than the one the client last read). At first sight it works fine:
// Validate version number
UserSchema.pre("save", function(next) {
var user = this
user.constructor.findById(user._id, function(err, userCurrent) { // userCurrent is the user that is currently in the db
if (err) return next(err)
if (userCurrent == null) return next()
if(userCurrent.docVersion > user.docVersion) {
return next(new Error("document was modified by someone else"))
} else {
user.docVersion = user.docVersion + 1
return next()
}
})
})
The problem is the following:
When one User document is saved at the same time by two client applications, is it possible that these interleave between the pre-hook and the actual save operations? What I mean is the following, imagine time going from left to right and v being the version number (which is persisted by save):
App1: findById(pre)[v:1] save[v->2]
App2: findById(pre)[v:1] save[v->2]
Resulting in App1 saving something that has been modified meanwhile (by App2), and it has no way to notice that it was modified. App2's update is completely lost.
My question might boil down to: Do the Mongoose pre-hook and the save method happen in one atomic step?
If not, could you give me a suggestion on how to fix this problem so that no update ever gets lost?
Thank you!
MongoDB has findAndModify which, for a single matching document, is an atomic operation.
Mongoose has various methods that use this method, and I think that they will suit your use case:
Model.findOneAndUpdate()
Model.findByIdAndUpdate()
Model.findOneAndRemove()
Model.findByIdAndRemove()
Another solution (one that Mongoose itself uses as well for its own document versioning) is to use the Update Document if Current pattern.

Referencing external doc in CouchDB view

I am scraping an 90K record database using JSON-RPC and I am trying to put in some basic error checking. I want to start by scraping the database twice using two different settings and adding a prefix to the second scrape. This way I can check to ensure that the two settings are not producing different records (due to dropped updates, etc). I wanted to implement the comparison using a view which compares each document from the first scrape with it's twin produced by the second scrape and then emit the names of records with a difference between them.
However, I cannot quite figure out how to pull in another doc in the view, everything I have read only discusses external docs using the emit() function, which is too late to permit me to compare it. In the example below, the lookup() function would grab the referenced document.
Is this just not possible?
function(doc) {
if(doc._id.slice(0,1)!=='$' && doc._id.slice(0,1)!== "_"){
var otherDoc = lookup('$test" + doc._id);
if(otherDoc){
var keys = doc.value.keys();
var same = true;
keys.forEach(function(key) {
if ((key.slice(0,1) !== '_') && (key.slice(0,1) !=='$') && (key!=='expires')) {
if (!Object.equal(otherDoc[key], doc[key])) {
same = false;
}
}
});
if(!same){
emit(doc._id, 1);
}
}
}
}
Context
You are correct that this is not possible in CouchDB. The whole point of the map function is that it must be idempotent, otherwise you lose all the other nice benefits of a pre-calculated index.
This is why you cannot access external resources in the map function, whether they be other records or the clock. Any time you run a map you must always get the same result if you put the same record into it. Since there are no relationships between records in CouchDB, you cannot promise that this is possible.
Solution
However, you can still achieve your end goal, just be different means. Some possibilities...
Assuming there is some meaningful numeric value in each doc, you could use a view to take the sum of all those values and group them by which import you did ({key: <batch id>, value: <meaningful number>}). Then compare the two numbers in your client or the browser to see if they match.
A brute force approach would be to use a view to pair the docs that should match. Each doc is on a different row, but they're grouped by a common field. Then iterate through the entire index comparing the pairs. This would certainly be the quickest to code and doesn't depend on your application or data.
Implement a validation function to enforce a schema on your data. Just be warned that this will reduce your write throughput since each written record will be piped out of Erlang and into the JS engine. Also, this is only applicable if you're worried about properly formed records instead of their precise content, which might not be the case.
Instead of your different batch jobs creating different docs, have them place them into the same doc. The structure might look like this: { "_id": "something meaningful", "batch_one": { ..data.. }, "batch_two": { ..data.. } } Then your validation function could compare them or you could create a view that indexes all the docs that don't match. All depends on where in your pipeline you want to do the error checking and correction.
Personally I like the last option better, but only if you don't plan to use the database as is in production. Ie., you wouldn't want to carry around all that extra data in each record.
Hope that helps.
Cheers.

Creating a pagination index in CouchDB?

I'm trying to create a pagination index view in CouchDB that lists the doc._id for every Nth document found.
I wrote the following map function, but the pageIndex variable doesn't reliably start at 1 - in fact it seems to change arbitrarily depending on the emitted value or the index length (e.g. 50, 55, 10, 25 - all start with a different file, though I seem to get the correct number of files emitted).
function(doc) {
if (doc.type == 'log') {
if (!pageIndex || pageIndex > 50) {
pageIndex = 1;
emit(doc.timestamp, null);
}
pageIndex++;
}
}
What am I doing wrong here? How would a CouchDB expert build this view?
Note that I don't want to use the "startkey + count + 1" method that's been mentioned elsewhere, since I'd like to be able to jump to a particular page or the last page (user expectations and all), I'd like to have a friendly "?page=5" URI instead of "?startkey=348ca1829328edefe3c5b38b3a1f36d1e988084b", and I'd rather CouchDB did this work instead of bulking up my application, if I can help it.
Thanks!
View functions (map and reduce) are purely functional. Side-effects such as setting a global variable are not supported. (When you move your application to BigCouch, how could multiple independent servers with arbitrary subsets of the data know what pageIndex is?)
Therefore the answer will have to involve a traditional map function, perhaps keyed by timestamp.
function(doc) {
if (doc.type == 'log') {
emit(doc.timestamp, null);
}
}
How can you get every 50th document? The simplest way is to add a skip=0 or skip=50, or skip=100 parameter. However that is not ideal (see below).
A way to pre-fetch the exact IDs of every 50th document is a _list function which only outputs every 50th row. (In practice you could use Mustache.JS or another template library to build HTML.)
function() {
var ddoc = this,
pageIndex = 0,
row;
send("[");
while(row = getRow()) {
if(pageIndex % 50 == 0) {
send(JSON.stringify(row));
}
pageIndex += 1;
}
send("]");
}
This will work for many situations, however it is not perfect. Here are some considerations I am thinking--not showstoppers necessarily, but it depends on your specific situation.
There is a reason the pretty URLs are discouraged. What does it mean if I load page 1, then a bunch of documents are inserted within the first 50, and then I click to page 2? If the data is changing a lot, there is no perfect user experience, the user must somehow feel the data changing.
The skip parameter and example _list function have the same problem: they do not scale. With skip you are still touching every row in the view starting from the beginning: finding it in the database file, reading it from disk, and then ignoring it, over and over, row by row, until you hit the skip value. For small values that's quite convenient but since you are grouping pages into sets of 50, I have to imagine that you will have thousands or more rows. That could make page views slow as the database is spinning its wheels most of the time.
The _list example has a similar problem, however you front-load all the work, running through the entire view from start to finish, and (presumably) sending the relevant document IDs to the client so it can quickly jump around the pages. But with hundreds of thousands of documents (you call them "log" so I assume you will have a ton) that will be an extremely slow query which is not cached.
In summary, for small data sets, you can get away with the page=1, page=2 form however you will bump into problems as your data set gets big. With the release of BigCouch, CouchDB is even better for log storage and analysis so (if that is what you are doing) you will definitely want to consider how high to scale.

Resources