Use an array of values to query Firestore and setup a snapshot listener

Use an array of values to query Firestore and setup a snapshot listener - node.js

Here is my problem:
I have a firestore collection that has a number of documents. There are about 500 documents generated/updated every hour and saved to the collection.
I would like to query the collection and setup a real-time snapshot listener for a subset of document IDs, that are provided by the client.
I think maybe I could to something like this (this syntax is likely not correct...just trying to get a feel for if it's even possible...but isn't the "in" limited to an array of 10 items? ):
const subbedDocs = ["doc1","doc2","doc3","doc4","doc5"]
docsRef.where('docID', 'in', subbedDocs).onSnapshot((doc) => {
handleSnapshot(doc);
});
I'm sorry, that code probably doesn't make sense....I'm still trying to learn all the ins and outs of Firestore.
Essentially, what I am trying to do is take an array of ID's and setup a .onSnapshot listener for those ID's. This list of IDs could be upwards of 40-50 items. Is this even possible? I am trying to avoid just setting up a listener on the whole collection and filtering out things I am not "subscribed" too as that seems wasteful from a resources perspective.

If you have the doc IDs in your array (it looks like you have) you can loop over them and start a listener during that:
const subbedDocs = ["doc1", "doc2", "doc3", "doc4", "doc5"];
for (let i = 0; i < subbedDocs.length; i++) {
const docID = subbedDocs[i];
docsRef.doc(docID).onSnapshot((doc) => {
handleSnapshot(doc);
});
}
It would be better to listen to a query and all filtered docs at once. But if you want to listen to each of them with a explicit listener that would do the trick.

As you've discovered, Firestore's in operator only allows up to 10 entries in the array. I'm also guessing you've added the docID as a field in the document, since I don't believe 'docID references the actual documentid.
I would not take this approach, because of the 10-entry limitation. What I would do is, as the client is selecting documents to follow, set a field (same in each document) to a unique Id for the client, so your query completely avoids the limitation. You can allow an unlimited number of Client listeners (up to implementation limits of Firestore) if you add that client ID into an array (called something like "ListenerArray") [again, as the client is selecting them]. Your query would be more like:
docsRef.where('ListenerArray', 'array-contains', clientID).onSnapshot((doc) => {
handleSnapshot(doc);
})
array-contains checks a single value against all entries in a document array, without limit. Every client can mark any number of documents to subscribe to.

Related

User Segmentation Engine using MongoDB

I have an analytics system that tracks customers and their attributes as well as their behavior in the form of events. It is implemented using Node.js and MongoDB (with Mongoose).
Now I need to implement a segmentation feature that allows to group stored users into segments based on certain conditions. For example something like purchases > 3 AND country = 'Netherlands'
In the frontend this would look something like this:
An important requirement here is that the segments get updated in realtime and not just periodically. This basically means, that every time a user's attributes change or he triggers a new event, I have to check again which segments he does belong to.
My current approach is to store the conditions for the segments as MongoDB queries, that I can then execute on the user collection in order to determine which users belong to a certain segment.
For example a segment to filter out all users that are using Gmail would look like this:
{
_id: '591638bf833f8c843e4fef24',
name: 'Gmail Users',
condition: {'email': { $regex : '.*gmail.*'}}
}
When a user matches the condition I would then store that he belongs to the 'Gmail Users' segment directly on the user's document:
{
username: 'john.doe',
email: 'john.doe#gmail.com',
segments: ['591638bf833f8c843e4fef24']
}
However by doing this, I would have to execute all queries for all segments every time a user's data changes, so I can check if he is part of the segment or not. This feels a bit complicated and cumbersome from a performance point of view.
Can you think of any alternative way to approach this? Maybe use a rule-engine and do the processing in the application and not on the database?

Unfortunately I don't know a better approach but you can optimize this solution a little bit.
I would do the same:
Store the segment conditions in a collection
Once you find a matching user, store the segment id in the user's document (segments)
An important requirement here is that the segments get updated in realtime and not just periodically.
You have no choice, you need to run the segmentation query every times when a segment changes.
I would have to execute all queries for all segments every time a user's data changes
This is where I would change your solution, actually just optimise it a little bit:
You don't need to run the segmentation queries on the whole collection. If you put your user id into the query with an $and, Mongodb will fetch the user first and after that will check the rest of the segmentation conditions. You need to make sure Mongodb uses the user's _id as an index, for this you can use .explain() to check it or .hint() to force it. Unfortunately you need to run N+1 queries if you have N segments (+1 is for the user update)
I would fetch every segments and store them in a cache (redis). If someone changed the segment I would update the cache as well. (Or just invalidate the cache and the next query will handle the rest, depends on the implementation). The point is that I would have every segments without fetching the database and if a user updated a record I would go through every segments with Node.js and validate the user by the conditions and I could update the user's segments array in the original update query so it would not require any extra database operation.
I know it could be a pain in the ass implementing something like this but it doesn't overload the database ...
Update
Let me give you some technical details about my second suggestion:
(This is just a pseudo code!)
Segment cache
module.exporst = function() {
return new Promise(resolve) {
Redis.get('cache:segments', function(err, segments) {
// handle error
// Segments are cached
if(segments) {
segments = JSON.parse(segments);
return resolve(segments);
}
//fetch segments and save it to the cache
Segments.find().exec(function(err, segments) {
// handle error
segments = JSON.stringify(segments);
// Save to the database but set 60 seconds as an expiration
Redis.set('cache:segments', segments, 'EX', 60, function(err) {
// handle error
return resolve(segments);
})
});
})
}
}
User update
// ...
let user = user.findOne(_id: ObjectId(req.body.userId));
// etc ...
// fetch segments from cache or from the database
let segments = yield segmentCache();
let userSegments = [];
segments.forEach(function(segment) {
if(checkSegment(user, segment)) {
userSegments.push(segment._id)
}
});
// Override user's segments with userSegments
This is where the magic happens, somehow you need to define the conditions in a way you can use them in an if statement.
Hint: Lodash has this functions: _.gt, _.gte, _.eq ...
Check segments
module.exports = function(user, segment) {
let keys = Object.keys(segment.condition);
keys.forEach(function(key) {
if(user[key] === segment.condition[key]) {
return false;
}
})
return true;
}

You are already storing an entire segment "query" in a document in segments collection - why not include a field in the same document which will enumerate which fields in the users document impact membership in a particular segment.
Since action of changing user data will know which fields are being changed, it can fetch only the segments which are computed using the fields being changed significantly reducing the size of segmentation "queries" you have to re-run.
Note that a change in user's data may add them to a segment they are not currently a member of, so checking only the segments currently stored in the user is not sufficient.

couchDB conflicts when supplying own ID with large inserts using _bulk_docs

Same code works fine when letting couch auto generate UUID's. I am starting off with a new completely empty database yet I keep getting this
error: conflict
reason: Document update conflict
To reiterate I am posting new documents to an empty database so not sure how I can get update conflicts when nothing is being updated. Even stranger the conflicting documents still show up in the DB with only a single revision, but overall there are missing records.
I am trying to insert about 38,000 records with _bulk_docs in batches of 100. I am getting these records (100 at a time) from a RETS server, each record already has a unique ID that I want to use for the couchDB _id instead of their UUID's. I am using a promised based library to get the records and axios to insert them into couch. After getting the first batch of 100 I then run this code to add an _id to each of the 100 records before inserting
let batch = [];
batch = records.results.map((listing) => {
let temp = listing;
temp._id = listing.ListingKey;
return temp;
});
Then insert:
axios.post('http://127.0.0.1:5984/rets_store/_bulk_docs', { docs: batch })
This is all inside of a function that I call recursively.
I know this probably wont be enough to see the issue but thought Id start here. I know for sure it has something to do with my map() and adding the _id = ListingKey
Thanks!

Referencing external doc in CouchDB view

I am scraping an 90K record database using JSON-RPC and I am trying to put in some basic error checking. I want to start by scraping the database twice using two different settings and adding a prefix to the second scrape. This way I can check to ensure that the two settings are not producing different records (due to dropped updates, etc). I wanted to implement the comparison using a view which compares each document from the first scrape with it's twin produced by the second scrape and then emit the names of records with a difference between them.
However, I cannot quite figure out how to pull in another doc in the view, everything I have read only discusses external docs using the emit() function, which is too late to permit me to compare it. In the example below, the lookup() function would grab the referenced document.
Is this just not possible?
function(doc) {
if(doc._id.slice(0,1)!=='$' && doc._id.slice(0,1)!== "_"){
var otherDoc = lookup('$test" + doc._id);
if(otherDoc){
var keys = doc.value.keys();
var same = true;
keys.forEach(function(key) {
if ((key.slice(0,1) !== '_') && (key.slice(0,1) !=='$') && (key!=='expires')) {
if (!Object.equal(otherDoc[key], doc[key])) {
same = false;
}
}
});
if(!same){
emit(doc._id, 1);
}
}
}
}

Context
You are correct that this is not possible in CouchDB. The whole point of the map function is that it must be idempotent, otherwise you lose all the other nice benefits of a pre-calculated index.
This is why you cannot access external resources in the map function, whether they be other records or the clock. Any time you run a map you must always get the same result if you put the same record into it. Since there are no relationships between records in CouchDB, you cannot promise that this is possible.
Solution
However, you can still achieve your end goal, just be different means. Some possibilities...
Assuming there is some meaningful numeric value in each doc, you could use a view to take the sum of all those values and group them by which import you did ({key: <batch id>, value: <meaningful number>}). Then compare the two numbers in your client or the browser to see if they match.
A brute force approach would be to use a view to pair the docs that should match. Each doc is on a different row, but they're grouped by a common field. Then iterate through the entire index comparing the pairs. This would certainly be the quickest to code and doesn't depend on your application or data.
Implement a validation function to enforce a schema on your data. Just be warned that this will reduce your write throughput since each written record will be piped out of Erlang and into the JS engine. Also, this is only applicable if you're worried about properly formed records instead of their precise content, which might not be the case.
Instead of your different batch jobs creating different docs, have them place them into the same doc. The structure might look like this: { "_id": "something meaningful", "batch_one": { ..data.. }, "batch_two": { ..data.. } } Then your validation function could compare them or you could create a view that indexes all the docs that don't match. All depends on where in your pipeline you want to do the error checking and correction.
Personally I like the last option better, but only if you don't plan to use the database as is in production. Ie., you wouldn't want to carry around all that extra data in each record.
Hope that helps.
Cheers.

How do I design a couchdb view for following case ?

I am migrating an application from mySQL to couchDB. (Okay, Please dont pass judgements on this).
There is a function with signature
getUserBy($column, $value)
Now you can see that in case of SQL it is a trivial job to construct a query and fire it.
However as far as couchDB is concerned I am supposed to write views with map functions
Currently I have many views such as
get_user_by_name
get_user_by_email
and so on. Can anyone suggest a better and yet scalable way of doing this ?

Sure! One of my favorite views, for its power, is by_field. It's a pretty simple map function.
function(doc) {
// by_field: map function
// A single view for every field in every document!
var field, key;
for (field in doc) {
key = [field, doc[field]];
emit(key, 1);
}
}
Suppose your documents have a .name field for their name, and .email for their email address.
To get users by name (ex. "Alice" and "Bob"):
GET /db/_design/example/_view/by_field?include_docs=true&key=["name","Alice"]
GET /db/_design/example/_view/by_field?include_docs=true&key=["name","Bob"]
To get users by email, from the same view:
GET /db/_design/example/_view/by_field?include_docs=true&key=["email","alice#gmail.com"]
GET /db/_design/example/_view/by_field?include_docs=true&key=["name","bob#gmail.com"]
The reason I like to emit 1 is so you can write reduce functions later to use sum() to easily add up the documents that match your query.

Insert/update Doctrine object from Excel

On the project which I am currently working, I have to read an Excel file (with over a 1000 rows), extract all them and insert/update to a database table.
in terms of performance, is better to add all the records to a Doctrine_Collection and insert/update them after using the fromArray() method, right? One other possible approach is to create a new object for each row (a Excel row will be a object) and them save it but I think its worst in terms of performance.
Every time the Excel is uploaded, it is necessary to compare its rows to the existing objects on the database. If the row does not exist as object, should be inserted, otherwise updated. My first approach was turn both object and rows into arrays (or Doctrine_Collections); then compare both arrays before implementing the needed operations.
Can anyone suggest me any other possible approach?

We did a bit of this in a project recently, with CSV data. it was fairly painless. There's a symfony plugin tmCsvPlugin, but we extended this quite a bit since so the version in the plugin repo is pretty out of date. Must add that to the #TODO list :)
Question 1:
I don't explicitly know about performance, but I would guess that adding the records to a Doctrine_Collection and then calling Doctrine_Collection::save() would be the neatest approach. I'm sure it would be handy if an exception was thrown somewhere and you had to roll back on your last save..
Question 2:
If you could use a row field as a unique indentifier, (let's assume a username), then you could search for an existing record. If you find a record, and assuming that your imported row is an array, use Doctrine_Record::synchronizeWithArray() to update this record; then add it to a Doctrine_Collection. When complete, just call Doctrine_Collection::save()
A fairly rough 'n' ready implementation:
// set up a new collection
$collection = new Doctrine_Collection('User');
// assuming $row is an associative
// array representing one imported row.
foreach ($importedRows as $row) {
// try to find an existing record
// based on a unique identifier.
$user = Doctrine_Core::getTable('User')
->findOneByUsername($row['username']);
// create a new user record if
// no existing record is found.
if (!$user instanceof User) {
$user = new User();
}
// sync record with current data.
$user->synchronizeWithArray($row);
// add to collection.
$collection->add($user);
}
// done. save collection.
$collection->save();
Pretty rough but something like this worked well for me. This is assuming that you can use your imported row data in some way to serve as a unique identifier.
NOTE: be wary of synchronizeWithArray() if you're using sf1.2/doctrine 1.0 - if I remember correctly it was not implemented correctly. it works fine in doctrine 1.2 though.

I have never worked on Doctrine_Collections, but I can answer in terms of database queries and code logic in a broader sense. I would apply the following logic:-
Fetch all the rows of the excel sheet from database in a single query and store them in an array $uploadedSheet.
Create a single array of all the rows of the uploaded excel sheet, call it $storedSheet. I guess the structures of the Doctrine_Collections $uploadedSheet and $storedSheet will be similar (both two-dimensional - rows, cells can be identified and compared).
3.Run foreach loops on the $uploadedSheet as follows and only identify the rows which need to be inserted and which to be updated (do actual queries later)-
$rowsToBeUpdated =array();
$rowsToBeInserted=array();
foreach($uploadedSheet as $row=>$eachRow)
{
if(is_array($storedSheet[$row]))
{
foreach($eachRow as $column=>$value)
{
if($value != $storedSheet[$row][$column])
{//This is a representation of comparison
$rowsToBeUpdated[$row]=true;
break; //No need to check this row anymore - one difference detected.
}
}
}
else
{
$rowsToBeInserted[$row] = true;
}
}
4. This way you have two arrays. Now perform 2 database queries -
bulk insert all those rows of $uploadedSheet whose numbers are stored in $rowsToBeInserted array.
bulk update all the rows of $uploadedSheet whose numbers are stored in $rowsToBeUpdated array.
These bulk queries are the key to faster performance.
Let me know if this helped, or you wanted to know something else.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string