Node: Check a Firebase db and execute a function when an objects time matches the current time - node.js

Background
I have a Node and React based application. I'm using Firebase for my storage and database. In my application users can fill out a form where they upload an image and select a time for the image to be added to their website. I save each image update as an object in my Firebase database like so. Images are arranged in order of ascending update time.
user-name: {
images: [
{
src: 'image-src-url',
updateTime: 1503953587727
}
{
src: 'image-src-url',
updateTime: 1503958424838
}
]
}
Scale
My applications db could potentially get very large with a lot of users and images. I'd like to ensure scalability.
Issue
How do I check when a specific image objects time has been met then execute a function? (I do not need assistance on the actual function that is being run just the checking of the db for a specific time.)
Attempts
I've thought about doing a cron job using node-cron that checks the entire database every 60s (users can only specify the minute the image will update, not the seconds.) Then if it finds a matching updateTime and executes my function. My concern is at a large scale that cron job will take a while to search the db and potentially miss a time.
I've also thought about when the user schedules a new update then dynamically create a specific cron job for that time. I'm unsure how to accomplish this.
Any other methods that may work? Are my concerns about node-cron not valid?

There are two approaches I can think of:
Keep track of the last timestamp you processed
Keep the "things to process" in a queue
Keep track of the last timestamp you processed
When you process items, you use the current timestamp as the cut-off point for your query. Something like:
var now = Date.now();
var query = ref.orderByChild("updateTime").endAt(now)
Now make sure to store this now somewhere (i.e. in your database) so that you can re-use it next time to retrieve the next batch of items:
var previous = ... previous value of now
var now = Date.now();
var query = ref.orderByChild("updateTime").startAt(previous).endAt(now);
With this you're only processing a single slice at a time. The only tricky bit is that somebody might insert a new node with an updateTime that you've already processed. If this is a concern for your use-case, you can prevent them from doing so with a validation rule on updateTime:
".validate": "newData.val() >= root.child('lastProcessed').val()"
As you add more items to the database, you will indeed be querying more items. So there is a scalability limit to this approach, but this approach should work well for anything up to a few hundreds of thousands of nodes (I haven't tested in a while so ymmv).
For a few previous questions on list size:
Firebase Performance: How many children per node?
Firebase Scalability Limit
How many records / rows / nodes is alot in firebase?
Keep the "things to process" in a queue
An alternative approach is to keep a queue of items that still need to be processed. So the clients add the items that they want processed to the queue with an updateTime of when they want to processed. And your server picks the items from the queue, performs the necessary updates, and removes the item from the queue:
var now = Date.now();
var query = ref.orderByChild("updateTime").endAt(now)
query.once("value").then(function(snapshot) {
snapshot.forEach(function(child) {
// TODO: process the child node
// remove the child node from the queue
child.ref.remove();
});
})
The difference with the earlier approach is that a queue's stable state is going to be empty (or at least quite small), so your queries will run against a much smaller list. That's also why you won't need to keep track of the last timestamp you processed: any item in the queue up to now is eligible for processing.

Related

Batch requests and concurrent processing

I have a service in NodeJS which fetches user details from DB and sends that to another application via http. There can be millions of user records, so processing this 1 by 1 is very slow. I have implemented concurrent processing for this like this:
const userIds = [1,2,3....];
const users$ = from(this.getUsersFromDB(userIds));
const concurrency = 150;
users$.pipe(
switchMap((users) =>
from(users).pipe(
mergeMap((user) => from(this.publishUser(user)), concurrency),
toArray()
)
)
).subscribe(
(partialResults: any) => {
// Do something with partial results.
},
(err: any) => {
// Error
},
() => {
// done.
}
);
This works perfectly fine for thousands of user records, it's processing 150 user records concurrently at a time, pretty faster than publishing users 1 by 1.
But problem occurs when processing millions of user records, getting those from database is pretty slow as result set size also goes to GBs(more memory usage also).
I am looking for a solution to get user records from DB in batches, while keep on publishing those records concurrently in parallel.
I thinking of a solution like, maintain a queue(of size N) of user records fetched from DB, whenever queue size is less than N, fetch next N results from DB and add to this queue.
Then the current solution which I have, will keep on getting records from this queue and keep on processing those concurrently with defined concurrency. But I am not quite able to put this in code. Is there are way we can do this using RxJS?
I think your solution is the right one, i.e. using the concurrent parameter of mergeMap.
The point that I do not understand is why you are adding toArray at the end of the pipe.
toArray buffers all the notifications coming from upstream and will emit only when the upstream completes.
This means that, in your case, the subscribe does not process partial results but processes all of the results you have obtained executing publishUser for all users.
On the contrary, if you remove toArray and leave mergeMap with its concurrent parameter, what you will see is a continuous flow of results into the subscribe due to the concurrency of the process.
This is for what rxjs is concerned. Then you can look at the specific DB you are using to see if it supports batch reads. In which case you can create buffers of user ids with the bufferCount operator and query the db with such buffers.

NodeJS and Mongo line who's online

TL;DR
logging online users and reporting back a count (based on a mongo find)
We've got a saas app for schools and students, as part of this I've been wanting a 'live' who's online ticker.
Teachers from the schools will see the counter, and the students and parents will trigger it.
I've got a socket.io connect from the web app to a NodeJS app.
Where there is lots of traffic, the Node/Mongo servers can't handle it, and rather than trow more resources at it, I figured it's better to optomise the code - because I don't know what I'm doing :D
with each student page load:
Create a socket.io connection with the following object:
{
'name': 'student or caregiver name',
'studentID': 123456,
'schoolID': 123,
'role': 'student', // ( or 'mother' or 'father' )
'page': window.location
}
in my NODE script:
io.on('connection', function(client) {
// if it's a student connection..
if(client.handshake.query.studentID) {
let student = client.handshake.query; // that student object
student.online = new Date();
student.offline = null;
db.collection('students').updateOne({
"reference": student.schoolID + student.studentID + student.role }, { $set: student
}, { upsert: true });
}
// IF STAFF::: just show count!
if(client.handshake.query.staffID) {
db.collection('students').find({ 'offline': null, 'schoolID':client.handshake.query.schoolID }).count(function(err, students_connected) {
emit('online_users' students_connected);
});
}
client.on('disconnect', function() {
// then if the students leaves the page..
if(client.handshake.query.studentID) {
db.collection('students').updateMany({ "reference": student.reference }, { $set: { "offline": new Date().getTime() } })
.catch(function(er) {});
}
// IF STAFF::: just show updated count!
if(client.handshake.query.staffID) {
db.collection('students').find({ 'offline': null, 'schoolID':client.handshake.query.schoolID }).count(function(err, students_connected) {
emit('online_users' students_connected);
});
}
});
});
What Mongo Indexes would you add, would you store online students differently (and in a different collection) to a 'page tracking' type deal like this?
(this logs the page and duration so I have another call later that pulls that - but that's not heavily used or causing the issue.
If separately, then insert, then delete?
The EMIT() to staff users, how can I only emit to staff with the same schoolID as the Students?
Thanks!
You have given a brief about the issue but no diagnosis on why the issue is happening. Based on a few assumptions I will try to answer your question.
First of all you have mentioned that you'd like suggestions on what Indexes can help your cause, based on what you have mentioned it's a write heavy system and indexes in principle will only slow the writes because on every write the Btree that handles the indexes will have to be updated too. Although the reads become way better specially in case of a huge collection with a lot of data.
So an index can help you a lot if your collection has let's say, 1 million documents. It helps you to skim only the required data without even doing a scan on all data, thanks to the Btree.
And Index should be created specifically based on the read calls you make.
For e.g.
{"student_id" : "studentID", "student_fname" : "Fname"}
If the read call here is based on student_id then create and index on that, and if multiple values are involved (equality - sort or anything) then create a compound index on those fields, giving priority to Equality field first and range and sort fields thereafter.
Now the seconds part of question, what would be better in this scenario.
This is a subjective thing and I'm sure everyone will have a different approach to this. My solution is based on a few assumptions.
Assumption(s)
The system needs to cater to a specific feature where student's online status is updated in some time interval and that data is available for reads for parents, teachers, etc.
The sockets that you are using, if they stay connected continuously all the time then it's that many concurrent connections with the server, if that is required or not, I don't know. But concurrent connections are heavy for the server as you would already know and unless that's needed 100 % try a mixed approach.
If it would be okay for you disconnect for a while or keep connection with the server for only a short interval then please consider that. Which basically means, you disconnect from the server gracefully, connect send data and repeat.
Or, just adopt a heartbeat system where your frontend app will call an API after set time interval and ping the server, based on that you can handle if the student is online or not, a little time delay, yes but easily scaleable.
Please use redis or any other in memory data store for such frequent writes and specially when you don't need to persist the data for long.
For example, let's say we use a redis list for every class / section of user and only update the timestamp (epoch) when their last heartbeat was received from the frontend.
In a class with 60 students, sort the students based on student_id or something like that.
Create a list for that class
For student_id which is the first in ascended student's list, update the epoch like this
LSET mylist 0 "1266126162661" //Epoch Time Stamp
0 is your first student and 59 is our 60th student, update it on every heartbeat. Either via API or the same socket system you have. Depends on your use case.
When a read call is needed
LRANGE classname/listname 0 59
Now you have epochs of all users, maintain the list of students either via database or another list where you can simply match the indexes with a specific student.
LSET studentList 0 "student_id" //Student id of the student or any other data, I am trying to explain the logic
On frontend when you have the epochs take the latest epoch in account and based on your use case, for e.g. let's say I want a student to be online if the hearbeat was received 5 minutes back.
Current Timestamp - Timestamp (If less than 5 minutes (in seconds)) then online or else offline.
This won't be a complete answer without discussing the problem some more, but figured I'd post some general suggestions.
First, we should figure out where the performance bottlenecks are. Is it a particular query? Is it too many simultaneous connections to MongoDB? Is it even just too much round trip time per query (if the two servers aren't within the same data center)? There's quite a bit to narrow down here. How many documents are in the collection? How much RAM does the MongoDB server have access to? This will give us an idea of whether you should be having scaling issues at this point. I can edit my answer later once we have more information about the problem.
Based on what we know currently, without making any model changes, you could consider indexing the reference field in order to make the upsert call faster (if that's the bottleneck). That could look something like:
db.collection('students').createIndex({
"reference": 1
},
{ background: true });
If the querying is the bottleneck, you could create an index like:
db.collection('students').createIndex({
"schoolID": 1
},
{ background: true });
I'm not confident (without knowing more about the data) that including offline in the index would help, because optimizing for "not null" can be tricky. Depending on the data, that may lead to storing the data differently (like you suggested).

Nodejs agenda job scheduler - How to group multiple jobs that will run in the same database scan into 1 batch job?

Problem
I have purchase service that users can use to buy/rent digital assets like game, media, movies... When purchase event happened, I create a job and schedule it to run at calculated expired date to remove key for such asset.
Everything works. But it would be better if I can group those jobs that will run in the same agenda db scan into 1 batch job to remove multiple keys.
This will reduce significant amount of db read/write/delete operations in both keys and agenda collection, it also increases the free memory at most of the time as instead of storing 100+ jobs to run in a scan, it stores just 1 job to remove 100+ keys.
Research
The closest feature I found in Agenda repo is unique(). Which allows user to find and modify the existing job that matches the fields defined in unique(). If it can concat new jobs to the existing job, that will solve my case.
Implementation
Before diving in and modify the package I want to check if there are already people solved the problem I have and have some thoughts to share.
Another solution without touching the package is to create an in-memory dictionary to accumulate jobs for a specific db scan with this strategy:
dict = {}
//if key expires in 1597202228 then put to dict slot:
dict = {
1597300000: [jobA]
}
//another key expires in 1597202238 then put to the same slot:
dict = {
1597300000: [jobA,jobB]
}
//the latch condition to put job batch into agenda:
if dict_size == dict_allocated_memory then put the whole dict into db.
if a batch_size = batch_limit then put the batch into db and remove the batch in dict.
if the batch is going to expire in the next db scan then put the batch (it may be empty, has a few jobs...) into db and remove the batch in dict.

Running a repetitive task in Node.js for each row in a postgres table on a different interval for each row

What would be a good approach to running a repetitive task for each row in a large postgres db table on a different per row interval in Node.js.
To give you some more context, here's a quick description of the application:
It's a chat based customer support app.
It consists of teams, which can be either a client team or a support team. Teams have users, which can be either client users or support users.
Client users send messages to a support team and wait for one of that team's users to answer their question.
When there's an unanswered client message waiting for a response, every agent for the receiving support team will receive a notification every n seconds (n being set on a per-team basis by the team admin).
So this task needs to infinitely loop through the rows in the teams table and send notifications if:
The team has messages waiting to be answered.
N seconds have passed since the last notification was sent (N being the number of seconds set by the team admin).
There might be a better approach to this condition altogether.
So my questions are:
What is an efficient way to infinitely loop through a postgres table with no upper limit on the number rows?
Should I load 1 row at a time? Several at a time?
What would be a good way to do this in Node?
I'm using Knex. Does Knex provide a mechanism for lazy loading a table and iterating through the rows?
A) Running a repetitive task via node can be done via a the js built-in function 'setInterval'.
// run the intervalFnc() every 5 seconds
const timerId = setTimeout(intervalFnc, 5000);
function intervalFnc() { console.log("Hello"); }
// to quit running it:
clearTimeout(timerId);
Then your interval function can do the actual work. An alternative would be to use cron (linux), or some OS process scheduler to trigger the function. I would use this method if you want to do it every minute, and a cron job if you want to do it every hour (in between these times becomes more debatable).
B) An efficient way...
B-1) Retrieving a block of records from a DB will be more efficient than one at a time. Knex has .offset and .limit clauses to choose a group of records to retrieve. A sample from the knex doc:
knex.select('*').from('users').limit(10).offset(30)
B-2) Database indexed access is important for performance if your tables are very large. I would recommend including an status flag field in your table to note which records are 'in-process', and also include a "next-review-timestamp" field with both fields being both indexed. Retrieve the records that have status_flag='in-process' AND next_review_timestamp <= now(). Sample:
knex('users').where('status_flag', 'in-process').whereRaw('next_review_timestamp <= now()')
Hope this helps!

How to fix a race condition in node js + redis + mongodb web application

I am building a web application that will process many transactions a second. I am using an Express Server with Node Js. On the database side, I am using Redis to store attributes of a user which will fluctuate continuously based on stock prices. I am using MongoDB to store semi-permanent attributes like Order configuration, User configuration, etc.,
I am hitting a race condition when multiple orders placed by a user are being processed at the same time, but only one would have been eligible as a check on the Redis attribute which stores the margin would not have allowed both the transactions.
The other issue is my application logic interleaves Redis and MongoDB read + write calls. So how would I go about solving race condition across both the DBs
I am thinking of trying to WATCH and MULTI + EXEC on Redis in order to make sure only one transaction happens at a time for a given user.
Or I can set up a Queue on Node / Redis which will process Orders one by one. I am not sure which is the right approach. Or how to go about implementing it.
This is all pseudocode. Application logic is a lot more complex with multiple conditions.
I feel like my entire application logic is a critical section ( Which I think is a bad thing )
//The server receives a request from Client to place an Order
getAvailableMargin(user.username).then((margin) => { // REDIS call to fetch margin of user. This fluctuates a lot, so I store it in REDIS
if (margin > 0) {
const o = { // Prepare an order
user: user.username,
price: orderPrice,
symbol: symbol
}
const order = new Order(o);
order.save((err, o) => { // Create new Order in MongoDB
if (err) {
return next(err);
}
User.findByIdAndUpdate(user._id, {
$inc: {
balance: pl
}
}) // Update balance in MongoDB
decreaseMargin(user.username) // decrease margin of User in REDIS
);
}
});
Consider margin is 1 and with each new order margin decreases by 1.
Now if two requests are received simultaneously, then the margin in Redis will be 1 for both the requests thus causing a race condition. Also, two orders will now be open in MongoDB as a result of this. When in fact at the end of the first order, the margin should have become 0 and the second order should have been rejected.
Another issue is that we have now gone ahead and updated the balance for the User in MongoDB twice, one for each order.
The expectation is that one of the orders should not execute and a retry should happen by checking the new margin in Redis. And the balance of the user should also have updated only once.
Basically, would I need to implement a watch on both Redis and MongoDB
and somehow retry a transaction if any of the watched fields/docs change?
Is that even possible? Or is there a much simpler solution that I might be missing?

Resources