firebase-queue: Multiple workers not running correctly? - node.js

I'm pretty new to Node.js, though I've been writing javascript for years. I'm more than open to any node advice for best-practices that I'm not following, or other rethinks. That said:
I'm building a system in which a user creates a reservation, and simultaneously submits a task for my firebase-queue to pick up. This queue has multiple specs associated with it. In turn, it's supposed to:
check availability, and in response confirm/throw an alert on the reservation and update the firebase data accordingly.
Update the users reservations, which is an index of reservation object keys, and removing any redundant ones.
Use node-schedule to create dated functions to send notifications about the pending expiration of their reservation.
However, when I run my script, only one of the firebase-queues that I instantiate runs. I can look in the dashboard and see that the progress is at 100, the _state is the new finished_state (which is the next spec's start_state), but that next queue won't pick up the task and process it.
If I quit my script and rerun it, that next queue will work fine. And then the queue after that won't work, until I repeat the act of quitting and rerunning the script. I can continue this until the entire task sequence completes, so I don't think the specs or the code being executed itself are blocking. I don't see any error states spring up, anyway.
From the documentation it looks like I should be able to write the script this way, with multiple calls to 'new Queue(queueRef, options, function(data, progress, resolve, reject)...' and they'll just run each task as I set them in their options (all of which are basically:
var options = {
'specId': 'process_reservation',
'numWorkers': 5,
'sanitize': true,
'suppressStack': false
};
but, nope. I guess I can spawn child-processes for each of the queue instances, but I'm not sure if that's an extreme reaction to the issues I'm seeing, and I'm not sure if it would complicate the node structure in terms of shared module exports. Also, I'm not sure if it'll start eating into my concurrent connection count.
Thanks!

Related

Backpressuring Snowflake using "rowStreamHighWaterMark" in snowflake-sdk?

I'm using snowflake-sdk and snowflake-promise to stream results (to avoid loading too many objects in memory).
For each streamed row, I want to process the received information (an ETL-like job that performs write-backs). My code is quite basic and similar to this simplistic snowflake-promise example.
My current problem is that .on('data', ...) is called more often than I can manage to handle. (My ETL-like job can't keep up with the received rows and my DB connection pool to perform write-backs gets exhausted).
I tried setting rowStreamHighWaterMark to various values (1, 10 [default], 100, 1000, 2000 and 4000) in an effort to slow down/backpressure stream.Readable but, unfortunately, it didn't change anything.
What did I miss ? How can I better control when to consume the read data ?
If this was written synchronous, you would see that to "be pushed too much data" than you can handled to write at the same time" cannot happen because:
while(data){
data.readrow()
doSomethineAwesome()
writeDataViaPoolTheBacksUp()
}
just can not spin to fast.
Now if you are accepting data on one async thread, and pushing that data onto a queue and draining the queue in another async thread, you will get the problem you discribe (that is your queue explodes). So you need to slow/pause the completion of the read's thread when the write thread is too behind.
Given to is writing to the assumed queue, when that gets too long, stop.
The other way you might be doing this is with no work queue, but fire a async write each time conditions are meet. This is bad because you have no track of outstand work, and you are doing many small updates to the DB, which if is Snowflake it really dislikes. A better approach would be to build a local set of data changes, we will call it a batch, and when you batch get to a size you flush the changes set in one operation (and you flush the batch when input is completed, to catch the dregs)
The Snowflake support got back to me with an answer.
They told me to create the connection this way:
var connection = snowflake.createConnection({
account: "testaccount",
username: "testusername",
password: "testpassword",
rowStreamHighWaterMark: 5
});
Full disclaimer: My project has changed and I could NOT recreate the problem on my local environment. I couldn't assess the answer's validity; still, I wanted to share in case somebody could get some hints from this information.

Camunda Engine behaviour with massive multi-instances processes and ready state

I wonder how Camunda manage multiple instances of a sub-process.
For example this BPMN:
Let's say multi-instances process would iterate on a big collection, 500 instances.
I have a function in a web app that call the endpoint to complete the user common task, and perform another call to camunda engine to get all tasks (on first API call callback). I am supposed to get a list of 500 sub-process user tasks (the ones generated by the multi-instances process).
What if the get tasks call is performed before Camunda Engine successfully instantiated all sub-processes?
Do i get a partial list of task ?
How to detect that main and sub processes are ready?
I don't really know if Camunda is able to manage this problematic by itself so I though of the following solution, knowing I only can use Modeler environment with Groovy to add code (Javascript as well, but the entire code parts already added are groovy):
Use of a sub process throw event to catch in main process, then count and compare tasks ready with awaited tasks number for each signal emitted.
Thanks
I would maybe likely spawn the tasks as parallel process (or 500 of them) and then got to a next step in which I signal or otherwise set a state that indicates the spawning is completed. I would further join the parallel processes all together again and have here a task signaling or otherwise setting a state that indicates all the parallel processes are done. See https://docs.camunda.org/manual/7.12/reference/bpmn20/gateways/parallel-gateway/. This way you can know exactly at what point (after spawning is done and before the join) you have a chance of getting your 500 spawned sub processes

Most Ideal Way to Run a Scheduled Job with Angular-Meteor

I'm writing an appointment scheduling app with angular-meteor. One of the requirements is that a text notification be sent out to the customer who made the appointment. The customer provides a cell number. But basically, all I want to do is send out an email X many minutes before the appointment time. Running off the angular-meteor stack, what might be the best way to do this? All of the appointment information is saved to a mongo db.
You might be interested in Meteor job-collection package (not specific to angular-meteor):
A persistent and reactive job queue for Meteor, supporting distributed workers that can run anywhere.
job-collection is a powerful and easy to use job manager designed and built for Meteor.js.
It solves the following problems (and more):
Schedule jobs to run (and repeat) in the future, persisting across server restarts
[…]
In particular job.after(someTimeBeforeAppointment)
// Server
var myJobs = JobCollection('myJobQueue');
// Start the myJobs queue running
myJobs.startJobServer();
// Create a Job (e.g. in a Meteor method)
var job = new Job(myJobs, 'jobType', jobData);
// Specify when it can run and save it.
job.after(someTimeBeforeAppointment).save();
// Server (or could be a different server!)
// How jobs should be processed.
myJobs.processJobs('jobType', function (job, done) {
var jobData = job.data;
// Do something… could be asynchronous.
job.done(); // or job.fail();
// Call done when work on this job has finished.
done();
});
The pollInterval can be specified in processJobs options. Default is every 5 seconds.
Write a node script that sends an email to every customer who has an appointment between X minutes and X+10 minutes from the time of running. Once the email is sent, set a boolean flag on the appointment in mongo so it doesn't get sent twice.
Run a cron that triggers it every 5 minutes.
The overlap should make sure that nothing slips though the cracks, and the flag will prevent multiples from being sent.

Good approaches for queuing simultaneous NodeJS processes

I am building a simple application to download a set of XML files and parse them into a database using the async module (https://npmjs.org/package/node-async) for flow control. The overall flow is as follows:
Download list of datasets from API (single Request call)
Download metadata for each dataset to get link to XML file (async.each)
Download XML for each dataset (async.parallel)
Parse XML for each dataset into JSON objects (async.parallel)
Save each JSON object to a database (async.each)
In effect, for each dataset there is a parent process (2) which sets of a series of asynchronous child processes (3, 4, 5). The challenge that I am facing is that, because so many parent processes fire before all of the children of a particular process are complete, child processes seem to be getting queued up in the event loop, and it takes a long time for all of the child processes for a particular parent process to resolve and allow garbage collection to clean everything up. The result of this is that even though the program doesn't appear to have any memory leaks, memory usage is still too high, ultimately crashing the program.
One solution which worked was to make some of the child processes synchronous so that they can be grouped together in the event loop. However, I have also seen an alternative solution discussed here: https://groups.google.com/forum/#!topic/nodejs/Xp4htMTfvYY, which pushes parent processes into a queue and only allows a certain number to be running at once. My question then is does anyone know of a more robust module for handling this type of queueing, or any other viable alternative for handling this kind of flow control. I have been searching but so far no luck.
Thanks.
I decided to post this as an answer:
Don't launch all of the processes at once. Let the callback of one request launch the next one. The overall work is still asynchronous, but each request gets run in series. You can then pool up a certain number of the connections to be running simultaneously to maximize I/O throughput. Look at async.eachLimit and replace each of your async.each examples with it.
Your async.parallel calls may be causing issues as well.

Node.js, (Hi)Redis and the multi command

I'm playing around with node.js and redis and installed the hiredis library via this command
npm install hiredis redis
I looked at the multi examples here:
https://github.com/mranney/node_redis/blob/master/examples/multi2.js
At line 17 it says
// you can re-run the same transaction if you like
which implies that the internal multi.queue object is never cleared once the commands finished executing.
My question is: How would you handle the situation in an http environment? For example, tracking the last connected user (this doesn't really need multi as it just executes one command but it's easy to follow)
var http = require('http');
redis = require('redis');
client = redis.createClient()
multi = client.multi();
http.createServer(function (request, response) {
multi.set('lastconnected', request.ip); // won't work, just an example
multi.exec(function(err, replies) {
console.log(replies);
});
});
In this case, multi.exec would execute 1 transaction for the first connected user, and 100 transactions for the 100th user (because the internal multi.queue object is never cleared).
Option 1: Should I create the multi object inside the http.createServer callback function, which would effectivly kill it at the end of the function's execution? How expensive in terms of CPU cycles would creating and destroying of this object be?
Option 2: The other option would be to create a new version of multi.exec(), something like multi.execAndClear() which will clear the queue the moment redis executed that bunch of commands.
Which option would you take? I suppose option 1 is better - we're killing one object instead of cherry picking parts of it - I just want to be sure as I'm brand new to both node and javascript.
The multi objects in node_redis are very inexpensive to create. As a side-effect, I thought it would be fun to let you re-use them, but this is obviously only useful under some circumstances. Go ahead and create a new multi object every time you need a new transaction.
One thing to keep in mind is that you should only use multi if you actually need all of the operations to execute atomically in the Redis server. If you just want to batch up a series of commands efficiently to save network bandwidth and reduce the number of callbacks you have to manage, just send the individual commands, one after the other. node_redis will automatically "pipeline" these requests to the server in order, and the individual command callbacks, if any, will be invoked in order.

Resources