How to manage multiple parallel threads in nodejs? - node.js

Scenario:
I am using node.js for writing a heavy application that listens to the tweets using streaming API, and for every tweet - it calls 3-4 REST APIs, then listens to a web socket for about 5-10 minutes and again calls 2 more REST apis.
Current Approach:
Though I have prepared functions for every task but I am not using callbacks. From tweet function F, I call another function F1 then from inside of F1 I call F2 and so on...
Problem:
It makes the code super messy and MAJORLY- the problem is that the concurrent requests are overlapping and sharing data among each other. For eg- while listening to a web socket I pass endpoint and auth info -
listenXYZ(endpoint, auth)
But it seems when two threads hit this function at same time then same endpoint is passed in those 2 sometimes.
Why does this happen?
How can I accomplish multi-threading with nice and organized data flow with controlled memory management?
Can I use workers?
Multi threading?

Related

Process long arrays in Cloud Functions

I'am developing an application that makes users able to broadcast videos. As many social network do, users need to receive a notification when someone goes live. To do so I'am using Cloud Functions and i pass to the functions the array of the users that must receive the notification; for every user of the array I need to extract the FCM TOKEN from the server and then send the notification.
For arrays of 10 / 20 Users the functions doesn't take so long, but for 150/300 users sometimes I get timeout or a very slow execution.
So my question is: Is it possible to divide the array in groups of 20/30 users and process many arrays at same time??
Thanks
There is 2 way to answer this
From a development point of view, some languages are easier for allowing the concurrent processing (Go is very handy for this). So, like this, because you spend a lot of time in API call (FCM), it's a first solution to perform several calls concurrently
From an architecture point of view, PubSub and/or Cloud Task are well designed for this
Your first function only creates chunk of message to send and posts them to Cloud Task or PubSub
Your second function receives the chunks and sends the messages. The chunks are processed in parallel on several functions.

air traffic controller for threads when calling a REST API

DISCLAIMER: If this post is off-topic to this site, please recommend a site where this post would be appropriate.
On Ubuntu 18.04, in bash, I am writing a network-based, threaded application that requires multiple servers. It receives files through the network and processes them, ultimately making an API call that finishes the processing and logs the results to a database for later retrieval and reporting.
So far I have written the application using non-threaded programming models and concepts. That means the files are processed one at a time in real-time. This works great if there is no sudden burst of files and/or a backlog of files to process. The main bottle neck has been the way I sequentially send files to the API one after another, waiting until the entire operation has taken place for one file and the API returns the results. The API has a rate limit of 8 calls per second. But since each call takes from .75 to 1 second, my program waits until the operation is done and only processes about 1 file per second through the API. In short, I did not have to worry about scheduling API calls because I could barely do one call per second.
Since the capacity is there to process 8 files per second, and I need more speed, I have been converting my single-threaded, sequential application into a parallel, scalable, multi-threaded application. This new version can spawn enough threads to send 8 files per second to the REST API and much more. So now I have the opposite problem. I am sending too many requests per second to the REST API and am in danger of triggering penalties, etc. Ultimately, when my traffic is higher, I will upgrade my subscription to the API and get more calls per second, but this current dilemma has got me thinking about how to schedule the API calls with different threads.
The purpose of this post is to discuss an idea about how to schedule these REST API calls across various threads. Specifically, I want to discuss how to coordinate timing and usage of the API while maintaining efficiency and yet not overloading the API. In short, I want to coordinate a group of threads so that the API is properly used. Not too fast and not too slow.
Independent of my application, this idea could be useful in a number of generically similar scenarios.
My idea is to create an "air traffic controller" ("ATC") so that the threads of the application have a centralized timing authority to check when they are ready to submit files to the REST API. The ATC would know how many time slots/calls per time period (in this case, calls per second) the API can schedule. The ATC would be listening for the threads to request a time slot ("launch code") which would give them a time slot in the future to perform their API call. The ATC would decide based on the schedule of other launch codes that it has already handed out.
In my case, from the start of the upload of the file to the API, it could take 0.75 to 1 second to complete the processing and receive a response from the API. This does not affect the count of new API calls that can be performed. It is just a consideration of how long the threads will be waiting once they call the API. It may not be relevant to this overall discussion.
Each thread would obviously have to do some error handling. If the API timed out or threw an error, then the thread would have to handle it and get back in line with the ATC -if appropriate- and ask for a new launch code. Maybe it should report the error to the ATC for centralized logging?
In situations where the file processing needs burst above 8 files per second, there would be a scheduling backlog where the threads should wait their turn as assigned by the ATC.
Here are some other considerations:
Function
The ATC would be a lightweight daemon that does the following:
- listens on some TCP port
- receives a request
security token (?), thread id, priority
- authenticates the request (?)
- examines schedule
- reserves the next available time slot
- returns the launch code
security token (?), current time, launch timing offset to current time, URL and auth token for the API
- expunged expired launch codes
The ATC would need the following:
- to know what port it is supposed to run on
- to know how many slots per time period it was set to schedule
(e.g. 8 per second)
- to have a super fast read/write access to the schedule (associative array?)
- to know the URL and corresponding auth token for the thread to use
- maybe to know multiple URLs and auth tokens for load balancing
Here are more things to consider:
Security
How could we keep the ATC secure while ensuring high performance?
Network-level security (e.g. firewalls allowing only the IP addresses of the file-processing servers?)
Auth tokens or logins and passwords?
Performance
What would the requirements be for this ATC server? Would this be taxing to a CPU and memory?
Timing
How often would an NTP call be needed? By the ATC server? By the servers which call the API?
Scalability
Being able to provide different URLs and auth tokens would allow the ATC to load balance with different API providers.
Threading of the ATC itself
Would the ATC need to spawn threads to be able to handle each new request?
How does a web server handle requests?
How would the various threads share a common schedule?
In a non-threaded environment, the ATC would possibly keep an associative array in memory to keep performance as high as possible. How would the various threads of the ATC have access to the same schedule?
So here is my question. Does this exist? If not, what are some best practices in trying to build the above?
It seems like a beanstalkd kind of network service except it only provides permission/scheduling and is extremely dependant on timing.

NodeJS with Redis message queue - How to set multiple consumers (threads)

I have a nodejs project that is exposing a simple rest api for an external web application. This webhook must cope with a large number of requests per second as well as return 200 OK very quickly to the caller. In order for that to happen I investigate a redis simple queue to be enqueued with each request's to be handled asynchronously later on (via a consumer thread).
The redis simple queue seems like an easy way to achieve this task (https://github.com/smrchy/rsmq)
1) Is rsmq.receiveMessage() { ....... } a blocking method? if this handler is slow - will it impact my server's performance?
2) If the answer to question 1 is true - Is it recommended to extract the consumption of the messages to an external micro service? (a dedicated consumer)? what are the best practices to create multi threaded consumers on such environment?
You can use pubsub feature provided by redis https://redis.io/topics/pubsub
You can publish to various channels without any knowledge of subscribers . Subscribers can subscribe to the channels they wish.
sreeni
1) No, it won't block the event loop, however you will only start processing a second message once you call the "next" method, i.e., you will process one message at a time. To overcome this, you can start multiple workers in parallel. Take a look here: https://stackoverflow.com/a/45984677/7201847
2) That's an architectural decision that depends on the load you have to support and the hardware capacity you have. I would recommend at least two Node.js processes, one for adding the messages to the queue and another one to actually processing them, with the option to start additional worker processes if needed, depending on the results of your performance tests.

WCF Multithreading With Scalability Considerations

Working on a stateless WCF rest web service and have an operation with 3 independent tasks. Each one can be run independently. Each task consists of a web service call to an external API & a follow up local DB read operation that takes less than 0.25 sec.
First thing that comes to mind is that I should spawn 3 separate threads then join and return result. Using a Thread Pool would probably not be a good idea here as its limited to 250 treads max.
Performance is of concern, but not at the expense of scalability.
Should I be concerned about the overhead of starting & joining 3 separate threads for each web service call ?
Wrap the calls to external service into async Task methods, then call from your WCF method. It will use thread pool and will queue your web service calls nicely if thread pull is exhausted.
You can use async IO to perform the webservice calls. Async IO does not occupy any thread at all while it is running. You can do the same thing for the database calls. This alleviates any threading concern that you might have.
Alternatively, you can rely on the thread-pool. You can increase the limits. You can calculate how many threads you need: If 100 requests arrive per second and each one takes 2 seconds to complete you need 200 threads. This can easily be served by the built-in thread pool assuming you configure appropriate limits.
In case the external service is down and takes 30 seconds to timeout this number now shoots up to 3000 threads which I consider unsafe. So you either need a low timeout, a circuit breaker or async IO.
So in order to decide you need to forecast load and latency.
I'll link to some discussion for why and when to use async IO:
https://stackoverflow.com/a/25087273/122718 Why does the EF 6 tutorial use asychronous calls?
https://stackoverflow.com/a/12796711/122718 Should we switch to use async I/O by default?

Node/Express: running specific CPU-instensive tasks in the background

I have a site that makes the standard data-bound calls, but then also have a few CPU-intensive tasks which are ran a few times per day, mainly by the admin.
These tasks include grabbing data from the db, running a few time-consuming different algorithms, then reuploading the data. What would be the best method for making these calls and having them run without blocking the event loop?
I definitely want to keep the calculations on the server so web workers wouldn't work here. Would a child process be enough here? Or should I have a separate thread running in the background handling all /api/admin calls?
The basic answer to this scenario in Node.js land is to use the core cluster module - https://nodejs.org/docs/latest/api/cluster.html
It is an acceptable API to :
easily launch worker node.js instances on the same machine (each instance will have its own event loop)
keep a live communication channel for short messages between instances
this way, any work done in the child instance will not block your master event loop.

Resources