How to distribute NodeJS requests to several servers and merge the results

How to distribute NodeJS requests to several servers and merge the results - node.js

I have a simple NodeJS web app that calls several apis asynchronously and merges the results to return one big result. Now let's say that I want to optimize this. How do I do this?
I am new to NoeJS and also the concept of scaling systems. I have been reading about load balancing, distributed systems, etc... I think this is the right way to go, but honestly I don't know.
I was thinking of doing something like this -
Set up a system that has several servers, and each has an instance of a NodeJS webapp that makes an api call given a path, and returns the result.
Have a master server that grabs the result from each of these servers, and merge the result and return it to the client.
Is this right way to go? What technologies do I use? Thank you for your help.

I am guessing you are trying to setup web-crawling or api-crawling, to grab data from 3rd party end point. If that is true, you would have a list of users / IDs or something like that that you pass to the web service you call and grab the data.
First of making a large number of requests very fast and in a stable way is tricky and depends on several factors to be stable and robust.
Is the 3rd party API rate limited.
Network connection on the client machine making the requests.
Error handling for both API and client errors like connection reset etc.
Sheer volume of data you are fetching back, like if you are trying to crawl data on millions of users from 3rd party API as fast as possible.
Your instinct is correct that you would have to scale this over several servers or at-least several parallel node processes on machine with lot of resources, however start small, test, and then scale would be my recommendation. Here are a few steps.
Use a good robust node http client like axios
If you are dealing with huge number of items (username, ids. emails etc) you will need stable way of iterating over them. Put them in a database like PostgreSQL or MySQL.
From here on figure out what's the fastest rate at which your API supports calling. And write stable function to iterate over your 'input' and call the API.
Then you have a couple of options. If data you are collecting is separate for each request you make. You can save it back in the database for each input. If you literally want to merge the data from multiple API calls, you can use a key-value storage like Redis. You can give an ID to each call and create a combination key for input+request_id format, then when all requests are done, you can merge them.
When you a small scale model in place you can now add a good job manager like Kue or Bull to the mix, and split the set of inputs in database from point (2) over several jobs that can be run in parallel.
Once you have a stable job-manager for that can repeat this node process for a range of inputs , now you are at a point where you can scale.
Deploy this same code on multiple servers that all talk to same Database and Redis. Install the Node process to run using a process manager like PM2.
Finally the way setup works is, each copy of same node program fetches a different set of inputs (usernames/IDs etc) form the source database, and writes the results back to the database or Redis depending on how you want to handle the output.
Optional post processing on redis to fetch the key value pairs and merge the responses grouped by input.
Some important things you have to be hyper aware of when coding this issues are:
Memory Management: Use design patterns/code/libraries that saves you most memory. Load absolutely minimum of what you need to in memory. Eg: iterating on an array of 1 millions usernames in memory is more expensive than keeping them in database and paging over them.
Error Handing: There will be lots of them. API errors, unforeseen exceptions, memory leaks, network drops etc. Having robust error handling and recovery mechanism will save the day.
Logging: Good quality logging will be critical to keep a check on how different parts of system are doing. Look at winston.
Throttling API calls: Remember making 10,000 API calls at the same minute will likely crash your machine or even most APIs.At the very least go very slow due to memory overloads. However adding a slight delay (like 10 milliseconds) between every 10 parallel calls will be HUGE boost in speed and make the calls much more stable. This strategy is called throttling or rate-limiting the API calls. Finding a sweet spot that works for your problem is important. Yes going slow can actually make you reach goal faster!
Your question was quite broad without specific code question, this is a general strategy and hopefully will give you a good starting point and links to reference materials so you can start building your solution.

Related

Most efficient way to get users country in Next.js/Node.js?

In a Next.js app, what would be the most efficient (fastest) way to retrieve the users country?
Among other things, I would use it to determine which scripts are loaded using next/script.
I looked in to node-geoip and fast-geoip, but even though fast-geoip has a very thorough explanation below, I do not understand the mechanisms behind Next.js/Node.js to evaluate the methods properly.
Concretely, what geoip-lite does is that, on startup, it reads the whole database from disk, parses it and puts it all on memory, thus this results in the startup time being increased by about ~233 ms along with an increase of memory being used by the process of around ~110 MB, in exchange for any new queries being resolved with low sub-millisecond latencies (~0.02 ms).
This works if you have a long-running process that will need to geolocate a lot of IPs and don't care about the increases in memory usage nor startup time, but if, for example, your use-case requires only geolocating a single IP, these trade-offs don't make much sense as only a small part of the satabase is needed to answer that query, not all of it.
This library tries to provide a solution for these use-cases by separating the database into chunks and building an indexing tree around them, so that IP lookups only have to read the parts of the database that are needed for the query at hand. This results in the first query taking around 9ms and subsequent ones that hit the disk cache taking 0.7 ms, while memory consumption is kept at around 0.7MB.
Wrapping it up, geoip-lite has huge overhead costs but sub-millisecond queries whereas this library doesn't have any overhead costs but its queries are slower (0.7-9 ms).
As geoip would be called for every visitor, I assume it would have to read the whole database on each initialization and thereby making fast-geoip the best choice?
or is there some built in mechanism, that makes sure it is accessed from memory across the subsequent requests, when frequently loaded and hence making node-geoip the best choice?
or am I focused on solving my problem the wrong way, and should rather see if there is some way I can get the location via the users browser?
Would appreciate any feedback, even if there is a completely different path worth exploring:-)

I read the documentation for fast-geoip. It's designed for "serverless" cloud services such as AWS Lambda, GCP Cloud Functions, CF Workers where RAM is limited and expensive.
Note the package author's emphasis on low steady-state RAM use in the graphs below.
In summary, assuming a cloud VM/bare-metal deployment and the need to call the IP to location method on every page request, there is probably no compelling reason to use the above package.
PS: Check if the above packages require you to rotate a DB file on disk every few weeks (or rebuild+redeploy your Node app) to keep data up to date. There are commercial REST APIs such as the one in my bio (I am the developer) that may mitigate this hassle, YMMV.

Options for getting a CPU intensive job off my web server?

I have been working on a Web App for visualizing live data. It is crucial that this data is kept up to date on the client side without such updates being invoked directly by the client (e.g. no button presses or refreshing the page). Currently, on page load, I grab the current data set from a database (DynamoDB) via Ajax, and subsequent updates are pushed to any listening clients every 5 minutes via a Websockets connection (using Socket.io).
I have overlooked the computational load of this update job. It has to mine some data, process it, update the database, and send the update out to all clients. As a result, the web server is left unresponsive for about 30 seconds with each update. Furthermore, my current architecture limits me from putting my server behind a load balancer, which is something I anticipate coming up in the future. For both these reasons, I really need to get this update job off my web server.
I am relatively inexperienced in web development, and I don't feel I am knowledgeable enough about these technologies to know the drawbacks of the solutions I have come up with. Currently, I am considering:
Break the update off into a separate process so it does not block the Node event loop. This would solve my issue in the short term, but if I ever want to load balance my application, I can't have the update running on multiple machines.
Drop Websockets entirely and just have the client query the database every 5 minutes, while a separate process (or separate server if I want load balancing) keeps the database up to date without interacting directly with the client. Will this kind of access pattern put too much load on my db?
Have a separate server run the update, and send the result via Websockets (or maybe some other protocol) to my load balanced application servers, which then push that update to all listening clients as usual. Is this even possible?
Perhaps there are other solutions. It seems like this would be a relatively common problem, so I was hoping I could find some guidance here. What are the potential issues with the solutions I have proposed, and are there other possible solutions that my suit my use case better?

It sounds like you want one process sitting somewhere which crunches the data and publishes it to a stream. Clients can then subscribe to the stream as and when they like. Redis handles streams nicely, you could process your data and push it into a redis stream. You could then create a small node service which subscribes to the redis stream and pushes the formatted data out over a websocket or via polling.
In this scenario you can then scale up either the publishing process (the one crunching the numbers) if your data load goes up, or scale up your subscribed process (which serves the data over a websocket to browsers) if you get an influx of clients watching the data.
You can also easily distribute the hosting of these services across other machines, and even write them in different languages if you decide the number crunching needs something like threading.
You're then left with the issue of clients (web browsers) consuming this data with a load balance in-between. This can be a hard problem if you use websockets and is bundled with pros and cons. But importantly you'll have separated your data crunching from your result publishing and that'll isolate out your issue to only the load balancing.

I have done pretty much the same to check ressources on some of our servers.
I have a C# service getting the information on each server that we manage, sending them to a queue (Amq).
From there, I have a stomp client fetching data from amq and emiting them to a websocket.
My main micro service is fetching the data to save them into a db.
My visualisation webapp is connected to the same ws and is fetching the data as they are sent to display them.
The Amq step isn't mandatory at all, it's just something I had to work with (historical).
I don't know what type of data your are working with, so I don't know if my solution can apply to you.
Don't hesitate if I'm not clear or you have any question.

This is a big question and I'm not going to try and give you a definitive answer.
For option 2
It really depends on how expensive your queries are. You can make DynamoDB fast if you pay for enough throughput. That said, on the face it, re-loading your whole dataset, when that sounds like its probably large, probably isn't good engineering.
For option 3
This option seems best to me if its achievable, although admittedly its hard to say with such a complex system - obviously you can't share your whole project.
Given your are already using AWS you might want to look into AWS Lambda. If you can move the update process into a stand alone job, you can host it on lambda and move the load off the web server. Lambda is essentially infinitely scalable and you only pay for the compute you use.
This really depends on you being able to split the update task off into a separate service. Its likely you would need a fair bit of refactoring to isolate it as a service. If you can break little bits off at a time, and make the move gradually, even better.
If you consider trying this, and you've not used Lambda before, I would definitely start small with some hello world examples. Then try a very simple service in your application, and build up to taking on the update service.
You might also consider looking in AWS Simple Message Queue Service to handle the comms between clients and server.
Database tuning
If a lot of your update time is spent waiting for database actions to complete, rather than server processing, you can consider tuning that side of things up. Things to consider are:
Buying more throughput
Using batch operations (as these move load to DynamoDB from your server)
Tuning keys, indexes and database access

When do I need to use worker processes in Heroku

I have a Node.js app with a small set of users that is currently architected with a single web process. I'm thinking about adding an after save trigger that will get called when a record is added to one of my tables. When that after save trigger is executed, I want to perform a large number of IO operations to external APIs. The number of IO operations depends on the number of elements in an array column on the record. Thus, I could be performing a large number of asynchronous operations after each record is saved in this particular table.
I thought about moving this work to a background job as suggested in Worker Dynos, Background Jobs and Queueing. The article gives as a rule of thumb that tasks that take longer than 500 ms be moved to background job. However, after working through the example using RabbitMQ (Asynchronous Web-Worker Model Using RabbitMQ in Node), I'm not convinced that it's worth the time to set everything up.
So, my questions are:
For an app with a limited amount of concurrent users, is it ok to leave a long-running function in a web process?
If I eventually decide to send this work to a background job it doesn't seem like it would be that hard to change my after save trigger. Am I missing something?
Is there a way to do this that is easier than implementing a message queue?

For an app with a limited amount of concurrent users, is it ok to leave a long-running function in a web process?
this is more a question of preference, than anything.
in general i say no - it's not ok... but that's based on experience in building rabbitmq services that run in heroku workers, and not seeing this as a difficult thing to do.
with a little practice, you may find that this is the simpler solution, as I have (it allows simpler code, and more robust code, as it splits the web away from the background processor - allowing each to run without knowing about each other directly)
If I eventually decide to send this work to a background job it doesn't seem like it would be that hard to change my after save trigger. Am I missing something?
are you missing something? not really
as long as you write your current in-the-web-process code in a well structured and modular fashion, moving it to a background process is not usually a big deal
most of the panic that people get from having to move code into the background, comes from having their code tightly coupled to the HTTP request / response process (i know from personal experience how painful it can be)
Is there a way to do this that is easier than implementing a message queue?
there are many options for distributed computing and background processing. i personally like RabbitMQ and the messaging patterns that it uses.
i would suggest giving it a try and seeing if it's something that can work well for you.
other options include redis with pub/sub libraries on top of it, using direct HTTP API calls to another web server, or just using a timer in your background process to check database tables on a given frequency and having the code run based on the data it finds.
p.s. you may find my RabbitMQ For Developers course of interest, if you are wanting to dig deeper into RMQ w/ node: http://rabbitmq4devs.com

How would you do heavy processing in Meteor?

I have a meteor app that is currently pulling data from twitter and is subsequently doing some manipulation and then inserting the documents into a collection. Let's say I run this process forever but don't want to block the event loop, is there any solution for this?
Note: I know node.js is single-threaded, and meteor doesn't support packages such as cluster because it requires sticky sessions. The only solution I can think of is adding a server dedicated to processing the data coming in from twitter and forwarding the requests to that server but then I have no longer have a case to use Meteor or node.
Help would be appreciated.

The truth here is that while javascript/node/meteor might be capable to do processing in, you yourself really don't want to do that. Let me give some observations and a personal example:
Your app is all about the latency. If one of your requests takes long to complete because it is stuck in a tight loop it affects every other client connected to your server at that moment. Everybody's latency will increase if this happens. (This is the case for making sure you have no tight loops in your code)
Javascript (the language) has very unsophisticated support for numeric values. (You basically get a double). Things like float, long, int, byte are all meant to allow you to do tight loops as fast as possible. If you can represent a value in a primitive type most closely matched to it you will get a lot of improvement. (This is the case for extracting your data processing to a language suited for data processing)
I was prototyping an app that had to do some aggregations over data. I fired it in meteor using a setInterval callback and it took about 2 seconds to complete each time. On my own development machine I didn't really notice it (because meteor apps hide latency issues very effectively). As soon as I deployed it and started looking at the logs I realized that not a single user had latency on any request below 4 seconds. This is horrible client experience.
I extracted the number crunching to a small clojure app. All integration happens via records inserted and read from the mongo db and the clojure code has some timed events firing every couple of seconds doing exactly the same calcs as was previously done in meteor.
In clojure those calcs now take less than 100ms in total (compared to 2-4 seconds in meteor).
To come back to your question: It doesn't sound like your application has a user interface? If it does, you would do well to keep that in meteor because it's excellent for web UI's. But it's not the right technology for headless apps, which it sounds to me like you have.

You can use this.unblock() within the beginning of your method that does the heavy processing. Meteor will than start another fiber, go on with processing your method, fire the callback when it is done. More info here: http://docs.meteor.com/#method_unblock

What is the best way to build a scaleable analytics back-end using Heroku? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 3 years ago.
Improve this question
I'd need to build a simple analytics back-end for capturing user behaviour. This will be captured via a Javascript snippet on a webpage just like Google Analytics or Mixpanel data.
The system needs to capture close-to-realtime browser data (scrolling position of page, mouse position etc.) It will record the state of the users' page every 5 seconds. There are only three attributes on each measurement but they are have to be taken frequently.
The data doesn't necessarily need to be sent every 5 seconds, it could be bussed up less frequently however it's imperative that I get all of the data while the user is on the page. i.e. I can't bus it once per minute and lose the last 59 seconds of data for someone who leaves after 119 seconds.
If possible I'd like to build a system that will scale for the foreseeable future which means it working for 10,000 sites, each with 100 concurrent visitors, i.e. 100,000 concurrent users each sending one event every 5 seconds.
I'm not worried about querying the data, that can be done using a separate system. I'm most interested in how to handle the capture of the data itself.
Requirements
Based on the budgeting above, the system needs to handle 20,000 events per second coming from a pool of 100,000 users.
I'd like to host this service on Heroku however while I've done a lot of work with Rails, I'm completely new to high throughput systems (other than knowing you don't process them using Rails).
Questions
Is there a commercial system that would be good for doing this (like Pusher but for data capture as well as distribution)?
Should I be looking to do this using HTTP requests or websockets?
Is node.js the right choice for this or just trendy?
If I were to chose a socket-based solution, how many sockets can a dyno on Heroku handle for each webserver
What are the pertinent considerations for choosing between Mongo / Reddis etc. for storage
Is this the type of problem which actually requires two solutions - the first to get you to reasonable scale quickly and inexpensively and the second to take you past that scale on lower incremental cost but with more development effort required upfront?

My high level comment for you is to build your system following the 12 factor design, and then worry about scaling as the customers arrive. I'm thrilled with Node.js and the npm ecosystem, but I also think you could build a perfectly acceptable platform with Rails. If it took 3 dynos to support 100 K concurrent users with Node, and double that with Rails, you still might be better off with Rails, if your comfort with Ruby got you to market 3 months faster. Anyway, assuming you go with Node, here are my answers:
Here are some alternatives to Pusher that might work for you and a discussion of Pusher vs. Pubnub. Also see Ably.
Use socket.io. It's largely the standard, because it uses the best transport available and falls back from WebSockets to HTTP methods.
Node is a fantastic choice and is also trendy (see the module growth rate). I suspect you could make your system work fine in Node, Rails or several other frameworks.
A Heroku dyno should be able to support tens of thousands of concurrent connections, depending on how efficient you are with RAM. A server with 16 GB of RAM was able to support a million concurrent connections. Assuming you're RAM-limited, a Heroku dyno with 512 MB of RAM should be able to support ~30 K connections.
You likely want to pick two different systems, one for storage and processing of your data, and one for caching. Here's a great post about picking your core data platform from the creator of Instagram. For core data, I recommend Postgres (on Heroku) using the Sequelize ORM. But, Mongo with SOLR for search would probably work fine too. Note that Postgres 9.2 can be used as a NoSQL datastore if that's the way you want to go. For a caching system I highly recommend Redis.
No, I would try to avoid throw away engineering. Instead, build something that works, and expect that everytime you reach an order of magnitude more traffic, some piece of the system will break and need to be replaced. But, if you follow the 12 Factor principles, you should be in good shape to scale horizontally while you're investing in the replacement.
Good luck.

There are many services for sockets, but Pusher and Pubnub seem to be the market leaders in this space. What ever you do, don't host your own like socket.io because heroku times out requests longer than 30 seconds, including websockets. So a hosted socket would definitely be out of the question unless you plan on closing and re-opening the socket every few seconds.
If you were to use a socket service like Pusher, then you will need to implement a http endpoint for the service to send you the data anyway. So I would just cut the middle man out and go with a direct http request. Granted you need to collect constant user interactions, but that can all be recorded on the JavaScript client and sent back to the app periodically through CORS XHR or a tracking image.
node is a great choice, it's light, pretty easy to set up and the npm libraries available will have everything you need to get you started. Rails can be pretty swift too, especially if you cut out the things you don't need. There is a great railscast on this subject. The important thing is to keep it as simple as possible. Maybe split it into two applications; one for collecting data, the other for analysing/process it. This way you could collect the data in node cause it's fast and analyse/process it in rails cause it's easy.
As I mentioned in 1. sockets just aren't going to work in heroku and even if you used pusher you're still going to have to support the same number of http requests because when pusher receives the data it's going to send it straight on to you. As for how many dynos will you need, this will be something that will be easily tested but not something I can estimate. It will depend entirely on the efficiency of the code collecting the data. A simple Apache AB test with the load and concurrency you are expecting will give you a good indication of what you will need. Node comes with it's own concurrency but if you were to use rails to collect the data then use unicorn or puma as your server because they support concurrency. Also try different configurations when Apache AB testing; heroku now provide 2x dynos which are 1024mb instead of 512 which will allow you more concurrency
This stackoverflow thread suggests redis is faster and faster is what you're going to want for collecting the data. Though after collecting it, you'll probably want to process it and store it in more than a key, value store. Mongo is a good option for that but i would go with a graph database like neo4j because of the intricate connections analytics have.
If your entering new ground here, then you are not going to get it right first time, you will find yourself iterating over it to get the best performance and the most accurate data. Eventually you'll probably delete it and start again with a new architecture and the cycle will continue. Keeping the data collection and the analysis separate means you can focus on getting each bit right separately.
A few addional points I would like to mention is use a CDN for distribution of the JavaScript client, or better yet, provide the full JS to serve from the page. Either way, load fast and load asynchronously. It sounds like a fun project. Good luck!
EDIT In an alternate universe, where you do not have to use heroku, websockets would be an awesome solution.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string