Express and parser - node.js

I want to build a website, where user can press a button and nodeJS will start parsing process of some site (using PhantomJS hedless browser) and then return result to the user. I'm planning to use a static page with support of sockets, so user will get instant response as soon as parsing process will finish running. The process of parsing page with use of PhantomJS is kind of slow, so it'll take some time to run. My question is:
Is it normal to run the parser from the same nodeJS process (express)? What about preformance, when a bunch of people will press the button at the same time, should i be worried about that?
Or maybe i should separate 2 processes (parser and express) and somehow make them communicate them with each other?

Maybe you would want to use Child processes in Node. Child Process

Related

Cuncurrency handling in express js

because of some issues such as having SSR, SSG, and CSR beside each other, I decided to create my own SSR for React js with express js, I'm using redux and saga, and I have several API calls to generate the data before rendering it.
so I had to use several promises in my server-side renderer, such as waiting for redux to finish all APIs, or waiting for styles and scripts, also I'm using react-ssr-prepass and it navigates through all my components (for dispatching actions that are required in SSR).
so I have a lot of thread-blocking stuff in my project.
for handling concurrency I started to use node-cluster, so I'll have several nodes on my server and it will increase the concurrency capacity, but it's not the best solution because, under heavy load, even node clustering won't be able to respond to all of the requests.
so I started to think about worker thread or child process in node js, so I make an instance of my server-side renderer on each request and do everything in the background, so concurrent requests won't wait for eachother to be done.
but the issue is in the child process or worker thread I can't use "import", since it's es6
so I have two questions
first of all, is there any way to use es6 in the child process? (I tried babel-esm-plugin but it's not supporting webpack 5)
second, is there any better idea than using worker thread of child process to increase the concurrency capacity?
so I found the solution for my first challenge, instead of running my renderer directly with the child process, I had to build it first, so I used webpack to make a cjs output of it, then use that output in the child process.
and for increase the performance, even more, I used a combination of SSR and SSG, so in each request I check if a file mapped to the route exists on the server, if it's not, I'm gonna use SSR renderer output to create a file, and serve the response to the user, then for next request since the cached file exists I use that cache file instead of rendering the result again.
finally I set a corn job on the server to clear the cache every 10 minute

Managing node child processes generated with spawn

I have built a service that takes a screenshot of a URL. I have built this using a Node and Phantom JS.
My Node app works as follows:
A simple app that receives an API requests to indicate which URL to load and take a screenshot of
The app spawns a child Phantom process which takes the screenshot and saves it to a temp file on the server
The main process uploads the image to S3
The main process fires an API request back to the initial website to say the image is uploaded with the image’s URL
The temporary file is deleted
A diagram of how it works:
This works fine for a single request, no problem. However, when I throw multiple, consecutive requests at this service I get strange results. Each request received by the service spawns a Phantom JS process and a screenshot is taken, but the data in the API request sent back to the main website is often not correct. Regularly the system will send back the image URL from a screenshot created by another child process.
My hunch is that when the spawned process exits, it sends the API request to the original website with whatever data it has then just received, rather than the data for the process it’s just completed.
I feel like this should be any easy thing to manage, but I haven’t quite found the right approach. Does anyone have any tips/tricks for managing the child processes created with spawn, especially when they exit. I would like to perform another task based on this exited process.
My initial thought was to keep an array of the child process PID’s along with the related data I had and do a lookup in this array when the child process exits. This didn’t seem to fix the problem though - I still had incorrect data being sent back to the main website. I do wonder if I implemented this correctly though. I defined the array on each API request received by the service, so thinking about it, it would have been recreated on each request…I think.
Another thought was that I should be using fork instead of spawn. I think this would allow me to communicate with the child process, but as far as I can see I can only use this to run a JS file, not a executable like Phantom. Is this correct?
I feel a bit like I’m reinventing the wheel at this point but any tips would be much appreciated, thank you.

Is there a way to run a node task in a child process?

I have a node server, which needs to:
Serve the web pages
Keep querying an external REST API and save data to database and send data to clients for certain updates from REST API.
Task 1 is just a normal node tasks. But I don't know how to implement the task 2. This task won't expose any interface to outside. It's more like a background task.
Can anybody suggest? Thanks.
To make a second node.js app that runs at the same time as your first one, you can just create another node.js app and then run it from your first one using child_process.spawn(). It can regularly query the external REST API and update the database as needed.
The part about "Send data to clients for certain updates from REST API" is not so clear what you're trying to do.
If you're using socket.io to send data to connected browsers, then the browsers have to be connected to your web server which I presume is your first node.js process. To have the second node.js process cause data to be sent through the socket.io connections in the first node.js process, you need some interprocess way to communicate. You can use stdout and stdin via child_process.spawn(), you can use some feature in your database or any of several other IPC methods.
Because querying a REST API and updating a database are both asynchronous operations, they don't take much of the CPU of a node.js process. As such, you don't really have to do these in another node.js process. You could just have a setInterval() in your main node.js process, query the API every once in a while, update the database when results are received and then you can directly access the socket.io connections to send data to clients without having to use a separate process and some sort of IPC mechanism.
Task 1:
Express is good way to accomplish this task.
You can explore:
http://expressjs.com/
Task 2:
If you are done with Expressjs. Then you can write your logic with in Express Framework.
This task then can be done with node module forever. Its a simple tool that runs your background scripts forever. You can use forever to run scripts continuously (whether it is written in node.js or not)
Have a look:
https://github.com/foreverjs/forever

Is it possible to pause cherrypy server in order to update static files / db without stopping it?

I have an internal cherrypy server that serves static files and answers XMLRPC requests. All works fine, but 1-2 times a day i need to update this static files and database. Of course i can just stop server, run update and start server. But this is not very clean since all other code that communicate with server via XMLRPC will have disconnects and users will see "can't connect" in broswers. And this adds additional complexity - i need some external start / stop / update code, wile all updaes can be perfectly done within cherrypy server itself.
Is it possible to somehow "pause" cherrypy programmatically so it will server static "busy" page and i can update data without fear that right now someone is downloading file A from server and i will update file B he wants next, so he will get different file versions.
I have tried to implement this programmatically, but where is a problem here. Cherrypy is multithread (and this is good), so even if i introduce a global "busy" flag i need some way to wait for all threads to complete aready existing tasks before i can update data. Can't find such way :(.
CherryPy's engine controls such things. When you call engine.stop(), the HTTP server shuts down, but first it waits for existing requests to complete. This mode is designed to allow for debugging to occur while not serving requests. See this state machine diagram. Note that stop is not the same as exit, which really stops everything and exits the process.
You could call stop, then manually start up an HTTP server again with a different app to serve a "busy" page, then make your edits, then stop the interim server, then call engine.start() and engine.block() again and be on your way. Note that this will mean a certain amount of downtime as the current requests finish and the new HTTP server takes over listening on the socket, but that will guarantee all current requests are done before you start making changes.
Alternately, you could write a bit of WSGI middleware which usually passes requests through unchanged, but when tripped returns a "busy" page. Current requests would still be allowed to complete, so there might be a period in which you're not sure if your edits will affect requests that are in progress. How to write WSGI middleware doesn't fit very well in an SO reply; search for resources like this one. When you're ready to hook it up in CherryPy, see http://docs.cherrypy.org/dev/concepts/config.html#wsgi

NodeJS - Child node process?

I'm using NodeJS to run a socket server (using socket.io). When a client connects, I want am opening and running a module which does a bunch of stuff. Even though I am careful to try and catch as much as possible, when this module throws an error, it obviously takes down the entire socket server with it.
Is there a way I can separate the two so if the connected clients module script fails, it doesn't necessarily take down the entire server?
I'm assuming this is what child process is for, but the documentation doesn't mention starting other node instances.
I'd obviously need to kill the process if the client disconnected too.
I'm assuming these modules you're talking about are JS code. If so, you might want to try the vm module. This lets you run code in a separate context, and also gives you the ability to do a try / catch around execution of the specific code.
You can run node as a separate process and watch the data go by using spawn, then watch the stderr/stdout/exit events to track any progress. Then kill can be used to kill the process if the client disconnects. You're going to have to map clients and spawned processes though so their disconnect event will trigger the process close properly.
Finally the uncaughtException event can be used as a "catch-all" for any missed exceptions, making it so that the server doesn't get completely killed (signals are a bit of an exception of course).
As the other poster noted, you could leverage the 'vm' module, but as you might be able to tell from the rest of the response, doing so adds significant complexity.
Also, from the 'vm' doc:
Note that running untrusted code is a tricky business requiring great care.
To prevent accidental global variable leakage, vm.runInNewContext is quite
useful, but safely running untrusted code requires a separate process.
While I'm sure you could run a new nodejs instance in a child process, the best practice here is to understand where your application can and will fail, and then program defensively to handle all possible error conditions.
If some part of your code "take(s) down the entire ... server", then you really to understand why this occurred and solve that problem rather than rely on another process to shield you from the work required to design and build a production-quality service.

Resources