Should I use puppeteer as part of my backend? - node.js

Lets say I want to create a front-end where multiple users can send a request to a server that scrapes some links off a website, would puppeteer be able to process it concurrently/atleast fast enough or should I consider a different method?
Also, is there any possible way to load a page in a headless browser instance(with js enabled) on a mobile device? How could I go about coding my own headless browser in javascript if its possible?

You can always deploy your node.js instance via PM2 and spawn multiple processes concurrently to handle the incoming load. You should restrict the total processes to the total number of cores available to your box, but otherwise this would work fine.
Whether or not this could handle your load depends on your server, the workload, and the expected throughout. You'd need to do some load testing to make that determination for your system.

Related

Is there any way to limit the concurrent request thread of Selenium Chromedriver? [duplicate]

When 2 tests are running in Chrome, i have observed that too many Google Chrome(32 Bit) processes are running in Task manager, Is this a correct behavior of Chome Driver
When multiple automated tests are getting executed through Google Chrome you must have observed that there are potentially dozens of Google Chrome processes running which can be observed through Windows Task Manager's Processes tab.
Snapshot:
As per the article SOLVED: Why Google Chrome Has So Many Processes for a better user experience Google Chrome initiates a lot of windows background processes for each tab that have been opened by your Automated Tests. Google tries to keep the browser stable by separating each web page into as many processes as it deems fit to ensure that if one process fails on a page, that particular process(es) can be terminated or refreshed without needing to kill or refresh the entire page.
However, from 2018 onwards Google Chrome was actually redesigned to create a new process for each of the following entities:
Tab
HTML/ASP text on the page
Plugin those are loaded
App those are loaded
Frames within the page
In a Chromium Blog Multi-process Architecture it is mentioned:
Google Chrome takes advantage of these properties and puts web apps and plug-ins in separate processes from the browser itself. This means that a rendering engine crash in one web app won't affect the browser or other web apps. It means the OS can run web apps in parallel to increase their responsiveness, and it means the browser itself won't lock up if a particular web app or plug-in stops responding. It also means we can run the rendering engine processes in a restrictive sandbox that helps limit the damage if an exploit does occur.
As a conclusion, the many processes you are seeing is pretty much in line with the current implementation of google-chrome
Outro
You can find a relevant discussion in How to quit all the Firefox processes which gets initiated through GeckoDriver and Selenium using Python

When Google Chrome spawns a new process?

When I see the task manager of Google's Chrome I could see few (each) tabs run under individual process and group of tabs run under a single process. Out of curiosity, I searched to know why it runs as multiple process instead of multiple threads. And one thing which brought to my attention is when it runs as a single process and spawns multiple threads there could be few limitations/drawbacks like,
1) Limitation on number of threads that could be created
2) When a single tab becomes unresponsive the entire application would be come useless and we have to quit chrome and restart it due to some misbehaving site.
A few mentioned that Chrome uses single process per domain, but here it doesn't seem to be true.
I'm still not clear on,
1) When Chrome decides to spawn a new process?
2) What are the other advantage of running individual tabs under separate process?
3) How cookies are shared between tabs when each of them run under different process? Is this happening via interprocess communication? If yes, will it be too costly? And will it impact the other tab's (ie' the web page) performance?
After asking this question, I came to see this article (Multi Process Architecture of Chromium) and it answered my question (1).
When Chrome decides to spawn a new process?
Once Google Chrome has created its browser process, it will generally create one renderer process for each instance of a web site you visit. This approach aims to keep pages from different web sites isolated from each other.
You can think of this as using a different process for each tab in the browser, but allowing two tabs to share a process if they are related to each other and are showing the same site. For example, if one tab opens another tab using JavaScript, or if you open a link to the same site in a new tab, the tabs will share a renderer process. This lets the pages in these tabs communicate via JavaScript and share cached objects. Conversely, if you type the URL of a different site into the location bar of a tab, they will swap in a new renderer process for the tab.
They place a limit on the number of renderer processes that they create (as 20 in most cases). Once they hit this limit, they'll start re-using the existing renderer processes for new tabs.
What are the advantage of running tabs under different process?
Google Chrome takes advantage of these properties and puts web apps and plug-ins in separate processes from the browser itself. This means that a rendering engine crash in one web app won't affect the browser or other web apps. It means the OS can run web apps in parallel to increase their responsiveness, and it means the browser itself won't lock up if a particular web app or plug-in stops responding. It also means they can run the rendering engine processes in a restrictive sandbox that helps limit the damage if an exploit does occur.
Interestingly, using multiple processes means Google Chrome can have its own Task Manager (shown below), which you can get to by right clicking on the browser's title bar. This Task Manager lets you track resource usage for each web app and plug-in, rather than for the entire browser. It also lets you kill any web apps or plug-ins that have stopped responding, without having to restart the entire browser.

PhantomJs and Nodejs very slow

I have created a rest service with nodejs, where for the response it goes to a certain page and scrape some date using the phantomjs version of nodejs.
The whole process is very slow (I had to move to another server because some connections were automatically timeout after 30 seconds).
Another problem (as is my understanding) is that the server is single thread so it takes even a lot more to respond if it is already processing another request.
My questions are:
Is there a way to speed up the whole process?
Is there a way to make the nodejs run multithreaded?
Most important would a Java implementaion of the same services (with selenium) would be faster or allow multithreading? Thanks

Less resources for steady client connections, why?

I heard of node.js is very suitable for applications where a persistent connection from the browser to the server is needed. That "long-polling" technique is used, that allows to send updates to the user in real time without needing a lot of server resources. A more traditional server model would need a thread for every single user.
My question, what is done instead, how are the requests served differently?
Why doesn't it take so much resources?
Nodejs is event-driven. The node script is started and then loops continuously, waiting for events to be fired, until stopped. Once running, the overhead associated with loading is done.
Compare this to a more traditional language such as c#.net or PHP, where a request causes the server to load and run the script and it's dependencies. The script then does its' task (often serving a web page) and then shuts down. Another page is requested, the whole process starts again.

How do I set up routing to multiple instances of a node.js server on one url?

I have a simple node.js server app built that I'm hoping to test out soon. It's single threaded and works fine without any child processing whatsoever. My problem is that the server box has multiple cores and the simplest way I can think to utilize them is by running multiple instances of the server app. However this would require them all to be on the same domain name and so some sort of request routing is required. I personally don't have much experience with servers in general and don't know if this is a task for node.js to perform or some other less complicated program (or more complicated.) If there is a node.js mechanism to solve this, for example, if one running instance can send incoming requests to the next instance, than how would I detect when this needs to happen? Transversely, if I use some other program how will it manage to detect when it needs to start talking to a new instance?
Node.js includes built-in support for managing a cluster of instances of your application to take advantage of multiple cores via the cluster module.

Resources