Chromium browser instance as a service - node.js

I'm working on web scraping application that uses node.js, puppeteer and Chromium. When I run the script it launches Chromium browser and executes the script using puppeteer package for scraping a web page.
To improve the performance I want to eliminate the launching of the Chromium browser instance.
I thinking of Chromium browser instance as a service so that I can connect to it and execute the script. Just like a database pooling connection.

I think you are looking for the "browserless/chrome" docker. Take a look at it.
To eliminate the launching of the Chromium instance, I use it like this:
docker run -e "PREBOOT_CHROME=true" -e "EXIT_ON_HEALTH_FAILURE=true" -e "CONNECTION_TIMEOUT=3600000" browserless/chrome
The flag that you are looking for is the "PREBOOT_CHROME=true". The other ones works for my use case, but maybe you don't need it. Take a look at the docs.
Regards.

Related

using puppeteer or puppeteer-core in cli script

I need to write a node cli script that will run some tests on the forms of a website I'm working on. I want to use puppeteer but I'm a bit confused about the difference between the full version and puppeteer-core. What is the best choice if I want to run the tests from a cli script without opening the browser and only simulating it?
To put it simply, puppeteer-core is for when you already have a browser and don't want to download a whole Chromium which the main puppeteer package does, automatically.
It is better to go with the full puppeteer since this way you will be getting the "batteries included, tested and are guaranteed to work" experience.
Official documentation offers a detailed comparison.

Running Selenium in Azure Function

I want to periodically scrape a website with Selenium and a headless PhantomJS driver.
My boss wants me to run it "in the cloud" for reasons, and a serverless Azure Function looks like it could be a useful way to do it, instead of having to run a VM or something.
I've got my VS.net code to do the scraping mostly done, but I just realized that I'm not sure if I can actually deploy it as a function, since it looks like it wants me to include the phantomjs.exe in my project in order to run, which may not work in a Azure Function...
Can I do what I wanted to do, or should I explore other options?
PhantomJS is a known unsupported framework in App Service, which is the same environment Azure Functions runs on.
You can find more information here: https://github.com/projectkudu/kudu/wiki/Azure-Web-App-sandbox#unsupported-frameworks

Headless browser automation app via electron js app

I would like to create an electron app that can do some web automation based on user input into a GUI. In my research it seems my two best bets are Phantom and Selenium+Chromedriver.
The thing I'd like to do is have an app that someone else could download and run without any additional setup. It seems with Chromedriver and Phantom that I'd need to have others download and add these things to their PATH. In order to get things functioning.
Is there a way around this? Or is there another approach I should be taking? Any advice is appreciated. Thanks!
First off, you should have a look at Nightmare.js which is like PhantomJS in many ways, but uses Electron under the hood (and that's good, because Chromium in Electron is very fresh compared to PhantomJS engine).
If you still want to use PhantomJS in Electron that's quite fine too. You may bundle it with your application or install npm module as a dependency and require that in your script. The main thing is - PhantomJS will be installed together with your app and you know the path to that folder.

Native messaging of Chrome extension

I am running an example application that uses native messaging on OS X.
After downloading an example of chrome, I registered an extension and located a native messaging host file at /Library/Google/Chrome/NativeMessagingHosts/com.google.chrome.example.echo.json.
According to the guide, Chrome starts a native messaging host in a separate process.
But I cannot look for that process.
Is there a way for chrome to run host process?
What do I miss?
The sample app provided by Chrome is a packaged app. Once you install it you have to launch it from your browser. And then do ps ax and look for the sample python-script process.
To run the app, start chrome and go to the following URL:
chrome-extension://knldjmfmopnpolahpmmgbagdohdnhkik/main.html
If you've done everything right, it should start the python script and you should see a window.
if its not working, have a look at chrome's console. you could also enable additional chrome logs - see here: http://www.chromium.org/for-testers/enable-logging

Selenium on shared gui-less host

I need to run Selenium (or another webscraping tool that can handle javascript) on a remote linux host (Webfaction). I am using Python.
Is this possible? The server is gui-less so I can't run browsers. Or can I, if I use PyVirtualDisplay?
What about running Selenium with HtmlUnit?
I have tried using Selenium with Selenium/PyVirtualDisplay/ChromeDriver, but keep getting various error messages. So I'm wondering if this is even possible before I continue to debug something impossible.
If you need to handle JavaScript Selenium/Webdriver seems to be a good solution.
If you need to run headless, GhostDriver (instead of ChromeDriver) is an excellent alternative. It is based on PhantomJS, a headless browser based itself on Webkit. It has full JS-support.

Resources