I am looking for a command line option to get a webpage, and execute the associated JavaScript code. In other words, call a headless browser via command line.
I can't use wget, it does not load and execute the associated JavaScript:
wget --load-cookies cookies.txt -O /dev/null https://example.com/update?run=1
Use case: we have web pages that read elastisearch indexes, do some data manipulation, and update elastisearch indexes. We'd like to do the update on an hourly basis via a cron job. We don't need to capture anything, e.g. no png capture, no HTML capture. We simply need to load the webpage and execute its JavaScript via a cron job, ideally something like run-headless https://example.com/update. OS is CentOS 7.
I searched stackoverflow and did not find any answer satisfying my needs. selenium etc seem like an overkill:
Play Framework: Run single selenium test on the command line (headless browser)
Python Headless Browser for GAE
How to run headless browser inside docker? - this looks intersting, but docker seems like overkill
After some research I found a solution using puppeteer headless browser. Ideally I wanted a single command like run-headless https://example.com/update, but login was required, hence driving the headless browser with puppeteer.
Installation steps for CentOS 7.6:
1. Install chrome
# cd
# mkdir install
# cd install/
# wget http://mirror.centos.org/centos/7/os/x86_64/Packages/vulkan-filesystem-1.1.97.0-1.el7.noarch.rpm
# yum localinstall vulkan-filesystem-1.1.97.0-1.el7.noarch.rpm
# wget http://mirror.centos.org/centos/7/os/x86_64/Packages/vulkan-1.1.97.0-1.el7.x86_64.rpm
# yum localinstall vulkan-1.1.97.0-1.el7.x86_64.rpm
# wget http://mirror.centos.org/centos/7/os/x86_64/Packages/liberation-fonts-1.07.2-16.el7.noarch.rpm
# yum localinstall liberation-fonts-1.07.2-16.el7.noarch.rpm
# vi /etc/yum.repos.d/google-chrome.repo
# cat /etc/yum.repos.d/google-chrome.repo
[google-chrome]
name=google-chrome
baseurl=http://dl.google.com/linux/chrome/rpm/stable/x86_64
enabled=1
gpgcheck=1
gpgkey=https://dl.google.com/linux/linux_signing_key.pub
# yum install google-chrome-stable
2. Install node.js
# curl -sL https://rpm.nodesource.com/setup_14.x | sudo bash -
# yum install nodejs
3. Patch /etc/sysctl.conf
This was needed to run puppeteer without disabling the sandbox:
# echo "user.max_user_namespaces=15000" >> /etc/sysctl.conf
# reboot
4. Create run-hourly.js puppeteer script
This node script has to run as a regular user, not root:
$ cd /path/to/script
$ npm install --save puppeteer
$ npm install --save pending-xhr-puppeteer
$ mkdir userDataDir
$ vi run-hourly.js # (content below)
$ node run-hourly.js
File content of run-hourly.js script:
const config = {
userDataDir: __dirname + '/userDataDir',
login: {
url: 'https://www.example.com/login/',
username: 'foobar',
password: 'secret',
},
pages: [{
url: 'https://www.example.com/update/hourly',
pdfFile: __dirname + '/page.pdf'
}]
};
const puppeteer = require('puppeteer');
const { PendingXHR } = require('pending-xhr-puppeteer');
(async() => {
// initialize headless browser
const browser = await puppeteer.launch({
headless: true, // run headless
dumpio: true, // capture console log to stdout
userDataDir: config.userDataDir // custom user data
});
const page = await browser.newPage();
const pendingXHR = new PendingXHR(page);
// login
await page.goto(config.login.url, {waitUntil: 'load'});
await page.type('#loginusername', config.login.username);
await page.type('#password', config.login.password);
await page.click('#signin');
await page.waitForNavigation();
// load pages of interest
await Promise.all(config.pages.map(async (pageCfg) => {
await page.goto(pageCfg.url, {waitUntil: 'networkidle0'}); // wait for page load
await page.setRequestInterception(true); // intercept requests for next line
await pendingXHR.waitForAllXhrFinished(); // wait for all requests to finish
await page.pdf({path: pageCfg.pdfFile}); // generate PDF from rendered page
}));
await browser.close();
})();
5. Add hourly job to cron
Install the cron job as same user as the script owner
$ crontab -l
$ crontab -e
25 * * * * cd /path/to/script && node run-hourly.js > hourly.log 2>&1
Related
I'm trying to test my chrome extension using Github Actions.
This is how I setup the tests and the fixtures:
extensionContext: async ({ exe_path }, use) => {
const pathToExtension = path.join(__dirname, '..', 'build');
const extensionContext = await chromium.launch({
headless: false,
args: [
`--disable-extensions-except=${pathToExtension}`,
`--load-extension=${pathToExtension}`,
],
executablePath: exe_path,
});
await use(await extensionContext.newContext());
await extensionContext.close();
},
I'm running the system test inside Github actions using a Makefile, the target runs the next command:
xvfb-run --auto-servernum --server-args='-screen 0, 1920x1080x24' yarn run test --project=${BROWSER}
The setup is as follows (Also from the Makefile):
yarn install --frozen-lockfile
yarn run playwright install --with-deps
When I run the above code, chromium.launch hangs and the test fails after the 30s timeout.
When I remove the --disable-extensions-except flag launch succeeds, but the extension is not loaded (although the --load-extension flag remains).
Is there a working example on how to test extensions in headful mode inside any CI/CD framework (preferably Github actions)?
Additional Info:
Playwright Version: 1.28.0
Operating System: Linux (ubuntu-latest runner in Github actions)
Node.js version: 18.12.0
Browser: chromium
Thanks
I am trying to debug an issue which causes headless Chrome using Puppeteer to behave differently on my local environment and on a remote environment such as AWS or Heroku.
The application tries to search public available jobs on LinkedIn without authentication (no need to look at profiles), the url format is something like this: https://www.linkedin.com/jobs/search?keywords=Engineer&location=New+York&redirect=false&position=1&pageNum=0
When I open this url in my local environment I have no problems, but when I try to do the same thing on a remote machine such as AWS EC2 or Heroku Dyno I am redirected to a login form by LinkedIn. To debug this difference I've built a Docker image (based on this image) to have isolation from my local Chrome/profile:
Dockerfile
FROM buildkite/puppeteer
WORKDIR /app
COPY . .
RUN npm install
CMD node index.js
EXPOSE 9222
index.js
const puppeteer = require("puppeteer-extra");
puppeteer.use(require("puppeteer-extra-plugin-stealth")());
const testPuppeteer = async () => {
console.log('Opening browser');
const browser = await puppeteer.launch({
headless: true,
slowMo: 20,
args: [
'--remote-debugging-address=0.0.0.0',
'--remote-debugging-port=9222',
'--single-process',
'--lang=en-GB',
'--disable-dev-shm-usage',
'--no-sandbox',
'--disable-setuid-sandbox',
"--proxy-server='direct://",
'--proxy-bypass-list=*',
'--disable-gpu',
'--allow-running-insecure-content',
'--enable-automation',
],
});
console.log('Opening page...');
const page = await browser.newPage();
console.log('Page open');
const url = "https://www.linkedin.com/jobs/search?keywords=Engineer&location=New+York&redirect=false&position=1&pageNum=0";
console.log('Opening url', url);
await page.goto(url, {
waitUntil: 'networkidle0',
});
console.log('Url open');
// page && await page.close();
// browser && await browser.close();
console.log("Done! Leaving page open for remote inspection...");
};
(async () => {
await testPuppeteer();
})();
The docker image used for this test can be found here.
I've run the image on my local environment with the following command:
docker run -p 9222:9222 spinlud/puppeteer-linkedin-test
Then from the local Chrome browser chrome://inspect it should be possible to inspect the GUI of the application (I have deliberately left open the page in headless browser):
As you can see even in local docker the page opens without authentication.
I've done the same test on an AWS EC2 (Amazon Linux 2) with Docker installed. It needs to be a public instance with SSH access and an inbound rule to allow traffic through port 9222 (for remote Chrome debugging).
I've run the same command:
docker run -p 9222:9222 spinlud/puppeteer-linkedin-test
Then again from local Chrome browser chrome://inspect, once added the remote public IP of the EC2, I was able to inspect the GUI of the remote headless Chrome as well:
As you can see this time LinkedIn requires authentication. We can see also a difference in the cookies:
I can't understand the reasons behind this different behaviour between my local and remote environment. In theory Docker should provide isolation and in both environment the headless browser should start with no cookies and a fresh (empty session). Still there is difference and I can't figure out why.
Does anyone have any clue?
I am getting this error again and again while launching the application. I would have reinstalled puppeteer for like 8-9 times and even downloaded all the dependencies listed in the Troubleshooting link.
Error: Failed to launch the browser process! spawn /home/......./NodeJs/Scraping/code3/node_modules/puppeteer/.local-chromium/linux-756035/chrome-linux/chrome ENOENT
TROUBLESHOOTING: https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md
This Code is just for taking a screenshot of google.com
NodeJs Version- 14.0.0
Puppeteer Version- 4.0.1
Ubuntu Version- 20.04
I am using puppeteer which is bundled with Chromium
const chalk = require("chalk");
// MY OCD of colorful console.logs for debugging... IT HELPS
const error = chalk.bold.red;
const success = chalk.keyword("green");
(async () => {
try {
// open the headless browser
var browser = await puppeteer.launch({ headless: false });
// open a new page
var page = await browser.newPage();
// enter url in page
await page.goto(`https://www.google.com/`);
// Google Say Cheese!!
await page.screenshot({ path: "example.png" });
await browser.close();
console.log(success("Browser Closed"));
} catch (err) {
// Catch and display errors
console.log(error(err));
await browser.close();
console.log(error("Browser Closed"));
}
})(); ```
As you said puppeteer 2.x.x works for you perfectly but 4.x.x doesn't: it seems to be a linux dependency issue which occurs more since puppeteer 3.x.x (usually libgbm1 is the culprit).
If you are not sure where is your chrome executable located first run:
whereis chrome
(e.g.: /usr/bin/chrome)
Then to find your missing dependencies run:
ldd /usr/bin/chrome | grep not
sudo apt-get install the listed dependencies.
After this happened you are able to do a clean npm install on your project with the latest puppeteer aas well (as of today it will be 5.0.0).
I'm running puppeteer scraper at Digital Ocean droplet.
Server is Ubuntu 18.04
ufw is enabled and ssh, http, https ports enabled.
This scraper has been running by pm2
This is the current output and Code snip.
0|server | 2019-12-23T09:09:27.266Z: [openPage] Error:
net::ERR_TUNNEL_CONNECTION_FAILED at https://xxxx/xxxx
...
const browser = await puppeteer.launch({
headless: false,
args:["--no-sandbox", "--proxy-server=zproxy.lum-superproxy.io:22225"]
});
page = await browser.newPage()
// set random agent to page
await page.setUserAgent(agents[Math.floor(Math.random() * agents.length)])
await page.authenticate({
username: process.env.USERNAME,
password: process.env.PWD
})
....
plus env variables are working correctly. I checked this out by console.log(process.env.USERNAME)
Ensure that your proxy DOES support HTTPS/SSL if you'd like Puppeteer to scrape HTTPS content.
You can easily test if your proxy supports SSL with:
curl --proxy [ip]:[port] https://ipinfo.io/ip
I've made some research on the Web and SOF, but found nothing really helpful on that error.
I installed Node and Puppeteer with Windows 10 Ubuntu Bash, but didn't manage to make it work, yet I manage to make it work on Windows without Bash on an other machine.
My command is :
node index.js
My index.js tries to take a screenshot of a page :
const puppeteer = require('puppeteer');
async function run() {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://github.com');
await page.screenshot({ path: 'screenshots/github.png' });
browser.close();
}
run();
Does anybody know the way I could fix this "Error: kill ESRCH" error?
I had the same issue, this worked for me.
Try updating your script to the following:
const puppeteer = require('puppeteer');
async function run() {
//const browser = await puppeteer.launch();
const browser = await puppeteer.launch({headless: true, args: ['--no-sandbox'] }); //WSL's chrome support is very new, and requires sandbox to be disabled in a lot of cases.
const page = await browser.newPage();
await page.goto('https://github.com');
await page.screenshot({ path: 'screenshots/github.png' });
await browser.close(); //As #Md. Abu Taher suggested
}
run();
const browser = await puppeteer.launch({ args: ['--no-sandbox'] });
If you want to read all the details on this, this ticket has them (or links to them).
https://github.com/Microsoft/WSL/issues/648
Other puppeteer users with similar issues:
https://github.com/GoogleChrome/puppeteer/issues/290#issuecomment-322851507
I just fixed this issue. What you need to do is the following:
1) Install Debian dependencies
You can find them in this doc:
https://github.com/GoogleChrome/puppeteer/blob/master/docs/troubleshooting.md
sudo apt-get install all of those bad boys.
2) Add '--no-sandbox' flag when launching puppeteer
3) Make sure your windows 10 is up to date. I was missing an important update that allowed you to launch Chrome.
Points no consider:
Windows bash is not a complete drop-in replacement for Ubuntu bash (yet). There are many cases where different GUI based apps did not work properly. Also, the script might be confused by bash on windows 10. It could think that the os is linux instead of windows.
Windows 10 bash only supports 64-bit binaries, so make sure the node and the chrome version that's used inside is pretty much 64-bit. Puppeteer is using -child.pid to kill the child processes instead of child.pid on windows version. Make sure puppeteer is not getting confused by all these bash/windows thing.
Back to your case.
You are using browser.close() in the function, but it should be await browser.close(), otherwise it's not executing in proper order.
Also, You should try to add await page.close(); before browser.close();.
So the code should be,
await page.close();
await browser.close();
I worked around it by softlinking chrome.exe to node_modules/puppeteer/.../chrome as below
ln -s /mnt/c/Program\ Files\ \(x86\)/Google/Chrome/Application/chrome.exe node_modules/puppeteer/.local-chromium/linux-515411/chrome-linux/chrome