Puppeteer on Heroku Error R10 (Boot timeout) Node (webscraping app) - node.js

I created a web scraping app, which checks for a certain problem on an ecommerce website.
What it does:
Loops through an array of pages
checks for a condition on every page
if condition is met - pushes page to temparray
sends an email with temparray as body
I wrapped that function in a cronjob function.
On my local machine it runs fine.
Deployed like this:
headless: true
'--no-sandbox',
'--disable-setuid-sandbox'
Added the pptr buildpack link to settings in heroku
slugsize is 259.6 MiB of 500 MiB
It didnt work.
set boot timeout to 120s (instead of 60s)
It worked. But only ran once.
Since it want to run that function several times per day, I need to fix the issue.
I have another app running which uses the same cronjob and notification function and it works on heroku.
Here's my code, if anyone is interested.
const puppeteer = require('puppeteer');
const nodemailer = require("nodemailer");
const CronJob = require('cron').CronJob;
let articleInfo ='';
const mailArr = [];
let body = '';
const testArr = [
'https://bxxxx..', https://b.xxx..', https://b.xxxx..',
];
async function sendNotification() {
let transporter = nodemailer.createTransport({
host: 'mail.brxxxxx.dxx',
port: 587,
secure: false,
auth: {
user: 'hey#b.xxxx',
pass: process.env.heyBfPW2
}
});
let textToSend = 'This is the heading';
let htmlText = body;
let info = await transporter.sendMail({
from: '"BB Checker" <hey#baxxxxx>',
to: "sxx..xx...xx#gmail.com",
subject: 'Hi there',
text: textToSend,
html: htmlText
});
console.log("Message sent: %s", info.messageId);
}
async function boxLookUp (item) {
const browser = await puppeteer.launch({
headless: true,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
],
});
const page = await browser.newPage();
await page.goto(item);
const content = await page.$eval('.set-article-info', div => div.textContent);
const title = await page.$eval('.product--title', div => div.textContent);
const orderNumber = await page.$eval('.entry--content', div => div.textContent);
// Check if deliveryTime is already updated
try {
await page.waitForSelector('.delivery--text-more-is-coming');
// if not
} catch (e) {
if (e instanceof puppeteer.errors.TimeoutError) {
// if not updated check if all parts of set are available
if (content != '3 von 3 Artikeln ausgewählt' && content != '4 von 4 Artikeln ausgewählt' && content != '5 von 5 Artikeln ausgewählt'){
articleInfo = `${title} ${orderNumber} ${item}`;
mailArr.push(articleInfo)
}
}
}
await browser.close();
};
const checkBoxes = async (arr) => {
for (const i of arr) {
await boxLookUp(i);
}
console.log(mailArr)
body = mailArr.toString();
sendNotification();
}
async function startCron() {
let job = new CronJob('0 */10 8-23 * * *', function() { // run every_10_minutes_between_8_and_11
checkBoxes(testArr);
}, null, true, null, null, true);
job.start();
}
startCron();

Had the same issue for 3 days now. Here something that might help: https://stackoverflow.com/a/55861535/13735374
Has to be done alongside the Procfile thing.

Assuming the rest of the code works (nodemailer, etc), I'll simplify the problem to focus purely on running a scheduled Node Puppeteer task in Heroku. You can re-add your mailing logic once you have a simple example running.
Heroku runs scheduled tasks using simple job scheduling or a custom clock process.
Simple job scheduling doesn't give you much control, but is easier to set up and potentially less expensive in terms of billable hours if you're running it infrequently. The custom clock, on the other hand, will be a continuously-running process and therefore chew up hours.
A custom clock process can do your cron task exactly, so that's probably the natural fit for this case.
For certain scenarios, you can sometimes workaround on the simple scheduler to do more complicated schedules by having it exit early or by deploying multiple apps.
For example, if you want a twice-daily schedule, you could have two apps that run the same task scheduled at different hours of the day. Or, if you wanted to run a task twice weekly, schedule it to run daily using the simple scheduler, then have it check its own time and exit immediately if the current day isn't one of the two desired days.
Regardless of whether you use a custom clock or simple scheduled task, note that long-running tasks really should be handled by a background task, so the examples below aren't production-ready. That's left as an exercise for the reader and isn't Puppeteer-specific.
Custom clock process
package.json:
{
"name": "test-puppeteer",
"version": "1.0.0",
"description": "",
"scripts": {
"start": "echo 'running'"
},
"author": "",
"license": "ISC",
"dependencies": {
"cron": "^1.8.2",
"puppeteer": "^9.1.1"
}
}
Procfile
clock: node clock.js
clock.js:
const {CronJob} = require("cron");
const puppeteer = require("puppeteer");
// FIXME move to a worker task; see https://devcenter.heroku.com/articles/node-redis-workers
const scrape = async () => {
const browser = await puppeteer.launch({
args: ["--no-sandbox", "--disable-setuid-sandbox"]
});
const [page] = await browser.pages();
await page.setContent(`<p>clock running at ${Date()}</p>`);
console.log(await page.content());
await browser.close();
};
new CronJob({
cronTime: "30 * * * * *", // run every 30 seconds for demonstration purposes
onTick: scrape,
start: true,
});
Set up
Install Heroku CLI and create a new app with Node and Puppeteer buildpacks (see this answer):
heroku create
heroku buildpacks:add --index 1 https://github.com/jontewks/puppeteer-heroku-buildpack -a cryptic-dawn-48835
heroku buildpacks:add --index 1 heroku/nodejs -a cryptic-dawn-48835
(replace cryptic-dawn-48835 with your app name)
Deploy:
git init
git add .
git commit -m "initial commit"
heroku git:remote -a cryptic-dawn-48835
git push heroku master
Add a clock process:
heroku ps:scale clock=1
Verify that it's running with heroku logs --tail. heroku ps:scale clock=0 turns off the clock.
Simple scheduler
package.json:
Same as above, but no need for cron. No need for a Procfile either.
task.js:
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({
args: ["--no-sandbox", "--disable-setuid-sandbox"]
});
const [page] = await browser.pages();
await page.setContent(`<p>scheduled job running at ${Date()}</p>`);
console.log(await page.content());
await browser.close();
})();
Set up
Install Heroku CLI and create a new app with Node and Puppeteer buildpacks (see this answer):
heroku create
heroku buildpacks:add --index 1 https://github.com/jontewks/puppeteer-heroku-buildpack -a cryptic-dawn-48835
heroku buildpacks:add --index 1 heroku/nodejs -a cryptic-dawn-48835
(replace cryptic-dawn-48835 with your app name)
Deploy:
git init
git add .
git commit -m "initial commit"
heroku git:remote -a cryptic-dawn-48835
git push heroku master
Add a scheduler:
heroku addons:add scheduler:standard -a cryptic-dawn-48835
Configure the scheduler by running:
heroku addons:open scheduler -a cryptic-dawn-48835
This opens a browser and you can add a command node task.js to run every 10 minutes.
Verify that it worked after 10 minutes with heroku logs --tail. The online scheduler will show the time of next/previous execution.
See this answer for creating an Express-based web app on Heroku with Puppeteer.

Related

Why does Puppeeteer cause my test suite to hang for 30 seconds when I use "waitForSelector" even though I'm calling "close" on the page and browser?

I have a Node.js Mocha test suite (I've created a minimal reproduction based on the real world application I was trying to create an automated test for).
package.json:
{
"name": "puppeteer-mocha-hang-repro",
"version": "1.0.0",
"description": "",
"main": "index.js",
"scripts": {
"test": "echo \"Error: no test specified\" && exit 1"
},
"keywords": [],
"author": "",
"license": "ISC",
"devDependencies": {
"chai": "4.3.7",
"express": "4.18.2",
"mocha": "10.2.0",
"puppeteer": "19.6.2"
}
}
index.spec.js:
const expect = require('chai').expect;
const express = require('express');
const puppeteer = require('puppeteer');
const webServerPort = 3001;
describe('test suite', function () {
this.timeout(10000);
let webServer;
let browser;
beforeEach(async () => {
// Start web server using Express
const app = express();
app.get('/', (_, res) => {
res.send('<html>Hello, World from the <span id="source">Express web server</span>!</html>');
});
webServer = app.listen(webServerPort, () => {
console.log(`Web server listening on port ${webServerPort}.`);
});
// Start browser using Puppeteer
browser = await puppeteer.launch();
console.log('Browser launched.');
});
afterEach(async () => {
// Stop browser
await browser.close();
console.log('Browser closed.');
// Stop web server
await webServer.close();
console.log('Web server closed.');
});
it('should work', async () => {
const page = await browser.newPage();
await page.goto(`http://localhost:${webServerPort}/`);
console.log('Went to root page of web server via Puppeteer.');
if (process.env['PARSE_PAGE'] === 'true') {
const sel = await page.waitForSelector('#source');
const text = await sel.evaluate(el => el.textContent);
console.log('According to Puppeteer, the text content of the #source element on the page is:', text);
expect(text).eql('Express web server');
}
await page.close();
console.log('Page closed.');
});
});
If I run the test suite with the command npx mocha index.spec.js, which causes lines 45-48 to be skipped, the test suite passes and the Mocha process ends quickly:
$ time npx mocha index.spec.js
test suite
Web server listening on port 3001.
Browser launched.
Went to root page of web server via Puppeteer.
Page closed.
✔ should work (70ms)
Browser closed.
Web server closed.
1 passing (231ms)
real 0m0.679s
user 0m0.476s
sys 0m0.159s
Note that it finished in 0.679s.
If I instead run it with the command PARSE_PAGE=true npx mocha index.spec.js, which causes none of my code to be skipped, the tests pass quickly but the process hangs for about 30 seconds:
$ time PARSE_PAGE=true npx mocha index.spec.js
test suite
Web server listening on port 3001.
Browser launched.
Went to root page of web server via Puppeteer.
According to Puppeteer, the text content of the #source element on the page is: Express web server
Page closed.
✔ should work (79ms)
Browser closed.
Web server closed.
1 passing (236ms)
real 0m30.631s
user 0m0.582s
sys 0m0.164s
Note that it finished in 30.631s.
I suspected that this meant I was leaving things open, forgetting to call functions like close. But, I am calling close on the Express web server, Puppeteer browser, and Puppeteer page. I tried calling close on the objects I use when I don't skip any of that code, which are sel and text. But if I try that, I get error messages telling me that those objects have no such functions.
System details:
$ node --version
v18.13.0
$ npm --version
9.4.0
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.1 LTS
Release: 22.04
Codename: jammy
$ uname -r
5.10.16.3-microsoft-standard-WSL
Update: this behavior is a regression fixed by #9612 and deployed as 19.6.3. To fix the problem, upgrade to 19.6.3 (or downgrade to <= 19.6.0 if you're using an older Puppeteer for some reason).
See the original answer below.
I'm able to reproduce the hang, even without Mocha. It seems to be a bug in Puppeteer versions 19.6.1 and 19.6.2. Here's a minimal example:
const puppeteer = require("puppeteer"); // 19.6.1 or 19.6.2
const html = `<!DOCTYPE html><html><body><p>hi</p></body></html>`;
let browser;
(async () => {
browser = await puppeteer.launch();
const [page] = await browser.pages();
await page.setContent(html);
const el = await page.waitForSelector("p");
console.log(await el.evaluate(el => el.textContent));
})()
.catch(err => console.error(err))
.finally(async () => {
await browser?.close();
console.log("browser closed");
});
The culprit is page.waitForSelector, which seems to run its full 30 second default timeout even after resolving, somehow preventing the process from exiting. I've opened issue #9610 on Puppeteer's GitHub repo.
Possible workarounds:
Downgrade to 19.6.0.
Avoid using waitForSelector, since the data you want is in the static HTML (may not apply to your actual page though).
Call with page.waitForSelector("#source", {timeout: 0}) which seems to fix the problem, with the risk of stalling forever if used in a script (not a problem with mocha since the test will time out).
Call with page.waitForSelector("#source", {timeout: 1000}) which reduces the impact of the delay, with the risk of a false positive if the element takes longer than a second to load. This doesn't seem to stack, so if you use a 1-3 second delay across many tests, mocha should exit within a few seconds of all tests completing rather than the sum of all delays across all waitForSelector calls. This isn't practical in most scripts, though.
Run npx mocha --exit index.spec.js. Not recommended--this suppresses the issue.
I'm not sure if the behavior is specific to waitForTimeout or if it may apply to other waitFor-family methods.
As an aside, your server listen and close calls are technically race conditions, so:
await new Promise(resolve =>
webServer = app.listen(webServerPort, () => {
console.log(`Web server listening on port ${webServerPort}.`);
resolve();
})
);
and
await new Promise(resolve => webServer.close(() => resolve()));
System details:
$ node --version
v18.7.0
$ npm --version
9.3.0
$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.1 LTS
Release: 22.04
Codename: jammy
$ uname -r
5.15.0-56-generic
I also confirmed the behavior on Windows 10.
I am not sure how much it will be helpful but you can try this:
if (process.env['PARSE_PAGE'] === 'true') {
const sel = await page.waitForSelector('#source');
const text = await page.evaluate(el => el.textContent, sel);
console.log('According to Puppeteer, the text content of the #source element on the page is:', text);
expect(text).eql('Express web server');
}
Also, check for global hooks!

run onPrepare function synchronously in Jasmine, before executing specs

Hello stackoverflow community,
The task
I'm running Jasmine tests programmatically using jasmine.execute(). My task is to run onPrepare function, where I do some setup work like setting up reporters etc, and I need that function to be synchronous (has to be finished before running specs)
Approach #1
The first approach I tried was to just declare an async onPrepare function, which also includes the code for specifying reporters, and then do
await onPrepare();
await jasmine.execute();
Problem
In the result I get jasmine.getEnv() is not a function. I assume because getEnv() becomes available as .execute() is ran. So I understand this won't work
Approach #2
The next thing I tried was to create a helper file with my sync code, specify it in the config and run jasmine.execute();.
So, if simplified, I have
// conf.js
(async () => {
let Jasmine = require('jasmine');
let jasmine = new Jasmine();
let variables = require("./variables.json");
let {spawn} = require("child_process");
let childProcess = spawn(variables.webdriver);
console.log(`Webdriver started, pid: ${childProcess.pid}`);
jasmine.exitOnCompletion = false;
jasmine.loadConfig({
'spec_files': ['specs/*.spec.js'],
'helpers': ['on-jasmine-prepare.js'],
'stopSpecOnExpectationFailure': false,
'random': false,
})
const result = await jasmine.execute();
console.log('Test status:', result.overallStatus);
console.log('Closing library and webdriver process');
await library.close();
await childProcess.kill();
console.log('webdriver killed:', childProcess.killed);
})()
// on-jasmine-prepare.js
(async () => {
const {SpecReporter} = require("jasmine-spec-reporter");
const library = require("./library/library");
const variables = require("./variables.json");
const errorHandler = require("./modules/on-error-handler");
jasmine.getEnv().clearReporters();
jasmine.DEFAULT_TIMEOUT_INTERVAL = 3 * 60 * 1000;
jasmine.getEnv().addReporter(new SpecReporter({}))
global.library = new library.Library(variables.IP);
console.log('library instantiated')
await library.deleteSessions();
console.log('sessions deleted')
await library.launch(library.home);
console.log('home page launched')
jasmine.getEnv().addReporter(
errorHandler(library)
)
console.log('debugger reporter added' )
})();
Problem
The problem that I noticed that the helper file is executed asynchronously with the specs and I get a spec error before the helper function finishes (basically because of race condition). Below the output example, where you can see some console.log from onPrepare was ran after spec started
Webdriver started, pid: 40745
library instantiated
Jasmine started
sessions deleted
Example
✗ App is loaded (7 secs)
- Error: Session already exist
home page launched
debugger reporter added
The question
How do I run onPrepare function synchronously, before specs? Preferably natively (using only jasmine capabilities). Otherwise maybe using third party packages. I know it's possible because #protractor had this feature, however I couldn't back-engineer it
MacOS
node v16.13.2
jasmine v4.1.0
Thank you

How do I roll to a specific browser version with Playwright?

I need to run some tests using Playwright among different Chromium versions. I have different Chromium folders with different versions, but I don't know how to switch from a version to another using the CLI to run my tests. Some help? Thanks :)
You can use the executablePath argument when launching the browser to use a custom executable. See here. Note, that this only works with Chromium based browsers, see here.
const playwright = require('playwright');
(async () => {
const browser = await playwright.chromium.launch({
executablePath: '/your/custom/chromium',
headless: false, // to see the browser
slowMo: 4000 // to slow it down
});
// example for edge on msft windows
// executablePath: 'C:/Program Files (x86)/Microsoft/Edge/Application/msedge.exe',
const page = await browser.newPage();
await page.goto('http://whatsmyuseragent.org/');
await page.screenshot({ path: `example.png` });
await browser.close();
})();
Also Playwright does only test against the latest stable version, so other Chromium versions might miss-behave. See here under the releases.
Max Schmitt is right: the library is not guaranteed to work with non-bundled Chromiums. Anyway, you can give it a try to multiple Chromium-based browsers in the executablePath. As it is not builtin in the Playwright Test you will need to implement it yourself.
Note: like this you lose some of the simplicity of Playwright Test.
In my example I used Jest as a test runner so yarn add --dev jest is required. The last CLI argument - reserved for the browser version - can be retrieved with process.argv.slice(-1)[0] within Node, like this you can tell your tests what browser version you want to use. Here they will be edge, chrome and the default is the bundled chromium.
MS Edge (Chromium)
yarn test chrome.test.js edge
Chrome
yarn test chrome.test.js chrome
Chromium (default - bundled with Playwright) (but any string, or the lack of this argument will also launch this as the default)
yarn test chrome.test.js chromium_default
chrome.test.js
(with Windows-specific executable paths)
const playwright = require('playwright')
let browser
let page
beforeAll(async function () {
let chromeExecutablePath
switch (process.argv.slice(-1)[0]) {
case 'chrome':
chromeExecutablePath = 'C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe'
break
case 'edge':
chromeExecutablePath = 'C:\\Program Files (x86)\\Microsoft\\Edge\\Application\\msedge.exe'
break
default:
chromeExecutablePath = ''
}
browser = await playwright.chromium.launch({
headless: false,
executablePath: chromeExecutablePath
})
page = await browser.newPage()
})
describe('Google Search', function () {
test('should respond with HTTP 200 - OK', async function () {
const response = await page.goto('https://google.com')
const responseCode = response.status()
expect(responseCode).toBe(200)
})
afterAll(async function () {
await browser.close()
})
})

Pushing process to background causes high kswapd0

I have a cpu-intensive process running on a raspberry pi that's executed by running a nodejs file. Running the first command (below) and then running the file on another tab works just fine. However when I run the process via a bash shell script, the process stalls.
Looking at the processes using top I see that kswapd0 and kworker/2:1+ takes over most of the cpu. What could be causing this?
FYI, the first command begins the Ethereum discovery protocol via HTTP and IPC
geth --datadir $NODE --syncmode 'full' --port 8080 --rpc --rpcaddr 'localhost' --rpcport 30310 --rpcapi 'personal,eth,net,web3,miner,txpool,admin,debug' --networkid 777 --allow-insecure-unlock --unlock "$HOME_ADDRESS" --password ./password.txt --mine --maxpeers 100 2> results/log.txt &
sleep 10
# create storage contract and output result
node performanceContract.js
UPDATE:
performanceContract.js
const ethers = require('ethers');
const fs = require('fs')
const provider = new ethers.providers.IpcProvider('./node2/geth.ipc')
const walletJson = fs.readFileSync('./node2/keystore/keys', 'utf8')
const pwd = fs.readFileSync('./password.txt', 'utf8').trim();
const PerformanceContract = require('./contracts/PerformanceContract.json');
(async function () {
try {
const wallet = await ethers.Wallet.fromEncryptedJson(walletJson, pwd)
const connectedWallet = wallet.connect(provider)
const factory = new ethers.ContractFactory(PerformanceContract.abi, PerformanceContract.bytecode, connectedWallet)
const contract = await factory.deploy()
const deployedInstance = new ethers.Contract(contract.address, PerformanceContract.abi, connectedWallet);
let tx = await deployedInstance.loop(6000)
fs.writeFile(`./results/contract_result_xsmall_${new Date()}.txt`, JSON.stringify(tx, null, 4), () => {
console.log('file written')
})
...
Where loop is a method that loops keccak256 encryption method. It's purpose is to test diffent gas costs by alternating the loop #.
Solved by increasing the sleep time to 1min. Assume it was just a memory issue that need more time before executing the contract.

How to use forever-monitor with Electron-Angular project?

I am using Angular 2 with Electron and want to keep running a process in background to show notifications. I am using forever-monitor for that, it works only in development mode, but when I package my app using electron-packager, this stops working. My code looks like that:
main.ts
exports.runBackgroundProcess = () => {
// Run a background process forever
var forever = require('forever-monitor');
var child = new(forever.Monitor)('src/assets/notification-process.js',
{
env: {ELECTRON_RUN_AS_NODE: 1},
options: []
});
child.start();
}
I wrote a function in main.ts that will run background process when called from angular component. Code in notification-process.js is following:
notification-process.js
notifier = require('node-notifier')
notifierFun = (msg) => {
notifier.notify({
title: 'Notify Me',
message: msg,
wait: true
});
}
var CronJob = require('cron').CronJob;
new CronJob('* * * * * *', function() {
notifierFun("Message from notification process");
});
Finally I am calling the function from app.component.ts
let main_js = this.electronService.remote.require("./main.js");
main_js.runBackgroundProcess();
I don't think it is a good idea to set your script in the assets directory.
I would prefer it to be packaged as an extra resource.
the next snippet will permit to launch your node process
var child_process = require('child_process');
var child = child_process.fork('notification-process.js',[],{
cwd : 'resources'
});
If it does not work once packaged, this may be involved because your files have not been packaged .To package it as an extra resource, modify package.json as follow :
this will package webserver folder to resources/webserver folder:
"target": [
"win": {
"target": "nsis",
"icon": "build/icon.ico",
"extraResources" : [{
"from" : "webserver",
"to" : "webserver"}
]
},
for reference, have a look at :
https://nodejs.org/api/child_process.html#child_process_child_process_fork_modulepath_args_options
That's how it worked:
1- Moved notification-process.js file from assets folder to main directory.
2- Changed file path in main.js:
var child = new (forever.Monitor)(path.join(__dirname, 'notification-process.js')...
Without using join, it doesn't work after packaging the app.

Resources