Puppeteer: goto() and waitForSelector() always fail on Cloud Functions - node.js

I am using Puppeteer to do some webscraping which is executed on a scheduled pubsub Cloud Function. The issue that I have is that the page.goto() and page.waitForSelector() never ever completes when I deploy my function onto Firebase Cloud Function. The script works fine locally on my machine.
Here is my implementation so far:
//Scheduled pubsub function at ./functions/index.js
exports.scraper = functions.pubsub
.onRun((context) => {
var scraper = new ScraperManager();
return scraper.start();
})
//Entry function
ScraperManager.prototype.start = async function() {
var webpagePromises = []
for (const agency of agencies) {
for (page_num = 0; page_num < num_of_pages; page_num++) {
const url = setupUrl(agency, page_num); //Returns a url
const webpagePromise = getWebpage(agency, page_num, url)
webpagePromises.push(webpagePromise)
}
}
return Promise.all(webpagePromises)
}
async function getWebpage(agency, page_num, url) {
var data = {}
const browser = await puppeteer.launch(constants.puppeteerOptions);
try {
const page = await browser.newPage();
await page.setUserAgent(constants.userAgent);
await page.goto(url, {timeout: 0});
console.log("goto completes")
await page.waitForSelector('div.main_container', {timeout: 0});
console.log("waitFor completes")
const body = await page.$eval('body', el => el.innerHTML);
data['html'] = body;
return data;
} catch (err) {
console.log("Puppeteer error", err);
return;
} finally {
await browser.close();
}
}
//PuppeteerOptions in constants file
puppeteerOptions: {
headerless: true,
args: [
'--disable-gpu',
'--no-sandbox',
]
}
Note that the {timeout: 0} is necessary as the page.goto and page.waitForSelector() takes more time than the default timeout value of 30000ms. Even with the timeout disabled, both goto and waitForSelector() never completes, ie both the console.log() statements do not get logged. The above script works fine when running the script locally, ie console.log() does print out correctly, but never works when deployed on Cloud Functions. The entire cloud function always get timedout (presently set at 300s) without any logs printed.
Would anybody be able to advice?

I had the same kind of issue: a stupid page.$eval() to get a simple node would never return (actually it was taking more than 3 minutes ...) or crash the page after more than 5 minutes on a virtual private server, while the same script was doing fine on my local computer.
Looking at the virtual server's RAM usage (around 99%), I've come to the conclusion that this kind of script cannot run on a server with only 2 GB or RAM.
:-(

Related

playwright - get content from multiple pages in parallel

I am trying to get the page content from multiple URLs using playwright in a nodejs application. My code looks like this:
const getContent = async (url: string): Promise<string> {
const browser = await firefox.launch({ headless: true });
const page = await browser.newPage();
try {
await page.goto(url, {
waitUntil: 'domcontentloaded',
});
return await page.content();
} finally {
await page.close();
await browser.close();
}
}
const items = [
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
// more items...
]
await Promise.all(
items.map(async (item) => {
const contents = [];
for (url in item.urls) {
contents.push(await getContent(url))
}
return contents;
}
)
I am getting errors like error (Page.content): Target closed. but I noticed that if I just run without loop:
const content = getContent('https://www.example.com');
It works.
It looks like each iteration of the loops share the same instance of browser and/or page, so they are closing/navigating away each other.
To test it I built a web API with the getContent function and when I send 2 requests (almost) at the same time one of them fails, instead if send one request at the time it always works.
Is there a way to make playwright work in parallel?
I don't know if that solves it, but noticed there are two missing awaits. Both the firefox.launch(...) and the browser.newPage() are asynchronous and need an await in front.
Also, you don't need to launch a new browser so many times. PlayWright has the feature of isolated browser contexts, which are created much faster than launching a browser. It's worth experimenting with launching the browser before the getContent function, and using
const context = await browser.newContext();
const page = await context.newPage();

Puppeteer: Staying on page too long closes the browser (Protocol error (Runtime.callFunctionOn): Session closed. Most likely the page has been closed)

My plan is to connect to a page, interact with its elements for a while, and then wait and start over. Since the process of accessing the page is complicated, I would ideally log in only once, and then permanently stay on page.
index.js
const puppeteer = require('puppeteer');
const creds = require("./creds.json");
(async () => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.online-messenger.com');
await goToChats(page);
await page.waitForSelector('div[aria-label="Chats"]');
setInterval(async () => {
let index = 1;
while (index <= 5) {
if (await isUnread(page, index)) {
await page.click(`#local-chat`);
await page.waitForSelector('div[role="main"]');
let conversationName = await getConversationName(page);
if (isChat(conversationName)) {
await writeMessage(page);
}
}
index++;
}
}, 30000);
} catch (e) { console.log(e); }
await page.close();
await browser.close();
})();
Again, I do not want to close the connection, so I thought add the setInterval() would help me with the problem. The core code works absolutely fine, but every single time I run the code with the interval function I get this error:
Error: Protocol error (Runtime.callFunctionOn): Session closed. Most likely the page has been closed.
I timed the main part of my code and it would typically take around 20-25 seconds. I thought the problem lies in the delay set to 30 seconds, but I get the same error even when I increase it to e.g. 60000 (60 seconds).
What am I doing wrong? Why is setInterval not working, and is there possibly a different way of tackling my problem?
Okay, after spending some more time on the problem, I realised it was indeed the setInterval function that caused all the errors.
The code is asynchronous and in order to make it all work, I had to use an 'async' version of setInterval(). I wrapped my code in an endless loop, which finishes with promise, which resolves after a specified time.
...
await goToChats(page);
await page.waitForSelector('div[aria-label="Chats"]');
while(true) {
let index = 1;
while (index <= 5) {
if (await isUnread(page, index)) {
await page.click(`#local-chat`);
await page.waitForSelector('div[role="main"]');
let conversationName = await getConversationName(page);
if (isChat(conversationName)) {
await writeMessage(page);
}
}
waitBeforeNextIteration(10000);
index++;
}
...
waitBeforeNextIteration(ms) {
return new Promise(resolve => setTimeout(resolve, ms))
}

NodeJS async function doesn't run when inside of a while true loop

Im trying to create a cookie generator by using puppeteer to capture cookies and add the cookies to a local json file. Everything works fine except when I'm trying to have the function run every 5 seconds it hangs and never completes the function. In python I used to do
while True:
main()
time.sleep(5)
But in node I'm doing this, and it hangs. Theres no errors it just hangs.
while (true){
main()
}
my function never runs and it just hangs. Heres the main function simplified.
function sleep(ms) {
return new Promise((resolve) => {
setTimeout(resolve, ms);
});
}
main = async () => {
let start = now();
puppeteer.launch( {headless:true} ).then( async (browser) => {
console.log('Loaded Browser Successfully!')
const page = await browser.newPage()
await page.goto('some link')
const cookies = await page.cookies();
await sleep(5000)
for(let cookie of cookies){
if (cookie.name == 'some cookie')
console.log(cookie)
}
await browser.close()
return cookies
}).then( async (cookies) => {
let rawData = fs.readFileSync(path.join(__dirname,'Cookies.json'))
let cookieJar = JSON.parse(rawData)
cookieJar.push(cookies)
console.log(cookieJar.length)
await fs.writeFileSync(path.join(__dirname,'Cookies.json'), JSON.stringify(cookieJar))
let end = now();
console.log(`It took ${end - start}ms`)
return
})
}
What am i doing wrong here?
JS never runs code in parallel, only asynchronously. Your asynchronous Promise-callback can only ever run, after your current execution finishes. Since you used a while (true), it will never finish.
For your problem, setInterval is ideal:
setInterval(main, 5000)

How to group multiple calls to function that creates headless Chrome instance in GraphQL

I have a NodeJS server running GraphQL. One of my queries gets a list of "projects" from an API and returns a URL. This URL is then passed to another function which gets a screenshot of that website (using a NodeJS package which is a wrapper around Puppeteer).
{
projects {
screenshot {
url
}
}
}
My issue is, that when I run this, if there is more than say a couple of projects that it needs to go and generate a screenshot for it. It runs the screenshot function for each data response object (See below) and therefore creates a separate headless browser on the server, so my server rapidly runs out of memory and crashes.
{
"data": {
"projects": [
{
"screenshot": {
"url": "https://someurl.com/randomid/screenshot.png"
}
},
{
"screenshot": {
"url": "https://someurl.com/randomid/screenshot.png"
}
}
]
}
}
This is a simplified version of the code I have for the screenshot logic for context:
const webshotScreenshot = (title, url) => {
return new Promise(async (resolve, reject) => {
/** Create screenshot options */
const options = {
height: 600,
scaleFactor: 2,
width: 1200,
launchOptions: {
headless: true,
args: ['--no-sandbox']
}
};
/** Capture website */
await captureWebsite.base64(url.href, options)
.then(async response => {
/** Create filename and location */
let folder = `output/screenshots/${_.kebabCase(title)}`;
/** Create directory */
createDirectory(folder);
/** Create filename */
const filename = 'screenshot.png';
const fileOutput = `${folder}/${filename}`;
return await fs.writeFile(fileOutput, response, 'base64', (err) => {
if (err) {
// handle error
}
/** File saved successfully */
resolve({
fileOutput
});
});
})
.catch(err => {
// handle error
});
});
};
What I'd like to know, is how I could modify this logic, to:
Avoid creating a headless instance for every call to the function? Essentially group/batch every URL provided in the response and process it in one go
And anything I can do to help reduce the load on the server when this processing is happening so that I don't run out of memory?
I have done a lot now with Node args and setting memory limits etc. But the main thing now I think is making this as efficient as possible.
You can utilize dataloader to batch your calls to whatever function gets the screenshots. This function should take an array of URLs and return a Promise that resolves with the array of resulting images.
const DataLoader = require('dataloader')
const screenshotLoader = new DataLoader(async (urls) => {
// see below
})
// Inject a new DataLoader instance into your context, then inside your resolver
screenshotLoader.load(yourUrl)
It doesn't look like capture-website supports passing in multiple URLs. That means, each call to captureWebsite.base64 will spin up a new puppeteer instance. So, Promise.all is out, but you have a couple of options:
Handle the screen captures sequentially. This will be slow, but should ensure only one instance of puppeteer is up at a time.
const images = []
for (const url in urls) {
const image = await captureWebsite.base64(url, options)
images.push(image)
}
return images
Utilize bluebird or a similar library to run the requests concurrently but with a limit:
const concurrency = 3 // 3 at a time
return Bluebird.map(urls, (url) => {
return captureWebsite.base64(url, options)
}, { concurrency })
Switch to using puppeteer directly, or some different library that supports taking multiple screenshots.
const browser = await puppeteer.launch({args: ['--no-sandbox', '--disable-setuid-sandbox']});
const page = await browser.newPage();
for (const url in urls) {
const image = await captureWebsite.base64(url, options)
await page.goto(url);
await page.screenshot(/* path and other screenshot options */);
}
await browser.close();

How to wait for all downloads to complete with Puppeteer?

I have a small web scraping application that downloads multiple files from a web application where the URLs require visting the page.
It works fine if I keep the browser instance alive in between runs, but I want to close the instance in between runs. When I call browser.close() my downloads are stopped because the chrome instance is closed before the downloads have finished.
Does puppeteer provide a way to check if downloads are still active, and wait for them to complete? I've tried page.waitForNavigation({ waitUntil: "networkidle0" }) and "networkidle2", but those seem to wait indefinitely.
node.js 8.10
puppeteer 1.10.0
Update:
It's 2022. Use Playwright to get away from this mass. manage downloads
It also has 'smarter' locator, which examine selectors every time before click()
old version for puppeteer:
My solution is to use chrome's own chrome://downloads/ page to managing download files. This solution can be very easily to auto restart a failed download using chrome's own feature
This example is 'single thread' currently, because it's only monitoring the first item appear in the download manager page. But you can easily adapt it to 'infinite threads' by iterating through all download items (#frb0~#frbn) in that page, well, take care of your network:)
dmPage = await browser.newPage()
await dmPage.goto('chrome://downloads/')
await your_download_button.click() // start download
await dmPage.bringToFront() // this is necessary
await dmPage.waitForFunction(
() => {
// monitoring the state of the first download item
// if finish than return true; if fail click
const dm = document.querySelector('downloads-manager').shadowRoot
const firstItem = dm.querySelector('#frb0')
if (firstItem) {
const thatArea = firstItem.shadowRoot.querySelector('.controls')
const atag = thatArea.querySelector('a')
if (atag && atag.textContent === '在文件夹中显示') { // may be 'show in file explorer...'? you can try some ids, classess and do a better job than me lol
return true
}
const btn = thatArea.querySelector('cr-button')
if (btn && btn.textContent === '重试') { // may be 'try again'
btn.click()
}
}
},
{ polling: 'raf', timeout: 0 }, // polling? yes. there is a 'polling: "mutation"' which kind of async
)
console.log('finish')
An alternative if you have the file name or a suggestion for other ways to check.
async function waitFile (filename) {
return new Promise(async (resolve, reject) => {
if (!fs.existsSync(filename)) {
await delay(3000);
await waitFile(filename);
resolve();
}else{
resolve();
}
})
}
function delay(time) {
return new Promise(function(resolve) {
setTimeout(resolve, time)
});
}
Implementation:
var filename = `${yyyy}${mm}_TAC.csv`;
var pathWithFilename = `${config.path}\\${filename}`;
await waitFile(pathWithFilename);
You need check request response.
await page.on('response', (response)=>{ console.log(response, response._url)}
You should check what is coming from response then find status, it comes with status 200
Using puppeteer and chrome I have one more solution which might help you.
If you are downloading the file from chrome it will always have ".crdownload" extension. And when file is completely downloaded that extension will vanish.
So, I am using recurring function and maximum number of times it can iterate, If it doesn't download the file in that time.. I am deleting it. And I am constantly checking a folder for that extention.
async checkFileDownloaded(path, timer) {
return new Promise(async (resolve, reject) => {
let noOfFile;
try {
noOfFile = await fs.readdirSync(path);
} catch (err) {
return resolve("null");
}
for (let i in noOfFile) {
if (noOfFile[i].includes('.crdownload')) {
await this.delay(20000);
if (timer == 0) {
fs.unlink(path + '/' + noOfFile[i], (err) => {
});
return resolve("Success");
} else {
timer = timer - 1;
await this.checkFileDownloaded(path, timer);
}
}
}
return resolve("Success");
});
}
Here is another function, its just wait for the pause button to disappear:
async function waitForDownload(browser: Browser) {
const dmPage = await browser.newPage();
await dmPage.goto("chrome://downloads/");
await dmPage.bringToFront();
await dmPage.waitForFunction(() => {
try {
const donePath = document.querySelector("downloads-manager")!.shadowRoot!
.querySelector(
"#frb0",
)!.shadowRoot!.querySelector("#pauseOrResume")!;
if ((donePath as HTMLButtonElement).innerText != "Pause") {
return true;
}
} catch {
//
}
}, { timeout: 0 });
console.log("Download finished");
}
I didn't like solutions that were checking DOM or file system for the file.
From Chrome DevTools Protocol documentation](https://chromedevtools.github.io/) I found two events,
Page.downloadProgress and Browser.downloadProgress. (Though Page.downloadProgress is marked as deprecated, that's the one that worked for me.)
This event has a property called state which tells you about the state of the download. state could be inProgress, completed and canceled.
You can wrap this event in a Promise to await it till the status changes to completed
async function waitUntilDownload(page, fileName = '') {
return new Promise((resolve, reject) => {
page._client().on('Page.downloadProgress', e => {
if (e.state === 'completed') {
resolve(fileName);
} else if (e.state === 'canceled') {
reject();
}
});
});
}
and await it as follows,
await waitUntilDownload(page, fileName);
Created simple await function that will check for file rapidly or timeout in 10 seconds
import fs from "fs";
awaitFileDownloaded: async (filePath) => {
let timeout = 10000
const delay = 200
return new Promise(async (resolve, reject) => {
while (timeout > 0) {
if (fs.existsSync(filePath)) {
resolve(true);
return
} else {
await HelperUI.delay(delay)
timeout -= delay
}
}
reject("awaitFileDownloaded timed out")
});
},
You can use node-watch to report the updates to the target directory. When the file upload is complete you will receive an update event with the name of the new file that has been downloaded.
Run npm to install node-watch:
npm install node-watch
Sample code:
const puppeteer = require('puppeteer');
const watch = require('node-watch');
const path = require('path');
// Add code to initiate the download ...
const watchDir = '/Users/home/Downloads'
const filepath = path.join(watchDir, "download_file");
(async() => {
watch(watchDir, function(event, name) {
if (event == "update") {
if (name === filepath)) {
browser.close(); // use case specific
process.exit(); // use case specific
}
}
})
Tried doing an await page.waitFor(50000); with a time as long as the download should take.
Or look at watching for file changes on complete file transfer
you could search in the download location for the extension the files have when still downloading 'crdownload' and when the download is completed the file is renamed with the original extension: from this 'video_audio_file.mp4.crdownload' turns into 'video_audio_file.mp4' without the 'crdownload' at the end
const fs = require('fs');
const myPath = path.resolve('/your/file/download/folder');
let siNo = 0;
function stillWorking(myPath) {
siNo = 0;
filenames = fs.readdirSync(myPath);
filenames.forEach(file => {
if (file.includes('crdownload')) {
siNo = 1;
}
});
return siNo;
}
Then you use is in an infinite loop like this and check very a certain period of time, here I check every 3 seconds and when it returns 0 which means there is no pending files to be fully downloaded.
while (true) {
execSync('sleep 3');
if (stillWorking(myPath) == 0) {
await browser.close();
break;
}
}

Resources