playwright - get content from multiple pages in parallel - node.js

I am trying to get the page content from multiple URLs using playwright in a nodejs application. My code looks like this:
const getContent = async (url: string): Promise<string> {
const browser = await firefox.launch({ headless: true });
const page = await browser.newPage();
try {
await page.goto(url, {
waitUntil: 'domcontentloaded',
});
return await page.content();
} finally {
await page.close();
await browser.close();
}
}
const items = [
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
// more items...
]
await Promise.all(
items.map(async (item) => {
const contents = [];
for (url in item.urls) {
contents.push(await getContent(url))
}
return contents;
}
)
I am getting errors like error (Page.content): Target closed. but I noticed that if I just run without loop:
const content = getContent('https://www.example.com');
It works.
It looks like each iteration of the loops share the same instance of browser and/or page, so they are closing/navigating away each other.
To test it I built a web API with the getContent function and when I send 2 requests (almost) at the same time one of them fails, instead if send one request at the time it always works.
Is there a way to make playwright work in parallel?

I don't know if that solves it, but noticed there are two missing awaits. Both the firefox.launch(...) and the browser.newPage() are asynchronous and need an await in front.
Also, you don't need to launch a new browser so many times. PlayWright has the feature of isolated browser contexts, which are created much faster than launching a browser. It's worth experimenting with launching the browser before the getContent function, and using
const context = await browser.newContext();
const page = await context.newPage();

Related

Puppeteer iterates through all pages but browser won't close after

After the program iterates through all pages, it doesn't break out of the while loop and closes the browser. Instead, it ran through the while loop one extra time and gave me an error: "TimeoutError: waiting for selector `.pager .next a` failed: timeout 30000ms exceeded" What went wrong?
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto("http://books.toscrape.com/");
let isLastPage = false;
while (!isLastPage) {
await page.waitForSelector(".pager .next a");
isLastPage = (await page.$(".pager .next a")) === null;
await page.click(".pager .next a");
}
console.log("done");
await browser.close();
})();
Your last page detection logic is just flawed. While you're on a page, you're trying to both see if ".pager .next a" exists AND you're trying to click that. Obviously if it doesn't exist, you can't click it.
What you want to do is make sure the page is loaded by waiting for .pager .current which is a part of the navigation footer that will be there on every page. Then, check if .pager .next a is there BEFORE you click and if it's not there, then you can just break out of the while loop. If the page is dynamic and you need to use puppeteer, then you can do something like this:
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({ headless: false });
try {
const page = await browser.newPage();
await page.goto("http://books.toscrape.com/");
let cntr = 0;
while (true) {
await page.waitForSelector(".pager .current");
console.log(`page ${++cntr}`);
// process the page content here
if ((await page.$(".pager .next a")) === null) {
break;
}
await page.click(".pager .next a");
}
console.log("done");
} finally {
await browser.close();
}
})();
And, to make sure that you always close the browser even upon errors, you need to catch any errors and make sure you close the browser in those conditions. In this case, you can use try/finally.
If the page is not dynamic, then you can also just use plain GET requests and use cheerio to examine what's in the page which is simpler and doesn't involve loading the whole chromium browser engine.

How to tell Puppeteer to finish everything and give me what it has so far

I am struggling to get Puppeteer to just finish all activity so I can get the page content. It just fails silently on await page.content();.
This is for pages that often have few resources that never really finish loading, so the browser is stuck in page not finished loading state. (Chromium tab spinner is spinning).
So I am looking for a way to force the page so conclude the rendering process with everything it has done so far. I assumed await client.send("Page.stopLoading"); would do that for me but it's pretty much like pressing X on the browser and it's not telling Chromium to finish loading.
(async () => {
const browser = await p.launch({
defaultViewport: { width: 800, height: 600, isMobile: true },
});
const page = await browser.newPage();
const client = await page.target().createCDPSession();
await page.setUserAgent(ua);
try {
await page.goto(url, { waitUntil: "networkidle2", timeout: 5000 }).catch(e => void e);
await client.send("Page.stopLoading");
console.log("tried to stop");
const content = await page.content();
console.log(content);
} catch (e) {
console.log("fail");
}
await browser.close();
})();
Also tried to create an array of all requests and looking for ones that interceptor has not handled in order to abort hanging requests, but that approach produced nothing either.

Puppeteer page.$$ returns empty array

I'm working on a simple scraper but I can't get past this issue.
It returns an empty array everytime I run it, however the site does contain the elements and returns a NodeList when I run querySelectorAll on the console.
Is there anything I migh be overlooking? I've already tried waitForSelector to wait for it but no luck, it just gives a timeout.
Thank you
const scraper = async () => {
try {
const browser = await puppeteer.launch({ args: ['--no-sandbox', "--disabled-setupid-sandbox"]});
const page = await browser.newPage();
await page.goto('https://randomtodolistgenerator.herokuapp.com/library');
const elements = await page.$$(".card-body");
console.log(elements);
await browser.close();
} catch (error) {
console.log(error)
}
}
It turned out that the WSL was not able to run chromium for some reason.
I ended up installing Linux on a VM and it is working now.

How to group multiple calls to function that creates headless Chrome instance in GraphQL

I have a NodeJS server running GraphQL. One of my queries gets a list of "projects" from an API and returns a URL. This URL is then passed to another function which gets a screenshot of that website (using a NodeJS package which is a wrapper around Puppeteer).
{
projects {
screenshot {
url
}
}
}
My issue is, that when I run this, if there is more than say a couple of projects that it needs to go and generate a screenshot for it. It runs the screenshot function for each data response object (See below) and therefore creates a separate headless browser on the server, so my server rapidly runs out of memory and crashes.
{
"data": {
"projects": [
{
"screenshot": {
"url": "https://someurl.com/randomid/screenshot.png"
}
},
{
"screenshot": {
"url": "https://someurl.com/randomid/screenshot.png"
}
}
]
}
}
This is a simplified version of the code I have for the screenshot logic for context:
const webshotScreenshot = (title, url) => {
return new Promise(async (resolve, reject) => {
/** Create screenshot options */
const options = {
height: 600,
scaleFactor: 2,
width: 1200,
launchOptions: {
headless: true,
args: ['--no-sandbox']
}
};
/** Capture website */
await captureWebsite.base64(url.href, options)
.then(async response => {
/** Create filename and location */
let folder = `output/screenshots/${_.kebabCase(title)}`;
/** Create directory */
createDirectory(folder);
/** Create filename */
const filename = 'screenshot.png';
const fileOutput = `${folder}/${filename}`;
return await fs.writeFile(fileOutput, response, 'base64', (err) => {
if (err) {
// handle error
}
/** File saved successfully */
resolve({
fileOutput
});
});
})
.catch(err => {
// handle error
});
});
};
What I'd like to know, is how I could modify this logic, to:
Avoid creating a headless instance for every call to the function? Essentially group/batch every URL provided in the response and process it in one go
And anything I can do to help reduce the load on the server when this processing is happening so that I don't run out of memory?
I have done a lot now with Node args and setting memory limits etc. But the main thing now I think is making this as efficient as possible.
You can utilize dataloader to batch your calls to whatever function gets the screenshots. This function should take an array of URLs and return a Promise that resolves with the array of resulting images.
const DataLoader = require('dataloader')
const screenshotLoader = new DataLoader(async (urls) => {
// see below
})
// Inject a new DataLoader instance into your context, then inside your resolver
screenshotLoader.load(yourUrl)
It doesn't look like capture-website supports passing in multiple URLs. That means, each call to captureWebsite.base64 will spin up a new puppeteer instance. So, Promise.all is out, but you have a couple of options:
Handle the screen captures sequentially. This will be slow, but should ensure only one instance of puppeteer is up at a time.
const images = []
for (const url in urls) {
const image = await captureWebsite.base64(url, options)
images.push(image)
}
return images
Utilize bluebird or a similar library to run the requests concurrently but with a limit:
const concurrency = 3 // 3 at a time
return Bluebird.map(urls, (url) => {
return captureWebsite.base64(url, options)
}, { concurrency })
Switch to using puppeteer directly, or some different library that supports taking multiple screenshots.
const browser = await puppeteer.launch({args: ['--no-sandbox', '--disable-setuid-sandbox']});
const page = await browser.newPage();
for (const url in urls) {
const image = await captureWebsite.base64(url, options)
await page.goto(url);
await page.screenshot(/* path and other screenshot options */);
}
await browser.close();

How to pass dynamic page automation commands to puppeteer from external file?

I'm trying to pass dynamic page automation commands to puppeteer from an external file. I'm new to puppeteer and node so I apologize in advance.
// app.js
// ========
app.get('/test', (req, res) =>
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://testurl.com');
var events = require('./events.json');
for(var i=0;i<events.length;i++){
var tmp = events[i];
await page.evaluate((tmp) => { return Promise.resolve(tmp.event); }, tmp);
}
await browser.close();
})());
My events json file looks like:
// events.json
// ========
[
{
"event":"page.waitFor(4000)"
},
{
"event":"page.click('#aLogin')"
},
{
"event":"page.waitFor(1000)"
}
]
I've tried several variations of the above as well as importing a module that passes the page object to one of the module function, but nothing has worked. Can anyone tell me if this is possible and, if so, how to better achieve this?
The solution is actually very simple and straightforward. You just have to understand how this works.
First of all, you cannot pass page elements like that to evaluate. Instead you can do the following,
On a seperate file,
module.exports = async function getCommands(page) {
return Promise.all([
await page.waitFor(4000),
await page.click("#aLogin"),
await page.waitFor(1000)
]);
};
Now on your main file,
await require('./events.js').getCommands(page);
There, it's done! It'll execute all commands for you one by one just as you wanted.
Here is a complete code with some adjustments,
const puppeteer = require("puppeteer");
async function getCommands(page) {
return Promise.all([
await page.title(),
await page.waitFor(1000)
]);
};
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
let data = await getCommands(page);
console.log(data);
await page.close();
await browser.close();
})();

Resources