Puppeteer Devtools Programaticaly - node.js

I can open the devtools that exist in Puppeteer, but I cannot write data to the console section and export the log of this data to the cmd screen?
In Puppeteer, I want to print to console as below and get the output below.
Screenshot

You are asking for two things here
Capture console.log messages to the command prompt
Run a javascript command inside puppeteer
For the first point you can set the option dumpio: true as a option
For the second point you can jump into the page using evaluate and make a call to console.log
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
dumpio: true
});
const page = await browser.newPage();
const url = "https://stackoverflow.com";
await page.goto(url);
await page.waitFor('h1');
await page.evaluate(() => {
console.log(document.getElementsByTagName("h1")[0].innerText);
});
console.log("Done.")
await browser.close();
})();
Also for brevity if you are getting to much output you can omit dumpio and instead catch the log as an event e.g.
page.on('console', (msg) => console[msg._type]('PAGE LOG:', msg._text));
await page.waitFor('h1');
await page.evaluate(() => {
console.log(1 + 2);
console.log(document.getElementsByTagName("h1")[0].innerText);
});
the second script returns
PAGE LOG: 3
PAGE LOG: We <3 people who code
Done.

Related

Puppeteer iterates through all pages but browser won't close after

After the program iterates through all pages, it doesn't break out of the while loop and closes the browser. Instead, it ran through the while loop one extra time and gave me an error: "TimeoutError: waiting for selector `.pager .next a` failed: timeout 30000ms exceeded" What went wrong?
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto("http://books.toscrape.com/");
let isLastPage = false;
while (!isLastPage) {
await page.waitForSelector(".pager .next a");
isLastPage = (await page.$(".pager .next a")) === null;
await page.click(".pager .next a");
}
console.log("done");
await browser.close();
})();
Your last page detection logic is just flawed. While you're on a page, you're trying to both see if ".pager .next a" exists AND you're trying to click that. Obviously if it doesn't exist, you can't click it.
What you want to do is make sure the page is loaded by waiting for .pager .current which is a part of the navigation footer that will be there on every page. Then, check if .pager .next a is there BEFORE you click and if it's not there, then you can just break out of the while loop. If the page is dynamic and you need to use puppeteer, then you can do something like this:
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({ headless: false });
try {
const page = await browser.newPage();
await page.goto("http://books.toscrape.com/");
let cntr = 0;
while (true) {
await page.waitForSelector(".pager .current");
console.log(`page ${++cntr}`);
// process the page content here
if ((await page.$(".pager .next a")) === null) {
break;
}
await page.click(".pager .next a");
}
console.log("done");
} finally {
await browser.close();
}
})();
And, to make sure that you always close the browser even upon errors, you need to catch any errors and make sure you close the browser in those conditions. In this case, you can use try/finally.
If the page is not dynamic, then you can also just use plain GET requests and use cheerio to examine what's in the page which is simpler and doesn't involve loading the whole chromium browser engine.

How to tell Puppeteer to finish everything and give me what it has so far

I am struggling to get Puppeteer to just finish all activity so I can get the page content. It just fails silently on await page.content();.
This is for pages that often have few resources that never really finish loading, so the browser is stuck in page not finished loading state. (Chromium tab spinner is spinning).
So I am looking for a way to force the page so conclude the rendering process with everything it has done so far. I assumed await client.send("Page.stopLoading"); would do that for me but it's pretty much like pressing X on the browser and it's not telling Chromium to finish loading.
(async () => {
const browser = await p.launch({
defaultViewport: { width: 800, height: 600, isMobile: true },
});
const page = await browser.newPage();
const client = await page.target().createCDPSession();
await page.setUserAgent(ua);
try {
await page.goto(url, { waitUntil: "networkidle2", timeout: 5000 }).catch(e => void e);
await client.send("Page.stopLoading");
console.log("tried to stop");
const content = await page.content();
console.log(content);
} catch (e) {
console.log("fail");
}
await browser.close();
})();
Also tried to create an array of all requests and looking for ones that interceptor has not handled in order to abort hanging requests, but that approach produced nothing either.

playwright - get content from multiple pages in parallel

I am trying to get the page content from multiple URLs using playwright in a nodejs application. My code looks like this:
const getContent = async (url: string): Promise<string> {
const browser = await firefox.launch({ headless: true });
const page = await browser.newPage();
try {
await page.goto(url, {
waitUntil: 'domcontentloaded',
});
return await page.content();
} finally {
await page.close();
await browser.close();
}
}
const items = [
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
// more items...
]
await Promise.all(
items.map(async (item) => {
const contents = [];
for (url in item.urls) {
contents.push(await getContent(url))
}
return contents;
}
)
I am getting errors like error (Page.content): Target closed. but I noticed that if I just run without loop:
const content = getContent('https://www.example.com');
It works.
It looks like each iteration of the loops share the same instance of browser and/or page, so they are closing/navigating away each other.
To test it I built a web API with the getContent function and when I send 2 requests (almost) at the same time one of them fails, instead if send one request at the time it always works.
Is there a way to make playwright work in parallel?
I don't know if that solves it, but noticed there are two missing awaits. Both the firefox.launch(...) and the browser.newPage() are asynchronous and need an await in front.
Also, you don't need to launch a new browser so many times. PlayWright has the feature of isolated browser contexts, which are created much faster than launching a browser. It's worth experimenting with launching the browser before the getContent function, and using
const context = await browser.newContext();
const page = await context.newPage();

Puppeteer page.$$ returns empty array

I'm working on a simple scraper but I can't get past this issue.
It returns an empty array everytime I run it, however the site does contain the elements and returns a NodeList when I run querySelectorAll on the console.
Is there anything I migh be overlooking? I've already tried waitForSelector to wait for it but no luck, it just gives a timeout.
Thank you
const scraper = async () => {
try {
const browser = await puppeteer.launch({ args: ['--no-sandbox', "--disabled-setupid-sandbox"]});
const page = await browser.newPage();
await page.goto('https://randomtodolistgenerator.herokuapp.com/library');
const elements = await page.$$(".card-body");
console.log(elements);
await browser.close();
} catch (error) {
console.log(error)
}
}
It turned out that the WSL was not able to run chromium for some reason.
I ended up installing Linux on a VM and it is working now.

How to pass dynamic page automation commands to puppeteer from external file?

I'm trying to pass dynamic page automation commands to puppeteer from an external file. I'm new to puppeteer and node so I apologize in advance.
// app.js
// ========
app.get('/test', (req, res) =>
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://testurl.com');
var events = require('./events.json');
for(var i=0;i<events.length;i++){
var tmp = events[i];
await page.evaluate((tmp) => { return Promise.resolve(tmp.event); }, tmp);
}
await browser.close();
})());
My events json file looks like:
// events.json
// ========
[
{
"event":"page.waitFor(4000)"
},
{
"event":"page.click('#aLogin')"
},
{
"event":"page.waitFor(1000)"
}
]
I've tried several variations of the above as well as importing a module that passes the page object to one of the module function, but nothing has worked. Can anyone tell me if this is possible and, if so, how to better achieve this?
The solution is actually very simple and straightforward. You just have to understand how this works.
First of all, you cannot pass page elements like that to evaluate. Instead you can do the following,
On a seperate file,
module.exports = async function getCommands(page) {
return Promise.all([
await page.waitFor(4000),
await page.click("#aLogin"),
await page.waitFor(1000)
]);
};
Now on your main file,
await require('./events.js').getCommands(page);
There, it's done! It'll execute all commands for you one by one just as you wanted.
Here is a complete code with some adjustments,
const puppeteer = require("puppeteer");
async function getCommands(page) {
return Promise.all([
await page.title(),
await page.waitFor(1000)
]);
};
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
let data = await getCommands(page);
console.log(data);
await page.close();
await browser.close();
})();

Resources