How to pass dynamic page automation commands to puppeteer from external file? - node.js

I'm trying to pass dynamic page automation commands to puppeteer from an external file. I'm new to puppeteer and node so I apologize in advance.
// app.js
// ========
app.get('/test', (req, res) =>
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://testurl.com');
var events = require('./events.json');
for(var i=0;i<events.length;i++){
var tmp = events[i];
await page.evaluate((tmp) => { return Promise.resolve(tmp.event); }, tmp);
}
await browser.close();
})());
My events json file looks like:
// events.json
// ========
[
{
"event":"page.waitFor(4000)"
},
{
"event":"page.click('#aLogin')"
},
{
"event":"page.waitFor(1000)"
}
]
I've tried several variations of the above as well as importing a module that passes the page object to one of the module function, but nothing has worked. Can anyone tell me if this is possible and, if so, how to better achieve this?

The solution is actually very simple and straightforward. You just have to understand how this works.
First of all, you cannot pass page elements like that to evaluate. Instead you can do the following,
On a seperate file,
module.exports = async function getCommands(page) {
return Promise.all([
await page.waitFor(4000),
await page.click("#aLogin"),
await page.waitFor(1000)
]);
};
Now on your main file,
await require('./events.js').getCommands(page);
There, it's done! It'll execute all commands for you one by one just as you wanted.
Here is a complete code with some adjustments,
const puppeteer = require("puppeteer");
async function getCommands(page) {
return Promise.all([
await page.title(),
await page.waitFor(1000)
]);
};
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
let data = await getCommands(page);
console.log(data);
await page.close();
await browser.close();
})();

Related

Puppeteer - ExpressJS infinite loop a function

I have a problem with my Express JS app : When I'm trying to call a function, this function is endlessly called ... It opens a lot of chromium browser and cause performance issues ...
I just want to call this function one time.
I've found a solution to make it work (And called just one time), but in this situation I can't pass any parameters ...
const farm = (async () => {
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
await page.goto("https://www.example.com/?s=" + term);
await page.waitForSelector("div");
const postLinks = await page.evaluate(() => {
let postLinks = [];
let elements = document.querySelectorAll('div.article');
for (element of elements) {
postLinks.push({
title: element.querySelector('div.meta-info > h3 > a')?.textContent,
url: element.querySelector('div.meta-info > h3 > a')?.href
})
}
return postLinks;
});
console.log(postLinks);
await browser.close();
})();
app.get('/', (req, res) => {
var term = "Drake";
res.send(farm);
});
With the code below, I can pass parameters but I can't return the result in "res.send", and the function is called endlessly :
const farm = async (term) => {
const browser = await puppeteer.launch({headless: true});
const page = await browser.newPage();
await page.goto("https://www.example.com/?s=" + term);
await page.waitForSelector("div");
const postLinks = await page.evaluate(() => {
let postLinks = [];
let elements = document.querySelectorAll('div.article');
for (element of elements) {
postLinks.push({
title: element.querySelector('div.meta-info > h3 > a')?.textContent,
url: element.querySelector('div.meta-info > h3 > a')?.href
})
}
return postLinks;
});
console.log(postLinks);
await browser.close();
}
app.get('/', (req, res) => {
var term = "Drake";
var results = farm(term);
res.send(results);
});
Did I miss something ?
Thanks !
It's not an infinite loop, but unresolved promise. The farm returns a promise, which you're not waiting for, but instead send the pending promise before it resolves, i.e. before the puppeteer is done.
You need to wait for farm's promise to resolve, make middleware function async and add await to the farm call:
app.get('/', async(req, res) => {
var term = "Drake";
// farm returns a promise, so you need to wait for it to resolve, i.e. block execution
// otherwise it just sends pending promise, because node.js runs in non-blocking fashion
var results = await farm(term);
res.send(results);
});

playwright - get content from multiple pages in parallel

I am trying to get the page content from multiple URLs using playwright in a nodejs application. My code looks like this:
const getContent = async (url: string): Promise<string> {
const browser = await firefox.launch({ headless: true });
const page = await browser.newPage();
try {
await page.goto(url, {
waitUntil: 'domcontentloaded',
});
return await page.content();
} finally {
await page.close();
await browser.close();
}
}
const items = [
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
// more items...
]
await Promise.all(
items.map(async (item) => {
const contents = [];
for (url in item.urls) {
contents.push(await getContent(url))
}
return contents;
}
)
I am getting errors like error (Page.content): Target closed. but I noticed that if I just run without loop:
const content = getContent('https://www.example.com');
It works.
It looks like each iteration of the loops share the same instance of browser and/or page, so they are closing/navigating away each other.
To test it I built a web API with the getContent function and when I send 2 requests (almost) at the same time one of them fails, instead if send one request at the time it always works.
Is there a way to make playwright work in parallel?
I don't know if that solves it, but noticed there are two missing awaits. Both the firefox.launch(...) and the browser.newPage() are asynchronous and need an await in front.
Also, you don't need to launch a new browser so many times. PlayWright has the feature of isolated browser contexts, which are created much faster than launching a browser. It's worth experimenting with launching the browser before the getContent function, and using
const context = await browser.newContext();
const page = await context.newPage();

Simple web scraping with puppeteer / cheerio not working with params

I am trying to scrape https://www.premierleague.com/clubs/38/Wolverhampton-Wanderers/stats?se=274
The results being returned are for the page minus the ?se=274
This is applied by using the filter dropdown on the page and selecting 2019/20 season. I can navigate directly to the page and it works fine, but through code it does not work.
I have tried in cheerio and puppeteer. I was going to try nightmare too but this seems overkill I think. I am clearly not an expert! ;)
function getStats(callback){
var url = "https://www.premierleague.com/clubs/38/Wolverhampton-Wanderers/stats?se=274";
request(url, function (error, response, html) {
//console.log(html);
var $ = cheerio.load(html);
if(!error){
$('.allStatContainer.statontarget_scoring_att').filter(function(){
var data = $(this);
var vSOT = data.text();
//console.log(data);
console.log(vSOT);
});
}
});
callback;
}
This will return 564 instead of 2
It seems like you're calling callback before request returns. Move the callback call into the internal block, where the task you need is completed (in your case, it looks like the filter block).
It also looks like you're missing the () on the callback call.
Also, a recommendation: return the value you need through the callback.
So this code works....$10 from a rent-a-coder did the trick. Easy when you know how!
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch()
const page = await browser.newPage()
await page.goto('https://www.premierleague.com/clubs/4/Chelsea/stats?se=274')
const sleep = ms => new Promise(resolve => setTimeout(resolve, ms))
await sleep(4000)
const element = await page.$(".allStatContainer.statontarget_scoring_att");
const text = await page.evaluate(element => element.textContent, element);
console.log("Shots on Target:"+text)
browser.close()
})()

Getting DOM node text with Puppeteer and headless Chrome

I'm trying to use headless Chrome and Puppeteer to run our Javascript tests, but I can't extract the results from the page. Based on this answer, it looks like I should use page.evaluate(). That section even has an example that looks like what I need.
const bodyHandle = await page.$('body');
const html = await page.evaluate(body => body.innerHTML, bodyHandle);
await bodyHandle.dispose();
As a full example, I tried to convert that to a script that will extract my name from my user profile on Stack Overflow. Our project is using Node 6, so I converted the await expressions to use .then().
const puppeteer = require('puppeteer');
puppeteer.launch().then(function(browser) {
browser.newPage().then(function(page) {
page.goto('https://stackoverflow.com/users/4794').then(function() {
page.$('h2.user-card-name').then(function(heading_handle) {
page.evaluate(function(heading) {
return heading.innerText;
}, heading_handle).then(function(result) {
console.info(result);
browser.close();
}, function(error) {
console.error(error);
browser.close();
});
});
});
});
});
When I run that, I get this error:
$ node get_user.js
TypeError: Converting circular structure to JSON
at Object.stringify (native)
at args.map.x (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/helper.js:30:43)
at Array.map (native)
at Function.evaluationString (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/helper.js:30:29)
at Frame.<anonymous> (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:376:31)
at next (native)
at step (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:355:24)
at Promise (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:373:12)
at fn (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:351:10)
at Frame._rawEvaluate (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:375:3)
The problem seems to be with serializing the input parameter to page.evaluate(). I can pass in strings and numbers, but not element handles. Is the example wrong, or is it a problem with Node 6? How can I extract the text of a DOM node?
I found three solutions to this problem, depending on how complicated your extraction is. The simplest option is a related function that I hadn't noticed: page.$eval(). It basically does what I was trying to do: combines page.$() and page.evaluate(). Here's an example that works:
const puppeteer = require('puppeteer');
puppeteer.launch().then(function(browser) {
browser.newPage().then(function(page) {
page.goto('https://stackoverflow.com/users/4794').then(function() {
page.$eval('h2.user-card-name', function(heading) {
return heading.innerText;
}).then(function(result) {
console.info(result);
browser.close();
});
});
});
});
That gives me the expected result:
$ node get_user.js
Don Kirkby top 2% overall
I wanted to extract something more complicated, but I finally realized that the evaluation function is running in the context of the page. That means you can use any tools that are loaded in the page, and then just send strings and numbers back and forth. In this example, I use jQuery in a string to extract what I want:
const puppeteer = require('puppeteer');
puppeteer.launch().then(function(browser) {
browser.newPage().then(function(page) {
page.goto('https://stackoverflow.com/users/4794').then(function() {
page.evaluate("$('h2.user-card-name').text()").then(function(result) {
console.info(result);
browser.close();
});
});
});
});
That gives me a result with the whitespace intact:
$ node get_user.js
Don Kirkby
top 2% overall
In my real script, I want to extract the text of several nodes, so I need a function instead of a simple string:
const puppeteer = require('puppeteer');
puppeteer.launch().then(function(browser) {
browser.newPage().then(function(page) {
page.goto('https://stackoverflow.com/users/4794').then(function() {
page.evaluate(function() {
return $('h2.user-card-name').text();
}).then(function(result) {
console.info(result);
browser.close();
});
});
});
});
That gives the exact same result. Now I need to add error handling, and maybe reduce the indentation levels.
Using await/async and $eval, the syntax looks like the following:
await page.goto('https://stackoverflow.com/users/4794')
const nameElement = await context.page.$eval('h2.user-card-name', el => el.text())
console.log(nameElement)
I use page.$eval
const text = await page.$eval('h2.user-card-name', el => el.innerText );
console.log(text);
I had success using the following:
const browser = await puppeteer.launch();
try {
const page = await browser.newPage();
await page.goto(url);
await page.waitFor(2000);
let html_content = await page.evaluate(el => el.innerHTML, await page.$('.element-class-name'));
console.log(html_content);
} catch (err) {
console.log(err);
}
Hope it helps.

async/await issues with Chrome remote interface

I'd like to test this piece of code and wait until it's done to assert the results. Not sure where the issue is, it should return the Promise.resolve() at the end, but logs end before the code is executed.
Should Page.loadEventFired also be preceded by await?
const CDP = require('chrome-remote-interface')
async function x () {
const protocol = await CDP()
const timeout = ms => new Promise(resolve => setTimeout(resolve, ms))
// See API docs: https://chromedevtools.github.io/devtools-protocol/
const { Page, Runtime, DOM } = protocol
await Promise.all([Page.enable(), Runtime.enable(), DOM.enable()])
Page.navigate({ url: 'http://example.com' })
// wait until the page says it's loaded...
return Page.loadEventFired(async () => {
console.log('Page loaded! Now waiting a few seconds for all the JS to load...')
await timeout(3000) // give the JS some time to load
protocol.close()
console.log('Processing page source...')
console.log('Doing some fancy stuff here ...')
console.log('All done.')
return Promise.resolve()
})
}
(async function () {
console.log('start')
await x()
console.log('end')
})()
Yes you should await for Page.loadEventFired Example
async function x () {
const protocol = await CDP()
const timeout = ms => new Promise(resolve => setTimeout(resolve, ms))
// See API docs: https://chromedevtools.github.io/devtools-protocol/
const { Page, Runtime, DOM } = protocol
await Promise.all([Page.enable(), Runtime.enable(), DOM.enable()])
await Page.navigate({ url: 'http://example.com' })
// wait until the page says it's loaded...
await Page.loadEventFired()
console.log('Page loaded! Now waiting a few seconds for all the JS to load...')
await timeout(3000) // give the JS some time to load
protocol.close()
console.log('Processing page source...')
console.log('Doing some fancy stuff here ...')
console.log('All done.')
}
BTW you might also want to wrap your code with try-finally to always close protocol.
async function x () {
let protocol
try {
protocol = await CDP()
...
} finally {
if(protocol) protocol.close()
}

Resources