Getting DOM node text with Puppeteer and headless Chrome - node.js

I'm trying to use headless Chrome and Puppeteer to run our Javascript tests, but I can't extract the results from the page. Based on this answer, it looks like I should use page.evaluate(). That section even has an example that looks like what I need.
const bodyHandle = await page.$('body');
const html = await page.evaluate(body => body.innerHTML, bodyHandle);
await bodyHandle.dispose();
As a full example, I tried to convert that to a script that will extract my name from my user profile on Stack Overflow. Our project is using Node 6, so I converted the await expressions to use .then().
const puppeteer = require('puppeteer');
puppeteer.launch().then(function(browser) {
browser.newPage().then(function(page) {
page.goto('https://stackoverflow.com/users/4794').then(function() {
page.$('h2.user-card-name').then(function(heading_handle) {
page.evaluate(function(heading) {
return heading.innerText;
}, heading_handle).then(function(result) {
console.info(result);
browser.close();
}, function(error) {
console.error(error);
browser.close();
});
});
});
});
});
When I run that, I get this error:
$ node get_user.js
TypeError: Converting circular structure to JSON
at Object.stringify (native)
at args.map.x (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/helper.js:30:43)
at Array.map (native)
at Function.evaluationString (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/helper.js:30:29)
at Frame.<anonymous> (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:376:31)
at next (native)
at step (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:355:24)
at Promise (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:373:12)
at fn (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:351:10)
at Frame._rawEvaluate (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:375:3)
The problem seems to be with serializing the input parameter to page.evaluate(). I can pass in strings and numbers, but not element handles. Is the example wrong, or is it a problem with Node 6? How can I extract the text of a DOM node?

I found three solutions to this problem, depending on how complicated your extraction is. The simplest option is a related function that I hadn't noticed: page.$eval(). It basically does what I was trying to do: combines page.$() and page.evaluate(). Here's an example that works:
const puppeteer = require('puppeteer');
puppeteer.launch().then(function(browser) {
browser.newPage().then(function(page) {
page.goto('https://stackoverflow.com/users/4794').then(function() {
page.$eval('h2.user-card-name', function(heading) {
return heading.innerText;
}).then(function(result) {
console.info(result);
browser.close();
});
});
});
});
That gives me the expected result:
$ node get_user.js
Don Kirkby top 2% overall
I wanted to extract something more complicated, but I finally realized that the evaluation function is running in the context of the page. That means you can use any tools that are loaded in the page, and then just send strings and numbers back and forth. In this example, I use jQuery in a string to extract what I want:
const puppeteer = require('puppeteer');
puppeteer.launch().then(function(browser) {
browser.newPage().then(function(page) {
page.goto('https://stackoverflow.com/users/4794').then(function() {
page.evaluate("$('h2.user-card-name').text()").then(function(result) {
console.info(result);
browser.close();
});
});
});
});
That gives me a result with the whitespace intact:
$ node get_user.js
Don Kirkby
top 2% overall
In my real script, I want to extract the text of several nodes, so I need a function instead of a simple string:
const puppeteer = require('puppeteer');
puppeteer.launch().then(function(browser) {
browser.newPage().then(function(page) {
page.goto('https://stackoverflow.com/users/4794').then(function() {
page.evaluate(function() {
return $('h2.user-card-name').text();
}).then(function(result) {
console.info(result);
browser.close();
});
});
});
});
That gives the exact same result. Now I need to add error handling, and maybe reduce the indentation levels.

Using await/async and $eval, the syntax looks like the following:
await page.goto('https://stackoverflow.com/users/4794')
const nameElement = await context.page.$eval('h2.user-card-name', el => el.text())
console.log(nameElement)

I use page.$eval
const text = await page.$eval('h2.user-card-name', el => el.innerText );
console.log(text);

I had success using the following:
const browser = await puppeteer.launch();
try {
const page = await browser.newPage();
await page.goto(url);
await page.waitFor(2000);
let html_content = await page.evaluate(el => el.innerHTML, await page.$('.element-class-name'));
console.log(html_content);
} catch (err) {
console.log(err);
}
Hope it helps.

Related

playwright - get content from multiple pages in parallel

I am trying to get the page content from multiple URLs using playwright in a nodejs application. My code looks like this:
const getContent = async (url: string): Promise<string> {
const browser = await firefox.launch({ headless: true });
const page = await browser.newPage();
try {
await page.goto(url, {
waitUntil: 'domcontentloaded',
});
return await page.content();
} finally {
await page.close();
await browser.close();
}
}
const items = [
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
// more items...
]
await Promise.all(
items.map(async (item) => {
const contents = [];
for (url in item.urls) {
contents.push(await getContent(url))
}
return contents;
}
)
I am getting errors like error (Page.content): Target closed. but I noticed that if I just run without loop:
const content = getContent('https://www.example.com');
It works.
It looks like each iteration of the loops share the same instance of browser and/or page, so they are closing/navigating away each other.
To test it I built a web API with the getContent function and when I send 2 requests (almost) at the same time one of them fails, instead if send one request at the time it always works.
Is there a way to make playwright work in parallel?
I don't know if that solves it, but noticed there are two missing awaits. Both the firefox.launch(...) and the browser.newPage() are asynchronous and need an await in front.
Also, you don't need to launch a new browser so many times. PlayWright has the feature of isolated browser contexts, which are created much faster than launching a browser. It's worth experimenting with launching the browser before the getContent function, and using
const context = await browser.newContext();
const page = await context.newPage();

How do you call an external API from Selenium script to populate a field using Node JS?

I have a use case where I need to call an external API, parse the JSON that is returned and populate a form field in a web page all within a Selenium script written using Node JS.
Something like this:
// in Selenium script get the form field
let inputElement = await getElementById(driver, "my-id");
// then call an API including callback function
// in the callback function with the JSON response from the API
const myText = response.data.text;
await inputElement.sendKeys(myText,Key.ENTER);
I actually not even sure where to start with this - because I would be adding asynchronous code (the API call and waiting for the response in the callback) to the existing asynchronous code that is running as part of the Selenium script. And I need to not lose references to the web driver and the input element.
Some advice and recommendations to get me going would be very helpful.
If you are using node's inbuild https module the you can do something like this..
const { Builder, By, Key, until } = require("selenium-webdriver");
const https = require("https");
(async function example() {
let driver = await new Builder().forBrowser("chrome").build();
try {
await driver.get("http://www.google.com/ncr");
await https.get("https://jsonplaceholder.typicode.com/users/1", (resp) => {
let data = "";
resp.on("data", (chunk) => {
data += chunk;
});
resp.on("end", async () => {
// console.log(JSON.parse(data)["name"]);
await driver
.findElement(By.name("q"))
.sendKeys(JSON.parse(data)["name"], Key.RETURN);
});
});
await driver.wait(until.titleContains("- Google Search"), 1000);
} finally {
await driver.quit();
}
})();
Or if you are already using library like axios, then you can do something like this
const { Builder, By, Key, until } = require("selenium-webdriver");
const axios = require("axios");
(async function example() {
let driver = await new Builder().forBrowser("chrome").build();
try {
await driver.get("http://www.google.com/ncr");
const { data } = await axios.get(
"https://jsonplaceholder.typicode.com/users/1"
);
await driver.findElement(By.name("q")).sendKeys(data["name"], Key.RETURN);
await driver.wait(until.titleContains("- Google Search"), 1000);
} finally {
await driver.quit();
}
})();
Hope this is what you are looking for..

Use post variable with querySelector

I'm facing an issue trying to scrape datas on the web with puppeteer and querySelector.
I have a nodeJS WebServer that handle a post query, and then call a function to scrape the datas. I'm sending 2 parameters (postBlogUrl & postDomValue).
PostDomValue will contains as string the selector I'm trying to fetch datas from, for example:
[itemprop='articleBody'].
If I manually suggest the selector ([itemprop='articleBody']), everything is working well, I'm able to retrieve datas, but if i use the postDomValue var, nothing is returned.
I already tried to escape the var using CSS.escape(postDomValue), but no luck.
fetchBlogContent: async function(postBlogUrl, postDomValue) {
try {
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
page = await browser.newPage();
await page.goto(postBlogUrl, {
waitUntil: 'load'
})
let description = await page.evaluate(() => {
//This works return document.querySelector("[itemprop='articleBody']").innerHTML;
//This won't return document.querySelector(postDomValue).innerHTML;
})
return description
} catch (err) {
// handle err
return err;
}
}
const description = await page.evaluate((value) =>
document.querySelector(value).innerHTML, JSON.stringify(postDomValue));
See docs on how to pass args to page.evaluate() in puppeteer
If I understand correctly, the issue may be that you try to use a variable declared in the Node.js context inside an argument function of page.evaluate() that is executed in the browser context. In such cases, you need to transfer the value of a variable as an additional argument:
let description = await page.evaluate((selector) => {
return document.querySelector(selector).innerHTML;
}, postDomValue);
See more in page.evaluate().

Page.evaluate() returns undefined, but statement works in Chrome devTools

I'm trying to get the src values for all images on Bing image search for a search term. I am using puppeteer for it. I wrote a selector to grab each image tag and it works in the Chrome DevTools. It, however, isn't working when I write it in the code-
const puppeteer = require("puppeteer");
(async () => {
try{
let url = `https://www.bing.com/images/search?q=cannabis`
const browser = await puppeteer.launch({headless: false})
const page = await browser.newPage()
await page.goto(url)
await page.waitForSelector("ul.dgControl_list li img.mimg")
console.log(await page.evaluate(() => {
Array.from(document.querySelectorAll("ul.dgControl_list>li img.mimg"), img => img.src)
}))
} catch(err){
console.log("error - " + err)
}
})()
I get the output as an object containing arrays of 10 items each in the devTools, but when I run it in the console through my code, it is undefined. How do I read this object?
You are not returning any data from the page.evaluate call. To return the data you have to use the return statement or use the short syntax (as explained below):
console.log(await page.evaluate(() => {
return Array.from(document.querySelectorAll("ul.dgControl_list>li img.mimg"), img => img.src)
}))
Explanation: Arrow function
The arrow function has two ways to write them. One is the short syntax, you can use it like this:
const func = () => 1; // func() will simply return 1
You can only put in one statement in there (which might call other statements though). Alternatively, you can use the long form:
const func = () => { return 1; }; // Same function as above
You can use variable declarations and any kind of code inside this function (just as in a normal function() { ... }, but this time you have to use return to return a value.
Therefore, as an alternative, you could also write this (short syntax):
console.log(await page.evaluate(
() => Array.from(document.querySelectorAll("ul.dgControl_list>li img.mimg"), img => img.src)
))

How to pass dynamic page automation commands to puppeteer from external file?

I'm trying to pass dynamic page automation commands to puppeteer from an external file. I'm new to puppeteer and node so I apologize in advance.
// app.js
// ========
app.get('/test', (req, res) =>
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('http://testurl.com');
var events = require('./events.json');
for(var i=0;i<events.length;i++){
var tmp = events[i];
await page.evaluate((tmp) => { return Promise.resolve(tmp.event); }, tmp);
}
await browser.close();
})());
My events json file looks like:
// events.json
// ========
[
{
"event":"page.waitFor(4000)"
},
{
"event":"page.click('#aLogin')"
},
{
"event":"page.waitFor(1000)"
}
]
I've tried several variations of the above as well as importing a module that passes the page object to one of the module function, but nothing has worked. Can anyone tell me if this is possible and, if so, how to better achieve this?
The solution is actually very simple and straightforward. You just have to understand how this works.
First of all, you cannot pass page elements like that to evaluate. Instead you can do the following,
On a seperate file,
module.exports = async function getCommands(page) {
return Promise.all([
await page.waitFor(4000),
await page.click("#aLogin"),
await page.waitFor(1000)
]);
};
Now on your main file,
await require('./events.js').getCommands(page);
There, it's done! It'll execute all commands for you one by one just as you wanted.
Here is a complete code with some adjustments,
const puppeteer = require("puppeteer");
async function getCommands(page) {
return Promise.all([
await page.title(),
await page.waitFor(1000)
]);
};
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://example.com");
let data = await getCommands(page);
console.log(data);
await page.close();
await browser.close();
})();

Resources