I'm facing an issue trying to scrape datas on the web with puppeteer and querySelector.
I have a nodeJS WebServer that handle a post query, and then call a function to scrape the datas. I'm sending 2 parameters (postBlogUrl & postDomValue).
PostDomValue will contains as string the selector I'm trying to fetch datas from, for example:
[itemprop='articleBody'].
If I manually suggest the selector ([itemprop='articleBody']), everything is working well, I'm able to retrieve datas, but if i use the postDomValue var, nothing is returned.
I already tried to escape the var using CSS.escape(postDomValue), but no luck.
fetchBlogContent: async function(postBlogUrl, postDomValue) {
try {
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
page = await browser.newPage();
await page.goto(postBlogUrl, {
waitUntil: 'load'
})
let description = await page.evaluate(() => {
//This works return document.querySelector("[itemprop='articleBody']").innerHTML;
//This won't return document.querySelector(postDomValue).innerHTML;
})
return description
} catch (err) {
// handle err
return err;
}
}
const description = await page.evaluate((value) =>
document.querySelector(value).innerHTML, JSON.stringify(postDomValue));
See docs on how to pass args to page.evaluate() in puppeteer
If I understand correctly, the issue may be that you try to use a variable declared in the Node.js context inside an argument function of page.evaluate() that is executed in the browser context. In such cases, you need to transfer the value of a variable as an additional argument:
let description = await page.evaluate((selector) => {
return document.querySelector(selector).innerHTML;
}, postDomValue);
See more in page.evaluate().
Related
I am trying to get the page content from multiple URLs using playwright in a nodejs application. My code looks like this:
const getContent = async (url: string): Promise<string> {
const browser = await firefox.launch({ headless: true });
const page = await browser.newPage();
try {
await page.goto(url, {
waitUntil: 'domcontentloaded',
});
return await page.content();
} finally {
await page.close();
await browser.close();
}
}
const items = [
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
// more items...
]
await Promise.all(
items.map(async (item) => {
const contents = [];
for (url in item.urls) {
contents.push(await getContent(url))
}
return contents;
}
)
I am getting errors like error (Page.content): Target closed. but I noticed that if I just run without loop:
const content = getContent('https://www.example.com');
It works.
It looks like each iteration of the loops share the same instance of browser and/or page, so they are closing/navigating away each other.
To test it I built a web API with the getContent function and when I send 2 requests (almost) at the same time one of them fails, instead if send one request at the time it always works.
Is there a way to make playwright work in parallel?
I don't know if that solves it, but noticed there are two missing awaits. Both the firefox.launch(...) and the browser.newPage() are asynchronous and need an await in front.
Also, you don't need to launch a new browser so many times. PlayWright has the feature of isolated browser contexts, which are created much faster than launching a browser. It's worth experimenting with launching the browser before the getContent function, and using
const context = await browser.newContext();
const page = await context.newPage();
document.querySelectorAll('.summary').innerText;
This throws an error in the below snippet saying "document.querySelector is not a function" in my Puppeteer page's exposed fucntion docTest.
I want to pass a specific node to each method and get the result inside evaluate.
Same with document.getElemenetbyId.
const puppeteer = require('puppeteer');
//var querySelectorAll = require('query-selector');
let docTest = (document) => {
var summary = document.querySelectorAll(.summary).innerText;
console.log(summary);
return summary;
}
let scrape = async () => {
const browser = await puppeteer.launch({
headless: false
});
const page = await browser.newPage();
await page.goto('http://localhost.com:80/static.html');
await page.waitFor(5000)
await page.exposeFunction('docTest', docTest);
var result = await page.evaluate(() => {
var resultworking = document.querySelector("tr");
console.log(resultworking);
var summary = docTest(document);
console.log(resultworking);
return summary;
});
console.log(result);
await page.waitFor(7000);
browser.close();
return {
result
}
};
scrape().then((value) => {
console.log(value); // Success!
});
I just had the same question. The problem is that the page.evaluate() function callback has to be an async function and your function docTest() will return a Promise when called inside the page.evaluate(). To fix it, just add the async and await keywords to your code:
await page.exposeFunction('docTest', docTest);
var result = await page.evaluate(async () => {
var summary = await docTest(document);
console.log(summary);
return summary;
});
Just remember that page.exposeFunction() will make your function return a Promise, then, you need to use async and await. This happens because your function will not be running inside your browser, but inside your nodejs application.
exposeFunction() does not work after goto()
Why can't I access 'window' in an exposeFunction() function with Puppeteer?
How to use evaluateOnNewDocument and exposeFunction?
exposeFunction remains in memory?
Puppeteer: pass variable in .evaluate()
Puppeteer evaluate function
allow to pass a parameterized funciton as a string to page.evaluate
Functions bound with page.exposeFunction() produce unhandled promise rejections
How to pass a function in Puppeteers .evaluate() method?
How can I dynamically inject functions to evaluate using Puppeteer?
I'm trying to get the src values for all images on Bing image search for a search term. I am using puppeteer for it. I wrote a selector to grab each image tag and it works in the Chrome DevTools. It, however, isn't working when I write it in the code-
const puppeteer = require("puppeteer");
(async () => {
try{
let url = `https://www.bing.com/images/search?q=cannabis`
const browser = await puppeteer.launch({headless: false})
const page = await browser.newPage()
await page.goto(url)
await page.waitForSelector("ul.dgControl_list li img.mimg")
console.log(await page.evaluate(() => {
Array.from(document.querySelectorAll("ul.dgControl_list>li img.mimg"), img => img.src)
}))
} catch(err){
console.log("error - " + err)
}
})()
I get the output as an object containing arrays of 10 items each in the devTools, but when I run it in the console through my code, it is undefined. How do I read this object?
You are not returning any data from the page.evaluate call. To return the data you have to use the return statement or use the short syntax (as explained below):
console.log(await page.evaluate(() => {
return Array.from(document.querySelectorAll("ul.dgControl_list>li img.mimg"), img => img.src)
}))
Explanation: Arrow function
The arrow function has two ways to write them. One is the short syntax, you can use it like this:
const func = () => 1; // func() will simply return 1
You can only put in one statement in there (which might call other statements though). Alternatively, you can use the long form:
const func = () => { return 1; }; // Same function as above
You can use variable declarations and any kind of code inside this function (just as in a normal function() { ... }, but this time you have to use return to return a value.
Therefore, as an alternative, you could also write this (short syntax):
console.log(await page.evaluate(
() => Array.from(document.querySelectorAll("ul.dgControl_list>li img.mimg"), img => img.src)
))
I'm trying to return the whole windows object from a page, and then traversing the object outside of puppeteer.
I'm trying to access the data in Highcharts property, for which I need to access the window object. The normal javascript code being something like window.Highcharts.charts[0].series[0].data.
I thought the easiest way would be to use puppeteer to access the site, and just send me back the windows object, which I could then use outside of puppeteer like any other JS object.
After reading the documentation, I'm finding it difficult to return the object as it would appear just putting 'window' into the chrome console. I'm not sure what I'm missing?
I've read through the documentation, and the following two methods seem like they should work?
(async () => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto('https://www.example.com', {waitUntil: 'networkidle2'});
// METHOD 1
// Create a Map object
await page.evaluate(() => window.map = new Map());
// Get a handle to the Map object prototype
const mapPrototype = await page.evaluateHandle(() => Map.prototype);
// Query all map instances into an array
const mapInstances = await page.queryObjects(mapPrototype);
console.log(mapInstances);
await mapInstances.dispose();
await mapPrototype.dispose();
// METHOD 2
const handle = await page.evaluateHandle(() => ({window, document}));
const properties = await handle.getProperties();
const windowHandle = properties.get('window');
const documentHandle = properties.get('document');
var result = await page.evaluate(win => win, windowHandle);
console.log(result)
await handle.dispose();
await browser.close();
})();
However, it only returns the following in the console, and not the simple object I would like;
Not sure if I'm going about this the right way, so any help/advice is much appreciated.
I'm trying to use headless Chrome and Puppeteer to run our Javascript tests, but I can't extract the results from the page. Based on this answer, it looks like I should use page.evaluate(). That section even has an example that looks like what I need.
const bodyHandle = await page.$('body');
const html = await page.evaluate(body => body.innerHTML, bodyHandle);
await bodyHandle.dispose();
As a full example, I tried to convert that to a script that will extract my name from my user profile on Stack Overflow. Our project is using Node 6, so I converted the await expressions to use .then().
const puppeteer = require('puppeteer');
puppeteer.launch().then(function(browser) {
browser.newPage().then(function(page) {
page.goto('https://stackoverflow.com/users/4794').then(function() {
page.$('h2.user-card-name').then(function(heading_handle) {
page.evaluate(function(heading) {
return heading.innerText;
}, heading_handle).then(function(result) {
console.info(result);
browser.close();
}, function(error) {
console.error(error);
browser.close();
});
});
});
});
});
When I run that, I get this error:
$ node get_user.js
TypeError: Converting circular structure to JSON
at Object.stringify (native)
at args.map.x (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/helper.js:30:43)
at Array.map (native)
at Function.evaluationString (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/helper.js:30:29)
at Frame.<anonymous> (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:376:31)
at next (native)
at step (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:355:24)
at Promise (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:373:12)
at fn (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:351:10)
at Frame._rawEvaluate (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:375:3)
The problem seems to be with serializing the input parameter to page.evaluate(). I can pass in strings and numbers, but not element handles. Is the example wrong, or is it a problem with Node 6? How can I extract the text of a DOM node?
I found three solutions to this problem, depending on how complicated your extraction is. The simplest option is a related function that I hadn't noticed: page.$eval(). It basically does what I was trying to do: combines page.$() and page.evaluate(). Here's an example that works:
const puppeteer = require('puppeteer');
puppeteer.launch().then(function(browser) {
browser.newPage().then(function(page) {
page.goto('https://stackoverflow.com/users/4794').then(function() {
page.$eval('h2.user-card-name', function(heading) {
return heading.innerText;
}).then(function(result) {
console.info(result);
browser.close();
});
});
});
});
That gives me the expected result:
$ node get_user.js
Don Kirkby top 2% overall
I wanted to extract something more complicated, but I finally realized that the evaluation function is running in the context of the page. That means you can use any tools that are loaded in the page, and then just send strings and numbers back and forth. In this example, I use jQuery in a string to extract what I want:
const puppeteer = require('puppeteer');
puppeteer.launch().then(function(browser) {
browser.newPage().then(function(page) {
page.goto('https://stackoverflow.com/users/4794').then(function() {
page.evaluate("$('h2.user-card-name').text()").then(function(result) {
console.info(result);
browser.close();
});
});
});
});
That gives me a result with the whitespace intact:
$ node get_user.js
Don Kirkby
top 2% overall
In my real script, I want to extract the text of several nodes, so I need a function instead of a simple string:
const puppeteer = require('puppeteer');
puppeteer.launch().then(function(browser) {
browser.newPage().then(function(page) {
page.goto('https://stackoverflow.com/users/4794').then(function() {
page.evaluate(function() {
return $('h2.user-card-name').text();
}).then(function(result) {
console.info(result);
browser.close();
});
});
});
});
That gives the exact same result. Now I need to add error handling, and maybe reduce the indentation levels.
Using await/async and $eval, the syntax looks like the following:
await page.goto('https://stackoverflow.com/users/4794')
const nameElement = await context.page.$eval('h2.user-card-name', el => el.text())
console.log(nameElement)
I use page.$eval
const text = await page.$eval('h2.user-card-name', el => el.innerText );
console.log(text);
I had success using the following:
const browser = await puppeteer.launch();
try {
const page = await browser.newPage();
await page.goto(url);
await page.waitFor(2000);
let html_content = await page.evaluate(el => el.innerHTML, await page.$('.element-class-name'));
console.log(html_content);
} catch (err) {
console.log(err);
}
Hope it helps.