Puppeteer page.$$ returns empty array - node.js

I'm working on a simple scraper but I can't get past this issue.
It returns an empty array everytime I run it, however the site does contain the elements and returns a NodeList when I run querySelectorAll on the console.
Is there anything I migh be overlooking? I've already tried waitForSelector to wait for it but no luck, it just gives a timeout.
Thank you
const scraper = async () => {
try {
const browser = await puppeteer.launch({ args: ['--no-sandbox', "--disabled-setupid-sandbox"]});
const page = await browser.newPage();
await page.goto('https://randomtodolistgenerator.herokuapp.com/library');
const elements = await page.$$(".card-body");
console.log(elements);
await browser.close();
} catch (error) {
console.log(error)
}
}

It turned out that the WSL was not able to run chromium for some reason.
I ended up installing Linux on a VM and it is working now.

Related

Puppeteer iterates through all pages but browser won't close after

After the program iterates through all pages, it doesn't break out of the while loop and closes the browser. Instead, it ran through the while loop one extra time and gave me an error: "TimeoutError: waiting for selector `.pager .next a` failed: timeout 30000ms exceeded" What went wrong?
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({ headless: false });
const page = await browser.newPage();
await page.goto("http://books.toscrape.com/");
let isLastPage = false;
while (!isLastPage) {
await page.waitForSelector(".pager .next a");
isLastPage = (await page.$(".pager .next a")) === null;
await page.click(".pager .next a");
}
console.log("done");
await browser.close();
})();
Your last page detection logic is just flawed. While you're on a page, you're trying to both see if ".pager .next a" exists AND you're trying to click that. Obviously if it doesn't exist, you can't click it.
What you want to do is make sure the page is loaded by waiting for .pager .current which is a part of the navigation footer that will be there on every page. Then, check if .pager .next a is there BEFORE you click and if it's not there, then you can just break out of the while loop. If the page is dynamic and you need to use puppeteer, then you can do something like this:
const puppeteer = require("puppeteer");
(async () => {
const browser = await puppeteer.launch({ headless: false });
try {
const page = await browser.newPage();
await page.goto("http://books.toscrape.com/");
let cntr = 0;
while (true) {
await page.waitForSelector(".pager .current");
console.log(`page ${++cntr}`);
// process the page content here
if ((await page.$(".pager .next a")) === null) {
break;
}
await page.click(".pager .next a");
}
console.log("done");
} finally {
await browser.close();
}
})();
And, to make sure that you always close the browser even upon errors, you need to catch any errors and make sure you close the browser in those conditions. In this case, you can use try/finally.
If the page is not dynamic, then you can also just use plain GET requests and use cheerio to examine what's in the page which is simpler and doesn't involve loading the whole chromium browser engine.

playwright - get content from multiple pages in parallel

I am trying to get the page content from multiple URLs using playwright in a nodejs application. My code looks like this:
const getContent = async (url: string): Promise<string> {
const browser = await firefox.launch({ headless: true });
const page = await browser.newPage();
try {
await page.goto(url, {
waitUntil: 'domcontentloaded',
});
return await page.content();
} finally {
await page.close();
await browser.close();
}
}
const items = [
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
{
urls: ["https://www.google.com", "https://www.example.com"]
// other props
},
// more items...
]
await Promise.all(
items.map(async (item) => {
const contents = [];
for (url in item.urls) {
contents.push(await getContent(url))
}
return contents;
}
)
I am getting errors like error (Page.content): Target closed. but I noticed that if I just run without loop:
const content = getContent('https://www.example.com');
It works.
It looks like each iteration of the loops share the same instance of browser and/or page, so they are closing/navigating away each other.
To test it I built a web API with the getContent function and when I send 2 requests (almost) at the same time one of them fails, instead if send one request at the time it always works.
Is there a way to make playwright work in parallel?
I don't know if that solves it, but noticed there are two missing awaits. Both the firefox.launch(...) and the browser.newPage() are asynchronous and need an await in front.
Also, you don't need to launch a new browser so many times. PlayWright has the feature of isolated browser contexts, which are created much faster than launching a browser. It's worth experimenting with launching the browser before the getContent function, and using
const context = await browser.newContext();
const page = await context.newPage();

Use post variable with querySelector

I'm facing an issue trying to scrape datas on the web with puppeteer and querySelector.
I have a nodeJS WebServer that handle a post query, and then call a function to scrape the datas. I'm sending 2 parameters (postBlogUrl & postDomValue).
PostDomValue will contains as string the selector I'm trying to fetch datas from, for example:
[itemprop='articleBody'].
If I manually suggest the selector ([itemprop='articleBody']), everything is working well, I'm able to retrieve datas, but if i use the postDomValue var, nothing is returned.
I already tried to escape the var using CSS.escape(postDomValue), but no luck.
fetchBlogContent: async function(postBlogUrl, postDomValue) {
try {
const puppeteer = require('puppeteer');
const browser = await puppeteer.launch();
page = await browser.newPage();
await page.goto(postBlogUrl, {
waitUntil: 'load'
})
let description = await page.evaluate(() => {
//This works return document.querySelector("[itemprop='articleBody']").innerHTML;
//This won't return document.querySelector(postDomValue).innerHTML;
})
return description
} catch (err) {
// handle err
return err;
}
}
const description = await page.evaluate((value) =>
document.querySelector(value).innerHTML, JSON.stringify(postDomValue));
See docs on how to pass args to page.evaluate() in puppeteer
If I understand correctly, the issue may be that you try to use a variable declared in the Node.js context inside an argument function of page.evaluate() that is executed in the browser context. In such cases, you need to transfer the value of a variable as an additional argument:
let description = await page.evaluate((selector) => {
return document.querySelector(selector).innerHTML;
}, postDomValue);
See more in page.evaluate().

Puppeteer Devtools Programaticaly

I can open the devtools that exist in Puppeteer, but I cannot write data to the console section and export the log of this data to the cmd screen?
In Puppeteer, I want to print to console as below and get the output below.
Screenshot
You are asking for two things here
Capture console.log messages to the command prompt
Run a javascript command inside puppeteer
For the first point you can set the option dumpio: true as a option
For the second point you can jump into the page using evaluate and make a call to console.log
const puppeteer = require('puppeteer');
(async () => {
const browser = await puppeteer.launch({
dumpio: true
});
const page = await browser.newPage();
const url = "https://stackoverflow.com";
await page.goto(url);
await page.waitFor('h1');
await page.evaluate(() => {
console.log(document.getElementsByTagName("h1")[0].innerText);
});
console.log("Done.")
await browser.close();
})();
Also for brevity if you are getting to much output you can omit dumpio and instead catch the log as an event e.g.
page.on('console', (msg) => console[msg._type]('PAGE LOG:', msg._text));
await page.waitFor('h1');
await page.evaluate(() => {
console.log(1 + 2);
console.log(document.getElementsByTagName("h1")[0].innerText);
});
the second script returns
PAGE LOG: 3
PAGE LOG: We <3 people who code
Done.

Getting DOM node text with Puppeteer and headless Chrome

I'm trying to use headless Chrome and Puppeteer to run our Javascript tests, but I can't extract the results from the page. Based on this answer, it looks like I should use page.evaluate(). That section even has an example that looks like what I need.
const bodyHandle = await page.$('body');
const html = await page.evaluate(body => body.innerHTML, bodyHandle);
await bodyHandle.dispose();
As a full example, I tried to convert that to a script that will extract my name from my user profile on Stack Overflow. Our project is using Node 6, so I converted the await expressions to use .then().
const puppeteer = require('puppeteer');
puppeteer.launch().then(function(browser) {
browser.newPage().then(function(page) {
page.goto('https://stackoverflow.com/users/4794').then(function() {
page.$('h2.user-card-name').then(function(heading_handle) {
page.evaluate(function(heading) {
return heading.innerText;
}, heading_handle).then(function(result) {
console.info(result);
browser.close();
}, function(error) {
console.error(error);
browser.close();
});
});
});
});
});
When I run that, I get this error:
$ node get_user.js
TypeError: Converting circular structure to JSON
at Object.stringify (native)
at args.map.x (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/helper.js:30:43)
at Array.map (native)
at Function.evaluationString (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/helper.js:30:29)
at Frame.<anonymous> (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:376:31)
at next (native)
at step (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:355:24)
at Promise (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:373:12)
at fn (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:351:10)
at Frame._rawEvaluate (/mnt/data/don/git/Kive/node_modules/puppeteer/node6/FrameManager.js:375:3)
The problem seems to be with serializing the input parameter to page.evaluate(). I can pass in strings and numbers, but not element handles. Is the example wrong, or is it a problem with Node 6? How can I extract the text of a DOM node?
I found three solutions to this problem, depending on how complicated your extraction is. The simplest option is a related function that I hadn't noticed: page.$eval(). It basically does what I was trying to do: combines page.$() and page.evaluate(). Here's an example that works:
const puppeteer = require('puppeteer');
puppeteer.launch().then(function(browser) {
browser.newPage().then(function(page) {
page.goto('https://stackoverflow.com/users/4794').then(function() {
page.$eval('h2.user-card-name', function(heading) {
return heading.innerText;
}).then(function(result) {
console.info(result);
browser.close();
});
});
});
});
That gives me the expected result:
$ node get_user.js
Don Kirkby top 2% overall
I wanted to extract something more complicated, but I finally realized that the evaluation function is running in the context of the page. That means you can use any tools that are loaded in the page, and then just send strings and numbers back and forth. In this example, I use jQuery in a string to extract what I want:
const puppeteer = require('puppeteer');
puppeteer.launch().then(function(browser) {
browser.newPage().then(function(page) {
page.goto('https://stackoverflow.com/users/4794').then(function() {
page.evaluate("$('h2.user-card-name').text()").then(function(result) {
console.info(result);
browser.close();
});
});
});
});
That gives me a result with the whitespace intact:
$ node get_user.js
Don Kirkby
top 2% overall
In my real script, I want to extract the text of several nodes, so I need a function instead of a simple string:
const puppeteer = require('puppeteer');
puppeteer.launch().then(function(browser) {
browser.newPage().then(function(page) {
page.goto('https://stackoverflow.com/users/4794').then(function() {
page.evaluate(function() {
return $('h2.user-card-name').text();
}).then(function(result) {
console.info(result);
browser.close();
});
});
});
});
That gives the exact same result. Now I need to add error handling, and maybe reduce the indentation levels.
Using await/async and $eval, the syntax looks like the following:
await page.goto('https://stackoverflow.com/users/4794')
const nameElement = await context.page.$eval('h2.user-card-name', el => el.text())
console.log(nameElement)
I use page.$eval
const text = await page.$eval('h2.user-card-name', el => el.innerText );
console.log(text);
I had success using the following:
const browser = await puppeteer.launch();
try {
const page = await browser.newPage();
await page.goto(url);
await page.waitFor(2000);
let html_content = await page.evaluate(el => el.innerHTML, await page.$('.element-class-name'));
console.log(html_content);
} catch (err) {
console.log(err);
}
Hope it helps.

Resources