How to get Document.readyState in puppeteer / headless Chrome? - node.js

Using puppeteer, I cannot figure out how to get the document.readyState. I need to wait until the page is loaded before rending a pdf.
const browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox']
});
const page = await browser.newPage();
console.log('Setting HTML content...');
// Can't POST data with headless chrome, so we have to get the HTML and set the content of the page, then render that to a PDF
await page.setContent(html);
// Generates a PDF with 'screen' media type.
await page.emulateMedia('screen');
var renderPage = function () {
return new Promise(async resolve => {
await page.evaluate((document) => {
console.log(document);
const handleDocumentLoaded = () => {
console.log('readyState: ', document.readyState);
console.log('Rendering PDF...');
Promise.resolve(resolve(page.pdf({ path: thisPDFfileName, format: 'Letter' })));
};
if (document.readyState === "loading") {
document.addEventListener("DOMContentLoaded", handleDocumentLoaded);
} else {
handleDocumentLoaded();
}
});
// I also tried this... no luck
// setTimeout(async function () {
// console.log('Awaiting document...');
//
// const handle = await page.evaluateHandle(() => ({window, document}));
// const properties = await handle.getProperties();
// const windowHandle = properties.get('window');
// const documentHandle = properties.get('document');
// await handle.dispose();
//
// console.log('readyState: ', documentHandle.readyState);
// if ("complete" === documentHandle.readyState) {
// await documentHandle.dispose();
// console.log('readyState: ', doc.readyState);
// console.log('Rendering PDF...');
// resolve(page.pdf({ path: thisPDFfileName, format: 'Letter' }));
// } else {
// renderPage();
// }
// }), 250;
});
};
// Delay required to allow page to render JS before creating PDF
await renderPage();
await browser.close();
sendPdfToClient();
I tried evaluateHandle and could only get the innerHTML, not the document object itself.
What's the correct way to get the document object containing readyState?
Lastly, should I set a listener for loaded or DOMContentLoaded, I need to wait until the google maps JS renders the map? I can sent a custom event if need be, since I control the page being rendered.

If you are using page.goto(), you can use the waitUntil option to specify when to consider navigation as complete:
The waitUntil events include:
load - consider navigation to be finished when the load event is fired.
domcontentloaded - consider navigation to be finished when the DOMContentLoaded event is fired.
networkidle0 - consider navigation to be finished when there are no more than 0 network connections for at least 500 ms.
networkidle2 - consider navigation to be finished when there are no more than 2 network connections for at least 500 ms.
Alternatively, you can use page.on() to wait for the 'domcontentloaded' event or 'load' event.

I guess I was overcomplicated it. Apparently there is already
page.once('load', () => console.log('Page loaded!'));
which does exactly this. :-D
See detailed documentation here:
https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#event-load
There are 2 events which is relavant to your problem
event: 'domcontentloaded'
event 'load'

Related

Can't click link using puppeteer - Thingiverse

I'm trying to automate away downloading multiple files on thingiverse. I choose an object at random. But I'm having a hard time locating the link I need, clicking and then downloading. Has someone run into this before can I get some help?
I've tried several other variations.
import puppeteer from 'puppeteer';
async function main() {
const browser = await puppeteer.launch({
headless: true,
});
const page = await browser.newPage();
const response = await page.goto('https://www.thingiverse.com/thing:2033856/files');
const buttons = await page.$x(`//a[contains(text(), 'Download')]`);
if(buttons.length > 0){
console.log(buttons.length);
} else {
console.log('no buttons');
}
await wait(5000);
await browser.close();
return 'Finish';
}
async function wait(time: number) {
return new Promise(function (resolve) {
setTimeout(resolve, time);
});
}
function start() {
main()
.then((test) => console.log('DONE'))
.catch((reason) => console.log('Error: ', reason));
}
start();
Download Page
Code
I was able to get it to work.
The selector is: a[class^="ThingFile__download"]
Puppeteer is: const puppeteer = require('puppeteer-extra');
Before the await page.goto() I always recommend setting the viewport:
await page.setViewport({width: 1920, height: 720});
After that is set, change the await page.goto() to have a waitUntil option:
const response = await page.goto('https://www.thingiverse.com/thing:2033856/files', { waitUntil: 'networkidle0' }); // wait until page load
Next, this is a very important part. You have to do what is called waitForSelector() or waitForFunction().
I added both of these lines of code after the const response:
await page.waitForSelector('a[class^="ThingFile__download"]', {visible: true})
await page.waitForFunction("document.querySelector('a[class^=\"ThingFile__download\"]') && document.querySelector('a[class^=\"ThingFile__download\"]').clientHeight != 0");
Next, get the buttons. For my testing I just grabbed the button href.
const buttons = await page.$eval('a[class^="ThingFile__download"]', anchor => anchor.getAttribute('href'));
Lastly, do not check the .length of this variable. In this case we are just returning the href value which is a string. You will get a Promise of an ElementHandle when you try getting just the button:
const button = await page.$('a[class^="ThingFile__download"]');
console.log(button)
if (button) { ... }
Now if you change that page.$ to be page.$$, you will be getting a Promise of an Array<ElementHandle>, and will be able to use .length there.
const buttonsAll = await page.$$('a[class^="ThingFile__download"]');
console.log(buttonsAll)
if (buttons.length > 0) { ... }
Hopefully this helps, and if you can't figure it out I can post my full source later if I have time to make it look better.

puppeteer / node.js - enter page, click load more until all comments load, save page as mhtml

What i'm trying to accomplish is enter this site https://www.discoverpermaculture.com/permaculture-masterclass-video-1 wait until it loads, load all comments from disqus (click 'Load more comments' button until it's no longer present) and save page as mhtml for offline use.
I found similar question here Puppeteer / Node.js to click a button as long as it exists -- and when it no longer exists, commence action but unfortunately trying to detect the "Load more comments" button doesn't work for some reason.
Seems like WaitForSelector('a.load-more__button') is not working because all it prints out is "not visible".
Here's my code
const puppeteer = require('puppeteer');
const url = "https://www.discoverpermaculture.com/permaculture-masterclass-video-1";
const isElementVisible = async (page, cssSelector) => {
let visible = true;
await page
.waitForSelector(cssSelector, { visible: true, timeout: 4000 })
.catch(() => {
console.log('not visible');
visible = false;
});
return visible;
};
async function run () {
let browser = await puppeteer.launch({
headless: true,
defaultViewport: null,
args: [
'--window-size=1920,10000',
],
});
const page = await browser.newPage();
const fs = require('fs');
await page.goto(url);
await page.waitForNavigation();
await page.waitForTimeout(4000)
const selectorForLoadMoreButton = 'a.load-more__button';
let loadMoreVisible = await isElementVisible(page, selectorForLoadMoreButton);
while (loadMoreVisible) {
console.log('load more visible');
await page
.click(selectorForLoadMoreButton)
.catch(() => {});
await page.waitForTimeout(4000);
loadMoreVisible = await isElementVisible(page, selectorForLoadMoreButton);
}
const cdp = await page.target().createCDPSession();
const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });
fs.writeFileSync('page.mhtml', data);
browser.close();
}
run();
You're just waiting for an ajax request to be processed. You could simply save the total number of comments (top left of the DISQUS plugin) and compare it to an array of comments once the array is equal to the total then you've retrieved every comments.
I've posted something a while back on waiting for ajax request you can see it here: https://stackoverflow.com/a/66092889/3645650.
Alternatively, a simpler approach would be to just use the DISQUS api.
Comments are publicly accessible. You can just use the api key from the website:
https://disqus.com/api/3.0/threads/listPostsThreaded?limit=50&thread=7187962034&forum=pdc2018&order=popular&cursor=1%3A0%3A0&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F
parameter
options
limit
Default to 50. Maximum is 100.
thread
Thread number. eg: 7187962034.
forum
Forum id. eg: pdc2018.
order
desc, asc, popular.
cursor
Probably the page number. Format is 1:0:0. eg: Page 2 would be 2:0:0.
api_key
The platform api key. Here the api key is E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F.
If you have to iterate through different pages you would need to intercept the xhr responses to retrieve the thread number.
It turned out the problem was that disqus comments were inside of an iframe
//needed to add those 2 lines
const elementHandle = await page.waitForSelector('iframe');
const frame = await elementHandle.contentFrame();
//and change 'page' to 'frame' below
let loadMoreVisible = await isElementVisible(frame, selectorForLoadMoreButton);
while (loadMoreVisible) {
console.log('load more visible');
await frame
.click(selectorForLoadMoreButton)
.catch(() => {});
await frame.waitForTimeout(4000);
loadMoreVisible = await isElementVisible(frame, selectorForLoadMoreButton);
}
After this change it works perfect

have to download files from "Click Here" link which open the pdf on a separate window and then click on the save button to download file puppeteer

await page.goto('https://www.website.com/Dashboard.aspx');
// Dynamically adding IDs to get to required page
await page.type('#searchBox_ID' ,ids.RequestID);
await page.click('#buttonCLick');
await page.waitForNavigation();
// Now page has loaded
await page.waitForSelector('#PdfFile_Selector');
await page.click('#PdfFile_Selector');
// this will open the file in a new window and have to click on the download button on it
// how else can I do it ?
According to the link i posted in the comment, i adapted this example on how to download a pdf. Please refer to this post for a more detailed explanation.
async function downloadPdf(page, url, targetFile, timeout_ms) {
return new Promise(async (res, rej) => {
// setup a timeout in case something goes wrong:
const timeout = setTimeout(() => {
rej(new Error('Timeout after' + timeout_ms))
}, timeout_ms)
// the download listener:
const listener = req => {
console.log('on request!. url:', req.url())
if (req.url() === url) {
const file = fs.createWriteStream(targetFile);
// clear timeout and listener
page.removeListener('request', listener)
clearTimeout(timeout)
//#todo
// add proper error handling here..
http.get(req.url(), response => {
response.pipe(file).on('finish', () => {
res()
})
});
}
}
page.on('request', listener)
// open the given url!.
// either timeout or request listener will kick in.
page.goto(url);
})
}
And you could call it like this:
// ...
const page = await browser.newPage();
const url = 'http://www.africau.edu/images/default/sample.pdf'
const destFile = 'file.pdf'
console.log('lets download', url)
await downloadPdf(page, url, destFile, 20 * 1000)
console.log('FIle should be downloaded under', destFile)
So when your pdf page has opened, copy it's url and open it in anther tab again using downloadPDF function. Then close all tabs and you are done! .

Puppeteer unable to find element using xPath contains(text()) until after element has been selected in chrome dev tools

I am trying to click the "Create File" button on fakebook's download your information page. I am currently able to goto the page, and I wait for the login process to finish. However, when I try to detect the button using
page.$x("//div[contains(text(),'Create File')]")
nothing is found. The same thing occurs when I try to find it in the chrome dev tools console, both in a puppeteer window and in a regular window outside of the instance of chrome puppeteer is controlling:
This is the html info for the element:
I am able to find the element however after I have clicked on it using the chrome dev tools inspector tool:
(the second print statement is from after I have clicked on it with the element inspector tool)
How should I select this element? I am new to puppeteer and to xpath so I apologize if I just missed something obvious.
A small few links I currently remember looking at previously:
Puppeteer can't find selector
puppeteer cannot find element
puppeteer: how to wait until an element is visible?
My Code:
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
(async () => {
let browser;
try {
puppeteer.use(StealthPlugin());
browser = await puppeteer.launch({
headless: false,
// path: "C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe",
args: ["--disable-notifications"],
});
const pages = await browser.pages();
const page = pages[0];
const url = "https://www.facebook.com/dyi?referrer=yfi_settings";
await page.goto(url);
//Handle the login process. Since the login page is different from the url we want, I am going to assume the user
//has logged in if they return to the desired page.
//Wait for the login page to process
await page.waitForFunction(
(args) => {
return window.location.href !== args[0];
},
{ polling: "mutation", timeout: 0 },
[url]
);
//Since multifactor auth can resend the user temporarly to the desired url, use a little debouncing to make sure the user is completely done signing in
// make sure there is no redirect for mfa
await page.waitForFunction(
async (args) => {
// function to make sure there is a debouncing delay between checking the url
// Taken from: https://stackoverflow.com/a/49813472/11072972
function delay(delayInms) {
return new Promise((resolve) => {
setTimeout(() => {
resolve(2);
}, delayInms);
});
}
if (window.location.href === args[0]) {
await delay(2000);
return window.location.href === args[0];
}
return false;
},
{ polling: "mutation", timeout: 0 },
[url]
);
// await page.waitForRequest(url, { timeout: 100000 });
const requestArchiveXpath = "//div[contains(text(),'Create File')]";
await page.waitForXPath(requestArchiveXpath);
const [requestArchiveSelector] = await page.$x(requestArchiveXpath);
await page.click(requestArchiveSelector);
page.waitForTimeout(3000);
} catch (e) {
console.log("End Error: ", e);
} finally {
if (browser) {
await browser.close();
}
}
})();
Resolved using the comment above by #vsemozhebuty and source. Only the last few lines inside the try must change:
const iframeXpath = "//iframe[not(#hidden)]";
const requestArchiveXpath = "//div[contains(text(),'Create File')]";
//Wait for and get iframe
await page.waitForXPath(iframeXpath);
const [iframeHandle] = await page.$x(iframeXpath);
//content frame for iframe => https://devdocs.io/puppeteer/index#elementhandlecontentframe
const frame = await iframeHandle.contentFrame();
//Wait for and get button
await frame.waitForXPath(requestArchiveXpath);
const [requestArchiveSelector] = await frame.$x(requestArchiveXpath);
//click button
await requestArchiveSelector.click();
await page.waitForTimeout(3000);

Adding functions to p-queue to handle concurrency stops queue

I am using p-queue with Puppeteer. The goal is to run an X amount of Chrome instances where p-queue limits the amount of concurrency. When an exception occurs within a task in queue, I would like to requeue it. But when I do that the queue stops.
I have the following:
getAccounts it simply a helper method to parse a JSON file. And for every entry, I create it a task and submit it to the queue.
async init() {
let accounts = await this.getAccounts();
accounts.map(async () => {
await queue.add(() => this.test());
});
await queue.onIdle();
console.log("ended, with count: " + this._count)
}
The test method:
async test() {
this._count++;
const browser = await puppeteer.launch({headless: false});
try {
const page = await browser.newPage();
await page.goto(this._url);
if (Math.floor(Math.random() * 10) > 4) {
throw new Error("Simulate error");
}
await browser.close();
} catch (error) {
await browser.close();
await queue.add(() => this.test());
console.log(error);
}
}
If I run this without await queue.add(() => this.test());, it runs fine and limits the concurrency to 3. But with it, whenever it goes in the catch, the current Chrome instance stops.
It also does not log the error, and neither this console.log("ended, with count: " + this._count).
Is this a bug with the node module, or am I doing something wrong?
I recommend checking Apify SDK package, where you can simply use one of helper class to manage puppeteer pages/browsers.
PuppeteerPool:
It manages browser instances for you. If you set one-page per browser. Each new page will create a new browser instance.
const puppeteerPool = new PuppeteerPool({
maxOpenPagesPerInstance: 1,
});
const page1 = await puppeteerPool.newPage();
const page2 = await puppeteerPool.newPage();
const page3 = await puppeteerPool.newPage();
// ... do something with the pages ...
// Close all browsers.
await puppeteerPool.destroy();
Or the PuppeteerCrawler is more powerfull with several options and helpers. You can manage the whole crawler in puppeteer there. You can check the PuppeteerCrawler example.
edit:
Example of using PuppeteerCrawler 10 concurency
const Apify = require('apify');
Apify.main(async () => {
// Apify.openRequestQueue() is a factory to get a preconfigured RequestQueue instance.
// We add our first request to it - the initial page the crawler will visit.
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({ url: 'https://news.ycombinator.com/' }); // Adds URLs you want to process
// Create an instance of the PuppeteerCrawler class - a crawler
// that automatically loads the URLs in headless Chrome / Puppeteer.
const crawler = new Apify.PuppeteerCrawler({
requestQueue,
maxConcurrency: 10, // Set max concurrency
puppeteerPoolOptions: {
maxOpenPagesPerInstance: 1, // Set up just one page for one browser instance
},
// The function accepts a single parameter, which is an object with the following fields:
// - request: an instance of the Request class with information such as URL and HTTP method
// - page: Puppeteer's Page object (see https://pptr.dev/#show=api-class-page)
handlePageFunction: async ({ request, page }) => {
// Code you want to process on each page
},
// This function is called if the page processing failed more than maxRequestRetries+1 times.
handleFailedRequestFunction: async ({ request }) => {
// Code you want to process when handlePageFunction failed
},
});
// Run the crawler and wait for it to finish.
await crawler.run();
console.log('Crawler finished.');
});
Example of using RequestList:
const Apify = require('apify');
Apify.main(async () => {
const requestList = new Apify.RequestList({
sources: [
// Separate requests
{ url: 'http://www.example.com/page-1' },
{ url: 'http://www.example.com/page-2' },
// Bulk load of URLs from file `http://www.example.com/my-url-list.txt`
{ requestsFromUrl: 'http://www.example.com/my-url-list.txt', userData: { isFromUrl: true } },
],
persistStateKey: 'my-state',
persistSourcesKey: 'my-sources',
});
// This call loads and parses the URLs from the remote file.
await requestList.initialize();
const crawler = new Apify.PuppeteerCrawler({
requestList,
maxConcurrency: 10, // Set max concurrency
puppeteerPoolOptions: {
maxOpenPagesPerInstance: 1, // Set up just one page for one browser instance
},
// The function accepts a single parameter, which is an object with the following fields:
// - request: an instance of the Request class with information such as URL and HTTP method
// - page: Puppeteer's Page object (see https://pptr.dev/#show=api-class-page)
handlePageFunction: async ({ request, page }) => {
// Code you want to process on each page
},
// This function is called if the page processing failed more than maxRequestRetries+1 times.
handleFailedRequestFunction: async ({ request }) => {
// Code you want to process when handlePageFunction failed
},
});
// Run the crawler and wait for it to finish.
await crawler.run();
console.log('Crawler finished.');
});

Resources