I am using p-queue with Puppeteer. The goal is to run an X amount of Chrome instances where p-queue limits the amount of concurrency. When an exception occurs within a task in queue, I would like to requeue it. But when I do that the queue stops.
I have the following:
getAccounts it simply a helper method to parse a JSON file. And for every entry, I create it a task and submit it to the queue.
async init() {
let accounts = await this.getAccounts();
accounts.map(async () => {
await queue.add(() => this.test());
});
await queue.onIdle();
console.log("ended, with count: " + this._count)
}
The test method:
async test() {
this._count++;
const browser = await puppeteer.launch({headless: false});
try {
const page = await browser.newPage();
await page.goto(this._url);
if (Math.floor(Math.random() * 10) > 4) {
throw new Error("Simulate error");
}
await browser.close();
} catch (error) {
await browser.close();
await queue.add(() => this.test());
console.log(error);
}
}
If I run this without await queue.add(() => this.test());, it runs fine and limits the concurrency to 3. But with it, whenever it goes in the catch, the current Chrome instance stops.
It also does not log the error, and neither this console.log("ended, with count: " + this._count).
Is this a bug with the node module, or am I doing something wrong?
I recommend checking Apify SDK package, where you can simply use one of helper class to manage puppeteer pages/browsers.
PuppeteerPool:
It manages browser instances for you. If you set one-page per browser. Each new page will create a new browser instance.
const puppeteerPool = new PuppeteerPool({
maxOpenPagesPerInstance: 1,
});
const page1 = await puppeteerPool.newPage();
const page2 = await puppeteerPool.newPage();
const page3 = await puppeteerPool.newPage();
// ... do something with the pages ...
// Close all browsers.
await puppeteerPool.destroy();
Or the PuppeteerCrawler is more powerfull with several options and helpers. You can manage the whole crawler in puppeteer there. You can check the PuppeteerCrawler example.
edit:
Example of using PuppeteerCrawler 10 concurency
const Apify = require('apify');
Apify.main(async () => {
// Apify.openRequestQueue() is a factory to get a preconfigured RequestQueue instance.
// We add our first request to it - the initial page the crawler will visit.
const requestQueue = await Apify.openRequestQueue();
await requestQueue.addRequest({ url: 'https://news.ycombinator.com/' }); // Adds URLs you want to process
// Create an instance of the PuppeteerCrawler class - a crawler
// that automatically loads the URLs in headless Chrome / Puppeteer.
const crawler = new Apify.PuppeteerCrawler({
requestQueue,
maxConcurrency: 10, // Set max concurrency
puppeteerPoolOptions: {
maxOpenPagesPerInstance: 1, // Set up just one page for one browser instance
},
// The function accepts a single parameter, which is an object with the following fields:
// - request: an instance of the Request class with information such as URL and HTTP method
// - page: Puppeteer's Page object (see https://pptr.dev/#show=api-class-page)
handlePageFunction: async ({ request, page }) => {
// Code you want to process on each page
},
// This function is called if the page processing failed more than maxRequestRetries+1 times.
handleFailedRequestFunction: async ({ request }) => {
// Code you want to process when handlePageFunction failed
},
});
// Run the crawler and wait for it to finish.
await crawler.run();
console.log('Crawler finished.');
});
Example of using RequestList:
const Apify = require('apify');
Apify.main(async () => {
const requestList = new Apify.RequestList({
sources: [
// Separate requests
{ url: 'http://www.example.com/page-1' },
{ url: 'http://www.example.com/page-2' },
// Bulk load of URLs from file `http://www.example.com/my-url-list.txt`
{ requestsFromUrl: 'http://www.example.com/my-url-list.txt', userData: { isFromUrl: true } },
],
persistStateKey: 'my-state',
persistSourcesKey: 'my-sources',
});
// This call loads and parses the URLs from the remote file.
await requestList.initialize();
const crawler = new Apify.PuppeteerCrawler({
requestList,
maxConcurrency: 10, // Set max concurrency
puppeteerPoolOptions: {
maxOpenPagesPerInstance: 1, // Set up just one page for one browser instance
},
// The function accepts a single parameter, which is an object with the following fields:
// - request: an instance of the Request class with information such as URL and HTTP method
// - page: Puppeteer's Page object (see https://pptr.dev/#show=api-class-page)
handlePageFunction: async ({ request, page }) => {
// Code you want to process on each page
},
// This function is called if the page processing failed more than maxRequestRetries+1 times.
handleFailedRequestFunction: async ({ request }) => {
// Code you want to process when handlePageFunction failed
},
});
// Run the crawler and wait for it to finish.
await crawler.run();
console.log('Crawler finished.');
});
Related
What i'm trying to accomplish is enter this site https://www.discoverpermaculture.com/permaculture-masterclass-video-1 wait until it loads, load all comments from disqus (click 'Load more comments' button until it's no longer present) and save page as mhtml for offline use.
I found similar question here Puppeteer / Node.js to click a button as long as it exists -- and when it no longer exists, commence action but unfortunately trying to detect the "Load more comments" button doesn't work for some reason.
Seems like WaitForSelector('a.load-more__button') is not working because all it prints out is "not visible".
Here's my code
const puppeteer = require('puppeteer');
const url = "https://www.discoverpermaculture.com/permaculture-masterclass-video-1";
const isElementVisible = async (page, cssSelector) => {
let visible = true;
await page
.waitForSelector(cssSelector, { visible: true, timeout: 4000 })
.catch(() => {
console.log('not visible');
visible = false;
});
return visible;
};
async function run () {
let browser = await puppeteer.launch({
headless: true,
defaultViewport: null,
args: [
'--window-size=1920,10000',
],
});
const page = await browser.newPage();
const fs = require('fs');
await page.goto(url);
await page.waitForNavigation();
await page.waitForTimeout(4000)
const selectorForLoadMoreButton = 'a.load-more__button';
let loadMoreVisible = await isElementVisible(page, selectorForLoadMoreButton);
while (loadMoreVisible) {
console.log('load more visible');
await page
.click(selectorForLoadMoreButton)
.catch(() => {});
await page.waitForTimeout(4000);
loadMoreVisible = await isElementVisible(page, selectorForLoadMoreButton);
}
const cdp = await page.target().createCDPSession();
const { data } = await cdp.send('Page.captureSnapshot', { format: 'mhtml' });
fs.writeFileSync('page.mhtml', data);
browser.close();
}
run();
You're just waiting for an ajax request to be processed. You could simply save the total number of comments (top left of the DISQUS plugin) and compare it to an array of comments once the array is equal to the total then you've retrieved every comments.
I've posted something a while back on waiting for ajax request you can see it here: https://stackoverflow.com/a/66092889/3645650.
Alternatively, a simpler approach would be to just use the DISQUS api.
Comments are publicly accessible. You can just use the api key from the website:
https://disqus.com/api/3.0/threads/listPostsThreaded?limit=50&thread=7187962034&forum=pdc2018&order=popular&cursor=1%3A0%3A0&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F
parameter
options
limit
Default to 50. Maximum is 100.
thread
Thread number. eg: 7187962034.
forum
Forum id. eg: pdc2018.
order
desc, asc, popular.
cursor
Probably the page number. Format is 1:0:0. eg: Page 2 would be 2:0:0.
api_key
The platform api key. Here the api key is E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F.
If you have to iterate through different pages you would need to intercept the xhr responses to retrieve the thread number.
It turned out the problem was that disqus comments were inside of an iframe
//needed to add those 2 lines
const elementHandle = await page.waitForSelector('iframe');
const frame = await elementHandle.contentFrame();
//and change 'page' to 'frame' below
let loadMoreVisible = await isElementVisible(frame, selectorForLoadMoreButton);
while (loadMoreVisible) {
console.log('load more visible');
await frame
.click(selectorForLoadMoreButton)
.catch(() => {});
await frame.waitForTimeout(4000);
loadMoreVisible = await isElementVisible(frame, selectorForLoadMoreButton);
}
After this change it works perfect
I am trying to click the "Create File" button on fakebook's download your information page. I am currently able to goto the page, and I wait for the login process to finish. However, when I try to detect the button using
page.$x("//div[contains(text(),'Create File')]")
nothing is found. The same thing occurs when I try to find it in the chrome dev tools console, both in a puppeteer window and in a regular window outside of the instance of chrome puppeteer is controlling:
This is the html info for the element:
I am able to find the element however after I have clicked on it using the chrome dev tools inspector tool:
(the second print statement is from after I have clicked on it with the element inspector tool)
How should I select this element? I am new to puppeteer and to xpath so I apologize if I just missed something obvious.
A small few links I currently remember looking at previously:
Puppeteer can't find selector
puppeteer cannot find element
puppeteer: how to wait until an element is visible?
My Code:
const StealthPlugin = require("puppeteer-extra-plugin-stealth");
(async () => {
let browser;
try {
puppeteer.use(StealthPlugin());
browser = await puppeteer.launch({
headless: false,
// path: "C:\\Program Files (x86)\\Google\\Chrome\\Application\\chrome.exe",
args: ["--disable-notifications"],
});
const pages = await browser.pages();
const page = pages[0];
const url = "https://www.facebook.com/dyi?referrer=yfi_settings";
await page.goto(url);
//Handle the login process. Since the login page is different from the url we want, I am going to assume the user
//has logged in if they return to the desired page.
//Wait for the login page to process
await page.waitForFunction(
(args) => {
return window.location.href !== args[0];
},
{ polling: "mutation", timeout: 0 },
[url]
);
//Since multifactor auth can resend the user temporarly to the desired url, use a little debouncing to make sure the user is completely done signing in
// make sure there is no redirect for mfa
await page.waitForFunction(
async (args) => {
// function to make sure there is a debouncing delay between checking the url
// Taken from: https://stackoverflow.com/a/49813472/11072972
function delay(delayInms) {
return new Promise((resolve) => {
setTimeout(() => {
resolve(2);
}, delayInms);
});
}
if (window.location.href === args[0]) {
await delay(2000);
return window.location.href === args[0];
}
return false;
},
{ polling: "mutation", timeout: 0 },
[url]
);
// await page.waitForRequest(url, { timeout: 100000 });
const requestArchiveXpath = "//div[contains(text(),'Create File')]";
await page.waitForXPath(requestArchiveXpath);
const [requestArchiveSelector] = await page.$x(requestArchiveXpath);
await page.click(requestArchiveSelector);
page.waitForTimeout(3000);
} catch (e) {
console.log("End Error: ", e);
} finally {
if (browser) {
await browser.close();
}
}
})();
Resolved using the comment above by #vsemozhebuty and source. Only the last few lines inside the try must change:
const iframeXpath = "//iframe[not(#hidden)]";
const requestArchiveXpath = "//div[contains(text(),'Create File')]";
//Wait for and get iframe
await page.waitForXPath(iframeXpath);
const [iframeHandle] = await page.$x(iframeXpath);
//content frame for iframe => https://devdocs.io/puppeteer/index#elementhandlecontentframe
const frame = await iframeHandle.contentFrame();
//Wait for and get button
await frame.waitForXPath(requestArchiveXpath);
const [requestArchiveSelector] = await frame.$x(requestArchiveXpath);
//click button
await requestArchiveSelector.click();
await page.waitForTimeout(3000);
I'm trying to crawl several web pages to check broken links and writing the results of the links to a json files, however, after the first file is completed the app crashes with no error popping up...
I'm using Puppeteer to crawl, Bluebird to run each link concurrently and fs to write the files.
WHAT IVE TRIED:
switching file type to '.txt' or '.php', this works but I need to create another loop outside the current workflow to convert the files from '.txt' to '.json'. Renaming the file right after writing to it also causes the app to crash.
using try catch statements for fs.writeFile but it never throws an error
the entire app outside of express, this worked at some point but i trying to use it within the framework
const express = require('express');
const router = express.Router();
const puppeteer = require('puppeteer');
const bluebird = require("bluebird");
const fs = require('fs');
router.get('/', function(req, res, next) {
(async () => {
// Our (multiple) URLs.
const urls = ['https://www.testing.com/allergy-test/', 'https://www.testing.com/genetic-testing/'];
const withBrowser = async (fn) => {
const browser = await puppeteer.launch();
try {
return await fn(browser);
} finally {
await browser.close();
}
}
const withPage = (browser) => async (fn) => {
const page = await browser.newPage();
// Turns request interceptor on.
await page.setRequestInterception(true);
// Ignore all the asset requests, just get the document.
page.on('request', request => {
if (request.resourceType() === 'document' ) {
request.continue();
} else {
request.abort();
}
});
try {
return await fn(page);
} finally {
await page.close();
}
}
const results = await withBrowser(async (browser) => {
return bluebird.map(urls, async (url) => {
return withPage(browser)(async (page) => {
await page.goto(url, {
waitUntil: 'domcontentloaded',
timeout: 0 // Removes timeout.
});
// Search for urls we want to "crawl".
const hrefs = await page.$$eval('a[href^="https://www.testing.com/"]', as => as.map(a => a.href));
// Predefine our arrays.
let links = [];
let redirect = [];
// Loops through each /goto/ url on page
for (const href of Object.entries(hrefs)) {
response = await page.goto(href[1], {
waitUntil: 'domcontentloaded',
timeout: 0 // Remove timeout.
});
const chain = response.request().redirectChain();
const link = {
'source_url': href[1],
'status': response.status(),
'final_url': response.url(),
'redirect_count': chain.length,
};
// Loops through the redirect chain for each href.
for ( const ch of chain) {
redirect = {
status: ch.response().status(),
url: ch.url(),
};
}
// Push all info of target link into links
links.push(link);
}
// JSONify the data.
const linksJson = JSON.stringify(links);
fileName = url.replace('https://www.testing.com/', '');
fileName = fileName.replace(/[^a-zA-Z0-9\-]/g, '');
// Write data to file in /tmp directory.
fs.writeFile(`./tmp/${fileName}.json`, linksJson, (err) => {
if (err) {
return console.log(err);
}
});
});
}, {concurrency: 4}); // How many pages to run at a time.
});
})();
});
module.exports = router;
UPDATE:
So there is nothing wrong with my code... I realized nodemon was stopping the process after each file was saved. Since nodemon would detect a "file change" it kept restarting my server after the first item
Comparing response time of Node to Chrome, node is slower. I'm making request to the same page; making two request, one from Node and the second one from Chrome console.
Chrome v: Latest
Node v: 12
OS: Windows 64x
Node:
const fetch = require("node-fetch");
const url =
"https://poshmark.com/search?query=t%20shirts&availability=sold_out&department=All";
(async () => {
console.time("Load Time: ");
await fetch(url);
console.timeEnd("Load Time: ");
})();
Chrome:
Go to the url and then run this in console.
(async () => {
console.time("Load Time: ");
var request = await fetch(location);
console.timeEnd("Load Time: ");
})();
Results:
Node: ~3.746s
Chrome:1030.53515625ms
Is there anything that we can do to fix this?
Thanks for the help.
With this code in node.js that loads the whole response with two different libraries:
"use strict";
const got = require('got');
const fetch = require("node-fetch");
async function run1(silent) {
let url = "https://poshmark.com/search?query=t%20shirts&availability=sold_out&department=All";
url += `&random=${Math.floor(Math.random() * 1000000)}`;
if (silent) {
let {body} = await got(url);
} else {
console.time("Load Time Got");
let {body} = await got(url);
console.timeEnd("Load Time Got");
}
}
async function run2(silent) {
let url = "https://poshmark.com/search?query=t%20shirts&availability=sold_out&department=All";
url += `&random=${Math.floor(Math.random() * 1000000)}`;
if (silent) {
let body = await fetch(url).then(res => res.text());
} else {
console.time("Load Time Fetch");
let body = await fetch(url).then(res => res.text());
console.timeEnd("Load Time Fetch");
}
}
function delay(t) {
return new Promise(resolve => setTimeout(resolve, t));
}
async function go() {
await delay(1000);
// one throw away run for each to make sure everything is fully initialized
await run1(true);
await run2(true);
// let garbage collector settle down
await delay(1000);
await run1(false);
await delay(1000);
await run2(false);
}
go();
And, run four separate times, I get these results:
Load Time Got: : 967.574ms
Load Time Fetch: : 921.211ms
Load Time Got: 872.823ms
Load Time Fetch: 858.379ms
Load Time Got: 802.700ms
Load Time Fetch: 930.276ms
Load Time Got: 819.646ms
Load Time Fetch: 966.878ms
So, I'm not seeing multiple second responses here. Note, I'm generating a unique query parameter for each URL to attempt to defeat any caching anywhere.
Using a similar randomized URL in similar code in the browser console that loads the whole response body, I get runs of 748.2ms, 647.9ms, 738.2ms.
Using puppeteer, I cannot figure out how to get the document.readyState. I need to wait until the page is loaded before rending a pdf.
const browser = await puppeteer.launch({
headless: true,
args: ['--no-sandbox']
});
const page = await browser.newPage();
console.log('Setting HTML content...');
// Can't POST data with headless chrome, so we have to get the HTML and set the content of the page, then render that to a PDF
await page.setContent(html);
// Generates a PDF with 'screen' media type.
await page.emulateMedia('screen');
var renderPage = function () {
return new Promise(async resolve => {
await page.evaluate((document) => {
console.log(document);
const handleDocumentLoaded = () => {
console.log('readyState: ', document.readyState);
console.log('Rendering PDF...');
Promise.resolve(resolve(page.pdf({ path: thisPDFfileName, format: 'Letter' })));
};
if (document.readyState === "loading") {
document.addEventListener("DOMContentLoaded", handleDocumentLoaded);
} else {
handleDocumentLoaded();
}
});
// I also tried this... no luck
// setTimeout(async function () {
// console.log('Awaiting document...');
//
// const handle = await page.evaluateHandle(() => ({window, document}));
// const properties = await handle.getProperties();
// const windowHandle = properties.get('window');
// const documentHandle = properties.get('document');
// await handle.dispose();
//
// console.log('readyState: ', documentHandle.readyState);
// if ("complete" === documentHandle.readyState) {
// await documentHandle.dispose();
// console.log('readyState: ', doc.readyState);
// console.log('Rendering PDF...');
// resolve(page.pdf({ path: thisPDFfileName, format: 'Letter' }));
// } else {
// renderPage();
// }
// }), 250;
});
};
// Delay required to allow page to render JS before creating PDF
await renderPage();
await browser.close();
sendPdfToClient();
I tried evaluateHandle and could only get the innerHTML, not the document object itself.
What's the correct way to get the document object containing readyState?
Lastly, should I set a listener for loaded or DOMContentLoaded, I need to wait until the google maps JS renders the map? I can sent a custom event if need be, since I control the page being rendered.
If you are using page.goto(), you can use the waitUntil option to specify when to consider navigation as complete:
The waitUntil events include:
load - consider navigation to be finished when the load event is fired.
domcontentloaded - consider navigation to be finished when the DOMContentLoaded event is fired.
networkidle0 - consider navigation to be finished when there are no more than 0 network connections for at least 500 ms.
networkidle2 - consider navigation to be finished when there are no more than 2 network connections for at least 500 ms.
Alternatively, you can use page.on() to wait for the 'domcontentloaded' event or 'load' event.
I guess I was overcomplicated it. Apparently there is already
page.once('load', () => console.log('Page loaded!'));
which does exactly this. :-D
See detailed documentation here:
https://github.com/GoogleChrome/puppeteer/blob/master/docs/api.md#event-load
There are 2 events which is relavant to your problem
event: 'domcontentloaded'
event 'load'