I have a pdf generation API which takes JSON data, creates html page using ejs templates and generates PDF. For 45 page PDF it takes around 2 min. When I am trying with large amount of data that will result in 450+ pages it just hangs. Previously it used to throw TimeoutError: waiting for Page.printToPDF failed: timeout 30000ms exceeded. But I added timeout: 0 option. Now it just hangs. There's no response. I tried looking for online solutions. No luck.
Here's the code.
ejs.renderFile(filePath, { data }, async (err, html) => {
if (err) {
// error handler
} else {
const browser = await puppeteer.launch({
headless: true,
args: ["--no-sandbox", "--disable-dev-shm-usage", "--disable-gpu"]
})
const page = await browser.newPage();
await page.setContent(html, { waitUntil: 'domcontentloaded' })
const fileStream = await page.pdf({
displayHeaderFooter: true,
footerTemplate:
'<div class="footer" style="font-size: 10px;color: #000; margin: 10px
auto;clear:both; position: relative;"><span class="pageNumber"></span>
</div>',
margin: {top :30 ,bottom: 50},
});
await browser.close();
res.send(fileStream)
}
})
I added some console logs. It gets stuck on page.pdf().
Related
I have been experiencing this issue from a long time now. I have a web scraper on a Windows VM and I have it set to run every few hours. It works most of the time but a lot of times Puppeteer just opens this page 👇 and not the site or page I want to open.
Why does that happen and what can be the fix for this?
A simple reproduction for this issue can be this code
import puppeteer from 'puppeteer'
import { scheduleJob } from 'node-schedule';
async function run() {
const browser = await puppeteer.launch({
headless: false,
executablePath: chromePath,
defaultViewport: null,
timeout: 0,
args: ['--no-sandbox', '--start-maximized'],
});
const page = await browser.newPage();
await page.evaluateOnNewDocument(() => {
Object.defineProperty(navigator, 'webdriver', {
get: () => false,
});
});
await page.goto('https://aliexpress.com', {
waitUntil: 'networkidle0',
timeout: 0,
});
}
run();
scheduleJob('scrape aliexpress', `0 */${hours} * * *`, run);
Imagine keeping track of a page like this? (Open with Chrome, then right click and select Translate to English.)
http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=35366681030756042
When you press F12 and select the Network tab, note that responses are returning—with an interval of about 1 per second—containing the last prices and trades, with these HTTP header details:
{
...
connection: keep-alive
cookies: fooCookie
...
}
I have tried the GOT package with a keep-alive config:
const gotOption = {
keepAlive: true,
maxSockets: 10,
}
await got.get(url, {
agent: {
http: new HttpAgent(gotOption),
https: new HttpsAgent(gotOption),
},
})
I get just the first response, but how can I get new responses?
Is it possible to use Puppeteer for this purpose?
Well, there is a new xhr request being made every 3 to 5 seconds.
You could run a function triggering on that specific event. Intercepting .aspx responses and running your script. Here is a minimal snipet.
let puppeteer = require(`puppeteer`);
(async () => {
let browser = await puppeteer.launch({
headless: true,
});
let page = await browser.newPage(); (await browser.pages())[0].close();
let res = 0;
page.on('response', async (response) => {
if (response.url().includes(`.aspx`)) {
res++;
console.log(`\u001b[1;36m` + `Response ${res}: ${new Date(Date.now())}` + `\u001b[0m`);
};
});
await page.goto('http://www.tsetmc.com/Loader.aspx?ParTree=151311&i=35366681030756042');
//await browser.close();
})();
I'm trying to search the Kwik Trip website for daily deals using nodeJs but I keep getting a timeout error when I try to crawl it. Not quite sure what could be happening. Does anyone know what may be going on?
Below is my code, I'm trying to wait for .agendaItemWrap to load before it brings back all of the HTML because it's a SPA.
function getQuickStar(req, res){
(async () => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const navigationPromise = page.waitForNavigation({waitUntil: "domcontentloaded"});
await page.goto('https://www.kwiktrip.com/savings/daily-deals');
await navigationPromise;
await page.waitForSelector('.agendaItemWrap', { timeout: 30000 });
const body = await page.evaluate(() => {
return document.querySelector('body').innerHTML;
});
console.log(body);
await browser.close();
} catch (error) {
console.log(error);
}
})();
}
Here's a link to the web page I'm trying to crawl https://www.kwiktrip.com/savings/daily-deals
It appear your desired selector is located into an iframe, and not into the page.mainframe.
You then need to wait for your iframe, and perform the waitForSelector on this particular iframe.
Quick tip : you don't need any page.waitForNavigation with a page.goto, because you can set the waitUntil condition into the options. By default it waits for the page onLoad event.
I am using Puppeteer on Nodejs to render a PDF document via a page. However, it seems like images hosted on AWS S3 cannot be loaded, whereas images stored locally on the server itself loads fine.
I tried adding both waitUntil: "networkidle0" and waitUntil: "networkidle1" but it still does not work. I also tried adding printBackground: true too.
The images loads perfectly fine on the page as seen here
However, on the PDF generated by Puppeteer, the images does not load
This is my code:
(async () => {
const browser = await puppeteer.launch({
args: ["--no-sandbox"]
});
const page = await browser.newPage();
await page.setExtraHTTPHeaders({
authorization: req.session.token
});
await page.goto(config.url + "/project/download/" + permalink, {
waitUntil: "networkidle0"
});
const buffer = await page.pdf({
filename: permalink + "_ProjectBrief" + ".pdf",
format: "A4",
margin: {
top: 60,
bottom: 60,
left: 60,
right: 50
},
});
res.type("application/pdf");
res.send(buffer);
await browser.close();
})();
Any idea what should I do to get over this issue?
Thanks in advance!
I think I solved the issue.
After adding headless: false to
const browser = await puppeteer.launch({
args: ["--no-sandbox"]
});
I realised that the images did not load because of a 400 error. My hypothesis is that Chromium did not have enough time to download the images and thus throwing this error.
What I did was to edit the HTML file that I want Puppeteer to render, adding this code to it:
data.ProjectImages.forEach(e => {
window.open(e.imageUrl, '_blank');
});
What this does is that it opens up the images via its URL onto a new tab. This ensures that the images are downloaded to the Chromium instance (Puppeteer runs Chromium).
The rendered PDF can now display the images.
I'm trying to click on a page button on this website but when I enter the site an alert box shows up and I don't know how to close it.
I just started experimenting with Puppeteer, this is the code I'm using this simple code right now:
const ptr = require('puppeteer');
ptr.launch().then(async browser => {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 800 });
await page.goto('https://portaleperiti.grupporealemutua.it/PPVET/VetrinaWebPortalePeriti/');
//This is the alert button selector
await page.click("#BoxAlertBtnOk");
//This is the button on the page i want to click on
await page.click("input[value='Perito RE / G.F.']");
await page.screenshot({
path: 'screenshot.png',
fullPage: true
});
await browser.close();
});
This is the error I get: UnhandledPromiseRejectionWarning: Error: Node is either not visible or not an HTMLElement
at ElementHandle._clickablePoint
Any help would be really appreciated, thanks!
There are few things going on that page,
The alert box only loads after page is loaded (It has a onload property on body tag). So you should wait until network is idle.
Clicking those "Perito" buttons creates a new window/tab due to the window.open() code put into onclick handler.
The new tab redirects multiple times and shows a login page if the user is not logged in already.
Solution:
1. Make sure to load the page properly.
Just add { waitUntil: "networkidle0" } to .goto or .waitForNavigation.
await page.goto(
"https://portaleperiti.grupporealemutua.it/PPVET/VetrinaWebPortalePeriti/",
{ waitUntil: "networkidle0" }
// <-- Make sure the whole page is completely loaded
);
2. Wait for the element before clicking
Already suggested on other answers, wait for the element using waitFor.
// wait and click the alert button
await page.waitFor("#BoxAlertBtnOk");
await page.click("#BoxAlertBtnOk");
3. Optional, add few seconds before taking screenshot after clicking the button.
// optional, add few seconds before taking this screenshot
// just to make sure it works even on slow machine
await page.waitFor(2000);
await page.screenshot({
path: "screenshot_before.png",
fullPage: true
});
4. Use the page.evaluate and document.querySelector to get element
page.click will not handle all kind of clicks. Sometimes there are different events bound to some elements and you have to treat that separately.
// we can click using querySelector and the native
// just page.click does not trigger the onclick handler on this page
await page.evaluate(() =>
document.querySelector("input[value='Perito RE / G.F.']").click()
);
5. Treat the new tab separately
Together with browser.once('targetcreated'), new Promise, and browser.pages() you can catch the newly created tab and work on it.
Note: Read final code at end of the answer before using this.
// this is the final page after clicking the input on previous page
// https://italy.grupporealemutua.it/FIM/sps/IDPRMA/saml20/login
function newTabCatcher(browser) {
// we resolve this promise after doing everything we need to do on this page
// or in error
return new Promise((resolve, reject) => {
// set the listener before clicking the button to have proper interaction
// we listen for only one new tab
browser.once("targetcreated", async function() {
console.log("New Tab Created");
try {
// get the newly created window
const tabs = await browser.pages();
const lastTab = tabs[tabs.length - 1];
// Wait for navigation to finish as well as specific login form
await Promise.all([
lastTab.waitForNavigation({ waitUntil: "networkidle0" }),
lastTab.waitFor("#div_login")
]);
// browser will switch to this tab just when it takes the screenshot
await lastTab.screenshot({
path: "screenshot_newtab.png",
fullPage: true
});
resolve(true);
} catch (error) {
reject(error);
}
});
});
}
Final Code:
Just for clarity, here is how I used all code snippets specified above.
const ptr = require("puppeteer");
ptr.launch({ headless: false }).then(async browser => {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 800 });
await page.goto(
"https://portaleperiti.grupporealemutua.it/PPVET/VetrinaWebPortalePeriti/",
{ waitUntil: "networkidle0" }
// <-- Make sure the whole page is completely loaded
);
// wait and click the alert button
await page.waitFor("#BoxAlertBtnOk");
await page.click("#BoxAlertBtnOk");
// optional, add few seconds before taking this screenshot
// just to make sure it works even on slow machine
await page.waitFor(2000);
await page.screenshot({
path: "screenshot_before.png",
fullPage: true
});
// we can click using querySelector and the native
// just page.click does not trigger the onclick handler on this page
await page.evaluate(() =>
document.querySelector("input[value='Perito RE / G.F.']").click()
);
// here we go and process the new tab
// aka get screenshot, fill form etc
await newTabCatcher(browser);
// rest of your code
// ...
await browser.close();
});
Result:
It worked flawlessly!
Note:
Notice how I used new Promise and async await together. This might not be the best practice, but now you have a lead of what to look for when creating a scraper for some old websites.
If it's relevant to anyone else who facing dialog boxes, the following code solved it for me:
this.page.on('dialog', async dialog => {
await dialog.dismiss();
});
Your button - #BoxAlertBtnOk will be appear on the webpage after a moment, when you call await page.click("#BoxAlertBtnOk"); the button is invisible. Try to wait until it visible then take an action:
await page.waitForSelector("#BoxAlertBtnOk");
await page.click("#BoxAlertBtnOk");
await page.waitForSelector("input[value='Perito RE / G.F.']");
await page.click("input[value='Perito RE / G.F.']");