I have a service nodejs working in Ubuntu, using puppeteer to take screenshots of pages, but in some pages the method page.screenshot({fullPage: true, type: 'jpeg'}) doesn't works in some random URLs and no errors are displayed in the log. The code is:
async takeScreenshot() {
console.log('trying take Screenshot [...]');
let image = await this.page.screenshot({fullPage: true, type: 'jpeg'});
console.log('Completed!');
return image;
}
An example of page that I had this problem is: https://nuevo.jumbo.cl/mayonesa-hellmanns-751-g-supreme-light/p
It seems you are not waiting until the page is fully loaded before you take the screenshot. You need 'waitUntil':
const page = await browser.newPage();
await page.goto(inputImgUrl, {waitUntil: 'networkidle'});
So the whole thing should look like something like this:
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(inputImgUrl, {waitUntil: 'networkidle'});
await page.setViewport(viewPortData);
await page.screenshot({path: outputFile, type: 'jpeg', quality: 50, clip: cropData});
await browser.close();
})();
You did not explain how it did not work, or how you expected it to show, but I'm assuming it captured something like the following image.
When you first load the page, it'll load something like this, which is not related to puppeteer, since you are not telling the code to wait for certain amount, if you are doing so, I don't see it specified anywhere.
For me, just normal screenshot worked, except the region settings. I used version 0.13 and the following code,
await page.goto("https://nuevo.jumbo.cl/mayonesa-hellmanns-751-g-supreme-light/p", {waitUntil: "networkidle0"});
await page.screenshot({
path: "test_google.png",
fullPage: true
});
});
Related
I'm using crawlee#3.0.3 (not released yet, from github), and I'm trying to block specific resources from loading with playwrightUtils.blockRequests (which isn't available in previous versions). When I try the code suggested in the official repo, it works as expected:
import { launchPlaywright, playwrightUtils } from 'crawlee';
const browser = await launchPlaywright();
const page = await browser.newPage();
await playwrightUtils.blockRequests(page, {
// extraUrlPatterns: ['adsbygoogle.js'],
});
await page.goto('https://cnn.com');
await page.screenshot({ path: 'cnn_no_images.png' });
await browser.close();
I can see that the images aren't loaded from the screenshot. My problem has to do with the fact that I'm using PlaywrightCrawler:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await playwrightUtils.blockRequests(page);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});
This way, I'm not able to block specific resources, and my guess is that blockRequests needs launchPlaywright to work, and I don't see a way to pass that to PlaywrightCrawler.blockRequests has been available for puppeteer, so maybe someone has tried this before.
Also, i've tried "route interception", but again, I couldn't make it work with PlaywrightCrawler.
you can set any listeners or code before navigation by using preNavigationHooks like this:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
preNavigationHooks: [async ({ page }) => {
await playwrightUtils.blockRequests(page);
}],
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});
I'm trying to learn how to track changes in a div. I found a post that showed the following code:
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.exposeFunction('onCustomEvent', text => console.log(text));
await page.goto('https://www.time.ir', {waitUntil: 'networkidle0'});
await page.evaluate(() => {
$('#digitalClock').bind("DOMSubtreeModified", function(e) {
window.onCustomEvent(e.currentTarget.textContent.trim());
});
});
})();
When running this it pulls the time from the webpage and every second console.logs the new time - exactly what I'm looking for. However I'm having issues with any other page for some reason. For example, the very similar code below gives me an error:
'node:1801) UnhandledPromiseRejectionWarning: Error: Evaluation failed: ReferenceError: $$ is not defined'
await page.exposeFunction('onCustomEvent', text => console.log(text));
await page.goto('https://www.clocktab.com', {waitUntil: 'networkidle0'});
await page.evaluate(() => {
$('#digit2').bind("DOMSubtreeModified", function(e) {
window.onCustomEvent(e.currentTarget.textContent.trim());
});
});
I'm not sure the difference between them other than the page I navigate to, and the element that I'm looking at to find the changing value. Additionally, I did read somewhere that DOMSubtreeModified is deprecated now, so if there's a better way to get what I'm looking for that would be great!
Thanks in advance
The difference is that in the second website there is not jquery, and when you send the evaluation function, $ is not defined.
Replace with vanilla js:
document.querySelector('#digit2').addEventListener ("DOMSubtreeModified", function(e) {
window.onCustomEvent(e.currentTarget.textContent.trim());
})
Suggestion: when i debug with the puppeeter evaluation function i copy-paste this on my browser console in the page. For example:
I'm trying to search the Kwik Trip website for daily deals using nodeJs but I keep getting a timeout error when I try to crawl it. Not quite sure what could be happening. Does anyone know what may be going on?
Below is my code, I'm trying to wait for .agendaItemWrap to load before it brings back all of the HTML because it's a SPA.
function getQuickStar(req, res){
(async () => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const navigationPromise = page.waitForNavigation({waitUntil: "domcontentloaded"});
await page.goto('https://www.kwiktrip.com/savings/daily-deals');
await navigationPromise;
await page.waitForSelector('.agendaItemWrap', { timeout: 30000 });
const body = await page.evaluate(() => {
return document.querySelector('body').innerHTML;
});
console.log(body);
await browser.close();
} catch (error) {
console.log(error);
}
})();
}
Here's a link to the web page I'm trying to crawl https://www.kwiktrip.com/savings/daily-deals
It appear your desired selector is located into an iframe, and not into the page.mainframe.
You then need to wait for your iframe, and perform the waitForSelector on this particular iframe.
Quick tip : you don't need any page.waitForNavigation with a page.goto, because you can set the waitUntil condition into the options. By default it waits for the page onLoad event.
I am using Puppeteer on Nodejs to render a PDF document via a page. However, it seems like images hosted on AWS S3 cannot be loaded, whereas images stored locally on the server itself loads fine.
I tried adding both waitUntil: "networkidle0" and waitUntil: "networkidle1" but it still does not work. I also tried adding printBackground: true too.
The images loads perfectly fine on the page as seen here
However, on the PDF generated by Puppeteer, the images does not load
This is my code:
(async () => {
const browser = await puppeteer.launch({
args: ["--no-sandbox"]
});
const page = await browser.newPage();
await page.setExtraHTTPHeaders({
authorization: req.session.token
});
await page.goto(config.url + "/project/download/" + permalink, {
waitUntil: "networkidle0"
});
const buffer = await page.pdf({
filename: permalink + "_ProjectBrief" + ".pdf",
format: "A4",
margin: {
top: 60,
bottom: 60,
left: 60,
right: 50
},
});
res.type("application/pdf");
res.send(buffer);
await browser.close();
})();
Any idea what should I do to get over this issue?
Thanks in advance!
I think I solved the issue.
After adding headless: false to
const browser = await puppeteer.launch({
args: ["--no-sandbox"]
});
I realised that the images did not load because of a 400 error. My hypothesis is that Chromium did not have enough time to download the images and thus throwing this error.
What I did was to edit the HTML file that I want Puppeteer to render, adding this code to it:
data.ProjectImages.forEach(e => {
window.open(e.imageUrl, '_blank');
});
What this does is that it opens up the images via its URL onto a new tab. This ensures that the images are downloaded to the Chromium instance (Puppeteer runs Chromium).
The rendered PDF can now display the images.
I'm trying to click on a page button on this website but when I enter the site an alert box shows up and I don't know how to close it.
I just started experimenting with Puppeteer, this is the code I'm using this simple code right now:
const ptr = require('puppeteer');
ptr.launch().then(async browser => {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 800 });
await page.goto('https://portaleperiti.grupporealemutua.it/PPVET/VetrinaWebPortalePeriti/');
//This is the alert button selector
await page.click("#BoxAlertBtnOk");
//This is the button on the page i want to click on
await page.click("input[value='Perito RE / G.F.']");
await page.screenshot({
path: 'screenshot.png',
fullPage: true
});
await browser.close();
});
This is the error I get: UnhandledPromiseRejectionWarning: Error: Node is either not visible or not an HTMLElement
at ElementHandle._clickablePoint
Any help would be really appreciated, thanks!
There are few things going on that page,
The alert box only loads after page is loaded (It has a onload property on body tag). So you should wait until network is idle.
Clicking those "Perito" buttons creates a new window/tab due to the window.open() code put into onclick handler.
The new tab redirects multiple times and shows a login page if the user is not logged in already.
Solution:
1. Make sure to load the page properly.
Just add { waitUntil: "networkidle0" } to .goto or .waitForNavigation.
await page.goto(
"https://portaleperiti.grupporealemutua.it/PPVET/VetrinaWebPortalePeriti/",
{ waitUntil: "networkidle0" }
// <-- Make sure the whole page is completely loaded
);
2. Wait for the element before clicking
Already suggested on other answers, wait for the element using waitFor.
// wait and click the alert button
await page.waitFor("#BoxAlertBtnOk");
await page.click("#BoxAlertBtnOk");
3. Optional, add few seconds before taking screenshot after clicking the button.
// optional, add few seconds before taking this screenshot
// just to make sure it works even on slow machine
await page.waitFor(2000);
await page.screenshot({
path: "screenshot_before.png",
fullPage: true
});
4. Use the page.evaluate and document.querySelector to get element
page.click will not handle all kind of clicks. Sometimes there are different events bound to some elements and you have to treat that separately.
// we can click using querySelector and the native
// just page.click does not trigger the onclick handler on this page
await page.evaluate(() =>
document.querySelector("input[value='Perito RE / G.F.']").click()
);
5. Treat the new tab separately
Together with browser.once('targetcreated'), new Promise, and browser.pages() you can catch the newly created tab and work on it.
Note: Read final code at end of the answer before using this.
// this is the final page after clicking the input on previous page
// https://italy.grupporealemutua.it/FIM/sps/IDPRMA/saml20/login
function newTabCatcher(browser) {
// we resolve this promise after doing everything we need to do on this page
// or in error
return new Promise((resolve, reject) => {
// set the listener before clicking the button to have proper interaction
// we listen for only one new tab
browser.once("targetcreated", async function() {
console.log("New Tab Created");
try {
// get the newly created window
const tabs = await browser.pages();
const lastTab = tabs[tabs.length - 1];
// Wait for navigation to finish as well as specific login form
await Promise.all([
lastTab.waitForNavigation({ waitUntil: "networkidle0" }),
lastTab.waitFor("#div_login")
]);
// browser will switch to this tab just when it takes the screenshot
await lastTab.screenshot({
path: "screenshot_newtab.png",
fullPage: true
});
resolve(true);
} catch (error) {
reject(error);
}
});
});
}
Final Code:
Just for clarity, here is how I used all code snippets specified above.
const ptr = require("puppeteer");
ptr.launch({ headless: false }).then(async browser => {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 800 });
await page.goto(
"https://portaleperiti.grupporealemutua.it/PPVET/VetrinaWebPortalePeriti/",
{ waitUntil: "networkidle0" }
// <-- Make sure the whole page is completely loaded
);
// wait and click the alert button
await page.waitFor("#BoxAlertBtnOk");
await page.click("#BoxAlertBtnOk");
// optional, add few seconds before taking this screenshot
// just to make sure it works even on slow machine
await page.waitFor(2000);
await page.screenshot({
path: "screenshot_before.png",
fullPage: true
});
// we can click using querySelector and the native
// just page.click does not trigger the onclick handler on this page
await page.evaluate(() =>
document.querySelector("input[value='Perito RE / G.F.']").click()
);
// here we go and process the new tab
// aka get screenshot, fill form etc
await newTabCatcher(browser);
// rest of your code
// ...
await browser.close();
});
Result:
It worked flawlessly!
Note:
Notice how I used new Promise and async await together. This might not be the best practice, but now you have a lead of what to look for when creating a scraper for some old websites.
If it's relevant to anyone else who facing dialog boxes, the following code solved it for me:
this.page.on('dialog', async dialog => {
await dialog.dismiss();
});
Your button - #BoxAlertBtnOk will be appear on the webpage after a moment, when you call await page.click("#BoxAlertBtnOk"); the button is invisible. Try to wait until it visible then take an action:
await page.waitForSelector("#BoxAlertBtnOk");
await page.click("#BoxAlertBtnOk");
await page.waitForSelector("input[value='Perito RE / G.F.']");
await page.click("input[value='Perito RE / G.F.']");