Getting changes in object with puppeteer - node.js

I'm trying to learn how to track changes in a div. I found a post that showed the following code:
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.exposeFunction('onCustomEvent', text => console.log(text));
await page.goto('https://www.time.ir', {waitUntil: 'networkidle0'});
await page.evaluate(() => {
$('#digitalClock').bind("DOMSubtreeModified", function(e) {
window.onCustomEvent(e.currentTarget.textContent.trim());
});
});
})();
When running this it pulls the time from the webpage and every second console.logs the new time - exactly what I'm looking for. However I'm having issues with any other page for some reason. For example, the very similar code below gives me an error:
'node:1801) UnhandledPromiseRejectionWarning: Error: Evaluation failed: ReferenceError: $$ is not defined'
await page.exposeFunction('onCustomEvent', text => console.log(text));
await page.goto('https://www.clocktab.com', {waitUntil: 'networkidle0'});
await page.evaluate(() => {
$('#digit2').bind("DOMSubtreeModified", function(e) {
window.onCustomEvent(e.currentTarget.textContent.trim());
});
});
I'm not sure the difference between them other than the page I navigate to, and the element that I'm looking at to find the changing value. Additionally, I did read somewhere that DOMSubtreeModified is deprecated now, so if there's a better way to get what I'm looking for that would be great!
Thanks in advance

The difference is that in the second website there is not jquery, and when you send the evaluation function, $ is not defined.
Replace with vanilla js:
document.querySelector('#digit2').addEventListener ("DOMSubtreeModified", function(e) {
window.onCustomEvent(e.currentTarget.textContent.trim());
})
Suggestion: when i debug with the puppeeter evaluation function i copy-paste this on my browser console in the page. For example:

Related

Blocking specific resources (css, images, videos, etc) using crawlee and playwright

I'm using crawlee#3.0.3 (not released yet, from github), and I'm trying to block specific resources from loading with playwrightUtils.blockRequests (which isn't available in previous versions). When I try the code suggested in the official repo, it works as expected:
import { launchPlaywright, playwrightUtils } from 'crawlee';
const browser = await launchPlaywright();
const page = await browser.newPage();
await playwrightUtils.blockRequests(page, {
// extraUrlPatterns: ['adsbygoogle.js'],
});
await page.goto('https://cnn.com');
await page.screenshot({ path: 'cnn_no_images.png' });
await browser.close();
I can see that the images aren't loaded from the screenshot. My problem has to do with the fact that I'm using PlaywrightCrawler:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await playwrightUtils.blockRequests(page);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});
This way, I'm not able to block specific resources, and my guess is that blockRequests needs launchPlaywright to work, and I don't see a way to pass that to PlaywrightCrawler.blockRequests has been available for puppeteer, so maybe someone has tried this before.
Also, i've tried "route interception", but again, I couldn't make it work with PlaywrightCrawler.
you can set any listeners or code before navigation by using preNavigationHooks like this:
const crawler = new PlaywrightCrawler({
maxRequestsPerCrawl: 3,
preNavigationHooks: [async ({ page }) => {
await playwrightUtils.blockRequests(page);
}],
async requestHandler({ page, request }) {
console.log(`Processing: ${request.url}`);
await page.screenshot({ path: 'cnn_no_images2.png' });
},
});

Trying to crawl a website using puppeteer but getting a timeout error

I'm trying to search the Kwik Trip website for daily deals using nodeJs but I keep getting a timeout error when I try to crawl it. Not quite sure what could be happening. Does anyone know what may be going on?
Below is my code, I'm trying to wait for .agendaItemWrap to load before it brings back all of the HTML because it's a SPA.
function getQuickStar(req, res){
(async () => {
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
const navigationPromise = page.waitForNavigation({waitUntil: "domcontentloaded"});
await page.goto('https://www.kwiktrip.com/savings/daily-deals');
await navigationPromise;
await page.waitForSelector('.agendaItemWrap', { timeout: 30000 });
const body = await page.evaluate(() => {
return document.querySelector('body').innerHTML;
});
console.log(body);
await browser.close();
} catch (error) {
console.log(error);
}
})();
}
Here's a link to the web page I'm trying to crawl https://www.kwiktrip.com/savings/daily-deals
It appear your desired selector is located into an iframe, and not into the page.mainframe.
You then need to wait for your iframe, and perform the waitForSelector on this particular iframe.
Quick tip : you don't need any page.waitForNavigation with a page.goto, because you can set the waitUntil condition into the options. By default it waits for the page onLoad event.

Puppeteer close javascript alert box

I'm trying to click on a page button on this website but when I enter the site an alert box shows up and I don't know how to close it.
I just started experimenting with Puppeteer, this is the code I'm using this simple code right now:
const ptr = require('puppeteer');
ptr.launch().then(async browser => {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 800 });
await page.goto('https://portaleperiti.grupporealemutua.it/PPVET/VetrinaWebPortalePeriti/');
//This is the alert button selector
await page.click("#BoxAlertBtnOk");
//This is the button on the page i want to click on
await page.click("input[value='Perito RE / G.F.']");
await page.screenshot({
path: 'screenshot.png',
fullPage: true
});
await browser.close();
});
This is the error I get: UnhandledPromiseRejectionWarning: Error: Node is either not visible or not an HTMLElement
at ElementHandle._clickablePoint
Any help would be really appreciated, thanks!
There are few things going on that page,
The alert box only loads after page is loaded (It has a onload property on body tag). So you should wait until network is idle.
Clicking those "Perito" buttons creates a new window/tab due to the window.open() code put into onclick handler.
The new tab redirects multiple times and shows a login page if the user is not logged in already.
Solution:
1. Make sure to load the page properly.
Just add { waitUntil: "networkidle0" } to .goto or .waitForNavigation.
await page.goto(
"https://portaleperiti.grupporealemutua.it/PPVET/VetrinaWebPortalePeriti/",
{ waitUntil: "networkidle0" }
// <-- Make sure the whole page is completely loaded
);
2. Wait for the element before clicking
Already suggested on other answers, wait for the element using waitFor.
// wait and click the alert button
await page.waitFor("#BoxAlertBtnOk");
await page.click("#BoxAlertBtnOk");
3. Optional, add few seconds before taking screenshot after clicking the button.
// optional, add few seconds before taking this screenshot
// just to make sure it works even on slow machine
await page.waitFor(2000);
await page.screenshot({
path: "screenshot_before.png",
fullPage: true
});
4. Use the page.evaluate and document.querySelector to get element
page.click will not handle all kind of clicks. Sometimes there are different events bound to some elements and you have to treat that separately.
// we can click using querySelector and the native
// just page.click does not trigger the onclick handler on this page
await page.evaluate(() =>
document.querySelector("input[value='Perito RE / G.F.']").click()
);
5. Treat the new tab separately
Together with browser.once('targetcreated'), new Promise, and browser.pages() you can catch the newly created tab and work on it.
Note: Read final code at end of the answer before using this.
// this is the final page after clicking the input on previous page
// https://italy.grupporealemutua.it/FIM/sps/IDPRMA/saml20/login
function newTabCatcher(browser) {
// we resolve this promise after doing everything we need to do on this page
// or in error
return new Promise((resolve, reject) => {
// set the listener before clicking the button to have proper interaction
// we listen for only one new tab
browser.once("targetcreated", async function() {
console.log("New Tab Created");
try {
// get the newly created window
const tabs = await browser.pages();
const lastTab = tabs[tabs.length - 1];
// Wait for navigation to finish as well as specific login form
await Promise.all([
lastTab.waitForNavigation({ waitUntil: "networkidle0" }),
lastTab.waitFor("#div_login")
]);
// browser will switch to this tab just when it takes the screenshot
await lastTab.screenshot({
path: "screenshot_newtab.png",
fullPage: true
});
resolve(true);
} catch (error) {
reject(error);
}
});
});
}
Final Code:
Just for clarity, here is how I used all code snippets specified above.
const ptr = require("puppeteer");
ptr.launch({ headless: false }).then(async browser => {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 800 });
await page.goto(
"https://portaleperiti.grupporealemutua.it/PPVET/VetrinaWebPortalePeriti/",
{ waitUntil: "networkidle0" }
// <-- Make sure the whole page is completely loaded
);
// wait and click the alert button
await page.waitFor("#BoxAlertBtnOk");
await page.click("#BoxAlertBtnOk");
// optional, add few seconds before taking this screenshot
// just to make sure it works even on slow machine
await page.waitFor(2000);
await page.screenshot({
path: "screenshot_before.png",
fullPage: true
});
// we can click using querySelector and the native
// just page.click does not trigger the onclick handler on this page
await page.evaluate(() =>
document.querySelector("input[value='Perito RE / G.F.']").click()
);
// here we go and process the new tab
// aka get screenshot, fill form etc
await newTabCatcher(browser);
// rest of your code
// ...
await browser.close();
});
Result:
It worked flawlessly!
Note:
Notice how I used new Promise and async await together. This might not be the best practice, but now you have a lead of what to look for when creating a scraper for some old websites.
If it's relevant to anyone else who facing dialog boxes, the following code solved it for me:
this.page.on('dialog', async dialog => {
await dialog.dismiss();
});
Your button - #BoxAlertBtnOk will be appear on the webpage after a moment, when you call await page.click("#BoxAlertBtnOk"); the button is invisible. Try to wait until it visible then take an action:
await page.waitForSelector("#BoxAlertBtnOk");
await page.click("#BoxAlertBtnOk");
await page.waitForSelector("input[value='Perito RE / G.F.']");
await page.click("input[value='Perito RE / G.F.']");

Incredibly weird puppeteer behaviour

I use this code:
// <-- add event on top of file
process.on("unhandledRejection", (reason, p) => {
console.error("Unhandled Rejection at: Promise", p, "reason:", reason);
// browser.close(); // <-- no need to close the browser here
});
const puppeteer = require('puppeteer');
async function getPic() {
try{ // <-- wrap the whole block in try catch
const browser = await puppeteer.launch(/*{headless: false}*/);
const page = await browser.newPage();
await page.setViewport({width: 1000, height: 500}); // <-- add await here so it sets viewport after it creates the page
//await page.goto('https://www.google.com'); //Old way of doing. It doesn't work for some reason...
page.goto('https://www.google.com/').catch(error => console.log("error on page.goto()", error));
// wait for either of events to trigger
await Promise.race([
page.waitForNavigation({waitUntil: 'domcontentloaded'}),
page.waitForNavigation({waitUntil: 'load'})
]);
await page.screenshot({path: 'pic.png'});
await browser.close(); // <-- close browser after everything is done
} catch (error) {
console.log(error);
}
}
getPic();
Then the program hangs. After 30 seconds, i get this error:
error on page.goto() Error: Navigation T imeout Exceeded: 30000ms exceeded at Promise.then (C:\...\pupet test\node_modules\puppeteer\lib\NavigatorWatcher.js:71:21) at <anonymous>
But i also get the picture i requested!
1.So how is it that page.goto() fails but it still gets the picture, which mean that page.goto() actually worked!?
2. What can i do to mitigate this weird error?
The program hangs because you called goto without async-await or promises, then you put it in a race for waitForNavigation, this makes the browser confused because all three line of code is mostly doing same thing on the back. It is trying to navigate and wait for it.
Use async await for promises. Do not call async methods synchronous ways. No matter what, this is how you must use it in your example case.
await page.goto('https://www.google.com');
If you want to wait until page load, then the goto function has that covered too. You don't need to use the waitForNavigation after goto.
await page.goto('https://www.google.com', {waitUntil: 'load'});
There is also domcontentloaded, networkidle2, networkidle0 for the waitUntil property. You can read more about it in the docs with full explanation.
The reason why screenshot is working properly is because it's getting executed asynchronously but then you are awaiting for the navigation later on.
Here is the code without much complexity and promise race.
try{ // <-- wrap the whole block in try catch
const browser = await puppeteer.launch(/*{headless: false}*/);
const page = await browser.newPage();
await page.setViewport({width: 1000, height: 500}); // <-- add await here so it sets viewport after it creates the page
await page.goto('https://www.google.com/', {waitUntil: 'load'})
await page.screenshot({path: 'pic.png'});
await browser.close(); // <-- close browser after everything is done
} catch (error) {
console.log(error);
}
Here is how it works perfectly on the sandbox.
The puppeteer docs is a good place to start to learn about this.

Puppeteer screenshot stop in random URL

I have a service nodejs working in Ubuntu, using puppeteer to take screenshots of pages, but in some pages the method page.screenshot({fullPage: true, type: 'jpeg'}) doesn't works in some random URLs and no errors are displayed in the log. The code is:
async takeScreenshot() {
console.log('trying take Screenshot [...]');
let image = await this.page.screenshot({fullPage: true, type: 'jpeg'});
console.log('Completed!');
return image;
}
An example of page that I had this problem is: https://nuevo.jumbo.cl/mayonesa-hellmanns-751-g-supreme-light/p
It seems you are not waiting until the page is fully loaded before you take the screenshot. You need 'waitUntil':
const page = await browser.newPage();
await page.goto(inputImgUrl, {waitUntil: 'networkidle'});
So the whole thing should look like something like this:
(async() => {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(inputImgUrl, {waitUntil: 'networkidle'});
await page.setViewport(viewPortData);
await page.screenshot({path: outputFile, type: 'jpeg', quality: 50, clip: cropData});
await browser.close();
})();
You did not explain how it did not work, or how you expected it to show, but I'm assuming it captured something like the following image.
When you first load the page, it'll load something like this, which is not related to puppeteer, since you are not telling the code to wait for certain amount, if you are doing so, I don't see it specified anywhere.
For me, just normal screenshot worked, except the region settings. I used version 0.13 and the following code,
await page.goto("https://nuevo.jumbo.cl/mayonesa-hellmanns-751-g-supreme-light/p", {waitUntil: "networkidle0"});
await page.screenshot({
path: "test_google.png",
fullPage: true
});
});

Resources