Puppeteer Cluster Inconsistent Results - node.js

I've recently converted a script over from Puppeteer to Puppeteer Cluster and during testing I've observed some odd results when testing multiple pages concurrently.
Effectively I'm loading a single page and then iterating over the product options on the page and gathering the price for any product variants.
One particular product has around 9 product variants, sometimes I will accurately capture all 9 variants, whereas on the next testing cycle it may only return 2 or 3 variants.
Any help would be greatly appreciated!
const puppeteer = require('puppeteer');
const { Cluster } = require('puppeteer-cluster');
const Product = require('../utils/product')
const config = require('../config/config.json')
const selectors = config.productData;
(async () => {
const urls = [
{link: ...},
{link: ...},
{link: ...}
]
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_CONTEXT,
maxConcurrency: 5,
puppeteerOptions: {
headless: false
},
});
await cluster.task(async ({ page, data: url }) => {
//instantiate a new product object
const product = new Product();
await page.goto(url, { waitUntil: 'load' });
const skuprice = await page.$eval(selectors.price, element => element.innerText);
console.log('Sku Price:' + skuprice)
//deal with variants
const options = await page.$$eval(selectors.variant, elements => elements.map(element=>element.id))
if (options.length > 0) {
//set up a variants array
for (let index = 0; index < options.length; index++) {
const element = options[index];
await page.waitForSelector(`#${element}`);
await page.$eval(`#${element}`, radio => radio.click());
await page.waitForTimeout(500);
const variantprice = await page.$eval(selectors.price, element => element.innerText);
console.log('Variant Price:' + variantprice)
}
}
});
urls.forEach(url => {
cluster.queue(url.link);
})
// many more pages
await cluster.idle();
await cluster.close();
})();

Dynamic javascript page should be scraped when all of the element is visible.
You can do following tricks:
[1] wait until selector is visible, check withawait page.waitForSelector(selector, {visible: true, timeout: 0})
[2] wait for desired time, but this is more flaky and prone to resulting error.
You can simplify and rewrite your code, like this:
await page.waitForSelector(`#${element}`, {visible: true, timeout: 0})
await page.click(`#${element}`)
/* await page.waitForTimeout(500) <= prone to error, use line below */
await page.waitForSelector(selectors.price, {visible: true, timeout: 0})
const variantprice = await page.$eval(selectors.price, element => element.innerText)

For anyone else searching for an answer, it looked like some of my CSS selectors weren't working on page refresh.
Re-reading the project documentation its worth including the following:
// Event handler to be called in case of problems
cluster.on('taskerror', (err, data) => {
console.log(`Error crawling ${data}: ${err.message}`);
});

Related

Puppeteer click parent node of a element with no id

I'm trying to select a certain size on this website, I have tried multiple approaches that have worked for me so far in puppeteer but none of them seems to work on this instance. I can get the size tab open but cannot figure how to select a specific size.
my code:
await page.goto(data[ii][0]), { //the website link
waitUntil: 'load',
timeout: 0
};
//part 1
await page.click('span[class="default-text__21bVM"]'); //opens size menu
let size = data[ii][1]; //gets size from an array, for example 9
// const xp = `//div[contains(#class, "col-3") and text()="${size}"]`;
// await page.waitForXPath(xp);
// const [sizeButton] = await page.$x(xp);
// await sizeButton.evaluate(btn => {
// btn.parentNode.dispatchEvent(new Event("mousedown"));
// });
await delay(1500);
await page.evaluate((size) => {
document.querySelector(`div > div[class="col-3"][text="${size}"]`).parentElement.click()
});
await page.click('span[class="text__1S19c"]'); // click on submit button
Neither of my approaches worked. I get Error: Evaluation failed: TypeError: Cannot read property 'parentElement' of null meaning the div wasn't found for whatever reason
this is the html of the div I'm trying to click on:
I tried different variations of the querySelector but none of them worked so I'm posting the problem here to see if this is even possible, or if I just made a mistake along the way
This seems working:
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch({ headless: false, defaultViewport: null });
try {
const [page] = await browser.pages();
await page.goto('https://releases.footshop.com/nike-air-force-1-07-lx-wmns-5iBRxXsBHBhvh4GFc9ge');
await page.click('span[class="default-text__21bVM"]');
const size = 9;
const xp = `//div[contains(#class, "col-3") and text()="${size}"]`;
await page.waitForXPath(xp);
const [sizeButton] = await page.$x(xp);
await sizeButton.click();
await page.click('span[class="text__1S19c"]');
} catch (err) { console.error(err); }

Puppeteer waitForSelector not working as expected

I've this simple code chunk:
const BUSINESS = 'lalala';
await page.waitForSelector('#searchboxinput').then(
page.evaluate( (BUSINESS) => {
document.querySelector('#searchboxinput').value = BUSINESS
}, BUSINESS)
),
If I set wait for selector -> then, I would expect then to be executed when the selector exists, but I'm getting a Cannot set property value of 'null'.
So I uderstand that document.querySelector('#searchboxinput') == undefined while I suppose that it cannot be possible as it's executed when waitForSelector promise is finished...
is that a bug or I'm missing something?
Not sure if I understand correctly as your chunk is syntactically not complete and not reprducible. But what if you use the returned value of page.waitForSelector()? This seems working:
import puppeteer from 'puppeteer';
const browser = await puppeteer.launch({ headless: false, defaultViewport: null });
try {
const [page] = await browser.pages();
await page.goto('https://example.org/');
const BUSINESS = 'lalala';
await page.waitForSelector('a').then((element) => {
page.evaluate((element, BUSINESS) => {
element.textContent = BUSINESS;
}, element, BUSINESS);
});
} catch (err) { console.error(err); }

Puppeteer Failing for more than 11 Urls

I would like to ask, whats the best way to capture more than 20 screenshots or different Urls?
I have tried the following code.
async function sCapture(url, site_name) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 720 })
await page.goto(url);
await page.screenshot({
path:`statusImage/${site_name}.jpg`
});
await browser.close();
}
Am getting the Urls from my DB like this.
db_connection.promise()
.execute("SELECT * FROM `urls`")
.then(([rows]) => {
rows.forEach(user => {
const url = user.link;
const name = user.link_name;
console.log(name);
sCapture(url, name)
});
db_connection.end();
}).catch(err => {
console.log(err);
});
Because my DB Table contains more than 50 urls
Before, I was getting this error:
MaxListenersExceededWarning: Possible EventEmitter memory leak detected. 11 exit listeners added. Use emitter.setMaxListeners() to increase limit
After I added the line below. Its just killing my server and I have to do a manual reboot for my site to work again.
require('events').EventEmitter.prototype._maxListeners = 100;
I will appreciate any help rendered.
I think your current code actually starts a new browser instance for each URL you want to fetch and I don't think you need to do that. A separate page is enough. Also, you are currently making all those requests in parallel, which will tax your machine more than doing it in sequence. Putting these two changes together give you something like this:
let browser;
async function sCapture(url, site_name) {
const page = await browser.newPage();
await page.setViewport({ width: 1280, height: 720 })
await page.goto(url);
await page.screenshot({
path:`statusImage/${site_name}.jpg`
});
}
const doit = () => {
db_connection.promise()
.execute("SELECT * FROM `urls`")
.then(([rows]) => {
rows.forEach(async user => {
const url = user.link;
const name = user.link_name;
console.log(name);
await sCapture(url, name);
});
db_connection.end();
}).catch(err => {
console.log(err);
});
}
(async () => {
browser = await puppeteer.launch();
doit();
await browser.close();
})();

Why am I not able to navigate through iFrames using Apify/Puppeteer?

I'm trying to manipulate forms of sites w/ iFrames in it using Puppeteer. I tried different ways to reach a specific iFrame, or even to count iFrames in a website, with no success.
Why isn't Puppeteer's object recognizing the iFrames / child frames of the page I'm trying to navigate through?
It's happening with other pages as well, such as https://www.veiculos.itau.com.br/simulacao
const Apify = require('apify');
const sleep = require('sleep-promise');
Apify.main(async () => {
// Launch the web browser.
const browser = await Apify.launchPuppeteer();
// Create and navigate new page
console.log('Open target page');
const page = await browser.newPage();
await page.goto('https://www.credlineitau.com.br/');
await sleep(15 * 1000);
for (const frame in page.mainFrame().childFrames()) {
console.log('test');
}
await browser.close();
});
Perhaps you'll find some helpful inspiration below.
const waitForIframeContent = async (page, frameSelector, contentSelector) => {
await page.waitForFunction((frameSelector, contentSelector) => {
const frame = document.querySelector(frameSelector);
const node = frame.contentDocument.querySelector(contentSelector);
return node && node.innerText;
}, {
timeout: TIMEOUTS.ten,
}, frameSelector, contentSelector);
};
const $frame = await waitForSelector(page, SELECTORS.frame.iframeNode).catch(() => null);
if ($frame) {
const frame = page.frames().find(frame => frame.name() === 'content-iframe');
const $cancelStatus = await waitForSelector(frame, SELECTORS.frame.membership.cancelStatus).catch(() => null);
await waitForIframeContent(page, SELECTORS.frame.iframeNode, SELECTORS.frame.membership.cancelStatus);
}
Give it a shot.

Click anywhere on page using Puppeteer

Currently I'm using Puppeteer to fetch cookies & headers from a page, however it's using a bot prevention system which is only bypassed when clicking on the page; I don't want to keep this sequential so it's "detectable"
How can I have my Puppeteer click anywhere on the page at random? regardless of wether it clicks a link, button etc..
I've currently got this code
const getCookies = async (state) => {
try {
state.browser = await launch_browser(state);
state.context = await state.browser.createIncognitoBrowserContext();
state.page = await state.context.newPage();
await state.page.authenticate({
username: proxies.username(),
password: proxies.password(),
});
await state.page.setViewport(functions.get_viewport());
state.page.on('response', response => handle_response(response, state));
await state.page.goto('https://www.website.com', {
waitUntil: 'networkidle0',
});
await state.page.waitFor('.unlockLink a', {
timeout: 5000
});
await state.page.click('.unlockLink a');
await state.page.waitFor('input[id="nondevice"]', {
timeout: 5000
});
state.publicIpv4Address = await state.page.evaluate(() => {
return sessionStorage.getItem("publicIpv4Address");
});
state.csrfToken = await state.page.evaluate(() => {
return sessionStorage.getItem("csrf-token");
});
//I NEED TO CLICK HERE! CAN BE WHITESPACE, LINK, IMAGE
state.browser_cookies = await state.page.cookies();
state.browser.close();
for (const cookie of state.browser_cookies) {
if(cookie.name === "dtPC") {
state.dtpc = cookie.value;
}
await state.jar.setCookie(
`${cookie.name}=${cookie.value}`,
'https://www.website.com'
)
}
return state;
} catch(error) {
if(state.browser) {
state.browser.close();
}
throw new Error(error);
}
};
The simplest way I can think of out of my head to choose a random element from DOM would be probably something like using querySelectorAll() which will return you an array of all <div>s in your document (or choose any other element, like <p> or anything else), then you can easily use click() on random one from the result, for example:
await page.evaluate(() => {
const allDivs = document.querySelectorAll('.left-sidebar-toggle');
const randomElement = allDivs[Math.floor(Math.random() * allDivs.length)];
randomElement.click();
});

Resources