Why cant Puppeteer find this link element on page?

Why cant Puppeteer find this link element on page? - node.js

^^UPDATE^^
Willing to pay someone to walk me through this, issue posted on codeMentor.io: https://www.codementor.io/u/dashboard/my-requests/9j42b83f0p
I've been looking to click on the element:
<a id="isc_LinkItem_1$20j" href="javascript:void" target="javascript" tabindex="2"
onclick="if(window.isc_LinkItem_1) return isc_LinkItem_1.$30i(event);"
$9a="$9d">Reporting</a>
In: https://stackblitz.com/edit/js-nzhhbk
(I haven't included the acutal page because its behind a username & pass)
seems easy enough
----------------------------------------------------------------------
solution1:
page.click('[id=isc_LinkItem_1$20j]') //not a valid selector
solution2:
const linkHandlers = await frame.$x("//a[contains(text(), 'Reporting')]");
if (linkHandlers.length > 0) {
await linkHandlers[0].click();
} else {
throw new Error('Link not found');
} //link not found
----------------------------------------------------------------------
I have looked at every which way to select and click it and it says it isn't in the document even though it clearly is (verified by inspecting the html in chrome dev tools and calling:page.evaluate(() => document.body.innerHTML))
**tried to see if it was in an iframe
**tried to select by id
**tried to select by inner text
**tried to console log the body in the browser (console logging not working verified on the inspect _element) //nothing happens
**tried to create an alert with body text by using: _evaluate(() => alert(document)) // nothing happens
**tried to create an alert to test to see if javascript can be injected by: _evaluate(() => alert('works')) // nothing happens
**also tried this: How to select elements within an iframe element in Puppeteer // doesn't work
Here is the code I have built so far
const page = await browser.newPage();
const login1url =
'https://np3.nextiva.com/NextOSPortal/ncp/landing/landing-platform';
await page.goto(login1url);
await page.waitFor(1000);
await page.type('[name=loginUserName', 'itsaSecretLol');
await page.type('[name=loginPassword]', 'nopeHaha');
await page.click('[type=submit]');
await page.waitForNavigation();
const login3url = 'https://np3.nextiva.com/NextOSPortal/ncp/admin/dashboard';
await page.goto(login3url);
await page.click('[id=hdr_users]');
await page.goto('https://np3.nextiva.com/NextOSPortal/ncp/user/manageUsers');
await page.goto('https://np3.nextiva.com/NextOSPortal/ncp/user/garrettmrg');
await page.waitFor(2000);
await page.click('[id=loginAsUser]');
await page.waitFor(2000);
await page.click('[id=react-select-5--value');
await page.waitFor(1000);
await page.click('[id=react-select-5--option-0]');
await page.waitFor(20000);
const elementHandle = await page.$('iframe[id=callcenter]');
const frame = await elementHandle.contentFrame();
const linkHandlers = await frame.$x("//a[contains(text(), 'Reporting')]");
if (linkHandlers.length > 0) {
await linkHandlers[0].click();
} else {
throw new Error('Link not found');
}

due isc_LinkItem_1$20j is not a valid selector, maybe you can try finding elements STARTING WITH isc_LinkItem_1 , like this
await page.waitForSelector("[id^=isc_LinkItem_1]", {visible: true, timeout: 30000});
await page.click("[id?=isc_LinkItem_1]);
?

On your solution1:
await page.click('a[id=isc_LinkItem_1\\$20j]');
Or try to:
await page.click('#isc_LinkItem_1\\$20j]');
I have the slight impression that you must provide what kind of element your trying to select before the brackets, in this case, an < a > element.
On the second solution, the # character means we're selecting an element by it's id

It turns out that the previous click triggered a new tab. Puppeteer doesn't move to the new tab, all previous code was being executed on the old tab. To fix all we had to do was find the new tab, select it and execute code, here is the function we wrote to select for the tab:
async function getTab(regex, browser, targets) {
let pages = await browser.pages();
if (targets) pages = await browser.targets();
let newPage;
for (let i = 0; i < pages.length; i++) {
const url = await pages[i].url();
console.log(url);
if (url.search(regex) !== -1) {
newPage = pages[i];
console.log('***');
console.log(url);
console.log('***');
break;
}
}
console.log('finished');
return newPage;
}

Related

waitForSelector suddenly no longer working in puppeteer

I have a working puppeteer script that I'd like to make into an API but I'm having problems with waitForSelector.
Background:
I wrote a puppeteer script that successfully searches for and scrapes the result of a query I specify in the code e.g. let address = xyz;. Now I'd like to make it into an API so that a user can query something. I managed to code everything necessary for the local API (working with express) and everything works as well. By that I mean: I coded all the server side stuff: I can make a request, the scraper function is called, puppeteer starts up, carries out my search (I need to type in an address, choose from a dropdown and press enter).
The status:
The result of my query is a form (basically 3 columns and some rows) in an iFrame and I want to scrape all the rows (I modify them into a specific json later on). The way it works is I use waitForSelector on the form's selector and then I use frame.evaluate.
Problem:
When I run my normal scraper everything works well, but when I run the (slightly modified but essentially same) code within the API framework, waitForSelector suddenly always times out. I have tried all the usual workarounds: waitForNavigation, taking a screenshot and inspecting etc but nothing helped. I've been reading quite a bit and could it be that I'm screwing something up in terms of async/await when I call my scraper from within the context of the API? I'm still quite new to this so please bear with me. This is the code of the working script - I indicated the important part
const puppeteer = require("puppeteer");
const chalk = require("chalk");
const fs = require('fs');
const error = chalk.bold.red;
const success = chalk.keyword("green");
address = 'Gumpendorfer Straße 12, 1060 Wien';
(async () => {
try {
// open the headless browser
var browser = await puppeteer.launch();
// open a new page
var page = await browser.newPage();
// enter url in page
await page.goto(`https://mein.wien.gv.at/Meine-Amtswege/richtwert?subpage=/lagezuschlag/`, {waitUntil: 'networkidle2'});
// continue without newsletter
await page.click('#dss-modal-firstvisit-form > button.btn.btn-block.btn-light');
// let everyhting load
await page.waitFor(1000)
console.log('waiting for iframe with form to be ready.');
//wait until selector is available
await page.waitForSelector('iframe');
console.log('iframe is ready. Loading iframe content');
//choose the relevant iframe
const elementHandle = await page.$(
'iframe[src="/richtwertfrontend/lagezuschlag/"]',
);
//go into frame in order to input info
const frame = await elementHandle.contentFrame();
//enter address
console.log('filling form in iframe');
await frame.type('#input_adresse', address, { delay: 100});
//choose first option from dropdown
console.log('Choosing from dropdown');
await frame.click('#react-autowhatever-1--item-0');
console.log('pressing button');
//press button to search
await frame.click('#next-button');
// scraping data
console.log('scraping')
await frame.waitForSelector('#summary > div > div > br ~ div');//This keeps failing in the API
const res = await frame.evaluate(() => {
const rows = [...document.querySelectorAll('#summary > div > div > br ~ div')];
const cells = rows.map(
row => [...row.querySelectorAll('div')]
.map(cell => cell.innerText)
);
return cells;
});
await browser.close();
console.log(success("Browser Closed"));
const mapFields = (arr1, arr2) => {
const mappedArray = arr2.map((el) => {
const mappedArrayEl = {};
el.forEach((value, i) => {
if (arr1.length < (i+1)) return;
mappedArrayEl[arr1[i]] = value;
});
return mappedArrayEl;
});
return mappedArray;
}
const Arr1 = res[0];
const Arr2 = res.slice(1,3);
let dataObj = {};
dataObj[address] = [];
// dataObj['lagezuschlag'] = mapFields(Arr1, Arr2);
// dataObj['adresse'] = address;
dataObj[address] = mapFields(Arr1, Arr2);
console.log(dataObj);
} catch (err) {
// Catch and display errors
console.log(error(err));
await browser.close();
console.log(error("Browser Closed"));
}
})();
I just can't understand why it would work in the one case and not in the other, even though I barely changed something. For the API I basically changed the name of the async function to const search = async (address) => { such that I can call it with the query in my server side script.
Thanks in advance - I'm not attaching the API code cause I don't want to clutter the question. I can update it if it's necessary

I solved this myself. Turns out the problem wasn't as complicated as I thought and it was annoyingly simple to solve. The problem wasn't with the selector that was timing out but with the previous selectors, specifically the typing and choosing from dropdown selectors. Essentially, things were going too fast. Before the search query was typed in, the dropdown was already pressed and nonsense came out. How I solved it: I included a waitFor(1000) call before the dropdown is selected and everything went perfectly. An interesting realisation was that even though that one selector timed out, it wasn't actually the source of the problem. But like I said, annoyingly simple and I feel dumb for asking this :) but maybe someone will see this and learn from my mistake

Click event does nothing when triggered

When I trigger a .click() event in a non-headless mode in puppeteer, nothing happens, not even an error.. "non-headless mode so i could visually monitor what is being clicked"
const scraper = {
test: async () => {
let browser, page;
try {
browser = await puppeteer.launch({
headless: false,
args: ["--no-sandbox", "--disable-setuid-sandbox"]
});
page = await browser.newPage();
} catch (err) {
console.log(err);
}
try {
await page.goto("https://www.betking.com/sports/s/eventOdds/1-840-841-0-0,1-1107-1108-0-0,1-835-3775-0-0,", {
waitUntil: "domcontentloaded"
});
console.log("scraping, wait...");
} catch (err) {
console.log(err);
}
console.log("waiting....");
try {
await page.waitFor('.eventsWrapper');
} catch (err) {
console.log(err, err.response);
}
try {
let oddsListData = await page.evaluate(async () => {
let regionAreaContainer = document.querySelectorAll('.areaContainer.region .regionGroup > .regionAreas > div:first-child > .area:nth-child(5)');
regionAreaContainer = Array.prototype.slice.call(regionAreaContainer);
let t = []; //Used to monitor the element being clicked
regionAreaContainer.forEach(async (region) => {
let dat = await region.querySelector('div');
dat.innerHTML === "GG/NG" ? t.push(dat.innerHTML) : false; //Used to confirm that the right element is being clicked
dat.innerHTML === "GG/NG" ? dat.click() : false;
})
return t;
})
console.log(oddsListData);
} catch (err) {
console.log(err);
}
}
}
I expect it to click the specified button and load in some dynamic data on the page.
In Chrome's console, I get the error
Transition Rejection($id: 1 type: 2, message: The transition has been superseded by a different transition, detail: Transition#3( 'sportsMultipleEvents'{"eventMarketIds":"1-840-841-0-0,1-1107-1108-0-0,1-835-3775-0-0,"} -> 'sportsMultipleEvents'{"eventMarketIds":"1-840-841-0-0,1-1107-1108-0-0,1-835-3775-535-14,"} ))

Problem
Behaving non-human-like by executing code like element.click() (inside the page context) or element.value = '..' (see this answer for a similar problem) seems to be problematic for Angular applications. You want to try to behave more human-like by using puppeteer functions like page.click() as they simulate a "real" mouse click instead of just triggering the element's click event.
In addition the page seems to rebuild parts of the page whenever one of the items is clicked. Therefore, you need to execute the selector again after each click.
Code sample
To behave more human-like and requery the elements after each click you can change the latter part of your code to something like this:
let list = await page.$x("//div[div/text() = 'GG/NG']");
for (let i = 0; i < list.length; i++) {
await list[i].click();
// give the page some time and then query the selectors again
await page.waitFor(500);
list = await page.$x("//div[div/text() = 'GG/NG']");
}
This code uses an XPath expression to query the div elements which contain another div element with the given text. After that, a click is simulated on the element and then the contents of the page are queried another time to respect the change of the DOM elements.

Here might be a less confusing way to click those:
for(var div of document.querySelectorAll('div')){
if(div.innerHTML === 'GG/NG') div.click()
}

How can I get all the comments/reviews from a google maps place using puppeteer? (I dont get all of them because page is scrollable)

I am trying to scrape comments/reviews from a place I search using Puppeteer. I have 2 problems:
I only get 16 comments/reviews from the current page, when in reality I want ALL the comments/reviews (in this case 62 comments or even more depending on my search) but I think the problem comes from the page being scrollable.
I am getting an error when I scrape reviews that have no comments in google maps saying:
"(node:13184) UnhandledPromiseRejectionWarning: Error: Evaluation failed: TypeError: Cannot read property 'innerText' of null
at __puppeteer_evaluation_script__:9:38"
, and I am not sure how to get rid of that everytime there is a review that has a NULL comment (I have some code almost at the end trying to solve the NULL comments but doesn't work and I tried a few other ways that didn't work either).
Below is my code:
const puppeteer = require('puppeteer'); // Require the Package we need...
let scrape = async () => { // Prepare scrape...
const browser = await puppeteer.launch({args: ['--no-sandbox', '--disabled-setuid-sandbox']}); // Prevent non-needed issues for *NIX
const page = await browser.newPage(); // Create request for the new page to obtain...
const busqueda = 'Alitas+del+Cadillac+Tumbaco';
const Url = `https://www.google.com/maps/search/${busqueda}`;
const buscar = '.section-result';
const click1 = '.widget-pane-link';
const cajaTexto = '#searchboxinput';
const comentarioLength = 'section-review-text';
const comentarios = 'div.section-review:nth-child(Index) > div:nth-child(1) > div:nth-child(3) > div:nth-child(2) > div:nth-child(1) > span:nth-child(4)';
console.log(comentarioLength);
//const comentario = 'div.section-review:nth-child(INDEX) > div:nth-child(1) > div:nth-child(3) > div:nth-child(2) > div:nth-child(1) > span:nth-child(4)';
// Replace with your Google Maps URL... Or Test the Microsoft one...
//await page.goto('https://www.google.com/maps/place/Microsoft/#36.1275216,-115.1728651,17z/data=!3m1!5s0x80c8c416a26be787:0x4392ab27a0ae83e0!4m7!3m6!1s0x80c8c4141f4642c5:0x764c3f951cfc6355!8m2!3d36.1275216!4d-115.1706764!9m1!1b1');
await page.goto(Url); // Define the Maps URL to Scrape...
await page.waitFor(2*1000); // In case Server has JS needed to be loaded...
await page.click(buscar); //busco caja de texto*/
await page.waitForNavigation();
await page.waitFor(2*1000);
await page.click(click1);
await page.waitForNavigation();
await page.waitFor(2*1000);
console.log(page.url());
console.log("3");
await page.evaluate(_ => { // This is just a test, don't really need this!
});
await page.waitFor(2*1000);
console.log('how many?', (await page.$$('.section-review-text')).length);
//div.section-result:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(1) > div:nth-child(2) > h3:nth-child(1) > span:nth-child(1)
let listLength = await page.evaluate((sel) => {
window.scrollBy(0, window.innerHeight);
return document.getElementsByClassName(sel).length;
}, comentarioLength);
console.log(listLength);
for (let i = 1; i <= listLength; i++) {
let selectorComentarios = comentarios.replace("Index", i);
const result = await page.evaluate((sel) => { // Let's create variables and store values...
return document.querySelector(sel).innerText;
}, selectorComentarios);
if(!result){
continue;
}
console.log(i+result);
}
/*await page.evaluate(_ => {
window.scrollBy(0, window.innerHeight)
})*/
browser.close(); // Close the Browser...
return result; // Return the results with the Review...
};
scrape().then((value) => { // Scrape and output the results...
console.log(value); // Yay, output the Results...
});

To solve the first problem you need to handle an infinite scroll by adding a function like:
async function scrollPage(page, scrollContainer) {
let lastHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
while (true) {
await page.evaluate(`document.querySelector("${scrollContainer}").scrollTo(0, document.querySelector("${scrollContainer}").scrollHeight)`);
await page.waitForTimeout(2000);
let newHeight = await page.evaluate(`document.querySelector("${scrollContainer}").scrollHeight`);
if (newHeight === lastHeight) {
break;
}
lastHeight = newHeight;
}
}
In this function, the first argument is the Puppeteer page and the second is
a scrollable HTML element (in this case it has classname .DxyBCb). This function checks the element's current scrollHeight, then scroll to this height and checks scrollHeight again. If it is changed by loading new elements the function repeats scroll again.
The second problem in this part of code:
return document.querySelector(sel).innerText;
You need to handle cases in which an element with the selector sel isn't found on the page. You can use optional chining which returns undefined instead of throwing an error and add some text when it happens:
return document.querySelector(sel)?.innerText || `Element with selector ${sel} not found on the page`;
And now if the searching selector is absent your evaluate function returns the text after ||.
Blog post with more detailed information that is beyond the scope of your question: web Scraping Google Maps reviews with Nodejs.

This is against Terms of services of Google Maps Platform.
Have a look at paragraph 3.2.4 (Restrictions Against Misusing the Services). It reads
(a) No Scraping. Customer will not extract, export, scrape, or cache Google Maps Content for use outside the Services. For example, Customer will not:(i) pre-fetch, index, store, reshare, or rehost Google Maps Content outside the services; (ii) bulk download geocodes; (iii) copy business names, addresses, or user reviews; or (iv) use Google Maps Content with text-to-speech services. Caching is permitted for certain Services as described in the Maps Service Specific Terms.
source: https://cloud.google.com/maps-platform/terms/#3-license
Sorry for being bearer of bad news.

Puppeteer in NodeJS reports 'Error: Node is either not visible or not an HTMLElement'

I'm using 'puppeteer' for NodeJS to test a specific website. It seems to work fine in most case, but some places it reports:
Error: Node is either not visible or not an HTMLElement
The following code picks a link that in both cases is off the screen.
The first link works fine, while the second link fails.
What is the difference?
Both links are off the screen.
Any help appreciated,
Cheers, :)
Example code
const puppeteer = require('puppeteer');
const initialPage = 'https://website.com/path';
const selectors = [
'div[id$="-bVMpYP"] article a',
'div[id$="-KcazEUq"] article a'
];
(async () => {
let selector, handles, handle;
const width=1024, height=1600;
const browser = await puppeteer.launch({
headless: false,
defaultViewport: { width, height }
});
const page = await browser.newPage();
await page.setViewport({ width, height});
page.setUserAgent('UA-TEST');
// Load first page
let stat = await page.goto(initialPage, { waitUntil: 'domcontentloaded'});
// Click on selector 1 - works ok
selector = selectors[0];
await page.waitForSelector(selector);
handles = await page.$$(selector);
handle = handles[12]
console.log('Clicking on: ', await page.evaluate(el => el.href, handle));
await handle.click(); // OK
// Click that selector 2 - fails
selector = selectors[1];
await page.waitForSelector(selector);
handles = await page.$$(selector);
handle = handles[12]
console.log('Clicking on: ', await page.evaluate(el => el.href, handle));
await handle.click(); // Error: Node is either not visible or not an HTMLElement
})();
I'm trying to emulate the behaviour of a real user clicking around the site, which is why I use .click(), and not .goto(), since the a tags have onclick events.

Instead of
await button.click();
do this:
await button.evaluate(b => b.click());
The difference is that button.evaluate(b => b.click()) runs the JavaScript HTMLElement.click() method on the given element in the browser context, which will fire a click event on that element even if it's hidden, off-screen or covered by a different element, whereas button.click() clicks using Puppeteer's ElementHandle.click() which
scrolls the page until the element is in view
gets the bounding box of the element (this step is where the error happens) and finds the screen x and y pixel coordinates of the middle of that box
moves the virtual mouse to those coordinates and sets the mouse to "down" then back to "up", which triggers a click event on the element under the mouse

First and foremost, your defaultViewport object that you pass to puppeteer.launch() has no keys, only values.
You need to change this to:
'defaultViewport' : { 'width' : width, 'height' : height }
The same goes for the object you pass to page.setViewport().
You need to change this line of code to:
await page.setViewport( { 'width' : width, 'height' : height } );
Third, the function page.setUserAgent() returns a promise, so you need to await this function:
await page.setUserAgent( 'UA-TEST' );
Furthermore, you forgot to add a semicolon after handle = handles[12].
You should change this to:
handle = handles[12];
Additionally, you are not waiting for the navigation to finish (page.waitForNavigation()) after clicking the first link.
After clicking the first link, you should add:
await page.waitForNavigation();
I've noticed that the second page sometimes hangs on navigation, so you might find it useful to increase the default navigation timeout (page.setDefaultNavigationTimeout()):
page.setDefaultNavigationTimeout( 90000 );
Once again, you forgot to add a semicolon after handle = handles[12], so this needs to be changed to:
handle = handles[12];
It's important to note that you are using the wrong selector for your second link that you are clicking.
Your original selector was attempting to select elements that were only visible to xs extra small screens (mobile phones).
You need to gather an array of links that are visible to your viewport that you specified.
Therefore, you need to change the second selector to:
div[id$="-KcazEUq"] article .dfo-widget-sm a
You should wait for the navigation to finish after clicking your second link as well:
await page.waitForNavigation();
Finally, you might also want to close the browser (browser.close()) after you are done with your program:
await browser.close();
Note: You might also want to look into handling unhandledRejection errors.
Here is the final solution:
'use strict';
const puppeteer = require( 'puppeteer' );
const initialPage = 'https://statsregnskapet.dfo.no/departementer';
const selectors = [
'div[id$="-bVMpYP"] article a',
'div[id$="-KcazEUq"] article .dfo-widget-sm a'
];
( async () =>
{
let selector;
let handles;
let handle;
const width = 1024;
const height = 1600;
const browser = await puppeteer.launch(
{
'defaultViewport' : { 'width' : width, 'height' : height }
});
const page = await browser.newPage();
page.setDefaultNavigationTimeout( 90000 );
await page.setViewport( { 'width' : width, 'height' : height } );
await page.setUserAgent( 'UA-TEST' );
// Load first page
let stat = await page.goto( initialPage, { 'waitUntil' : 'domcontentloaded' } );
// Click on selector 1 - works ok
selector = selectors[0];
await page.waitForSelector( selector );
handles = await page.$$( selector );
handle = handles[12];
console.log( 'Clicking on: ', await page.evaluate( el => el.href, handle ) );
await handle.click(); // OK
await page.waitForNavigation();
// Click that selector 2 - fails
selector = selectors[1];
await page.waitForSelector( selector );
handles = await page.$$( selector );
handle = handles[12];
console.log( 'Clicking on: ', await page.evaluate( el => el.href, handle ) );
await handle.click();
await page.waitForNavigation();
await browser.close();
})();

For anyone still having trouble this worked for me:
await page.evaluate(()=>document.querySelector('#sign-in-btn').click())
Basically just get the element in a different way, then click it.
The reason I had to do this was because I was trying to click a button in a notification window which sits outside the rest of the app (and Chrome seemed to think it was invisible even if it was not).

I know I’m late to the party but I discovered an edge case that gave me a lot of grief, and this thread, so figured I’d post my findings.
The culprit:
CSS
scroll-behavior: smooth
If you have this you will have a bad time.
The solution:
await page.addStyleTag({ content: "{scroll-behavior: auto !important;}" });
Hope this helps some of you.

My way
async function getVisibleHandle(selector, page) {
const elements = await page.$$(selector);
let hasVisibleElement = false,
visibleElement = '';
if (!elements.length) {
return [hasVisibleElement, visibleElement];
}
let i = 0;
for (let element of elements) {
const isVisibleHandle = await page.evaluateHandle((e) => {
const style = window.getComputedStyle(e);
return (style && style.display !== 'none' &&
style.visibility !== 'hidden' && style.opacity !== '0');
}, element);
var visible = await isVisibleHandle.jsonValue();
const box = await element.boxModel();
if (visible && box) {
hasVisibleElement = true;
visibleElement = elements[i];
break;
}
i++;
}
return [hasVisibleElement, visibleElement];
}
Usage
let selector = "a[href='https://example.com/']";
let visibleHandle = await getVisibleHandle(selector, page);
if (visibleHandle[1]) {
await Promise.all([
visibleHandle[1].click(),
page.waitForNavigation()
]);
}

How to avoid being detected as bot on Puppeteer and Phantomjs?

Puppeteer and PhantomJS are similar. The issue I'm having is happening for both, and the code is also similar.
I'd like to catch some informations from a website, which needs authentication for viewing those informations. I can't even access home page because it's detected like a "suspicious activity", like the SS: https://i.imgur.com/p69OIjO.png
I discovered that the problem doesn't happen when I tested on Postman using a header named Cookie and the value of it's cookie caught on browser, but this cookie expires after some time. So I guess Puppeteer/PhantomJS both are not catching cookies, because this site is denying the headless browser access.
What could I do for bypass this?
// Simple Javascript example
var page = require('webpage').create();
var url = 'https://www.expertflyer.com';
page.open(url, function (status) {
if( status === "success") {
page.render("home.png");
phantom.exit();
}
});

If anyone need in future for the same problem.
Using puppeteer-extra
I have tested the code on a server. On 2nd run there is google Captcha. You can solve it your self and restart the bot or use a Captcha solving service.
I did run the code more than 10 times there is no ip ban. I did not get captcha again on my continues run.
But you can get captcha again!
//sudo npm install puppeteer puppeteer-extra puppeteer-extra-plugin-stealth puppeteer-extra-plugin-adblocker readline
var headless_mode = process.argv[2]
const readline = require('readline');
const puppeteer = require('puppeteer-extra')
const StealthPlugin = require('puppeteer-extra-plugin-stealth')
puppeteer.use(StealthPlugin())
const AdblockerPlugin = require('puppeteer-extra-plugin-adblocker')
puppeteer.use(AdblockerPlugin({ blockTrackers: true }))
async function run () {
const browser = await puppeteer.launch({
headless:(headless_mode !== 'true')? false : true,
ignoreHTTPSErrors: true,
slowMo: 0,
args: ['--window-size=1400,900',
'--remote-debugging-port=9222',
"--remote-debugging-address=0.0.0.0", // You know what your doing?
'--disable-gpu', "--disable-features=IsolateOrigins,site-per-process", '--blink-settings=imagesEnabled=true'
]})
const page = await browser.newPage();
console.log(`Testing expertflyer.com`)
//await page.goto('https://www.expertflyer.com')
await goto_Page('https://www.expertflyer.com')
await waitForNetworkIdle(page, 3000, 0)
//await page.waitFor(7000)
await checking_error(do_2nd_part)
async function do_2nd_part(){
try{await page.click('#yui-gen2 > a')}catch{}
await page.waitFor(5000)
var seat = '#headerTitleContainer > h1'
try{console.log(await page.$eval(seat, e => e.innerText))}catch{}
await page.screenshot({ path: 'expertflyer1.png'})
await checking_error(do_3nd_part)
}
async function do_3nd_part(){
try{await page.click('#yui-gen1 > a')}catch{}
await page.waitFor(5000)
var pro = '#headerTitleContainer > h1'
try{console.log(await page.$eval(pro, e => e.innerText))}catch{}
await page.screenshot({ path: 'expertflyer2.png'})
console.log(`All done, check the screenshots?`)
}
async function checking_error(callback){
try{
try{var error_found = await page.evaluate(() => document.querySelectorAll('a[class="text yuimenubaritemlabel"]').length)}catch(error){console.log(`catch error ${error}`)}
if (error_found === 0) {
console.log(`Error found`)
var captcha_msg = "Due to suspicious activity from your computer, we have blocked your access to ExpertFlyer. After completing the CAPTCHA below, you will immediately regain access unless further suspicious behavior is detected."
var ip_blocked = "Due to recent suspicious activity from your computer, we have blocked your access to ExpertFlyer. If you feel this block is in error, please contact us using the form below."
try{var error_msg = await page.$eval('h2', e => e.innerText)}catch{}
try{var error_msg_details = await page.$eval('body > p:nth-child(2)', e => e.innerText)}catch{}
if (error_msg_details == captcha_msg) {
console.log(`Google Captcha found, You have to solve the captch here manually or some automation recaptcha service`)
await verify_User_answer()
await callback()
} else if (error_msg_details == ip_blocked) {
console.log(`The current ip address is blocked. The only way is change the ip address.`)
} else {
console.log(`Waiting for error page load... Waiting for 10 sec before rechecking...`)
await page.waitFor(10000)
await checking_error()
}
} else {
console.log(`Page loaded successfully! You can do things here.`)
await callback()
}
}catch{}
}
async function goto_Page(page_URL){
try{
await page.goto(page_URL, { waitUntil: 'networkidle2', timeout: 30000 });
} catch {
console.log(`Error in loading page, re-trying...`)
await goto_Page(page_URL)
}
}
async function verify_User_answer(call_back){
user_Answer = await readLine();
if (user_Answer == 'yes') {
console.log(`user_Answer is ${user_Answer}, Processing...`)
// Not working what i want. Will fix later
// Have to restart the bot after solving
await call_back()
} else {
console.log(`answer not match. try again...`)
var user_Answer = await readLine();
console.log(`user_Answer is ${user_Answer}`)
await verify_User_answer(call_back)
}
}
async function readLine() {
const rl = readline.createInterface({
input: process.stdin,
output: process.stdout
});
return new Promise(resolve => {
rl.question('Solve the captcha and type yes to continue: ', (answer) => {
rl.close();
resolve(answer)
});
})
}
async function waitForNetworkIdle(page, timeout, maxInflightRequests = 0) {
console.log('waitForNetworkIdle called')
page.on('request', onRequestStarted);
page.on('requestfinished', onRequestFinished);
page.on('requestfailed', onRequestFinished);
let inflight = 0;
let fulfill;
let promise = new Promise(x => fulfill = x);
let timeoutId = setTimeout(onTimeoutDone, timeout);
return promise;
function onTimeoutDone() {
page.removeListener('request', onRequestStarted);
page.removeListener('requestfinished', onRequestFinished);
page.removeListener('requestfailed', onRequestFinished);
fulfill();
}
function onRequestStarted() {
++inflight;
if (inflight > maxInflightRequests)
clearTimeout(timeoutId);
}
function onRequestFinished() {
if (inflight === 0)
return;
--inflight;
if (inflight === maxInflightRequests)
timeoutId = setTimeout(onTimeoutDone, timeout);
}
}
await browser.close()
}
run();
Please note "Solve the captcha and type yes to continue: " method not working as expected, Need some fixing.
Edit: Re-run the bot after 10 minutes got captcha again. Solved captcha on chrome://inspect/#devices restarted the bot, everything working again. No ip ban.

Things that can help in general :
Headers should be similar to common browsers, including :
User-Agent : use a recent one (see https://developers.whatismybrowser.com/useragents/explore/), or better, use a random recent one if you make multiple requests (see https://github.com/skratchdot/random-useragent)
Accept-Language : something like "en,en-US;q=0,5" (adapt for your language)
Accept: a standard one would be like "text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8"
If you make multiple request, put a random timeout between them
If you open links found in a page, set the Referer header accordingly
Images should be enabled
Javascript should be enabled
Check that "navigator.plugins" and "navigator.language" are set in the client javascript page context
Use proxies

If you think from the websites perspective, you are indeed doing suspicious work. So whenever you want to bypass something like this, make sure to think how they are thinking.
Set cookie properly
Puppeteer and PhantomJS etc will use real browsers and the cookies used there are better than when using via postman or such. You just need to use cookie properly.
You can use page.setCookie(...cookies) to set the cookies. Cookies are serialized, so if cookies is an array of object, you can simply do this,
const cookies = [{name: 'test', value: 'foo'}, {name: 'test2', value: 'foo'}]; // just as example, use real cookies here;
await page.setCookie(...cookies);
Try to tweak the behaviors
Turn off the headless mode and see the behavior of the website.
await puppeteer.launch({headless: false})
Try proxies
Some websites monitor based on Ip address, if multiple hits are from same IP, they blocks the request. It's best to use rotating proxies on that case.

The website you are trying to visit uses Distil Networks to prevent web scraping.
People have had success in the past bypassing Distil Networks by substituting the $cdc_ variable found in Chromium's call_function.js (which is used in Puppeteer).
For example:
function getPageCache(opt_doc, opt_w3c) {
var doc = opt_doc || document;
var w3c = opt_w3c || false;
// var key = '$cdc_asdjflasutopfhvcZLmcfl_'; <-- This is the line that is changed.
var key = '$something_different_';
if (w3c) {
if (!(key in doc))
doc[key] = new CacheWithUUID();
return doc[key];
} else {
if (!(key in doc))
doc[key] = new Cache();
return doc[key];
}
}
Note: According to this comment, if you have been blacklisted before you make this change, you face another set of challenges, so you must "implement fake canvas fingerprinting, disable flash, change IP, and change request header order (swap language and Accept headers)."

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Why cant Puppeteer find this link element on page? - node.js

due isc_LinkItem_1$20j is not a valid selector, maybe you can try finding elements STARTING WITH isc_LinkItem_1 , like this await page.waitForSelector("[id^=isc_LinkItem_1]", {visible: true, timeout: 30000}); await page.click("[id?=isc_LinkItem_1]); ?

Related

waitForSelector suddenly no longer working in puppeteer

Click event does nothing when triggered

How can I get all the comments/reviews from a google maps place using puppeteer? (I dont get all of them because page is scrollable)

Puppeteer in NodeJS reports 'Error: Node is either not visible or not an HTMLElement'

How to avoid being detected as bot on Puppeteer and Phantomjs?

Categories

Resources