How to overcome sites that detecting Automated ChromeDriver? [duplicate]

How to overcome sites that detecting Automated ChromeDriver? [duplicate] - python-3.x

I'm trying to automate a very basic task in a website using selenium and chrome but somehow the website detects when chrome is driven by selenium and blocks every request. I suspect that the website is relying on an exposed DOM variable like this one https://stackoverflow.com/a/41904453/648236 to detect selenium driven browser.
My question is, is there a way I can make the navigator.webdriver flag false? I am willing to go so far as to try and recompile the selenium source after making modifications, but I cannot seem to find the NavigatorAutomationInformation source anywhere in the repository https://github.com/SeleniumHQ/selenium
Any help is much appreciated
P.S: I also tried the following from https://w3c.github.io/webdriver/#interface
Object.defineProperty(navigator, 'webdriver', {
get: () => false,
});
But it only updates the property after the initial page load. I think the site detects the variable before my script is executed.

First the update 1
execute_cdp_cmd(): With the availability of execute_cdp_cmd(cmd, cmd_args) command now you can easily execute google-chrome-devtools commands using Selenium. Using this feature you can modify the navigator.webdriver easily to prevent Selenium from getting detected.
Preventing Detection 2
To prevent Selenium driven WebDriver getting detected a niche approach would include either / all of the below mentioned steps:
Adding the argument --disable-blink-features=AutomationControlled
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument('--disable-blink-features=AutomationControlled')
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.get("https://www.website.com")
You can find a relevant detailed discussion in Selenium can't open a second page
Rotating the user-agent through execute_cdp_cmd() command as follows:
#Setting up Chrome/83.0.4103.53 as useragent
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
Change the property value of the navigator for webdriver to undefined
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
Exclude the collection of enable-automation switches
options.add_experimental_option("excludeSwitches", ["enable-automation"])
Turn-off useAutomationExtension
options.add_experimental_option('useAutomationExtension', False)
Sample Code 3
Clubbing up all the steps mentioned above and effective code block will be:
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_argument("start-maximized")
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\WebDrivers\chromedriver.exe')
driver.execute_script("Object.defineProperty(navigator, 'webdriver', {get: () => undefined})")
driver.execute_cdp_cmd('Network.setUserAgentOverride', {"userAgent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.53 Safari/537.36'})
print(driver.execute_script("return navigator.userAgent;"))
driver.get('https://www.httpbin.org/headers')
History
As per the W3C Editor's Draft the current implementation strictly mentions:
The webdriver-active flag is set to true when the user agent is under remote control which is initially set to false.
Further,
Navigator includes NavigatorAutomationInformation;
It is to be noted that:
The NavigatorAutomationInformation interface should not be exposed on WorkerNavigator.
The NavigatorAutomationInformation interface is defined as:
interface mixin NavigatorAutomationInformation {
readonly attribute boolean webdriver;
};
which returns true if webdriver-active flag is set, false otherwise.
Finally, the navigator.webdriver defines a standard way for co-operating user agents to inform the document that it is controlled by WebDriver, so that alternate code paths can be triggered during automation.
Caution: Altering/tweaking the above mentioned parameters may block the navigation and get the WebDriver instance detected.
Update (6-Nov-2019)
As of the current implementation an ideal way to access a web page without getting detected would be to use the ChromeOptions() class to add a couple of arguments to:
Exclude the collection of enable-automation switches
Turn-off useAutomationExtension
through an instance of ChromeOptions as follows:
Java Example:
System.setProperty("webdriver.chrome.driver", "C:\\Utility\\BrowserDrivers\\chromedriver.exe");
ChromeOptions options = new ChromeOptions();
options.setExperimentalOption("excludeSwitches", Collections.singletonList("enable-automation"));
options.setExperimentalOption("useAutomationExtension", false);
WebDriver driver = new ChromeDriver(options);
driver.get("https://www.google.com/");
Python Example
from selenium import webdriver
options = webdriver.ChromeOptions()
options.add_experimental_option("excludeSwitches", ["enable-automation"])
options.add_experimental_option('useAutomationExtension', False)
driver = webdriver.Chrome(options=options, executable_path=r'C:\path\to\chromedriver.exe')
driver.get("https://www.google.com/")
Ruby Example
options = Selenium::WebDriver::Chrome::Options.new
options.add_argument("--disable-blink-features=AutomationControlled")
driver = Selenium::WebDriver.for :chrome, options: options
Legends
1: Applies to Selenium's Python clients only.
2: Applies to Selenium's Python clients only.
3: Applies to Selenium's Python clients only.

ChromeDriver:
Finally discovered the simple solution for this with a simple flag! :)
--disable-blink-features=AutomationControlled
navigator.webdriver=true will no longer show up with that flag set.
For a list of things you can disable, check them out here

Do not use cdp command to change webdriver value as it will lead to inconsistency which later can be used to detect webdriver. Use the below code, this will remove any traces of webdriver.
options.add_argument("--disable-blink-features")
options.add_argument("--disable-blink-features=AutomationControlled")

Before (in browser console window):
> navigator.webdriver
true
Change (in selenium):
// C#
var options = new ChromeOptions();
options.AddExcludedArguments(new List<string>() { "enable-automation" });
// Python
options.add_experimental_option("excludeSwitches", ['enable-automation'])
After (in browser console window):
> navigator.webdriver
undefined
This will not work for version ChromeDriver 79.0.3945.16 and above. See the release notes here

To exclude the collection of enable-automation switches as mentioned in the 6-Nov-2019 update of the top voted answer doesn't work anymore as of April 2020. Instead I was getting the following error:
ERROR:broker_win.cc(55)] Error reading broker pipe: The pipe has been ended. (0x6D)
Here's what's working as of 6th April 2020 with Chrome 80.
Before (in the Chrome console window):
> navigator.webdriver
true
Python example:
options = webdriver.ChromeOptions()
options.add_argument("--disable-blink-features")
options.add_argument("--disable-blink-features=AutomationControlled")
After (in the Chrome console window):
> navigator.webdriver
undefined

Nowadays you can accomplish this with cdp command:
driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {
"source": """
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
})
"""
})
driver.get(some_url)
by the way, you want to return undefined, false is a dead giveaway.

Finally this solved the problem for ChromeDriver, Chrome greater than v79.
ChromeOptions options = new ChromeOptions();
options.addArguments("--disable-blink-features");
options.addArguments("--disable-blink-features=AutomationControlled");
ChromeDriver driver = new ChromeDriver(options);
Map<String, Object> params = new HashMap<String, Object>();
params.put("source", "Object.defineProperty(navigator, 'webdriver', { get: () => undefined })");
driver.executeCdpCommand("Page.addScriptToEvaluateOnNewDocument", params);

Since this question is related to selenium a cross-browser solution to overriding navigator.webdriver is useful. This could be done by patching browser environment before any JS of target page runs, but unfortunately no other browsers except chromium allows one to evaluate arbitrary JavaScript code after document load and before any other JS runs (firefox is close with Remote Protocol).
Before patching we needed to check how the default browser environment looks like. Before changing a property we can see it's default definition with Object.getOwnPropertyDescriptor()
Object.getOwnPropertyDescriptor(navigator, 'webdriver');
// undefined
So with this quick test we can see webdriver property is not defined in navigator. It's actually defined in Navigator.prototype:
Object.getOwnPropertyDescriptor(Navigator.prototype, 'webdriver');
// {set: undefined, enumerable: true, configurable: true, get: ƒ}
It's highly important to change the property on the object that owns it, otherwise the following can happen:
navigator.webdriver; // true if webdriver controlled, false otherwise
// this lazy patch is commonly found on the internet, it does not even set the right value
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
});
navigator.webdriver; // undefined
Object.getOwnPropertyDescriptor(Navigator.prototype, 'webdriver').get.apply(navigator);
// true
A less naive patch would first target the right object and use right property definition, but digging deeper we can find more inconsistences:
const defaultGetter = Object.getOwnPropertyDescriptor(Navigator.prototype, 'webdriver').get;
defaultGetter.toString();
// "function get webdriver() { [native code] }"
Object.defineProperty(Navigator.prototype, 'webdriver', {
set: undefined,
enumerable: true,
configurable: true,
get: () => false
});
const patchedGetter = Object.getOwnPropertyDescriptor(Navigator.prototype, 'webdriver').get;
patchedGetter.toString();
// "() => false"
A perfect patch leaves no traces, instead of replacing getter function it would be good if we could just intercept the call to it and change the returned value. JavaScript has native support for that throught Proxy apply handler:
const defaultGetter = Object.getOwnPropertyDescriptor(Navigator.prototype, 'webdriver').get;
defaultGetter.apply(navigator); // true
defaultGetter.toString();
// "function get webdriver() { [native code] }"
Object.defineProperty(Navigator.prototype, 'webdriver', {
set: undefined,
enumerable: true,
configurable: true,
get: new Proxy(defaultGetter, { apply: (target, thisArg, args) => {
// emulate getter call validation
Reflect.apply(target, thisArg, args);
return false;
}})
});
const patchedGetter = Object.getOwnPropertyDescriptor(Navigator.prototype, 'webdriver').get;
patchedGetter.apply(navigator); // false
patchedGetter.toString();
// "function () { [native code] }"
The only inconsistence now is in the function name, unfortunately there is no way to override the function name shown in native toString() representation. But even so it can pass generic regular expressions that searches for spoofed browser native functions by looking for { [native code] } at the end of its string representation. To remove this inconsistence you can patch Function.prototype.toString and make it return valid native string representations for all native functions you patched.
To sum up, in selenium it could be applied with:
chrome.execute_cdp_cmd('Page.addScriptToEvaluateOnNewDocument', {'source': """
Object.defineProperty(Navigator.prototype, 'webdriver', {
set: undefined,
enumerable: true,
configurable: true,
get: new Proxy(
Object.getOwnPropertyDescriptor(Navigator.prototype, 'webdriver').get,
{ apply: (target, thisArg, args) => {
// emulate getter call validation
Reflect.apply(target, thisArg, args);
return false;
}}
)
});
"""})
The playwright project maintains a fork of Firefox and WebKit to add features for browser automation, one of them is equivalent to Page.addScriptToEvaluateOnNewDocument, but there is no implementation for Python of the communication protocol but it could be implemented from scratch.

Simple hack for python:
options = webdriver.ChromeOptions()
options.add_argument("--disable-blink-features=AutomationControlled")

As mentioned in the above comment - https://stackoverflow.com/a/60403652/2923098 the following option totally worked for me (in Java)-
ChromeOptions options = new ChromeOptions();
options.addArguments("--incognito", "--disable-blink-features=AutomationControlled");

I would like to add a Java alternative to the cdp command method mentioned by pguardiario
Map<String, Object> params = new HashMap<String, Object>();
params.put("source", "Object.defineProperty(navigator, 'webdriver', { get: () => undefined })");
driver.executeCdpCommand("Page.addScriptToEvaluateOnNewDocument", params);
In order for this to work you need to use the ChromiumDriver from the org.openqa.selenium.chromium.ChromiumDriver package. From what I can tell that package is not included in Selenium 3.141.59 so I used the Selenium 4 alpha.
Also, the excludeSwitches/useAutomationExtension experimental options do not seem to work for me anymore with ChromeDriver 79 and Chrome 79.

For those of you who've tried these tricks, please make sure to also check that the user-agent that you are using is the user agent that corresponds to the platform (mobile / desktop / tablet) your crawler is meant to emulate. It took me a while to realize that was my Achilles heel ;)

Python
I tried most of the stuff mentioned in this post and i was still facing issues.
What saved me for now is https://pypi.org/project/undetected-chromedriver
pip install undetected-chromedriver
import undetected_chromedriver.v2 as uc
from time import sleep
from random import randint
driver = uc.Chrome()
driver.get('www.your_url.here')
driver.maximize_window()
sleep(randint(3,9))
A bit slow but i will take slow over non working.
I guess if every interested could go over the source code and see what provides the win there.

If you use a Remote Webdriver , the code below will set navigator.webdriver to undefined.
work for ChromeDriver 81.0.4044.122
Python example:
options = webdriver.ChromeOptions()
# options.add_argument("--headless")
options.add_argument('--disable-gpu')
options.add_argument('--no-sandbox')
driver = webdriver.Remote(
'localhost:9515', desired_capabilities=options.to_capabilities())
script = '''
Object.defineProperty(navigator, 'webdriver', {
get: () => undefined
})
'''
driver.execute_script(script)

Use --disable-blink-features=AutomationControlled to disable navigator.webdriver

Related

How to solve selenium webdriver: ElementNotInteractableError: element not interactable in nodejs

I started to learn Selenium
but i'm stuck trying to upload and download on a element like this:
I want to upload a dwg file on this site and convert it into a text file. So I'm using selenium.
I encountered a problem while uploading the file
This is my error message:
We have tried several solutions to this problem.
ADD driver.manage().window().maximize();
use click() instead of sendKeys()
Check Element is Enabled
However, no solution has solved the problem.
This is the entire code:
const { Builder, Browser, By, Key, until } = require("selenium-webdriver");
const chromeDriver = require("selenium-webdriver/chrome");
const chromeOptions = new chromeDriver.Options();
const chromeExample = async () => {
const driver = await new Builder()
.forBrowser(Browser.CHROME)
.setChromeOptions(chromeOptions.headless())
.build();
driver.manage().window().maximize();
await driver.get("https://products.aspose.app/cad/text-extractor/dwg");
await driver.wait(
until.elementLocated(By.className("filedrop-container width-for-mobile")),
10 * 1000
);
await driver.wait(
until.elementIsEnabled(
driver.findElement(By.className("filedrop-container width-for-mobile"))
),
10 * 1000
);
const tmp = await driver
.findElement(By.className("filedrop-container width-for-mobile"))
.sendKeys("/home/yeongori/workspace/Engineering-data-search-service/macro/public/images/testfile1.dwg");
console.log(tmp);
};
The same error occurs when i change the code as below.
await driver
.findElement(By.className("filedrop-container width-for-mobile"))
.sendKeys(Key.ENTER);
One strange thing is that if i change sendKeys to click and check tmp with console.log, it is null.
This is Project Directory
How can I solve this problem? I'm sorry if it's too basic a question. But I'd be happy if there was any hint. Thank you.
WSL2(Ubuntu-20.04), Node.js v18.12.1

You need to use sendKeys on the input element which is nested deeper in the element that you are currently trying to send keys to.
You can reach the element with this xpath:
"//*[contains(#id,'UploadFileInput')]"

Selenium RemoteWebDriver() ERR_BLOCKED_BY_CLIENT on Chrome Extension in Selenium Grid, while without ChromeExtension works fine

We are using an extension in chrome for wiremock.
This chrome extension works beautifully while running locally both on Windows ( ref. 1), Mac computers.
However when instantiating RemoteWebDriver() in a Selenium Test Grid flavor named Moon, loading the extension fails (on Linux) catastrophically with ERR_BLOCKED_BY_CLIENT when trying to load the chrome extension.
(ref. 1)
On Windows localhost , execution actually fails with a message "That unpacked extensions is disabled by the admin"" but this can be easily workarounded by manually deleting a key at :
Computer\HKEY_LOCAL_MACHINE\SOFTWARE\Policies\Google\Chrome\ExtensionInstallBlacklist
https://support.leapwork.com/hc/en-us/articles/360003031952-How-to-resolve-Error-Loading-of-unpacked-extensions-is-disabled-by-the-administrator
**- So I guess one needs to do something similar in the Test Grid node running Linux, but how does that code look like exactly? **
We have a switch statement in the code, so we see that code works generally fine in the test grid when tests are instantiated without the chrome extension.
Junit4.x
com.github.tomakehurst wiremock-jre8 2.32.0
jdk8 (we have so many indirect dependencies on about 77 or so different artifact versions so updating is almost impossible. (I suspect I will get the Nobel price in physics before that happens)
Chrome version 91.0.4472 is instantiated in the grid.
Error message :
idgpnmonknjnojddfkpgkljpfnnfcklj is blocked. This page has been blocked by Chrome. ERR_BLOCKED_BY_CLIENT”
We have tried a few different code constructs which all failed to unblock the extension from loading.
Here is one such code example we unsuccessfully tried with:
private RemoteWebDriver createDriver( Scenario scenario) {
RemoteWebDriver driver = null;
ChromeOptions options = mock ?
new ChromeOptions()
.addExtensions(new File("src/test/resources/modheader/idgpnmonknjnojddfkpgkljpfnnfcklj.crx")) :
new ChromeOptions();
options.setPageLoadStrategy(PageLoadStrategy.EAGER);
DesiredCapabilities capabilities = new DesiredCapabilities();
capabilities.setCapability(ChromeOptions.CAPABILITY, options);
Proxy proxy = new Proxy();
proxy.setHttpProxy("http://proxy.aaaaaaaaaaaaaaaa.com:8080");
proxy.setSslProxy("http:// proxy.aaaaaaaaaaaaaaaa.com:8080");");
proxy.setNoProxy("");
capabilities.setCapability("proxy", proxy);
capabilities.setCapability("enableVNC", true);
capabilities.setCapability("name", String.format("TEST - %s", scenario.getName()));
System.setProperty("selenide.remote","https://grid.aaaaaaaaaaaaaaaa.com/wd/hub");
System.setProperty("selenide.browser", "chrome");
System.setProperty("selenide.proxyHost", "proxy.aaaaaaaaaaaaaaaa.com ");
System.setProperty("selenide.proxyPort", "8080");
System.setProperty("selenide.proxyEnabled", "true");
if (System.getProperty("debugMoon") != null && System.getProperty("debugMoon").length()>0) {
capabilities.setCapability("devtools", true);
capabilities.setCapability("noExit", true);
System.setProperty("devtools", "true");
System.setProperty("noExit", "true");
System.setProperty("moon_debugged", "true");
}
// options.merge(capabilities);
try {
if(mock) { addHeaderMock(); } // adding header before Chromedriver new () when started in the MOON test Grid
logger.info("*********** Grid ***********");
final URL url = new URL("https://grid.aaaaaaaaaaaaaaaa.com/wd/hub");
logger.info(">>> Connecting to Moon test grid url: " + url.toString());
driver = new RemoteWebDriver(url, capabilities) ); // have tried both capabilities object but also options.merge(capabilities) with no joy
} catch (final Exception e) {
logger.info("Unable to create driver " + e);
}
logger.info("Connection established\n");
return driver;
}
Any hints or pointers in the right direction are much appreciated!

How do I extract an interface from a library in typescript?

I a using the library puppeteer however I would only like to instantiate one browser. So I'm using a top level function to create the browser and passing it as a parameter to helper functions like so:
import * as puppeteer from 'puppeteer';
export async function scrape() {
const browser = await puppeteer
.launch({
//product:'chrome',
//executablePath: '/usr/bin/chromium-browser,
args: ['--no-sandbox', '--disable-setuid-sandbox'],
})
.catch(() => {});
scrapeVmb(browser)
scrapeChase(browser)
}
The problem becomes that I lose the intelisense inside the helper functions for the pupeteer library, This problem could be solved by setting the type of the browser parameter, however I do not know where to find the type of browser.
TLDR
How can I get the browser parameter to inherit the browser type inside the "scrapeVmb" and "scrapeChase" functions.

After fiddling around with it some more (I don't understand HOW this works so if someone knows it might be worthy of an answer) you can do:
type browserType = puppeteer.Browser
That will give you the browser interface which you can then use like so:
function scrapeEtc(browser:browserType) {
//...
{

selinium-webdriver issue with Error: The geckodriver.exe executable could not be found on the current PATH

Hey I want to get screen shot with nodejs selinium-webdriver firefox
I am getting error like that : Error: The geckodriver.exe executable could not be found on the current PATH.
I set up the enviornment variable, but no luck

You need to set the path to the geckodriver.exe prior to creating the driver instance.
In Java:
System.setProperty("webdriver.gecko.driver", "./drivers/geckodriver.exe");//"<PATH TO LOCATION>\\chromedriver.exe");

I had some successful result with this process:
1° - Check if your webdriver(geckodriver, chromedriver, etc.) is in the correct path (if you don't know how to do it, check the link, https://www.youtube.com/watch?v=fj0Ud16YJJw). I think this video is a little old, because in npm the code information is different, updated, but it also serves as instruction.
2°- I changed my type of license(in package.json) from “ISC” to “MIT”, you can do this manually. And for my surprise, this change made my code go well.
const webdriver = require("selenium-webdriver");
const firefox = require("selenium-webdriver/firefox");
const { Builder, Browser, By, Key, until } = require("selenium-webdriver");
async function example() {
let driver = await new Builder().forBrowser(Browser.FIREFOX).build();
try {
await driver.get("https://www.google.com/ncr");
await driver.findElement(By.name("q")).sendKeys("Selenium", Key.RETURN);
await driver.wait(until.titleIs("webdriver - Google Search"), 1000);
} finally {
await driver.quit();
}
}
example();
And here we have another link that goes to selenium webdriver dependency in npm website(https://www.npmjs.com/package/selenium-webdriver) for more information.

How do I set proxy in phantomjs

This https://www.npmjs.com/package/phantom#functionality-details page says:
You can also pass command line switches to the phantomjs process by specifying additional args to phantom.create(), eg:
phantom.create '--load-images=no', '--local-to-remote-url-access=yes', (page) ->
or by specifying them in the options* object:
phantom.create {parameters: {'load-images': 'no', 'local-to-remote-url-access': 'yes'}}, (page) ->
These examples are only in coffee script and also they insinuate that the create function can take
create('string',function)
or
create([object object],function)
but really the first parameter expected is the function!
I really wanted to try http://phantomjs.org/api/command-line.html I might have the wrong idea but to me it looks like they can be used in the create function (right before you do the createPage), am I wrong?
I have tried several things, the most logical one is this:
var phantom = require('phantom');
phantom.create(function(browser){
browser.createPage(function(page){
page.open('http://example.com/req.php', function() {
});},{parameters:{'proxy':'98.239.198.83:21320'}});});
So the page gets opened. I know this because I am making req.php save the $_SERVER object to a txt pad but, the REMOTE_ADDR and REMOTE_PORT headers are not the ones in the proxy I have set. They have no effect. I have also tried:
{options:{'proxy':'98.239.198.83:21320'}}
As the docs call that object the options* object *see above^
and
'--proxy=98.239.198.83:21320'
I have also had a dig through the phantom module to find the create function. It is not written in js I can't see it at least. It must be in C++. It looks like this module has been updated but, the examples deep inside the module look like old code.
How do I do this?
EDIT:
var phantom = require('phantom');
phantom.create(function(browser){
browser.createPage(function(page){
browser.setProxy('98.239.198.83','21320','http', null, null, function(){
page.open(
'http://example.com/req.php', function() {
});});});});
This produces no error and the page gets scraped but the proxy is ignored.

As for as phantom 2.0.10 version the following code is running very well in my windows machine
phantom.create(["--proxy=201.172.242.184:15124", "--proxy-type=socks5"])
.then((instance) => {
phInstance = instance;
return instance.createPage();
})
.then((page) => {
sitepage = page;
return page.open('http://newsdaily.online');
})
.then((status) => {
console.log(status);
return sitepage.property('title');
})
.then((content) => {
console.log(content);
sitepage.close();
phInstance.exit();
})
.catch((error) => {
console.log(error);
phInstance.exit();
});

{ parameters: { 'proxy': 'socks://98.239.198.83:21320' } }
They didn't update their docs.

Time is going on, so PhantomJS now able to set proxy "on the fly" (even on per-page-basis): see this commit: https://github.com/ariya/phantomjs/commit/efd8dedfb574c15ddaac26ae72690fc2031e6749
Here is sample usage of new setProxy function (i did not find web page setting usage, this is general usage of proxy on instance of phantom):
https://github.com/ariya/phantomjs/blob/master/examples/openurlwithproxy.js
If you want per-page proxy, use full URL for proxy (schema, user name,password, host, port - all it the URL)

As a side effect of trying to figure out an issue on Github for phantomjs-nodejs I was able to set a proxy as follows:
phantom = require 'phantom'
parameters = {
loadimages: '--load-images=no',
websecurity: '--web-security=no',
ignoresslerrors: '--ignore-ssl-errors=yes',
proxy: '--proxy=10.0.1.235:8118',
}
urls = {
checktor: "https://check.torproject.org/",
google: "https://google.com",
}
phantom.create parameters.websecurity, parameters.proxy, (ph) ->
ph.createPage (page) ->
page.open urls.checktor, (status) ->
console.log "page opened? ", status
page.evaluate (-> document.title), (result) ->
console.log 'Page title is ' + result
ph.exit()
The result where the proxy uses Tor was:
page opened? success
Page title is Congratulations. This browser is configured to use Tor.

use phantom npm package and co npm package.
co(function*() {
const phantomInstance = yield phantom.create(["--proxy=171.13.36.64:808"]);
crawScheduler.start(phantomInstance);
});

I'm running PhantomJS from windows cmd and syntaxes it use looks bit different from what's I notice if you didn't put http:// PJ wont recognize the value this is complete example
var page = require('webpage').create();
page.settings.loadImages = false; //
page.settings.proxy = 'http://192.168.1.5:8080' ;
page.settings.userAgent = 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36';
page.open('http://some.com/page', function() {
page.render('some.png');
phantom.exit();
});

Yet another solution for nodejs:
const phantomInstance = await require('phantom').create();
const page = await phantomInstance.createPage();
// get current settings:
var pageSettings = await page.property('settings');
/*
{
XSSAuditingEnabled: false,
javascriptCanCloseWindows: true,
javascriptCanOpenWindows: true,
javascriptEnabled: true,
loadImages: true,
localToRemoteUrlAccessEnabled: false,
userAgent: 'Mozilla/5.0 (Unknown; Linux x86_64) ... PhantomJS/2.1.1 Safari/538.1',
webSecurityEnabled: true
}
*/
pageSettings.proxy = 'https://78.40.87.18:808';
// update settings (return value is undefined):
await page.property('settings', pageSettings);
const status = await page.open('https://2ip.ru/');
// show IP:
var ip = await page.evaluate(function () {
var el = document.getElementById('d_clip_button');
return !el ? '?' : el.textContent;
});
console.log('IP:', ip);
It's an option to set proxy within specific page.

The CoffeeScript example is a little strange, because it is the browser that is passed into the callback of phantom.create and not page, but otherwise it must be compatible judging by the code.
var phantom = require('phantom');
phantom.create({
parameters: {
proxy: '98.239.198.83:21320'
}
}, function(browser){
browser.createPage(function(page){
page.open('http://example.com/req.php', function() {
...
});
});
});
Proxy settings are set during process creation, not during page opening. Although PhantomJS contains an undocumented phantom.setProxy() function which enables you to change the proxy settings in the middle of the script. The phantom module also seems to support it.

var phantom = require('phantom');
phantom.create(function (browser) {
browser.setProxy(proxyIP, proxyPort);
page.open(url, function (status) {
console.log(status);
});
},{dnodeOpts:{weak: false}});
it works fine on my windows.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to overcome sites that detecting Automated ChromeDriver? [duplicate] - python-3.x

ChromeDriver: Finally discovered the simple solution for this with a simple flag! :) --disable-blink-features=AutomationControlled navigator.webdriver=true will no longer show up with that flag set. For a list of things you can disable, check them out here

Nowadays you can accomplish this with cdp command: driver.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", { "source": """ Object.defineProperty(navigator, 'webdriver', { get: () => undefined }) """ }) driver.get(some_url) by the way, you want to return undefined, false is a dead giveaway.

Simple hack for python: options = webdriver.ChromeOptions() options.add_argument("--disable-blink-features=AutomationControlled")

As mentioned in the above comment - https://stackoverflow.com/a/60403652/2923098 the following option totally worked for me (in Java)- ChromeOptions options = new ChromeOptions(); options.addArguments("--incognito", "--disable-blink-features=AutomationControlled");

For those of you who've tried these tricks, please make sure to also check that the user-agent that you are using is the user agent that corresponds to the platform (mobile / desktop / tablet) your crawler is meant to emulate. It took me a while to realize that was my Achilles heel ;)

Use --disable-blink-features=AutomationControlled to disable navigator.webdriver

Related

How to solve selenium webdriver: ElementNotInteractableError: element not interactable in nodejs

Selenium RemoteWebDriver() ERR_BLOCKED_BY_CLIENT on Chrome Extension in Selenium Grid, while without ChromeExtension works fine

How do I extract an interface from a library in typescript?

selinium-webdriver issue with Error: The geckodriver.exe executable could not be found on the current PATH

How do I set proxy in phantomjs

Categories

Resources