I am writing a web scraper that I am trying to proxy, but can't quite figure out how to do it in Elixir.
I am using Hound running on top of a headless ChromeDriver. I purchased some proxy IPs through https://luminati.io and they offer both a chrome extension and a user/password base proxy server.
The webscraper actions comprise of a GenServer that represent a user scraping the web. There is no front end of the app, it accepts commands that are sent to it through a bot I built on Telegram, so when a user sends the login command for instance it triggers the login function of the GS.
At that point the GenServer will change the ChromeDriver session using Hound.change_session_to/2 and then log the user in.
This works great, but now I want to send every request through the proxy server via username and password. When changing the session with Hound, it allows the chromeOptions to be set as well.
ua = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36"
change_session_to(String.to_atom(account.username), %{browserName: "chrome", chromeOptions: %{"args" => ["--user-agent=#{ua}", "--proxy-server=http://user:password#proxy.luminati.io:22225"]}})
navigate_to "https://www.website.com/"
Another thing that I have tried doing is loading luminati's ChromeExtension that I would be able to use to proxy the traffic through, but I can't get the extension to load for each session. I downloaded the packed CRM chrome extension and placed it within my priv folder. When the session loads it seems to load the User Agent just fine, but the extension never starts. When I am trying to load the extension I am not running headless.
ua = "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.120 Safari/537.36"
priv_dir = :code.priv_dir(:boost_buddy)
change_session_to(String.to_atom(account.username), %{browserName: "chrome",
chromeOptions: %{"extensions" => ['#{priv_dir}/luminati/3.2_1'], "args" => ["-
-user-agent=#{ua}", "--proxy-server=http://user:password#proxy.luminati.io:22225"]}})
navigate_to "https://www.website.com/"
Does anyone have experience using chrome driver with Elixir? With Ruby and Java setting up the extension is typically no problem.
https://github.com/GoogleChrome/puppeteer/issues/659
-1 because this was the top result for googling "chrome headless extension"
Regarding sending each request through the proxy, I think you either need to interface with the chrome driver yourself (hijacking hound) or skip hound and use either chrome directly or through a selenium grid.
I think the issue stems from the fact that hound will initiate one single chrome instance, where the proxy settings will be defined. Further requests are done using that proxy.
So in order to achieve multiple proxy connections for different sessions you either need a way to set them through navigational steps (visiting a proxy website that then serves as a hard proxy) or use different browser instances altogether (I might be wrong though and perhaps there's an easier way of proxying the requests)
Related
I want to download a personal instagram page automatically, I thought to use the command wget to download the entire page, but it doesn't work.
I set the header ( the same used by the browser ) and the cookie (take by cookie.txt extension) so the entire command line is:
wget -x -U "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36" --load-cookies cookies.txt -r "https://instagram.com/username"
But the result is a white page with instagram logo.
Do you have other idea? Is there another way to achieve this?
I think that the request is correct, maybe instagram uses dynamic request with javascript or similar and I'm following a bad way, but if it's true, when I open the page in the browser, this should execute javascript code. Is this correct?
wget is not a web browser. In particular, it doesn't understand JavaScript, and Instagram's user page has most of its content generated via JavaScript, so that's your first problem.
Your second problem is that Instagram's bot policy forbids the use of wget, and it's very conceivable that they have measures to detect wget even if you change the user agent - there are companies which specialize in that.
I buy web traffic from several sources (including the major names in the industry) and recently got reports from advertisers that there's quite a bit of "invalid" traffic. They won't share which filter they use so I can block it on my end. I tested all the navigator properties, resolution, window size, modernizr features, etc, and the bad traffic seems to be spoofing everything.
After some testing, I found that using this code:
document.addEventListener('click', function() {
window.open('/save?' + navigator.userAgent ,'_blank');
});
In some cases, the saved user agent is different from the one saved on the top window. Meaning, a visit hits a page, in that page the user agent could be something like:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,
like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763
Then that page uses window.open() to open a new window and reads the user agent again, and it will read something like this:
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko)
HeadlessChrome/72.0.3617.0 Safari/537.36
I tried all the usual methods, window.chrome, webdriver, permissions, plugins, fonts, reading those vars in an iframe, etc, they pass all the tests, the only thing that works is the window.open, but I obviously can't open a popup to filter the traffic.
Is there any way to detect this type of traffic?
Is there a way for Passport to check if request came from mobile or web app when doing authentication? Because if request came from the web I want to return a view otherwise return a json payload.
This is my opinion,you can check user-agent in the request header ,its look like this(came from windows):
user-agent:Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36
and this is came from my iPhone
User-Agent:Mozilla/5.0 (iPhone; CPU iPhone OS 10_3 like Mac OS X) AppleWebKit/602.1.50 (KHTML, like Gecko) CriOS/56.0.2924.75 Mobile/14E5239e Safari/602.1
and this is Android
User-Agent:Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Mobile Safari/537.36
so you can figer it out from user-agent,which request came from mobile or pc
If you have two different clients expecting different results, then you should explicitly send different requests, not try to guess which response is wanted from some header that isn't necessarily reliable. Plus, there's nothing keeping a mobile device from also accessing the web interface. You can either vary the path or vary a query string.
So, from web, you might use /login and from mobile, you might use /login-json or some different path that indicates you want json.
Or from web, you might use /login and from mobile, you might use /login?type=json.
I would NOT recommend using the user-agent header to detect the intent of the request. Instead, specify the intent directly in the request.
Can we use Mobile Tools module in Drupal6 with Varnish?
I doubt varnish will cache the index page and will not allow redirection to mobile version of the page.
Any work arround?
You want to make your server return different responses based on the used device/browser. This means your pages 'vary' based on the used User-Agent http request header, and in theory you should instruct any http proxy/cache in between to only use a cached version if the User-Agent string is the same by adding a http response header:
Vary: User-Agent
However, because browsers like Internet Explorer (unlike Chrome) use many slightly different User-Agent headers, this will completely kill your cache hit ratio. You need a smarter cache to understand that Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident/6.0) for your purposes is equal to Mozilla/5.0 (compatible; MSIE 8.0; Windows NT 5.0; Trident/4.0; InfoPath.1; SV1; .NET CLR 3.0.4506.2152; .NET CLR 3.5.30729; .NET CLR 3.0.04506.30), or any other user-agent string used by a desktop browser.
There are two options for you to solve this with Varnish:
1: Do mobile user-agent detection yourself in varnish logic, the same way mobile tools does it. E.g.:
vcl_recv {
if (req.http.user-agent ~ 'ipad|ipod|iphone|android|mini opera|blackberry|up.browser|up.link|mmp|symbian|smartphone|midp|wap|vodafone|o2|pocket|kindle|mobile|pda|psp|treo') {
hash += "mobile"
}
}
2: Or, always set a session cookie mobile=true or mobile=false after you've seen the first request, and only serve cached pages for requests with this cookie.
And after googling a bit, you should read: http://fangel.github.com/mobile-detection-varnish-drupal/
This question already has answers here:
Why do all browsers' user agents start with "Mozilla/"?
(6 answers)
Closed 4 years ago.
When I myself send many requests to the server I found it amazing that in IE if I choose opera user string that the value of user string was
User-Agent Opera/9.80 (Windows NT 6.1; U; en) Presto/2.2.15 Version/10.00
But if I choose another browser in Internet Explorer that it puts Mozilla 5.0 in the user string first.
When I send the ajax request from Chrome that I found same thing that they put user string
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20
I found that Mozilla is an organization that doesn't have anything to do with Google and Microsoft. Perhaps it was a competitor for both. Why do MSFT and Google both put Mozilla in their user agent? Is there any reason for putting Mozilla in connection string?
Why do chrome and IE both put Mozilla in the userstring when they send the request? I do not know why but is there any specific reason for that?
See: user-agent-string-history
It all goes back to browser sniffing and making sure that the browsers are not blocked from getting content they can support. From the above article:
And Internet Explorer supported frames, and yet was not Mozilla, and so was not given frames. And Microsoft grew impatient, and did not wish to wait for webmasters to learn of IE and begin to send it frames, and so Internet Explorer declared that it was “Mozilla compatible” and began to impersonate Netscape, and called itself Mozilla/1.22 (compatible; MSIE 2.0; Windows 95), and Internet Explorer received frames, and all of Microsoft was happy, but webmasters were confused.