Splash does not connect to proxy using any of the 3 ways described in documentation - python-3.x

Splash browser does not send anything to through the http proxy. The pages are fetched even when the proxy is not running.
I am using scrapy with splash in python 3 to fetch pages after authentication for a an Angular.js website. The script is able to fetch pages, authenticate, and fetch pages after authentication. However, it does not use the proxy setup at localhost:8090 and wireshark confirms that traffic coming from port 8050 goes to some port in the 50k range.
The setup is
- splash running locally on a docker image (latest) on port 8050
- python 3 running locally on a mac
- Zap proxy running locally on a mac at port 8090
- Web page accessed through VPN
I have tried to specify the proxy host:port through the server using Chrome with a LUA script. Page is fetched without the proxy.
I have tried to specify the proxy in the python script with both Lua and with the api (args={'proxy':'host:port'} and the page is fetched without using the proxy.
I have tried using the proxy-host file and I get status 502.
Proxy set through Lua on Chrome (no error, not proxied):
function main(splash, args)
splash:on_request(function(request)
request:set_proxy{
host = "127.0.0.1",
port = 8090,
username = "",
password = "",
type = "HTTP"
}
end
)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
req = SplashRequest("http://mysite/home", self.log_in,
endpoint='execute', args={'lua_source': script})
Proxy set through api (status 502):
req = SplashRequest("http://mysite/home",
self.log_in, args={'proxy': 'http://127.0.0.1:8090'})
Proxy set through Lua in Python (no error, not proxied):
def start_requests(self):
script = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
splash:on_request(function(request)
request:set_proxy{
host = "127.0.0.1",
port = 8090,
username = "",
password = "",
type = "HTTP"
}
end
)
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
"""
req = SplashRequest("http://mysite/home", self.log_in,
endpoint='execute', args={'lua_source': script})
# req.meta['proxy'] = 'http://127.0.0.1:8090'
yield req
Proxy set through proxy file in docker image (status 502):
proxy file:
[proxy]
; required
host=127.0.0.1
port=8090
Shell command:
docker run -it -p 8050:8050 -v ~/Documents/proxy-profile:/etc/splash/proxy-profiles scrapinghub/splash --proxy-profiles-path=/etc/splash/proxy-profiles
All of the above should display the page in zap proxy at port 8090.
Some of the above seem to set the proxy, but the proxy can't reach localhost:8090 (status 502). Some don't work at all (no error, not proxied). I think this may be related to fact that a docker image is being used.
I am not looking to use Selenium because that is what this replacing.

All methods returning status 502 are working correctly. The reason for this issue is that docker images cannot access localhost on the host. To resolve this, use http://docker.for.mac.localhost:8090 as the proxy host:port on mac host and use docker run -it --network host scrapinghub/splash for linux with localhost:port. For linux, -p is invalidated since all services on the container will be on localhost.
Method 2 is best for a single proxy without rules. Method 4 is best for multiple proxies with rules.
I did not try other methods to see what they would return with these changes and why.

Alright I have been struggling with the same problem for a while now, but I found the solution for your first method on GitHub, which is based on what the Docker docs state:
The host has a changing IP address (or none if you have no network access). From 18.03 onwards our recommendation is to connect to the special DNS name host.docker.internal, which resolves to the internal IP address used by the host.
The gateway is also reachable as gateway.docker.internal.
Meaning that you should/could use the "host.docker.internal" as host instead for your proxy E.g.
splash:on_request(function (request)
request:set_proxy{
host = "host.docker.internal",
port = 8090
}
end)
Here is the link to the explanation: https://github.com/scrapy-plugins/scrapy-splash/issues/99#issuecomment-386158523

Related

Nodejs app in Docker not accessible on localhost:3000

I am trying to run a Node application in a Docker container. The installation instructions specified that after adding host: '0.0.0.0' to config/local.js and running docker-compose up the app should be accessible at localhost:3000, but I get an error message in the browser saying "The connection was reset - The connection to the server was reset while the page was loading."
I have tried to add the host: '0.0.0.0' in different places, or remove it entirely, access https://localhost:3000 as opposed to at http, connecting to localhost:3000/webpack-dev-server/index.html, etc.
What could be going wrong?

Passing tor proxy to splash in WSL2

I'm currently trying to pass a proxy to a splash instance running on docker Desktop launched from WSL.
I start tor using sudo service tor start.
To make sure my WSL tor service is communicating with Windows, I passed it as proxy to Firefox with the following Parameters:
IP: 127.0.0.1
Port: 9050
Proxy Type: SOCKS5
Then I go to https://check.torproject.org/ and tadaa it works.
I run my container using the following command:
sudo docker run -p 8050:8050 -p 5023:5023 scrapinghub/splash --disable-browser-caches
The easiest way I found to test it was to go to localhost:8050 and type in the following lines:
splash:on_request(function(request)
request:set_proxy{
host = "127.0.0.1",
port = 9050,
username = "",
password = "",
type = "SOCKS5"
}
end
)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
I query https://check.torproject.org/ , and I get error 99.
Am I missing something important here?
Have you looked at proxy profiles? They look to be the preferred way to attach proxies with a docker Splash container.

How to use puppeteer with browserless and proxy

I can't figure out how to use puppeteer with browserless and proxy. I keep getting proxy connection errors.
I run browserless in docker like so:
docker run -p 3000:3000 -e "MAX_CONCURRENT_SESSIONS=5" -e "MAX_QUEUE_LENGTH=0" -e "PREBOOT_CHROME=true" -e "CONNECTION_TIMEOUT=300000" --restart always browserless/chrome
Puppeteer options in config I tried to connect with:
const args = [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-infobars',
'--window-position=0,0',
'--ignore-certifcate-errors',
'--window-size=1400,900',
'--ignore-certifcate-errors-spki-list',
];
const options = {
args,
headless: true,
ignoreHTTPSErrors: true,
defaultViewport: null,
browserWSEndpoint: `ws://localhost:3000?--proxy-server=socks5://127.0.0.1:9055`,
}
How I connect:
const browser = await puppeteer.connect(config.options);
const page = await browser.newPage();
await page.goto('http://example.com', { waitUntil: 'networkidle0' }
Error I get:
Error: net::ERR_PROXY_CONNECTION_FAILED at http://example.com
at navigate (C:\...\node_modules\puppeteer\lib\cjs\puppeteer\common\FrameManager.js:115:23)
at processTicksAndRejections (internal/process/task_queues.js:94:5)
at async FrameManager.navigateFrame (C:\...\node_modules\puppeteer\lib\cjs\puppeteer\common\FrameManager.js:90:21)
at async Frame.goto (C:\...\node_modules\puppeteer\lib\cjs\puppeteer\common\FrameManager.js:417:16)
at async Page.goto (C:\...\node_modules\puppeteer\lib\cjs\puppeteer\common\Page.js:825:16)
The proxy I'm using in example above is TOR browser, that runs in the background. I can connect through it when I'm not using browserless and use puppeteer.launch() function. I put this proxy in args and everything works fine, the requests are going through tor proxy. I can't figure out know why it doesn't work with browserless and websockets though.
Of course I tried different proxies. I created local proxy in node similar to that How to create a simple http proxy in node.js? (the proxy-server option is then --proxy-server=http://127.0.0.1:3001), but the error is the same and I can't even see incoming requests in server's terminal, it looks like they don't even reach a proxy.
I tried public proxies addresses, same error.
Chaninng website I'm trying to connect to in page.goto() function doesn't change anything, still get the same error.
I'm beginner at web scraping and run out of options here. Any idea would be helpful.
In order to fix the issue specifically with tor, you need to make sure that the torrc file has 0.0.0.0:9050 open so that you can use it on any network ip otherwise it will only work with localhost. Once you set that, you can pass socks5://172.17.0.1:9050 to your browserless docker container and it can access the tor proxy form the host system. Keep in mind that the docker0 ip may be different, run ip addr show docker0 to find the ip address of the host to use the right one when passing it as the proxy.
Ok, it looks like some docker issue. Apparently, there are problems when I'm trying to connect from from browserless inside container to tor which is on host. I used host.docker.internal instead of localhost in connection string and it worked.

How to connect to docker container's link alias from host

I have 3 separate pieces to my dockerized application:
nodeapp: A node:latest docker container running an expressjs app that returns a JSON object when accessed from /api. This server is also CORs enabled according to this site.
nginxserver: A nginx:latest static server that simply hosts an index.html file that allows the user to click a button which would make the XMLHttpRequest to the node server above.
My host machine
The node:latest has its port exposed to the host via 3000:80.
The nginx:latest has its port exposed to the host via 8080:80.
From host I am able to access both nodeapp and nginxserver individually: I can make requests and see the JSON object returned from the node server using curl from the command line, and the button (index.html) is visible on the screen when I hit localhost:8080.
However, when I try clicking the button the call to XMLHttpRequest('GET', 'http://nodeapp/api', true) fails without seemingly hitting the nodeapp server (no log is present). I'm assuming this is because host does not understand http://nodeapp/api.
Is there a way to tell docker that while a container is running to add its container linking alias to my hosts file?
I don't know if my question is the proper solution to my problem. It looks as though I'm getting a CORs error returned but I don't think it is ever hitting my server. Does this have to do with accessing the application from my host machine?
Here is a link to an example repo
Edit: I've noticed that the when using the stack that clicking the button sends a response from my nginx container. I'm confused as to why it is routing through that server as the nodeapp is in my hosts file so it should recognize the correlation there?
Problem:
nodeapp exists in internal network, which is visible to your nginxserver only, you can check this by enter nginxserver
docker exec -it nginxserver bash
# cat /etc/hosts
Most important, your service setup is not correct, nginxserver shall act as reverse proxy in front of nodeapp
host (client) -> nginxserver -> nodeapp
Dirty Quick Solution:
If you really really want your client (host) to access internal application nodeapp, then you simple change below code
XMLHttpRequest('GET', 'http://nodeapp/api', true)
To
XMLHttpRequest('GET', 'http://localhost:3000/api', true)
Since in your docker-compose.yml, nodeapp service port 80 is exposed in home network as 3000, which can be accessed directly.
Better solution
You need redesign your service stack, to make nginxserver as frontend node, see one sample http://schempy.com/2015/08/25/docker_nginx_nodejs/

cntlm proxy with phantomjs

I'm trying to use the cntlm proxy on my windows machine to talk to a local web application on IIS that uses Windows Authentication from PhantomJS. To create the proxy, I'm doing: cntlm -v -u username#domain -p password -l 1456 localhost:80
My app lives at localhost/myapp
To test whether or not this works, I try to browse to localhost:1456/myapp but I always get an auth challenge and no sensible username/password combination seems to work. Any thoughts on why this setup might not be working as expected?
When I hit the proxied endpoint in a browser, this is the output from cntlm:
http://pastebin.com/xvvmfsGV
After wrestling with the concept for a while I finally figured out how to get this set up.
After installing cntlm, I ran the following from a command prompt:
"c:\Program Files (x86)\Cntlm\cntlm.exe" -u <user_name> -d <domain_name> -H
This asks for your password and spits out three hashes to use in the configuration file.
I whittled down the required configuration in cntlm.ini to:
Username <user_name>
Domain <domain_name>
PassLM <LM_hash>
PassNT <NT_hash>
PassNTLMv2 <NTLMv2_hash>
Proxy 192.168.7.1:80 #random proxy
NoProxy *
Listen 3133 # unused port
cntlm forces your to specify a top-level proxy even if you don't need one or have one, so any valid number for that option will do. Setting NoProxy to * ensures that any request never gets passed on to the bogus proxy specified.
Run "c:\Program Files (x86)\Cntlm\cntlm.exe" -f in a console to verify that everything is working. Otherwise, start and stop it as a service.
To test with phantomjs I used the following script:
var page = require('webpage').create();
page.open('http://<machine_name>/myapp', function(status) {
console.log("Status: " + status);
if(status === "success") {
page.render('example.png');
}
phantom.exit();
});
<machine_name> cannot be localhost because phantomjs bypasses proxies when the host is localhost, so use your machine name or ip address instead.
To run it: phantomjs --proxy=localhost:3133 test.js

Resources