How to use puppeteer with browserless and proxy - node.js

I can't figure out how to use puppeteer with browserless and proxy. I keep getting proxy connection errors.
I run browserless in docker like so:
docker run -p 3000:3000 -e "MAX_CONCURRENT_SESSIONS=5" -e "MAX_QUEUE_LENGTH=0" -e "PREBOOT_CHROME=true" -e "CONNECTION_TIMEOUT=300000" --restart always browserless/chrome
Puppeteer options in config I tried to connect with:
const args = [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-infobars',
'--window-position=0,0',
'--ignore-certifcate-errors',
'--window-size=1400,900',
'--ignore-certifcate-errors-spki-list',
];
const options = {
args,
headless: true,
ignoreHTTPSErrors: true,
defaultViewport: null,
browserWSEndpoint: `ws://localhost:3000?--proxy-server=socks5://127.0.0.1:9055`,
}
How I connect:
const browser = await puppeteer.connect(config.options);
const page = await browser.newPage();
await page.goto('http://example.com', { waitUntil: 'networkidle0' }
Error I get:
Error: net::ERR_PROXY_CONNECTION_FAILED at http://example.com
at navigate (C:\...\node_modules\puppeteer\lib\cjs\puppeteer\common\FrameManager.js:115:23)
at processTicksAndRejections (internal/process/task_queues.js:94:5)
at async FrameManager.navigateFrame (C:\...\node_modules\puppeteer\lib\cjs\puppeteer\common\FrameManager.js:90:21)
at async Frame.goto (C:\...\node_modules\puppeteer\lib\cjs\puppeteer\common\FrameManager.js:417:16)
at async Page.goto (C:\...\node_modules\puppeteer\lib\cjs\puppeteer\common\Page.js:825:16)
The proxy I'm using in example above is TOR browser, that runs in the background. I can connect through it when I'm not using browserless and use puppeteer.launch() function. I put this proxy in args and everything works fine, the requests are going through tor proxy. I can't figure out know why it doesn't work with browserless and websockets though.
Of course I tried different proxies. I created local proxy in node similar to that How to create a simple http proxy in node.js? (the proxy-server option is then --proxy-server=http://127.0.0.1:3001), but the error is the same and I can't even see incoming requests in server's terminal, it looks like they don't even reach a proxy.
I tried public proxies addresses, same error.
Chaninng website I'm trying to connect to in page.goto() function doesn't change anything, still get the same error.
I'm beginner at web scraping and run out of options here. Any idea would be helpful.

In order to fix the issue specifically with tor, you need to make sure that the torrc file has 0.0.0.0:9050 open so that you can use it on any network ip otherwise it will only work with localhost. Once you set that, you can pass socks5://172.17.0.1:9050 to your browserless docker container and it can access the tor proxy form the host system. Keep in mind that the docker0 ip may be different, run ip addr show docker0 to find the ip address of the host to use the right one when passing it as the proxy.

Ok, it looks like some docker issue. Apparently, there are problems when I'm trying to connect from from browserless inside container to tor which is on host. I used host.docker.internal instead of localhost in connection string and it worked.

Related

Nodejs app in Docker not accessible on localhost:3000

I am trying to run a Node application in a Docker container. The installation instructions specified that after adding host: '0.0.0.0' to config/local.js and running docker-compose up the app should be accessible at localhost:3000, but I get an error message in the browser saying "The connection was reset - The connection to the server was reset while the page was loading."
I have tried to add the host: '0.0.0.0' in different places, or remove it entirely, access https://localhost:3000 as opposed to at http, connecting to localhost:3000/webpack-dev-server/index.html, etc.
What could be going wrong?

Launch browser using puppeteer on docker container

I am trying to launch the browser using puppeteer on docker container.
However, when I am trying load the browser by hitting the API, I am seeing the following error
localhost:3000 is my client running locally. I am not sure if docker can access this address. I am thinking maybe this could be the reason for the connection failure. Please correct me if I am wrong.
When I try the above scenario without docker, it is working fine, I am able to see the puppeteer opening chromium browsers and show the page. To make it work on the docker container, what should I do?
The localhost on your host, where your application running on port 3000 is, is in a different namespace compared to the localhost for your puppeteer instance running in docker.
To fix this, you can either:
Make them be in the same namespace (Host Mode)
Create a bridge between them (Bridge Mode)
Host Mode
This puts your container in the same network namespace as the host. localhost will refer to the same thing inside and outside the container. Add --net-host to your docker run command
Bridge Mode
You can form a bridge between the container and host, by adding --add-host host.docker.internal:host-gateway to the docker run command, and changing puppeteer to use host.docker.internal:3000

How to correctly POST with axios to Node server with localhost in baseURL?

I am running a Node server locally on port 3000. I am using axios in ReactNative app to get and post data from the server.
I am setting the baseURL as:
const axiosObj = axios.create({
baseURL: 'http://localhost:3000'
});
This is not working and resulting in Network Error. However, if I use baseURL like this then it works
const axiosObj = axios.create({
baseURL: 'http://172.XX.X.X:3000'
});
I have also tried with baseURL:http://127.0.0.1:3000 which also doesn't work
I really need it to work with http://localhost:3000 or http://127.0.0.1:3000 without having to provide the actual IP address. Help would be appreciated.
React Native app run on other device, so you can't access localhost or http:127.0.0.1 directly, use
adb reverse tcp:3000 tcp:3000
for react-native app that run on android.
for ios, there is no way to do port reverse.

Node js can't upload files to FTP when deployed and running on production server

I'm using Node JS (12.13.0) and NPM (6.13.19) with basic-ftp. Everything works fine and I can upload files to the remote FTP (without SSL, my remote FTP doesn't allow this) when I run the code on my development machine from localhost.
The production server is hosted on Digital Ocean (Ubuntu 18.04.3) I have tried to disable the firewall, because I thought this might be the reason to the problem. I used sudo ufw disable and just to make sure it's disabled I check the current status with sudo ufw status which returns Status: inactive.
This is my code
async function uploadImageToFtp(fileName, path) {
const client = new ftp.Client()
client.ftp.verbose = true
try {
await client.access({
host: process.env.FTP_HOST,
user: process.env.FTP_USER,
password: process.env.FTP_PASSWORD,
secure: false
})
await client.uploadFrom(path, "images/bd/" + fileName)
} catch (err) {
console.log(err)
}
client.close()
}
Response on production
Connected to EXTERNAL_IP_ADDRESS < 220 server ready - login please Login
security: No encryption
> USER username < 331 password required
> PASS ###
Again on localhost everything works and we get past this step and starts uploading the file(s) to the same server and credentials.
After this I never get any response, except for a timeout with Bad Gateway 502 from my request.
I don't know the library, but the problem sounds like the FTP session is running in active mode. That can often be a problem, so if it is in active mode, I'd recommend trying setting your client to ask for passive mode.
There is issue with AWS dynamic routing. your instance is under vpc and nat is not able to resolve the address back from ftp server to your instance.
You can try by adding route entry in ip table. Check here
This tells the nat to resolve particular ftp to specific address. i hope this will help.

Splash does not connect to proxy using any of the 3 ways described in documentation

Splash browser does not send anything to through the http proxy. The pages are fetched even when the proxy is not running.
I am using scrapy with splash in python 3 to fetch pages after authentication for a an Angular.js website. The script is able to fetch pages, authenticate, and fetch pages after authentication. However, it does not use the proxy setup at localhost:8090 and wireshark confirms that traffic coming from port 8050 goes to some port in the 50k range.
The setup is
- splash running locally on a docker image (latest) on port 8050
- python 3 running locally on a mac
- Zap proxy running locally on a mac at port 8090
- Web page accessed through VPN
I have tried to specify the proxy host:port through the server using Chrome with a LUA script. Page is fetched without the proxy.
I have tried to specify the proxy in the python script with both Lua and with the api (args={'proxy':'host:port'} and the page is fetched without using the proxy.
I have tried using the proxy-host file and I get status 502.
Proxy set through Lua on Chrome (no error, not proxied):
function main(splash, args)
splash:on_request(function(request)
request:set_proxy{
host = "127.0.0.1",
port = 8090,
username = "",
password = "",
type = "HTTP"
}
end
)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
req = SplashRequest("http://mysite/home", self.log_in,
endpoint='execute', args={'lua_source': script})
Proxy set through api (status 502):
req = SplashRequest("http://mysite/home",
self.log_in, args={'proxy': 'http://127.0.0.1:8090'})
Proxy set through Lua in Python (no error, not proxied):
def start_requests(self):
script = """
function main(splash, args)
assert(splash:go(args.url))
assert(splash:wait(0.5))
splash:on_request(function(request)
request:set_proxy{
host = "127.0.0.1",
port = 8090,
username = "",
password = "",
type = "HTTP"
}
end
)
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
"""
req = SplashRequest("http://mysite/home", self.log_in,
endpoint='execute', args={'lua_source': script})
# req.meta['proxy'] = 'http://127.0.0.1:8090'
yield req
Proxy set through proxy file in docker image (status 502):
proxy file:
[proxy]
; required
host=127.0.0.1
port=8090
Shell command:
docker run -it -p 8050:8050 -v ~/Documents/proxy-profile:/etc/splash/proxy-profiles scrapinghub/splash --proxy-profiles-path=/etc/splash/proxy-profiles
All of the above should display the page in zap proxy at port 8090.
Some of the above seem to set the proxy, but the proxy can't reach localhost:8090 (status 502). Some don't work at all (no error, not proxied). I think this may be related to fact that a docker image is being used.
I am not looking to use Selenium because that is what this replacing.
All methods returning status 502 are working correctly. The reason for this issue is that docker images cannot access localhost on the host. To resolve this, use http://docker.for.mac.localhost:8090 as the proxy host:port on mac host and use docker run -it --network host scrapinghub/splash for linux with localhost:port. For linux, -p is invalidated since all services on the container will be on localhost.
Method 2 is best for a single proxy without rules. Method 4 is best for multiple proxies with rules.
I did not try other methods to see what they would return with these changes and why.
Alright I have been struggling with the same problem for a while now, but I found the solution for your first method on GitHub, which is based on what the Docker docs state:
The host has a changing IP address (or none if you have no network access). From 18.03 onwards our recommendation is to connect to the special DNS name host.docker.internal, which resolves to the internal IP address used by the host.
The gateway is also reachable as gateway.docker.internal.
Meaning that you should/could use the "host.docker.internal" as host instead for your proxy E.g.
splash:on_request(function (request)
request:set_proxy{
host = "host.docker.internal",
port = 8090
}
end)
Here is the link to the explanation: https://github.com/scrapy-plugins/scrapy-splash/issues/99#issuecomment-386158523

Resources