Resolving net::ERR_TUNNEL_CONNECTION_FAILED at Ubuntu Server - node.js

I'm running puppeteer scraper at Digital Ocean droplet.
Server is Ubuntu 18.04
ufw is enabled and ssh, http, https ports enabled.
This scraper has been running by pm2
This is the current output and Code snip.
0|server | 2019-12-23T09:09:27.266Z: [openPage] Error:
net::ERR_TUNNEL_CONNECTION_FAILED at https://xxxx/xxxx
...
const browser = await puppeteer.launch({
headless: false,
args:["--no-sandbox", "--proxy-server=zproxy.lum-superproxy.io:22225"]
});
page = await browser.newPage()
// set random agent to page
await page.setUserAgent(agents[Math.floor(Math.random() * agents.length)])
await page.authenticate({
username: process.env.USERNAME,
password: process.env.PWD
})
....
plus env variables are working correctly. I checked this out by console.log(process.env.USERNAME)

Ensure that your proxy DOES support HTTPS/SSL if you'd like Puppeteer to scrape HTTPS content.
You can easily test if your proxy supports SSL with:
curl --proxy [ip]:[port] https://ipinfo.io/ip

Related

Is there any way to use puppeteer's userDataDir when the project is running on heroku?

Is there any way to use puppeteer's userDataDir when the project is running on heroku?
like:
browser = await puppeteer.launch({
userDataDir: './test/myChromeSession',
})

Headless Chrome (Puppeteer) different behaviour running in local docker and remote docker (AWS EC2)

I am trying to debug an issue which causes headless Chrome using Puppeteer to behave differently on my local environment and on a remote environment such as AWS or Heroku.
The application tries to search public available jobs on LinkedIn without authentication (no need to look at profiles), the url format is something like this: https://www.linkedin.com/jobs/search?keywords=Engineer&location=New+York&redirect=false&position=1&pageNum=0
When I open this url in my local environment I have no problems, but when I try to do the same thing on a remote machine such as AWS EC2 or Heroku Dyno I am redirected to a login form by LinkedIn. To debug this difference I've built a Docker image (based on this image) to have isolation from my local Chrome/profile:
Dockerfile
FROM buildkite/puppeteer
WORKDIR /app
COPY . .
RUN npm install
CMD node index.js
EXPOSE 9222
index.js
const puppeteer = require("puppeteer-extra");
puppeteer.use(require("puppeteer-extra-plugin-stealth")());
const testPuppeteer = async () => {
console.log('Opening browser');
const browser = await puppeteer.launch({
headless: true,
slowMo: 20,
args: [
'--remote-debugging-address=0.0.0.0',
'--remote-debugging-port=9222',
'--single-process',
'--lang=en-GB',
'--disable-dev-shm-usage',
'--no-sandbox',
'--disable-setuid-sandbox',
"--proxy-server='direct://",
'--proxy-bypass-list=*',
'--disable-gpu',
'--allow-running-insecure-content',
'--enable-automation',
],
});
console.log('Opening page...');
const page = await browser.newPage();
console.log('Page open');
const url = "https://www.linkedin.com/jobs/search?keywords=Engineer&location=New+York&redirect=false&position=1&pageNum=0";
console.log('Opening url', url);
await page.goto(url, {
waitUntil: 'networkidle0',
});
console.log('Url open');
// page && await page.close();
// browser && await browser.close();
console.log("Done! Leaving page open for remote inspection...");
};
(async () => {
await testPuppeteer();
})();
The docker image used for this test can be found here.
I've run the image on my local environment with the following command:
docker run -p 9222:9222 spinlud/puppeteer-linkedin-test
Then from the local Chrome browser chrome://inspect it should be possible to inspect the GUI of the application (I have deliberately left open the page in headless browser):
As you can see even in local docker the page opens without authentication.
I've done the same test on an AWS EC2 (Amazon Linux 2) with Docker installed. It needs to be a public instance with SSH access and an inbound rule to allow traffic through port 9222 (for remote Chrome debugging).
I've run the same command:
docker run -p 9222:9222 spinlud/puppeteer-linkedin-test
Then again from local Chrome browser chrome://inspect, once added the remote public IP of the EC2, I was able to inspect the GUI of the remote headless Chrome as well:
As you can see this time LinkedIn requires authentication. We can see also a difference in the cookies:
I can't understand the reasons behind this different behaviour between my local and remote environment. In theory Docker should provide isolation and in both environment the headless browser should start with no cookies and a fresh (empty session). Still there is difference and I can't figure out why.
Does anyone have any clue?

Puppeteer scrapes news from google properly in local server but not in heroku

I have added the required build packs. There are also no errors shown in heroku logs. Locally the deployed application works completely fine and scrapes the required news but on heroku the page just refreshes and displays nothing
app.post("/news",function(req,res){
var pla= req.body.place;
var url='https://www.google.com/search?q=covid+19+'+pla+'&sxsrf=ALeKk02SupK-SO625SAtNAmqA5CHUj5xjg:1586447007701&source=lnms&tbm=nws&sa=X&ved=2ahUKEwikieXS19voAhXAxzgGHV5bCcQQ_AUoAXoECBwQAw&biw=1536&bih=535';
(async () => {
const browser = await puppeteer.launch({args: ['--no-sandbox']});
const page = await browser.newPage();
await page.goto(url);
var data = await page.evaluate(() =>
Array.from(document.querySelectorAll('div.g'))
.map(compact => ({
headline: compact.querySelector('h3').innerText.trim(),
img: compact.querySelector("img") === null ? 'https://upload.wikimedia.org/wikipedia/commons/thumb/6/6c/No_image_3x4.svg/1280px-No_image_3x4.svg.png' : compact.querySelector("img.th.BbeB2d").src,
url: compact.querySelector("h3.r.dO0Ag>a").href,
source: compact.querySelector("div.gG0TJc>div.dhIWPd>span.xQ82C.e8fRJf").innerText.trim(),
time: compact.querySelector("div.gG0TJc>div.dhIWPd>span.f.nsa.fwzPFf").innerText.trim(),
desc : compact.querySelector("div.st").innerText.trim()
}))
)
console.log(data);
res.render('news.ejs',{data: data});
await browser.close();
})();
});
I'd suggest you to add the '--disable-setuid-sandbox' flag to your puppeteer launch command:
const browser = await puppeteer.launch({args: ['--no-sandbox', '--disable-setuid-sandbox']});
I had some problem in the past, and if I recall it correctly the flag helped.
May be this could help (copied from Puppeteer official website) because I had similar problem and it worked for me.
Running Puppeteer on Heroku (https://github.com/puppeteer/puppeteer/blob/main/docs/troubleshooting.md#running-puppeteer-on-heroku)
Running Puppeteer on Heroku requires some additional dependencies that aren't included on the Linux box that Heroku spins up for you. To add the dependencies on deploy, add the Puppeteer Heroku buildpack to the list of buildpacks for your app under Settings > Buildpacks.
The url for the buildpack is https://github.com/jontewks/puppeteer-heroku-buildpack
Ensure that you're using '--no-sandbox' mode when launching Puppeteer. This can be done by passing it as an argument to your .launch() call: puppeteer.launch({ args: ['--no-sandbox'] });.
When you click add buildpack, simply paste that url into the input, and click save. On the next deploy, your app will also install the dependencies that Puppeteer needs to run.
If you need to render Chinese, Japanese, or Korean characters you may need to use a buildpack with additional font files like https://github.com/CoffeeAndCode/puppeteer-heroku-buildpack
There's also another simple guide from #timleland that includes a sample project: https://timleland.com/headless-chrome-on-heroku/.

Getting error when attempting to use proxy server in Node.js / Puppeteer

I am attempting to use a proxy within my Node.js / Puppeteer application and receiving errors.
If I remove the proxy code, the application runs as intended.
const browser = await puppeteer.launch({args: ['--proxy-server=socks5://127.0.0.1:9050'], headless: false});
I expect the application to run as usual, but with a different IP.
Error received: ERR_PROXY_CONNECTION_FAILED
Either your proxy is not working or puppeteer is rejecting it because it is most likely using a self-signed cert. To fix a cert issue add the following args.
args: [
'--proxy-server=socks5://127.0.0.1:9050'
'--ignore-certificate-errors',
'--ignore-certificate-errors-spki-list '
]
See: https://github.com/GoogleChrome/puppeteer/issues/1159

Headless chrome proxy server settings

Could anyone help me with setting proxy-server for headless chrome while using the lighthouse chrome launcher in Node.js as mentioned here
const launcher = new ChromeLauncher({
port: 9222,
autoSelectChrome: true, // False to manually select which Chrome install.
additionalFlags: [
'--window-size=412,732',
'--disable-gpu',
'--proxy-server="IP:PORT"',
headless ? '--headless' : ''
]
});
However, the above script does not hit my proxy server at all. Chrome seems to fallback to DIRECT:// connections to the target website.
One other resource that talks about using HTTP/HTTPS proxy server in the context of headless chrome is this. But it does not give any example of how to use this from Node.js.
I tried it using regular exec and it works just fine, here is my snippet:
const exec = require('child_process').exec;
function launchHeadlessChrome(url, callback) {
// Assuming MacOSx.
const CHROME = '/Users/h0x91b/Desktop/Google\\ Chrome\\ Beta.app/Contents/MacOS/Google\\ Chrome';
exec(`${CHROME} --headless --disable-gpu --remote-debugging-port=9222 --proxy-server=127.0.0.1:8888 ${url}`, callback);
}
launchHeadlessChrome('https://www.chromestatus.com', (err, stdout, stderr) => {
console.log('callback', err, stderr, stdout)
});
Then I navigated to http://localhost:9222 and in Developer tools I see :
Proxy connection Error, which is ok, because I don't have proxy on this port, but this means that the Chrome tried to connect via proxy...
BTW Chrome version is 59.
Have checked the source code https://github.com/GoogleChrome/lighthouse/blob/master/chrome-launcher/chrome-launcher.ts#L38-L44
I see no additionalFlags here, there is only chromeFlags try to use it...

Resources